Chapter 6 Missing data and Imputation
Imputation is the process of estimating missing data points. This can be done in any number of ways, and as usual, the “best” way depends on the problem.
Of course, you don’t have to impute data. You can also simply delete any indicator or unit that has missing values, although in many cases this can be too restrictive. Reasonable results can still be obtained despite small amounts of missing data, although if too much data is missing, the uncertainty can be too high to give a meaningful analysis. As usual, it is a balance.
A good first step is to check how much data is missing, and where (see below). Units with very high amounts of missing data can be screened out. Small amounts of missing data can then be imputed.
6.1 Concept
The simplest imputation approach is to use values of other units to estimate the missing point. Typically, this could involve the sample mean or median of the indicator. Here’s some data from the ASEM data set regarding the average connection speed of each country. Towards the end there are some missing values, so let’s view the last few rows:
library(COINr6)
<- data.frame(Country = ASEMIndData$UnitName, ConSpeed = ASEMIndData$ConSpeed)
Ind1 40:nrow(Ind1),]
Ind1[## Country ConSpeed
## 40 Korea 28.6
## 41 Lao PDR NA
## 42 Malaysia 8.9
## 43 Mongolia NA
## 44 Myanmar NA
## 45 New Zealand 14.7
## 46 Pakistan NA
## 47 Philippines 5.5
## 48 Russian Federation 11.8
## 49 Singapore 20.3
## 50 Thailand 16.0
## 51 Vietnam 9.5
Using our simple imputation method, we just replace the NA
values with the sample mean.
$ConSpeed <- replace(Ind1$ConSpeed, is.na(Ind1$ConSpeed), mean(Ind1$ConSpeed, na.rm = T))
Ind140:nrow(Ind1),]
Ind1[## Country ConSpeed
## 40 Korea 28.60000
## 41 Lao PDR 14.28605
## 42 Malaysia 8.90000
## 43 Mongolia 14.28605
## 44 Myanmar 14.28605
## 45 New Zealand 14.70000
## 46 Pakistan 14.28605
## 47 Philippines 5.50000
## 48 Russian Federation 11.80000
## 49 Singapore 20.30000
## 50 Thailand 16.00000
## 51 Vietnam 9.50000
This approach can be reasonable if countries are somehow similar in that indicator. However, other perhaps more informative ways are available:
- Substituting by the mean or median of the country group (e.g. income group or continent)
- If time series data is available, use the latest known data point
Then there is another class of methods which use data from other indicators to estimate the missing point. The core idea here is that if the indicators are related to one another, it is possible to guess the missing point by using known values of other indicators. This can take the form of:
- Simply substituting the mean or median of the normalised values of the other indicators
- Subsituting the mean or median of normalised values of the other indicators within the aggregation group
- Using a more formal approach, based on regression or more generally on statistical modelling
Let’s explore the options available in COINr
6.2 Data checks and screening
A first step is to check in detail how much data are missing. The function checkData()
does this, and also has the option to screen units based on data availability.
# Assemble ASEM data first
<- assemble(IndData = COINr6::ASEMIndData, IndMeta = COINr6::ASEMIndMeta,
ASEM AggMeta = COINr6::ASEMAggMeta)
## -----------------
## Denominators detected - stored in .$Input$Denominators
## -----------------
## -----------------
## Indicator codes cross-checked and OK.
## -----------------
## Number of indicators = 49
## Number of units = 51
## Number of aggregation levels = 3 above indicator level.
## -----------------
## Aggregation level 1 with 8 aggregate groups: Physical, ConEcFin, Political, Instit, P2P, Environ, Social, SusEcFin
## Cross-check between metadata and framework = OK.
## Aggregation level 2 with 2 aggregate groups: Conn, Sust
## Cross-check between metadata and framework = OK.
## Aggregation level 3 with 1 aggregate groups: Index
## Cross-check between metadata and framework = OK.
## -----------------
# Missing data check on the raw data set
<- checkData(ASEM, dset = "Raw")
ASEM head(ASEM$Analysis$Raw$MissDatSummary)
## UnitCode N_missing N_zero N_miss_or_zero PrcDataAll PrcNonZero LowDataAll
## 1 AUT 0 2 2 100 95.91837 FALSE
## 2 BEL 0 2 2 100 95.91837 FALSE
## 3 BGR 0 0 0 100 100.00000 FALSE
## 4 HRV 0 1 1 100 97.95918 FALSE
## 5 CYP 0 3 3 100 93.87755 FALSE
## 6 CZE 0 3 3 100 93.87755 FALSE
## ZeroFlag LowDatOrZeroFlag Included
## 1 FALSE FALSE TRUE
## 2 FALSE FALSE TRUE
## 3 FALSE FALSE TRUE
## 4 FALSE FALSE TRUE
## 5 FALSE FALSE TRUE
## 6 FALSE FALSE TRUE
The MissDataSummary
table shows the unit code, number of missing observations, and overall percentage data availability. Finally, the LowDataAll
column can be used as a flag for units with data availability lower than a set amount ind_thresh
which is one of the input arguments to checkData()
. ASEM indicators were already chosen to fulfill minimum data requirements, for which reason the data availability is all above the default of 2/3.
We will set the minimum data threshold a bit higher, and in doing so also demonstrate another feature of checkData()
, which is to automatically screen units based on data availability. Setting unit_screen = "byNA"
will generate a new data set .$Data$Screened
which only includes units with data availability above the set threshold. This data set can then be passed on to subsequent operations.
# Missing data check on the raw data set
<- checkData(ASEM, dset = "Raw", ind_thresh = 0.85, unit_screen = "byNA")
ASEM $Analysis$Raw$RemovedUnits
ASEM## [1] "BRN" "LAO" "MMR"
This generates a new data set .$Data$Screened
with the removed units recorded in .$Analysis$Raw$RemovedUnits
, which in this case are Brunei, Laos and Myanmar. It is also possible to screen units based on the number of zero values (unit_screen = "byzeros"
) or both (unit_screen = "byNAandzeros"
). Screening by zeros can be useful to exclude units which have zero values for all or most indicators, and can skew distributions.
As a final option to mention, you can manually include or exclude units. So for example, a unit that doesn’t have sufficient data coverage could be manually included anyway, and another unit could be excluded.
# Countries to include and exclude
<- data.frame(UnitCode = c("AUT","BRN"), Status = c("Exclude", "Include"))
df
# Missing data check on the raw data set
<- checkData(ASEM, dset = "Raw", ind_thresh = 0.85, unit_screen = "byNA", Force = df)
ASEM $Analysis$Raw$RemovedUnits
ASEM## [1] "AUT" "LAO" "MMR"
Now AUT has been excluded, even though it had sufficient data coverage, and BRN has been manually included, even though it didn’t. The intention is to give some flexibility to make exceptions for hard rules.
The other output of interest from checkData()
is missing data by groups:
head(ASEM$Analysis$Raw$MissDatByGroup[30:40,])
## UnitCode ConEcFin Instit P2P Physical Political Environ Social
## 30 GBR 100 100.00000 100 100.0 100 100 100.00000
## 31 AUS 100 100.00000 100 100.0 100 100 100.00000
## 32 BGD 100 83.33333 75 87.5 100 100 88.88889
## 33 BRN 80 100.00000 75 87.5 100 100 55.55556
## 34 KHM 80 100.00000 75 87.5 100 100 77.77778
## 35 CHN 100 100.00000 100 100.0 100 100 100.00000
## SusEcFin Conn Sust Index
## 30 100 100.00000 100.00000 100.00000
## 31 100 100.00000 100.00000 100.00000
## 32 80 86.66667 89.47368 87.75510
## 33 60 86.66667 68.42105 79.59184
## 34 100 86.66667 89.47368 87.75510
## 35 80 100.00000 94.73684 97.95918
This gives the percentage indicator data availability inside each aggregation group. It’s worth mentioning here that on aggregation, it is possible to set a data availability threshold for aggregation groups, below which an aggregate score returns NA
. This is explained more in the chapter on Aggregation.
6.3 Imputation in COINr
Now we turn to estimating missing data points in COINr. The function of interest is impute()
, which gives a number of options, corresponding to the types of imputation discussed above. Briefly, COINr can impute using:
- The indicator mean, either across all countries or within a specified group
- The indicator median, either across all countries or within a specified group
- The mean or median, within the aggregation group
- The expectation maximisation algorithm, via the AMELIA package
6.3.1 By column
The impute()
function follows a similar logic to other COINr functions. We can enter a COIN or a data frame, and get a COIN or a data frame back. Let’s impute on a COIN first, using the raw ASEM data.
# Assemble ASEM data if not done already (uncomment the following)
# ASEM <- assemble(IndData = COINr6::ASEMIndData, IndMeta = COINr6::ASEMIndMeta,
# AggMeta = COINr6::ASEMAggMeta)
# check how many NAs we have
sum(is.na(ASEM$Data$Raw))
## [1] 63
# impute using indicator mean
<- impute(ASEM, dset = "Raw", imtype = "ind_mean")
ASEM ## Missing data points detected = 63
## Missing data points imputed = 63, using method = ind_mean
# check how many NAs after imputation
sum(is.na(ASEM$Data$Imputed))
## [1] 0
If the input is a COIN, by default the output is an updated COIN, with a new data set .$Data$Imputed
. We can also set out2 = "df"
to output the imputed data frame on its own.
The above example replaces NA
values with the mean of the indicator, and setting imtype = ind_median
will instead use the indicator median. In this respect, we are using information from other units, inside the same indicator, to estimate the missing values.
Imputing by group may be more informative if you have meaningful grouping variables. For example, the ASEM data set has groupings by GDP, population, GDP per capita and whether the country is European or Asian. We might hypothesise that it is better to replace NA
values with the median inside a group, say by GDP group. This would imply that countries with a similar GDP are likely to have similar indicator values.
# impute using GDP group median
<- impute(ASEM, dset = "Raw", imtype = "indgroup_median", groupvar = "Group_GDP")
ASEM ## Missing data points detected = 63
## Missing data points imputed = 63, using method = indgroup_median
6.3.2 By row
So far, the imputation has been using values from the same column (indicator). Another possibility is to use values from other indicators, i.e operate row-wise. COINr offers two basic options in this respect: either to take the mean or median of the other indicators inside the aggregation group, and this can be done by setting imtype = "agg_mean"
or imtype = "agg_median"
respectively.
Both of these imputation methods will, for a given unit, take the mean or median of the normalised values of the other indicators in the aggregation group in the level above indicator level. The process is as follows, for each NA
value:
- Find indicators in the same aggregation group
- Normalise these indicators using the min-max method, so they each have a minimum of zero and a maximum of one.
- Replace the
NA
with the mean or median of the normalised values, for the unit in question. - Reverse the min-max transformation so that the indicators are returned to their original scale.
It’s important to realise that this is equivalent to aggregating with the mean or median without imputing first. This is because in the aggregation step, if we take the mean of a group of indicators, and there is a NA
present, this value is excluded from the mean calculation. Doing this is mathematically equivalent to assigning the mean to that missing value. Therefore, one reason to use this imputation method is to see which values are being implicitly assigned as a result of excluding missing values from the aggregation step. This is sometimes known as “shadow imputation”.
6.3.3 By column and row
Finally, we can use a regression-based expectation maximisation algorithm, imported via the AMELIA
package. This estimates missing values using the values of the other indicators and the other units. This effectively involves regressing each indicator on other indicators, and predicting missing values.
A catch of this approach is that if the number of units is small compared to the number of indicators, there will not be enough observations to estimate the parameters of the model. This is definitely the case if you have more indicators than observations. To use the EM algorithm in this case, imputation is performed on aggregation groups. So, to use the EM approach in COINr
, you have to specify the aggregation level to group indicators by.
# impute using EM, grouping indicators inside their pillars (level 2)
<- impute(ASEM, dset = "Raw", imtype = "EM", EMaglev = 2)
ASEM ## Missing data points detected = 63
## Missing data points imputed = 63, using method = EM
This works, because the ASEM data set has 51 units and no more than eight indicators per pillar. However, if we were to use EMaglev = 3
, i.e. try imputing indicators grouped into the two sub-indexes, it does not work.
Finding the right aggregation level to impute at might involve a bit of trial and error, but the advantage of imputing by groups of indicators is that aggregation groups are usually composed of indicators that are similar in some respect. This makes it more likely that they are good predictors of the other indicators in their group.
The AMELIA
package offers many more options for imputation that are not available through the impute()
function, such as bootstrapping, time-series imputation and more. As with plotting packages COINr
aims to provide an easy and quick interface to standard imputation in AMELIA
, but if you want to have full control you should use the AMELIA
package directly. Other good imputation options are the MICE
package (Multivariate Imputation via Chained Equations) and the missForest
package (using random forests).
Finally, if you are interested in imputing data over time, i.e. using the most recent data point available with panel data, go back to the Putting everything together section, which explains how assemble()
can be used to extract a single time point from panel data and impute using the latest year. This is deliberately done at the assembly stage because once constructed, a COIN represents only a single year of data.