Unit Screening • COINr

Unit screening is a screening or filtering of units based on data availability rules. Just like with indicators (columns), when a unit (row) has very few data points available, it may make sense to remove it. This avoids drawing conclusions on units with very few data points. It will also increase the percentage data availability of each indicator once the units have been removed.

The COINr function Screen() is a generic function with methods for data frames, coins and purses. It is a building function in that it creates a new data set in $.Data as its output.

Data frames

We begin with data frames. Let’s take a subset of the inbuilt example data for demonstration. I cherry-pick some rows and columns which have some missing values.

library(COINr)

# example data
iData <- ASEM_iData[40:51, c("uCode", "Research", "Pat", "CultServ", "CultGood")]

iData
#>    uCode Research   Pat CultServ CultGood
#> 40   KOR    20437 249.8  1.79800       NA
#> 41   LAO      175    NA       NA       NA
#> 42   MYS     8080  64.2  1.15292    7.555
#> 43   MNG      293   0.3  0.00266    0.046
#> 44   MMR      299    NA  0.08905       NA
#> 45   NZL     7731  46.5  0.34615    1.213
#> 46   PAK     7122   7.2  0.03553    1.256
#> 47   PHL     1361  11.3  0.29555    3.185
#> 48   RUS    16182 141.5  1.44633    8.379
#> 49   SGP    11411 270.5  0.92780   14.507
#> 50   THA     5317  53.6  0.08969    6.661
#> 51   VNM     3618    NA       NA       NA

The data has four indicators, plus an identifier column “uCode”. Looking at each unit, the data availability is variable. We have 12 units in total.

Now let’s use Screen() to screen out some of these units. Specifically, we will remove any units that have less than 75% data availabilty (3 of 4 indicators with non-NA values):

l_scr <- Screen(iData, unit_screen = "byNA", dat_thresh = 0.75)

The output of Screen() is a list:

str(l_scr, max.level = 1)
#> List of 3
#>  $ ScreenedData:'data.frame':    9 obs. of  5 variables:
#>  $ DataSummary :'data.frame':    12 obs. of  10 variables:
#>  $ RemovedUnits: chr [1:3] "LAO" "MMR" "VNM"

We can see already that the “RemovedUnits” entry tells us that three units were removed based on our specifications. We now have our new screened data set:

l_scr$ScreenedData
#>    uCode Research   Pat CultServ CultGood
#> 40   KOR    20437 249.8  1.79800       NA
#> 42   MYS     8080  64.2  1.15292    7.555
#> 43   MNG      293   0.3  0.00266    0.046
#> 45   NZL     7731  46.5  0.34615    1.213
#> 46   PAK     7122   7.2  0.03553    1.256
#> 47   PHL     1361  11.3  0.29555    3.185
#> 48   RUS    16182 141.5  1.44633    8.379
#> 49   SGP    11411 270.5  0.92780   14.507
#> 50   THA     5317  53.6  0.08969    6.661

And we have a summary of data availability and some other things:

head(l_scr$DataSummary)
#>    uCode N_missing N_zero N_miss_or_zero Dat_Avail Non_Zero LowData LowNonZero
#> 40   KOR         1      0              1      0.75        1   FALSE      FALSE
#> 41   LAO         3      0              3      0.25        1    TRUE      FALSE
#> 42   MYS         0      0              0      1.00        1   FALSE      FALSE
#> 43   MNG         0      0              0      1.00        1   FALSE      FALSE
#> 44   MMR         2      0              2      0.50        1    TRUE      FALSE
#> 45   NZL         0      0              0      1.00        1   FALSE      FALSE
#>    LowDatOrZeroFlag Included
#> 40            FALSE     TRUE
#> 41             TRUE    FALSE
#> 42            FALSE     TRUE
#> 43            FALSE     TRUE
#> 44             TRUE    FALSE
#> 45            FALSE     TRUE

This table is in fact generated by get_data_avail() - some more details can be found in the Analysis vignette.

Other than data availability, units can also be screened based on the presence of zeros, or on both - this is specified by the unit_screen argument. Use the Force¹ argument to override the screening rules for specified units if required (either to force inclusion or force exclusion).

Coins

Screening on coins is very similar to data frames, because the coin method extracts the relevant data set, passes it to the data frame method, and then then puts the output back as a new data set. This means the arguments are almost the same. The only thing different is to specify which data set to screen, the name to give the new data set, and whether to output a coin or a list.

We’ll build the example coin, then screen the raw data set with a threshold of 85% data availability and also name the new data set something different rather than “Screened” (the default):

# build example coin
coin <- build_example_coin(up_to = "new_coin", quietly = TRUE)

# screen units from raw dset
coin <- Screen(coin, dset = "Raw", unit_screen = "byNA", dat_thresh = 0.85, write_to = "Filtered_85pc")
#> Written data set to .$Data$Filtered_85pc

# some details about the coin by calling its print method
coin
#> --------------
#> A coin with...
#> --------------
#> Input:
#>   Units: 51 (AUS, AUT, BEL, ...)
#>   Indicators: 49 (Goods, Services, FDI, ...)
#>   Denominators: 4 (Area, Energy, GDP, ...)
#>   Groups: 4 (GDP_group, GDPpc_group, Pop_group, ...)
#> 
#> Structure:
#>   Level 1 Indicator: 49 indicators (FDI, ForPort, Goods, ...) 
#>   Level 2 Pillar: 8 groups (ConEcFin, Instit, P2P, ...) 
#>   Level 3 Sub-index: 2 groups (Conn, Sust) 
#>   Level 4 Index: 1 groups (Index) 
#> 
#> Data sets:
#>   Raw (51 units)
#>   Filtered_85pc (48 units)

The printed summary shows that the new data set only has 48 units, compared to the raw data set with 51. We can find which units were filtered because this is stored in the coin’s “Analysis” sub-list:

coin$Analysis$Filtered_85pc$RemovedUnits
#> [1] "BRN" "LAO" "MMR"

The Analysis sub-list also contains the data availability table that is output by Screen(). As with the data frame method, we can also choose to screen units by presence of zeroes, or a combination of zeroes and missing values.

Purses

For completion we also demonstrate the purse method. Like most purse methods, this is simply applying the coin method to each coin in the purse, without any special features. Here, we perform the same example as in the coin section, but on a purse of coins:

# build example purse
purse <- build_example_purse(up_to = "new_coin", quietly = TRUE)

# screen units in all coins to 85% data availability
purse <- Screen(purse, dset = "Raw", unit_screen = "byNA",
                dat_thresh = 0.85, write_to = "Filtered_85pc")
#> Written data set to .$Data$Filtered_85pc
#> Written data set to .$Data$Filtered_85pc
#> Written data set to .$Data$Filtered_85pc
#> Written data set to .$Data$Filtered_85pc
#> Written data set to .$Data$Filtered_85pc