Unit screening is a screening or filtering of units based on data availability rules. Just like with indicators (columns), when a unit (row) has very few data points available, it may make sense to remove it. This avoids drawing conclusions on units with very few data points. It will also increase the percentage data availability of each indicator once the units have been removed.
The COINr function Screen()
is a generic function with
methods for data frames, coins and purses. It is a building
function in that it creates a new data set in $.Data
as its
output.
Data frames
We begin with data frames. Let’s take a subset of the inbuilt example data for demonstration. I cherry-pick some rows and columns which have some missing values.
library(COINr)
# example data
iData <- ASEM_iData[40:51, c("uCode", "Research", "Pat", "CultServ", "CultGood")]
iData
#> uCode Research Pat CultServ CultGood
#> 40 KOR 20437 249.8 1.79800 NA
#> 41 LAO 175 NA NA NA
#> 42 MYS 8080 64.2 1.15292 7.555
#> 43 MNG 293 0.3 0.00266 0.046
#> 44 MMR 299 NA 0.08905 NA
#> 45 NZL 7731 46.5 0.34615 1.213
#> 46 PAK 7122 7.2 0.03553 1.256
#> 47 PHL 1361 11.3 0.29555 3.185
#> 48 RUS 16182 141.5 1.44633 8.379
#> 49 SGP 11411 270.5 0.92780 14.507
#> 50 THA 5317 53.6 0.08969 6.661
#> 51 VNM 3618 NA NA NA
The data has four indicators, plus an identifier column “uCode”. Looking at each unit, the data availability is variable. We have 12 units in total.
Now let’s use Screen()
to screen out some of these
units. Specifically, we will remove any units that have less than 75%
data availabilty (3 of 4 indicators with non-NA
values):
l_scr <- Screen(iData, unit_screen = "byNA", dat_thresh = 0.75)
The output of Screen()
is a list:
str(l_scr, max.level = 1)
#> List of 3
#> $ ScreenedData:'data.frame': 9 obs. of 5 variables:
#> $ DataSummary :'data.frame': 12 obs. of 10 variables:
#> $ RemovedUnits: chr [1:3] "LAO" "MMR" "VNM"
We can see already that the “RemovedUnits” entry tells us that three units were removed based on our specifications. We now have our new screened data set:
l_scr$ScreenedData
#> uCode Research Pat CultServ CultGood
#> 40 KOR 20437 249.8 1.79800 NA
#> 42 MYS 8080 64.2 1.15292 7.555
#> 43 MNG 293 0.3 0.00266 0.046
#> 45 NZL 7731 46.5 0.34615 1.213
#> 46 PAK 7122 7.2 0.03553 1.256
#> 47 PHL 1361 11.3 0.29555 3.185
#> 48 RUS 16182 141.5 1.44633 8.379
#> 49 SGP 11411 270.5 0.92780 14.507
#> 50 THA 5317 53.6 0.08969 6.661
And we have a summary of data availability and some other things:
head(l_scr$DataSummary)
#> uCode N_missing N_zero N_miss_or_zero Dat_Avail Non_Zero LowData LowNonZero
#> 40 KOR 1 0 1 0.75 1 FALSE FALSE
#> 41 LAO 3 0 3 0.25 1 TRUE FALSE
#> 42 MYS 0 0 0 1.00 1 FALSE FALSE
#> 43 MNG 0 0 0 1.00 1 FALSE FALSE
#> 44 MMR 2 0 2 0.50 1 TRUE FALSE
#> 45 NZL 0 0 0 1.00 1 FALSE FALSE
#> LowDatOrZeroFlag Included
#> 40 FALSE TRUE
#> 41 TRUE FALSE
#> 42 FALSE TRUE
#> 43 FALSE TRUE
#> 44 TRUE FALSE
#> 45 FALSE TRUE
This table is in fact generated by get_data_avail()
-
some more details can be found in the Analysis vignette.
Other than data availability, units can also be screened based on the
presence of zeros, or on both - this is specified by the
unit_screen
argument. Use the Force
1 argument
to override the screening rules for specified units if required (either
to force inclusion or force exclusion).
Coins
Screening on coins is very similar to data frames, because the coin method extracts the relevant data set, passes it to the data frame method, and then then puts the output back as a new data set. This means the arguments are almost the same. The only thing different is to specify which data set to screen, the name to give the new data set, and whether to output a coin or a list.
We’ll build the example coin, then screen the raw data set with a threshold of 85% data availability and also name the new data set something different rather than “Screened” (the default):
# build example coin
coin <- build_example_coin(up_to = "new_coin", quietly = TRUE)
# screen units from raw dset
coin <- Screen(coin, dset = "Raw", unit_screen = "byNA", dat_thresh = 0.85, write_to = "Filtered_85pc")
#> Written data set to .$Data$Filtered_85pc
# some details about the coin by calling its print method
coin
#> --------------
#> A coin with...
#> --------------
#> Input:
#> Units: 51 (AUS, AUT, BEL, ...)
#> Indicators: 49 (Goods, Services, FDI, ...)
#> Denominators: 4 (Area, Energy, GDP, ...)
#> Groups: 4 (GDP_group, GDPpc_group, Pop_group, ...)
#>
#> Structure:
#> Level 1 Indicator: 49 indicators (FDI, ForPort, Goods, ...)
#> Level 2 Pillar: 8 groups (ConEcFin, Instit, P2P, ...)
#> Level 3 Sub-index: 2 groups (Conn, Sust)
#> Level 4 Index: 1 groups (Index)
#>
#> Data sets:
#> Raw (51 units)
#> Filtered_85pc (48 units)
The printed summary shows that the new data set only has 48 units, compared to the raw data set with 51. We can find which units were filtered because this is stored in the coin’s “Analysis” sub-list:
coin$Analysis$Filtered_85pc$RemovedUnits
#> [1] "BRN" "LAO" "MMR"
The Analysis sub-list also contains the data availability table that
is output by Screen()
. As with the data frame method, we
can also choose to screen units by presence of zeroes, or a combination
of zeroes and missing values.
Purses
For completion we also demonstrate the purse method. Like most purse methods, this is simply applying the coin method to each coin in the purse, without any special features. Here, we perform the same example as in the coin section, but on a purse of coins:
# build example purse
purse <- build_example_purse(up_to = "new_coin", quietly = TRUE)
# screen units in all coins to 85% data availability
purse <- Screen(purse, dset = "Raw", unit_screen = "byNA",
dat_thresh = 0.85, write_to = "Filtered_85pc")
#> Written data set to .$Data$Filtered_85pc
#> Written data set to .$Data$Filtered_85pc
#> Written data set to .$Data$Filtered_85pc
#> Written data set to .$Data$Filtered_85pc
#> Written data set to .$Data$Filtered_85pc