Chapter 15 Helper functions
15.1 R Interfaces
15.1.1 Data import
COINr has a couple of useful functions to help import and export data and metadata. You might have heard of the COIN Tool which is an Excel-based tool for building and analysing composite indicators, similar in fact to COINr6. With the coinToolIn()
function you can import data directly from the COIN Tool to cross check or extend your analysis in COINr.
To demonstrate, we can take the example version of the COIN Tool, which you can download here. Then it’s as simple as running:
library(COINr6)
# This is the file path and name where the COIN Tool is downloaded to
# You could also just put it in your project directory.
<- "C:/Users/becke/Downloads/COIN_Tool_v1_LITE_exampledata.xlsm"
fname
<- COINToolIn(fname)
dflist ## Imported 15 indicators and 28 units.
The output of this function is a list with the three data frame inputs to assemble()
.
dflist## $IndData
## # A tibble: 28 x 17
## UnitName UnitCode ind.01 ind.02 ind.03 ind.04 ind.05 ind.06 ind.07 ind.08
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Austria AT 13.3 80.4 492. 14.9 68.8 34 3.7 87.6
## 2 Belgium BE 15.1 71.8 503. 7 59.6 24 5.2 81.2
## 3 Bulgaria BG 12.3 78.1 440. 2.2 51.3 15 10.6 72
## 4 Cyprus CY 14 76 438. 6.9 16.7 23 3.6 73.4
## 5 Czech Repub~ CZ 13.5 87.6 491. 8.8 73.2 27 3.9 86.7
## 6 Germany DE 9.7 80.2 508. 8.5 46.3 30 5.6 90.1
## 7 Denmark DK 9.7 73 504. 27.7 40.6 39 3.8 83.9
## 8 Estonia EE 8.6 83.3 524. 15.7 35.7 37 4.1 77.1
## 9 Greece EL 11.8 70 458. 4 31.5 30 3.9 49.2
## 10 Spain ES 14.9 57.4 491. 9.4 34.8 33 11.4 68
## # ... with 18 more rows, and 7 more variables: ind.09 <dbl>, ind.10 <dbl>,
## # ind.11 <dbl>, ind.12 <dbl>, ind.13 <dbl>, ind.14 <dbl>, ind.15 <dbl>
##
## $IndMeta
## # A tibble: 15 x 9
## IndCode IndName GPupper GPlower Direction IndWeight Agg1 Agg2 Agg3
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 ind.01 Pre-primary pu~ NA NA -1 0.4 sp.01 p.01 Index
## 2 ind.02 Share of popul~ NA NA 1 0.3 sp.01 p.01 Index
## 3 ind.03 Reading, maths~ NA NA 1 0.3 sp.01 p.01 Index
## 4 ind.04 Recent training NA NA 1 0.3 sp.02 p.01 Index
## 5 ind.05 VET students NA NA 1 0.35 sp.02 p.01 Index
## 6 ind.06 High computer ~ NA NA 1 0.35 sp.02 p.01 Index
## 7 ind.07 Early leavers ~ NA NA -1 0.7 sp.03 p.02 Index
## 8 ind.08 Recent graduat~ NA NA 1 0.3 sp.03 p.02 Index
## 9 ind.09 Activity rate ~ NA NA 1 0.5 sp.04 p.02 Index
## 10 ind.10 Activity rate ~ NA NA 1 0.5 sp.04 p.02 Index
## 11 ind.11 Long-term unem~ NA NA -1 0.4 sp.05 p.03 Index
## 12 ind.12 Underemployed ~ NA NA -1 0.6 sp.05 p.03 Index
## 13 ind.13 Higher educati~ NA NA -1 0.4 sp.06 p.03 Index
## 14 ind.14 ISCED 5-8 prop~ NA NA -1 0.1 sp.06 p.03 Index
## 15 ind.15 Qualification ~ NA NA -1 0.5 sp.06 p.03 Index
##
## $AggMeta
## # A tibble: 10 x 4
## AgLevel Code Name Weight
## <dbl> <chr> <chr> <dbl>
## 1 4 Index European Skills Index 1
## 2 3 p.01 Skills Development 0.3
## 3 3 p.02 Skills Activation 0.3
## 4 3 p.03 Skills Matching 0.4
## 5 2 sp.01 Compulsory education 0.5
## 6 2 sp.02 Training and tertiary education 0.5
## 7 2 sp.03 Transition to work 0.5
## 8 2 sp.04 Activity rates 0.5
## 9 2 sp.05 Unemployment 0.4
## 10 2 sp.06 Skills mismatch 0.6
Because the COIN Tool uses numeric codes for indicators such as ind.01
, you might want slightly more informative codes. The best way to do this is to name the codes yourself, but a quick solution is to set makecodes = TRUE
in COINToolIn()
. This generates short codes based on the indicator names. It will not yield perfect results, but for a quick analysis it might be sufficient. At least, you could use this and then modify the results by hand.
<- COINToolIn(fname, makecodes = TRUE)
dflist ## Imported 15 indicators and 28 units.
$IndMeta$IndCode
dflist## [1] "Pre-Pupi" "SharPopu" "ReadMath" "ReceTrai" "Stud"
## [6] "HighComp" "EarlLeav" "ReceGrad" "ActiRate" "ActiRate_1"
## [11] "LongUnem" "UndePart" "HighEduc" "Isce5-Pr" "QualMism"
While the codes could certainly be improved, it’s a lot better than uninformative numbers. Finally, we can assemble the output into a COIN and begin the construction.
<- assemble(dflist[[1]],dflist[[2]],dflist[[3]],)
ESI ## -----------------
## No denominators detected.
## -----------------
## -----------------
## Indicator codes cross-checked and OK.
## -----------------
## Number of indicators = 15
## Number of units = 28
## Number of aggregation levels = 3 above indicator level.
## -----------------
## Aggregation level 1 with 6 aggregate groups: sp.01, sp.02, sp.03, sp.04, sp.05, sp.06
## Cross-check between metadata and framework = OK.
## Aggregation level 2 with 3 aggregate groups: p.01, p.02, p.03
## Cross-check between metadata and framework = OK.
## Aggregation level 3 with 1 aggregate groups: Index
## Cross-check between metadata and framework = OK.
## -----------------
15.1.2 Export to Excel
Trigger warning for R purists! Sometimes it’s easier to look at your data in Excel. There, I said it. R is great for doing all kinds of complicated tasks, but if you just want to look at big tables of numbers and play around with them, maybe make a few quick graphs, then Excel is a great tool.
Actually Excel is kind of underrated by many people who are used to programming in R or Python, Matlab or even Stata. It has a lot of clever tools that not many people know about. But more importantly, Excel is a lingua franca between all kinds of professions - you can pass an Excel spreadsheet to almost anyone and they will be able to take a look at it and use the data. Try doing that with an R or Python script.
It just boils down to using the right tool for the right job. Anyway, with that aside, let’s look at COINr’s coin2Excel()
function. You put in your COIN, and it will write a spreadsheet.
# Build ASEM index
<- build_ASEM()
ASEM # Get some statistics
<- getStats(ASEM, dset = "Raw")
ASEM # write to Excel
coin2Excel(ASEM, fname = "ASEMresults.xlsx")
The spreadsheet will contain a number of tabs, including:
- The indicator data, metadata and aggregation metadata that was input to COINr
- All data sets in the
.$Data
folder, e.g. raw, treated, normalised, aggregated, etc. - (almost) All data frames found in the
.$Analysis
folder, i.e. statistics tables, outlier flags, correlation tables.
15.2 Selecting data sets and indicators
The getIn()
function is widely used by many COINr
functions. It is used for selecting specific data sets, and returning subsets of indicators. While some of this can be achieved fairly easily with base R, or dplyr::select()
, subsetting in a hierarchical context can be more awkward. That’s where getIn()
steps in to help.
Although it was made to be used internally, it might also help in other contexts. Note that this can work on COINs or data frames, but is most useful with COINs.
Let’s take some examples. First, we can get a whole data set. getIn()
will retrieve any of the data sets in the .$Data
folder, as well as the denominators.
# Build data set first, if not already done
<- build_ASEM()
ASEM ## -----------------
## Denominators detected - stored in .$Input$Denominators
## -----------------
## -----------------
## Indicator codes cross-checked and OK.
## -----------------
## Number of indicators = 49
## Number of units = 51
## Number of aggregation levels = 3 above indicator level.
## -----------------
## Aggregation level 1 with 8 aggregate groups: Physical, ConEcFin, Political, Instit, P2P, Environ, Social, SusEcFin
## Cross-check between metadata and framework = OK.
## Aggregation level 2 with 2 aggregate groups: Conn, Sust
## Cross-check between metadata and framework = OK.
## Aggregation level 3 with 1 aggregate groups: Index
## Cross-check between metadata and framework = OK.
## -----------------
## Missing data points detected = 65
## Missing data points imputed = 65, using method = indgroup_mean
# Get raw data set
<- getIn(ASEM, dset = "Raw")
datalist $ind_data_only
datalist## # A tibble: 51 x 49
## Goods Services FDI PRemit ForPort CostImpEx Tariff TBTs TIRcon RTAs
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 278. 108. 5 5.07 809. 0 1.6 1144 1 30
## 2 598. 216. 5.71 13.4 1574. 0 1.6 1348 1 30
## 3 42.8 13.0 1.35 1.04 15.5 52 1.6 1140 1 30
## 4 28.4 17.4 0.387 1.56 16.9 0 1.6 1179 1 30
## 5 8.77 15.2 1.23 0.477 40.8 100 1.6 1141 1 30
## 6 274. 43.5 3.88 4.69 108. 0 1.6 1456 1 30
## 7 147. 114. 9.1 2.19 1021. 0 1.6 1393 1 30
## 8 28.2 10.2 0.58 0.589 17.4 0 1.6 1153 1 30
## 9 102. 53.8 6.03 1.51 748. 70 1.6 1215 1 30
## 10 849. 471. 30.9 30.2 6745. 0 1.6 1385 1 30
## # ... with 41 more rows, and 39 more variables: Visa <dbl>, StMob <dbl>,
## # Research <dbl>, Pat <dbl>, CultServ <dbl>, CultGood <dbl>, Tourist <dbl>,
## # MigStock <dbl>, Lang <dbl>, LPI <dbl>, Flights <dbl>, Ship <dbl>,
## # Bord <dbl>, Elec <dbl>, Gas <dbl>, ConSpeed <dbl>, Cov4G <dbl>, Embs <dbl>,
## # IGOs <dbl>, UNVote <dbl>, Renew <dbl>, PrimEner <dbl>, CO2 <dbl>,
## # MatCon <dbl>, Forest <dbl>, Poverty <dbl>, Palma <dbl>, TertGrad <dbl>,
## # FreePress <dbl>, TolMin <dbl>, NGOs <dbl>, CPI <dbl>, FemLab <dbl>, ...
The output, here datalist
is a list containing the full data set .$ind_data
, the data set .$ind_data_only
only including numerical (indicator) columns, as well as unit codes, indicator codes and names, and the object type.
More usefully, we can get specific indicators:
# Get raw data set
<- getIn(ASEM, dset = "Raw", icodes = c("Flights", "LPI"))
datalist $ind_data_only
datalist## # A tibble: 51 x 2
## Flights LPI
## <dbl> <dbl>
## 1 29.0 4.10
## 2 31.9 4.11
## 3 9.24 2.81
## 4 9.25 3.16
## 5 8.75 3.00
## 6 15.3 3.67
## 7 32.8 3.82
## 8 3.13 3.36
## 9 18.9 3.92
## 10 97.6 3.90
## # ... with 41 more rows
More usefully still, we can get groups of indicators based on their groupings. For example, we can ask for indicators that belong to the “Political” group:
# Get raw data set
<- getIn(ASEM, dset = "Raw", icodes = "Political", aglev = 1)
datalist $ind_data_only
datalist## # A tibble: 51 x 3
## Embs IGOs UNVote
## <dbl> <dbl> <dbl>
## 1 88 227 42.6
## 2 84 248 43.0
## 3 67 209 43.0
## 4 62 197 42.7
## 5 43 172 42.3
## 6 84 201 42.2
## 7 77 259 42.8
## 8 46 194 42.9
## 9 74 269 42.7
## 10 100 329 40.4
## # ... with 41 more rows
To do this, you have to specify the aglev
argument, which specifies which level to retrieve the indicators from. Before the data set is aggregated, you can anyway only select the indicators, but for the aggregated data set, the situation is more complex. To illustrate, we can call the Connectivity sub-index, first asking for all indicators:
# Get raw data set
<- getIn(ASEM, dset = "Aggregated", icodes = "Conn", aglev = 1)
datalist $ind_data_only
datalist## # A tibble: 51 x 30
## Goods Services FDI PRemit ForPort CostImpEx Tariff TBTs TIRcon RTAs Visa
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 40.7 33.0 6.24 16.8 34.2 100 79.2 34.8 100 64.4 86.8
## 2 80.1 58.7 5.79 40.9 55.7 100 79.2 23.2 100 64.4 86.8
## 3 47.1 28.4 15.8 26.9 4.82 91.3 79.2 35.0 100 64.4 85.7
## 4 29.7 41.6 2.26 43.9 5.45 100 79.2 32.8 100 64.4 84.6
## 5 21.6 100 43.1 33.6 33.6 83.2 79.2 35.0 100 64.4 85.7
## 6 88.9 25.5 11.6 33.9 9.12 100 79.2 17.0 100 64.4 85.7
## 7 24.4 46.0 19.0 7.73 55.1 100 79.2 20.6 100 64.4 87.9
## 8 75.4 55.4 15.4 35.8 12.3 100 79.2 34.3 100 64.4 85.7
## 9 20.8 25.9 15.7 6.50 51.9 88.2 79.2 30.7 100 64.4 87.9
## 10 15.1 21.1 6.04 15.7 45.3 100 79.2 21.0 100 64.4 87.9
## # ... with 41 more rows, and 19 more variables: StMob <dbl>, Research <dbl>,
## # Pat <dbl>, CultServ <dbl>, CultGood <dbl>, Tourist <dbl>, MigStock <dbl>,
## # Lang <dbl>, LPI <dbl>, Flights <dbl>, Ship <dbl>, Bord <dbl>, Elec <dbl>,
## # Gas <dbl>, ConSpeed <dbl>, Cov4G <dbl>, Embs <dbl>, IGOs <dbl>,
## # UNVote <dbl>
Or we can call all the pillars belonging to “Connectivity”, i.e. the level below:
# Get raw data set
<- getIn(ASEM, dset = "Aggregated", icodes = "Conn", aglev = 2)
datalist $ind_data_only
datalist## # A tibble: 51 x 5
## ConEcFin Instit P2P Physical Political
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 26.2 77.5 54.1 41.1 78.2
## 2 48.2 75.6 43.3 72.0 80.8
## 3 24.6 75.9 27.1 28.4 67.5
## 4 24.6 76.8 39.1 47.6 62.6
## 5 46.4 74.6 59.8 29.5 48.7
## 6 33.8 74.4 39.6 41.5 71.0
## 7 30.4 75.4 51.0 50.5 78.1
## 8 38.9 77.3 53.5 42.5 55.4
## 9 24.1 75.1 39.1 44.5 77.9
## 10 20.6 75.4 28.4 42.4 87.6
## # ... with 41 more rows
Finally, if we want “Conn” itself, we can just call it directly with no aglev
specified.
# Get raw data set
<- getIn(ASEM, dset = "Aggregated", icodes = "Conn")
datalist $ind_data_only
datalist## # A tibble: 51 x 1
## Conn
## <dbl>
## 1 55.4
## 2 64.0
## 3 44.7
## 4 50.1
## 5 51.8
## 6 52.1
## 7 57.1
## 8 53.5
## 9 52.1
## 10 50.9
## # ... with 41 more rows
We can also use getIn()
with data frames, and it will behave in more or less the same way, except a data frame has no information about the structure of the index. Here, getIn()
returns what it can, and arguments like dset
and aglev
are ignored.
# use the ASEM indicator data frame directly
<- getIn(ASEMIndData, icodes = c("LPI", "Goods"))
datalist $ind_data_only
datalist## # A tibble: 51 x 2
## LPI Goods
## <dbl> <dbl>
## 1 4.10 278.
## 2 4.11 598.
## 3 2.81 42.8
## 4 3.16 28.4
## 5 3.00 8.77
## 6 3.67 274.
## 7 3.82 147.
## 8 3.36 28.2
## 9 3.92 102.
## 10 3.90 849.
## # ... with 41 more rows
15.3 Data frame tools
The roundDF()
function is a small helper function for rounding data frames that contain a mix of numeric and non-numeric columns. This is very handy for presenting tables generated by COINr in documents.
# use the ASEM indicator data frame directly
|>
ASEMIndData roundDF(decimals = 3) |>
::reactable(defaultPageSize = 5, highlight = TRUE, wrap = F) reactable
By default, numbers are rounded to two decimal places.
A similar function, rankDF()
, converts a data frame of numbers/scores to ranks, ignoring any non-numeric columns.
c("UnitCode", "LPI", "Goods")] |>
ASEMIndData[rankDF() |>
head(10)
## UnitCode LPI Goods
## 1 AUT 7 18
## 2 BEL 6 8
## 3 BGR 44 39
## 4 HRV 35 43
## 5 CYP 37 49
## 6 CZE 20 20
## 7 DNK 13 25
## 8 EST 29 45
## 9 FIN 11 32
## 10 FRA 12 3
Note that there are different ways to rank values, depending on how you deal with ties. The rankDF()
function uses so-called “sport” ranking, where ties are assigned the highest rank value. See ?rank
for some information on this.
In-group ranking is also possible using rankDF()
by specifying a column of the data frame to use as a grouping variable.
c("UnitCode", "Group_GDP", "Goods", "LPI")] |>
ASEMIndData[rankDF(use_group = "Group_GDP") |>
head(10)
## # A tibble: 10 x 4
## UnitCode Group_GDP Goods LPI
## <chr> <chr> <dbl> <dbl>
## 1 AUT L 7 5
## 2 BEL L 2 4
## 3 BGR S 2 10
## 4 HRV S 5 6
## 5 CYP S 11 8
## 6 CZE M 1 2
## 7 DNK L 10 7
## 8 EST S 7 3
## 9 FIN M 7 1
## 10 FRA XL 3 4
A final helpful function is compareDF()
, which gives a detailed comparison of two similar data frames. This is particularly useful when comparing results from one calculation to another. For example, let’s say you have built a composite indicator in Excel, but want to replicate it in COINr, to use some of the COINr tools. But having built it, you want to be sure that the results are exactly the same. From experience, it is actually quite common for differences to appear, which could be due to calculation errors, but also sometimes due to differences in the way that Excel calculates certain things. Anyway, how can you check that the two sets of results are the same, and if not, what are the differences?
The compareDF()
function is intended to compare two data frames that have the same variables and rows, but possibly in a different order. Moreover it allows to compare at a specified number of significant figures and will tell you specifically what the differences are. This is probably best clarified with an example.
# take a sample of indicator data (including the UnitCode column)
<- ASEMIndData[c(2,12:15)]
data1 # copy the data
<- data1
data2 # make a change: replace one value in data2 by NA
1,2] <- NA
data2[# compare data frames
compareDF(data1, data2, matchcol = "UnitCode")
## $Same
## [1] FALSE
##
## $Details
## Column TheSame Comment NDifferent
## 1 UnitCode TRUE Non-numerical and identical 0
## 2 LPI FALSE Numerical and different at 5 sf. 1
## 3 Flights TRUE Numerical and identical to 5 sf. 0
## 4 Ship TRUE Numerical and identical to 5 sf. 0
## 5 Bord TRUE Numerical and identical to 5 sf. 0
##
## $Differences
## $Differences$LPI
## UnitCode df1 df2
## 1 AUT 4.097985 NA
The compareDF()
function requires a column to be specified which is used to match rows. It will then check:
- Whether the two data frames are the same size
- Whether the column names are the same, and that the matching column has the same entries
- Column by column that the elements are the same, after sorting according to the matching column
The output summarises whether the two data frames are the same (TRUE
/FALSE
), which columns are different and how, and finally identifies specific data points which are different and returns both values.
Full disclosure, I was also involved in the development of the COIN Tool↩︎