Chapter 15 Helper functions

15.1 R Interfaces

15.1.1 Data import

COINr has a couple of useful functions to help import and export data and metadata. You might have heard of the COIN Tool which is an Excel-based tool for building and analysing composite indicators, similar in fact to COINr⁶. With the coinToolIn() function you can import data directly from the COIN Tool to cross check or extend your analysis in COINr.

To demonstrate, we can take the example version of the COIN Tool, which you can download here. Then it’s as simple as running:

library(COINr6)

# This is the file path and name where the COIN Tool is downloaded to
# You could also just put it in your project directory.
fname <- "C:/Users/becke/Downloads/COIN_Tool_v1_LITE_exampledata.xlsm"

dflist <- COINToolIn(fname)
## Imported 15 indicators and 28 units.

The output of this function is a list with the three data frame inputs to assemble().

dflist
## $IndData
## # A tibble: 28 x 17
##    UnitName     UnitCode ind.01 ind.02 ind.03 ind.04 ind.05 ind.06 ind.07 ind.08
##    <chr>        <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 Austria      AT         13.3   80.4   492.   14.9   68.8     34    3.7   87.6
##  2 Belgium      BE         15.1   71.8   503.    7     59.6     24    5.2   81.2
##  3 Bulgaria     BG         12.3   78.1   440.    2.2   51.3     15   10.6   72  
##  4 Cyprus       CY         14     76     438.    6.9   16.7     23    3.6   73.4
##  5 Czech Repub~ CZ         13.5   87.6   491.    8.8   73.2     27    3.9   86.7
##  6 Germany      DE          9.7   80.2   508.    8.5   46.3     30    5.6   90.1
##  7 Denmark      DK          9.7   73     504.   27.7   40.6     39    3.8   83.9
##  8 Estonia      EE          8.6   83.3   524.   15.7   35.7     37    4.1   77.1
##  9 Greece       EL         11.8   70     458.    4     31.5     30    3.9   49.2
## 10 Spain        ES         14.9   57.4   491.    9.4   34.8     33   11.4   68  
## # ... with 18 more rows, and 7 more variables: ind.09 <dbl>, ind.10 <dbl>,
## #   ind.11 <dbl>, ind.12 <dbl>, ind.13 <dbl>, ind.14 <dbl>, ind.15 <dbl>
## 
## $IndMeta
## # A tibble: 15 x 9
##    IndCode IndName         GPupper GPlower Direction IndWeight Agg1  Agg2  Agg3 
##    <chr>   <chr>             <dbl>   <dbl>     <dbl>     <dbl> <chr> <chr> <chr>
##  1 ind.01  Pre-primary pu~      NA      NA        -1      0.4  sp.01 p.01  Index
##  2 ind.02  Share of popul~      NA      NA         1      0.3  sp.01 p.01  Index
##  3 ind.03  Reading, maths~      NA      NA         1      0.3  sp.01 p.01  Index
##  4 ind.04  Recent training      NA      NA         1      0.3  sp.02 p.01  Index
##  5 ind.05  VET students         NA      NA         1      0.35 sp.02 p.01  Index
##  6 ind.06  High computer ~      NA      NA         1      0.35 sp.02 p.01  Index
##  7 ind.07  Early leavers ~      NA      NA        -1      0.7  sp.03 p.02  Index
##  8 ind.08  Recent graduat~      NA      NA         1      0.3  sp.03 p.02  Index
##  9 ind.09  Activity rate ~      NA      NA         1      0.5  sp.04 p.02  Index
## 10 ind.10  Activity rate ~      NA      NA         1      0.5  sp.04 p.02  Index
## 11 ind.11  Long-term unem~      NA      NA        -1      0.4  sp.05 p.03  Index
## 12 ind.12  Underemployed ~      NA      NA        -1      0.6  sp.05 p.03  Index
## 13 ind.13  Higher educati~      NA      NA        -1      0.4  sp.06 p.03  Index
## 14 ind.14  ISCED 5-8 prop~      NA      NA        -1      0.1  sp.06 p.03  Index
## 15 ind.15  Qualification ~      NA      NA        -1      0.5  sp.06 p.03  Index
## 
## $AggMeta
## # A tibble: 10 x 4
##    AgLevel Code  Name                            Weight
##      <dbl> <chr> <chr>                            <dbl>
##  1       4 Index European Skills Index              1  
##  2       3 p.01  Skills Development                 0.3
##  3       3 p.02  Skills Activation                  0.3
##  4       3 p.03  Skills Matching                    0.4
##  5       2 sp.01 Compulsory education               0.5
##  6       2 sp.02 Training and tertiary education    0.5
##  7       2 sp.03 Transition to work                 0.5
##  8       2 sp.04 Activity rates                     0.5
##  9       2 sp.05 Unemployment                       0.4
## 10       2 sp.06 Skills mismatch                    0.6

Because the COIN Tool uses numeric codes for indicators such as ind.01, you might want slightly more informative codes. The best way to do this is to name the codes yourself, but a quick solution is to set makecodes = TRUE in COINToolIn(). This generates short codes based on the indicator names. It will not yield perfect results, but for a quick analysis it might be sufficient. At least, you could use this and then modify the results by hand.

dflist <- COINToolIn(fname, makecodes = TRUE)
## Imported 15 indicators and 28 units.
dflist$IndMeta$IndCode
##  [1] "Pre-Pupi"   "SharPopu"   "ReadMath"   "ReceTrai"   "Stud"      
##  [6] "HighComp"   "EarlLeav"   "ReceGrad"   "ActiRate"   "ActiRate_1"
## [11] "LongUnem"   "UndePart"   "HighEduc"   "Isce5-Pr"   "QualMism"

While the codes could certainly be improved, it’s a lot better than uninformative numbers. Finally, we can assemble the output into a COIN and begin the construction.

ESI <- assemble(dflist[[1]],dflist[[2]],dflist[[3]],)
## -----------------
## No denominators detected.
## -----------------
## -----------------
## Indicator codes cross-checked and OK.
## -----------------
## Number of indicators = 15
## Number of units = 28
## Number of aggregation levels = 3 above indicator level.
## -----------------
## Aggregation level 1 with 6 aggregate groups: sp.01, sp.02, sp.03, sp.04, sp.05, sp.06
## Cross-check between metadata and framework = OK.
## Aggregation level 2 with 3 aggregate groups: p.01, p.02, p.03
## Cross-check between metadata and framework = OK.
## Aggregation level 3 with 1 aggregate groups: Index
## Cross-check between metadata and framework = OK.
## -----------------

15.1.2 Export to Excel

Trigger warning for R purists! Sometimes it’s easier to look at your data in Excel. There, I said it. R is great for doing all kinds of complicated tasks, but if you just want to look at big tables of numbers and play around with them, maybe make a few quick graphs, then Excel is a great tool.

Actually Excel is kind of underrated by many people who are used to programming in R or Python, Matlab or even Stata. It has a lot of clever tools that not many people know about. But more importantly, Excel is a lingua franca between all kinds of professions - you can pass an Excel spreadsheet to almost anyone and they will be able to take a look at it and use the data. Try doing that with an R or Python script.

It just boils down to using the right tool for the right job. Anyway, with that aside, let’s look at COINr’s coin2Excel() function. You put in your COIN, and it will write a spreadsheet.

# Build ASEM index
ASEM <- build_ASEM()
# Get some statistics
ASEM <- getStats(ASEM, dset = "Raw")
# write to Excel
coin2Excel(ASEM, fname = "ASEMresults.xlsx")

The spreadsheet will contain a number of tabs, including:

The indicator data, metadata and aggregation metadata that was input to COINr
All data sets in the .$Data folder, e.g. raw, treated, normalised, aggregated, etc.
(almost) All data frames found in the .$Analysis folder, i.e. statistics tables, outlier flags, correlation tables.

15.2 Selecting data sets and indicators

The getIn() function is widely used by many COINr functions. It is used for selecting specific data sets, and returning subsets of indicators. While some of this can be achieved fairly easily with base R, or dplyr::select(), subsetting in a hierarchical context can be more awkward. That’s where getIn() steps in to help.

Although it was made to be used internally, it might also help in other contexts. Note that this can work on COINs or data frames, but is most useful with COINs.

Let’s take some examples. First, we can get a whole data set. getIn() will retrieve any of the data sets in the .$Data folder, as well as the denominators.

# Build data set first, if not already done
ASEM <- build_ASEM()
## -----------------
## Denominators detected - stored in .$Input$Denominators
## -----------------
## -----------------
## Indicator codes cross-checked and OK.
## -----------------
## Number of indicators = 49
## Number of units = 51
## Number of aggregation levels = 3 above indicator level.
## -----------------
## Aggregation level 1 with 8 aggregate groups: Physical, ConEcFin, Political, Instit, P2P, Environ, Social, SusEcFin
## Cross-check between metadata and framework = OK.
## Aggregation level 2 with 2 aggregate groups: Conn, Sust
## Cross-check between metadata and framework = OK.
## Aggregation level 3 with 1 aggregate groups: Index
## Cross-check between metadata and framework = OK.
## -----------------
## Missing data points detected = 65
## Missing data points imputed = 65, using method = indgroup_mean

# Get raw data set
datalist <- getIn(ASEM, dset = "Raw")
datalist$ind_data_only
## # A tibble: 51 x 49
##     Goods Services    FDI PRemit ForPort CostImpEx Tariff  TBTs TIRcon  RTAs
##     <dbl>    <dbl>  <dbl>  <dbl>   <dbl>     <dbl>  <dbl> <dbl>  <dbl> <dbl>
##  1 278.      108.   5      5.07    809.          0    1.6  1144      1    30
##  2 598.      216.   5.71  13.4    1574.          0    1.6  1348      1    30
##  3  42.8      13.0  1.35   1.04     15.5        52    1.6  1140      1    30
##  4  28.4      17.4  0.387  1.56     16.9         0    1.6  1179      1    30
##  5   8.77     15.2  1.23   0.477    40.8       100    1.6  1141      1    30
##  6 274.       43.5  3.88   4.69    108.          0    1.6  1456      1    30
##  7 147.      114.   9.1    2.19   1021.          0    1.6  1393      1    30
##  8  28.2      10.2  0.58   0.589    17.4         0    1.6  1153      1    30
##  9 102.       53.8  6.03   1.51    748.         70    1.6  1215      1    30
## 10 849.      471.  30.9   30.2    6745.          0    1.6  1385      1    30
## # ... with 41 more rows, and 39 more variables: Visa <dbl>, StMob <dbl>,
## #   Research <dbl>, Pat <dbl>, CultServ <dbl>, CultGood <dbl>, Tourist <dbl>,
## #   MigStock <dbl>, Lang <dbl>, LPI <dbl>, Flights <dbl>, Ship <dbl>,
## #   Bord <dbl>, Elec <dbl>, Gas <dbl>, ConSpeed <dbl>, Cov4G <dbl>, Embs <dbl>,
## #   IGOs <dbl>, UNVote <dbl>, Renew <dbl>, PrimEner <dbl>, CO2 <dbl>,
## #   MatCon <dbl>, Forest <dbl>, Poverty <dbl>, Palma <dbl>, TertGrad <dbl>,
## #   FreePress <dbl>, TolMin <dbl>, NGOs <dbl>, CPI <dbl>, FemLab <dbl>, ...

The output, here datalist is a list containing the full data set .$ind_data, the data set .$ind_data_only only including numerical (indicator) columns, as well as unit codes, indicator codes and names, and the object type.

More usefully, we can get specific indicators:

# Get raw data set
datalist <- getIn(ASEM, dset = "Raw", icodes = c("Flights", "LPI"))
datalist$ind_data_only
## # A tibble: 51 x 2
##    Flights   LPI
##      <dbl> <dbl>
##  1   29.0   4.10
##  2   31.9   4.11
##  3    9.24  2.81
##  4    9.25  3.16
##  5    8.75  3.00
##  6   15.3   3.67
##  7   32.8   3.82
##  8    3.13  3.36
##  9   18.9   3.92
## 10   97.6   3.90
## # ... with 41 more rows

More usefully still, we can get groups of indicators based on their groupings. For example, we can ask for indicators that belong to the “Political” group:

# Get raw data set
datalist <- getIn(ASEM, dset = "Raw", icodes = "Political", aglev = 1)
datalist$ind_data_only
## # A tibble: 51 x 3
##     Embs  IGOs UNVote
##    <dbl> <dbl>  <dbl>
##  1    88   227   42.6
##  2    84   248   43.0
##  3    67   209   43.0
##  4    62   197   42.7
##  5    43   172   42.3
##  6    84   201   42.2
##  7    77   259   42.8
##  8    46   194   42.9
##  9    74   269   42.7
## 10   100   329   40.4
## # ... with 41 more rows

To do this, you have to specify the aglev argument, which specifies which level to retrieve the indicators from. Before the data set is aggregated, you can anyway only select the indicators, but for the aggregated data set, the situation is more complex. To illustrate, we can call the Connectivity sub-index, first asking for all indicators:

# Get raw data set
datalist <- getIn(ASEM, dset = "Aggregated", icodes = "Conn", aglev = 1)
datalist$ind_data_only
## # A tibble: 51 x 30
##    Goods Services   FDI PRemit ForPort CostImpEx Tariff  TBTs TIRcon  RTAs  Visa
##    <dbl>    <dbl> <dbl>  <dbl>   <dbl>     <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl>
##  1  40.7     33.0  6.24  16.8    34.2      100     79.2  34.8    100  64.4  86.8
##  2  80.1     58.7  5.79  40.9    55.7      100     79.2  23.2    100  64.4  86.8
##  3  47.1     28.4 15.8   26.9     4.82      91.3   79.2  35.0    100  64.4  85.7
##  4  29.7     41.6  2.26  43.9     5.45     100     79.2  32.8    100  64.4  84.6
##  5  21.6    100   43.1   33.6    33.6       83.2   79.2  35.0    100  64.4  85.7
##  6  88.9     25.5 11.6   33.9     9.12     100     79.2  17.0    100  64.4  85.7
##  7  24.4     46.0 19.0    7.73   55.1      100     79.2  20.6    100  64.4  87.9
##  8  75.4     55.4 15.4   35.8    12.3      100     79.2  34.3    100  64.4  85.7
##  9  20.8     25.9 15.7    6.50   51.9       88.2   79.2  30.7    100  64.4  87.9
## 10  15.1     21.1  6.04  15.7    45.3      100     79.2  21.0    100  64.4  87.9
## # ... with 41 more rows, and 19 more variables: StMob <dbl>, Research <dbl>,
## #   Pat <dbl>, CultServ <dbl>, CultGood <dbl>, Tourist <dbl>, MigStock <dbl>,
## #   Lang <dbl>, LPI <dbl>, Flights <dbl>, Ship <dbl>, Bord <dbl>, Elec <dbl>,
## #   Gas <dbl>, ConSpeed <dbl>, Cov4G <dbl>, Embs <dbl>, IGOs <dbl>,
## #   UNVote <dbl>

Or we can call all the pillars belonging to “Connectivity”, i.e. the level below:

# Get raw data set
datalist <- getIn(ASEM, dset = "Aggregated", icodes = "Conn", aglev = 2)
datalist$ind_data_only
## # A tibble: 51 x 5
##    ConEcFin Instit   P2P Physical Political
##       <dbl>  <dbl> <dbl>    <dbl>     <dbl>
##  1     26.2   77.5  54.1     41.1      78.2
##  2     48.2   75.6  43.3     72.0      80.8
##  3     24.6   75.9  27.1     28.4      67.5
##  4     24.6   76.8  39.1     47.6      62.6
##  5     46.4   74.6  59.8     29.5      48.7
##  6     33.8   74.4  39.6     41.5      71.0
##  7     30.4   75.4  51.0     50.5      78.1
##  8     38.9   77.3  53.5     42.5      55.4
##  9     24.1   75.1  39.1     44.5      77.9
## 10     20.6   75.4  28.4     42.4      87.6
## # ... with 41 more rows

Finally, if we want “Conn” itself, we can just call it directly with no aglev specified.

# Get raw data set
datalist <- getIn(ASEM, dset = "Aggregated", icodes = "Conn")
datalist$ind_data_only
## # A tibble: 51 x 1
##     Conn
##    <dbl>
##  1  55.4
##  2  64.0
##  3  44.7
##  4  50.1
##  5  51.8
##  6  52.1
##  7  57.1
##  8  53.5
##  9  52.1
## 10  50.9
## # ... with 41 more rows

We can also use getIn() with data frames, and it will behave in more or less the same way, except a data frame has no information about the structure of the index. Here, getIn() returns what it can, and arguments like dset and aglev are ignored.

# use the ASEM indicator data frame directly
datalist <- getIn(ASEMIndData, icodes = c("LPI", "Goods"))
datalist$ind_data_only
## # A tibble: 51 x 2
##      LPI  Goods
##    <dbl>  <dbl>
##  1  4.10 278.  
##  2  4.11 598.  
##  3  2.81  42.8 
##  4  3.16  28.4 
##  5  3.00   8.77
##  6  3.67 274.  
##  7  3.82 147.  
##  8  3.36  28.2 
##  9  3.92 102.  
## 10  3.90 849.  
## # ... with 41 more rows

15.3 Data frame tools

The roundDF() function is a small helper function for rounding data frames that contain a mix of numeric and non-numeric columns. This is very handy for presenting tables generated by COINr in documents.

# use the ASEM indicator data frame directly
ASEMIndData |>
  roundDF(decimals = 3) |>
  reactable::reactable(defaultPageSize = 5, highlight = TRUE, wrap = F)

By default, numbers are rounded to two decimal places.

A similar function, rankDF(), converts a data frame of numbers/scores to ranks, ignoring any non-numeric columns.

ASEMIndData[c("UnitCode", "LPI", "Goods")] |>
  rankDF() |>
  head(10)
##    UnitCode LPI Goods
## 1       AUT   7    18
## 2       BEL   6     8
## 3       BGR  44    39
## 4       HRV  35    43
## 5       CYP  37    49
## 6       CZE  20    20
## 7       DNK  13    25
## 8       EST  29    45
## 9       FIN  11    32
## 10      FRA  12     3

Note that there are different ways to rank values, depending on how you deal with ties. The rankDF() function uses so-called “sport” ranking, where ties are assigned the highest rank value. See ?rank for some information on this.

In-group ranking is also possible using rankDF() by specifying a column of the data frame to use as a grouping variable.

ASEMIndData[c("UnitCode", "Group_GDP", "Goods", "LPI")] |>
  rankDF(use_group = "Group_GDP") |>
  head(10)
## # A tibble: 10 x 4
##    UnitCode Group_GDP Goods   LPI
##    <chr>    <chr>     <dbl> <dbl>
##  1 AUT      L             7     5
##  2 BEL      L             2     4
##  3 BGR      S             2    10
##  4 HRV      S             5     6
##  5 CYP      S            11     8
##  6 CZE      M             1     2
##  7 DNK      L            10     7
##  8 EST      S             7     3
##  9 FIN      M             7     1
## 10 FRA      XL            3     4

A final helpful function is compareDF(), which gives a detailed comparison of two similar data frames. This is particularly useful when comparing results from one calculation to another. For example, let’s say you have built a composite indicator in Excel, but want to replicate it in COINr, to use some of the COINr tools. But having built it, you want to be sure that the results are exactly the same. From experience, it is actually quite common for differences to appear, which could be due to calculation errors, but also sometimes due to differences in the way that Excel calculates certain things. Anyway, how can you check that the two sets of results are the same, and if not, what are the differences?

The compareDF() function is intended to compare two data frames that have the same variables and rows, but possibly in a different order. Moreover it allows to compare at a specified number of significant figures and will tell you specifically what the differences are. This is probably best clarified with an example.

# take a sample of indicator data (including the UnitCode column)
data1 <- ASEMIndData[c(2,12:15)]
# copy the data
data2 <- data1
# make a change: replace one value in data2 by NA
data2[1,2] <- NA
# compare data frames
compareDF(data1, data2, matchcol = "UnitCode")
## $Same
## [1] FALSE
## 
## $Details
##     Column TheSame                          Comment NDifferent
## 1 UnitCode    TRUE      Non-numerical and identical          0
## 2      LPI   FALSE Numerical and different at 5 sf.          1
## 3  Flights    TRUE Numerical and identical to 5 sf.          0
## 4     Ship    TRUE Numerical and identical to 5 sf.          0
## 5     Bord    TRUE Numerical and identical to 5 sf.          0
## 
## $Differences
## $Differences$LPI
##   UnitCode      df1 df2
## 1      AUT 4.097985  NA

The compareDF() function requires a column to be specified which is used to match rows. It will then check:

Whether the two data frames are the same size
Whether the column names are the same, and that the matching column has the same entries
Column by column that the elements are the same, after sorting according to the matching column

The output summarises whether the two data frames are the same (TRUE/FALSE), which columns are different and how, and finally identifies specific data points which are different and returns both values.

Full disclosure, I was also involved in the development of the COIN Tool↩︎