This vignette describes how to retrieve data from a coin. The main
functions to do this are get_dset()
and the more flexible
get_data()
.
These functions are important to understand, because many COINr functions use them to retrieve data for plotting, analysis and other functions. Both functions are generics, which means that they have methods for coins and purses.
Data sets
Every time a “building” operation is applied to a coin, such as
Treat()
, Screen()
, Normalise()
and so on, a new data set is created. Data sets live in the
.$Data
sub-list of the coin. We can retrieve a data set at
any time using the get_data()
function:
library(COINr)
# build full example coin
coin <- build_example_coin(quietly = TRUE)
# retrieve normalised data set
dset_norm <- get_dset(coin, dset = "Normalised")
# view first few rows and cols
head(dset_norm[1:5], 5)
#> uCode LPI Flights Ship Bord
#> 1 AUS 79.96112 12.3223217 66.14497 0.00000
#> 2 AUT 94.07137 27.8763185 0.00000 42.01269
#> 3 BEL 94.56023 23.3967426 97.14314 100.00000
#> 4 BGD 27.63906 0.1243185 45.80661 10.85013
#> 5 BGR 34.29965 10.8828790 37.40495 16.34359
By default, a data set in the coin consists of indicator columns plus
the “uCode” column, which is the unique identifier of each row. You can
also ask to attach unit metadata columns, such as unit names, groups,
and anything else that was input when building the coin, using the
also_get
argument:
# retrieve normalised data set
dset_norm2 <- get_dset(coin, dset = "Normalised", also_get = c("uName", "GDP_group"))
# view first few rows and cols
head(dset_norm2[1:5], 5)
#> uCode uName GDP_group LPI Flights
#> 1 AUS Australia XL 79.96112 12.3223217
#> 2 AUT Austria L 94.07137 27.8763185
#> 3 BEL Belgium L 94.56023 23.3967426
#> 4 BGD Bangladesh M 27.63906 0.1243185
#> 5 BGR Bulgaria S 34.29965 10.8828790
Data subsets
While get_dset()
is a quick way to retrieve an entire
data set and metadata, the get_data()
function is a
generalisation: it can also be used to obtain a whole data set, but also
subsets of data, based on e.g. indicator selection and grouping
(columns), as well as unit selection and grouping (rows).
Indicators/columns
A simple example is to extract one or more named indicators from a target data set:
x <- get_data(coin, dset = "Raw", iCodes = c("Flights", "LPI"))
# see first few rows
head(x, 5)
#> uCode Flights LPI
#> 31 AUS 36.05498 3.793385
#> 1 AUT 29.01725 4.097985
#> 2 BEL 31.88546 4.108538
#> 32 BGD 4.27955 2.663902
#> 3 BGR 9.23588 2.807685
By default, get_data()
returns the requested indicators,
plus the uCode
identifier column. We can also set
also_get = "none"
to return only the indicator columns.
The iCode
argument can also accept groups of indicators,
based on the structure of the index. In our example, indicators are
aggregated into “pillars” (level 2) within groups. We can name an
aggregation group and extract the underlying indicators:
x <- get_data(coin, dset = "Raw", iCodes = "Political", Level = 1)
head(x, 5)
#> uCode Embs IGOs UNVote
#> 31 AUS 82 196 38.46245
#> 1 AUT 88 227 42.63920
#> 2 BEL 84 248 43.00308
#> 32 BGD 52 145 38.60601
#> 3 BGR 67 209 42.95986
Here we have requested all the indicators in level 1 (the indicator level), that belong to the group called “Political” (one of the pillars). Specifying the level becomes more relevant when we look at the aggregated data set, which also includes the pillar, sub-index and index scores. Here, for example, we can ask for all the pillar scores (level 2) which belong to the sustainability sub-index (level 3):
x <- get_data(coin, dset = "Aggregated", iCodes = "Sust", Level = 2)
head(x, 5)
#> uCode Environ Social SusEcFin
#> 1 AUS 31.92211 71.88108 55.69987
#> 2 AUT 69.47511 72.76415 62.88150
#> 3 BEL 53.00859 86.16783 50.09020
#> 4 BGD 81.66988 27.51138 64.58884
#> 5 BGR 55.69922 53.30489 61.68677
If this isn’t clear, look at the structure of the example index using
e.g. plot_framework(coin)
. If we wanted to select all the
indicators within the “Sust” sub-index we would set
Level = 1
. If we wanted to select the sub-index scores
themselves we would set Level = 3
, and so on.
The idea of selecting indicators and aggregates based on the
structure of the index is useful in many places in COINr, for example
examining correlations within aggregation groups using
plot_corr()
.
Units/rows
Units (rows) of the data set can also be selected (also in combination with selecting indicators). Starting with a simple example, let’s select specified units for a specific indicator:
get_data(coin, dset = "Raw", iCodes = "Goods", uCodes = c("AUT", "VNM"))
#> uCode Goods
#> 1 AUT 278.4264
#> 51 VNM 269.0766
Rows can also be sub-setted using groups, i.e. unit groupings that
are defined using variables input with iMeta$Type = "Group"
when building the coin. Recall that for our example coin we have several
groups (a reminder that you can see some details about the coin using
its print method):
coin
#> --------------
#> A coin with...
#> --------------
#> Input:
#> Units: 51 (AUS, AUT, BEL, ...)
#> Indicators: 49 (Goods, Services, FDI, ...)
#> Denominators: 4 (Area, Energy, GDP, ...)
#> Groups: 4 (GDP_group, GDPpc_group, Pop_group, ...)
#>
#> Structure:
#> Level 1 Indicator: 49 indicators (FDI, ForPort, Goods, ...)
#> Level 2 Pillar: 8 groups (ConEcFin, Instit, P2P, ...)
#> Level 3 Sub-index: 2 groups (Conn, Sust)
#> Level 4 Index: 1 groups (Index)
#>
#> Data sets:
#> Raw (51 units)
#> Denominated (51 units)
#> Imputed (51 units)
#> Screened (51 units)
#> Treated (51 units)
#> Normalised (51 units)
#> Aggregated (51 units)
The first way to subset by unit group is to name a grouping variable, and a group within that variable to select. For example, say we want to know the values of the “Goods” indicator for all the countries in the “XL” GDP group:
get_data(coin, dset = "Raw", iCodes = "Goods", use_group = list(GDP_group = "XL"))
#> uCode GDP_group Goods
#> 1 AUS XL 288.4893
#> 8 CHN XL 1713.6190
#> 11 DEU XL 1919.1940
#> 13 ESP XL 447.1229
#> 16 FRA XL 849.3303
#> 17 GBR XL 778.9052
#> 21 IDN XL 222.4186
#> 22 IND XL 288.9806
#> 24 ITA XL 658.1981
#> 25 JPN XL 732.2078
#> 28 KOR XL 568.9920
#> 45 RUS XL 343.8504
Since we have subsetted by group, this also returns the group column which was used.
Another way of sub-setting is to combine uCodes
and
use_group
. When these two arguments are both specified, the
result is to return the full group(s) to which the specified
uCodes
belong. This can be used to put a unit in context
with its peers within a group. For example, we might want to see the
values of the “Flights” indicator for a specific unit, as well as all
other units within the same population group:
get_data(coin, dset = "Raw", iCodes = "Flights", uCodes = "MLT", use_group = "Pop_group")
#> uCode Pop_group Flights
#> 6 BRN S 2.01900
#> 9 CYP S 8.75467
#> 14 EST S 3.12946
#> 19 HRV S 9.24529
#> 23 IRL S 34.17721
#> 30 LTU S 5.37919
#> 31 LUX S 4.84458
#> 32 LVA S 6.77976
#> 33 MLT S 6.75251
#> 35 MNG S 0.98951
#> 38 NOR S 25.64994
#> 39 NZL S 13.37242
#> 48 SVN S 1.51736
Here, we have to specify use_group
simply as a string
rather than a list. Since MLT is in the “S” population group, it returns
all units within that group.
Overall, the idea of get_data()
is to flexibly return
subsets of indicator data, based on the structure of the index and unit
groups.
Manual selection
As a final point, it’s worth pointing out that a coin is simply a list of R objects such as data frames, other lists, vectors and so on. It has a particular format which allows things to be easily accessed by COINr functions. But other than that, its an ordinary R object. This means that even without the helper functions mentioned, you can get at the data simply by exploring the coin yourself.
The data sets live in the .$Data
sub-list of the
coin:
names(coin$Data)
#> [1] "Raw" "Denominated" "Imputed" "Screened" "Treated"
#> [6] "Normalised" "Aggregated"
And we can access any of these directly:
data_raw <- coin$Data$Raw
head(data_raw[1:5], 5)
#> uCode LPI Flights Ship Bord
#> 31 AUS 3.793385 36.05498 14.004198 0
#> 1 AUT 4.097985 29.01725 0.000000 35
#> 2 BEL 4.108538 31.88546 20.567121 48
#> 32 BGD 2.663902 4.27955 9.698165 16
#> 3 BGR 2.807685 9.23588 7.919366 18
The metadata lives in the .$Meta
sub-list. For example,
the unit metadata, which includes groups, names etc:
str(coin$Meta$Unit)
#> 'data.frame': 51 obs. of 11 variables:
#> $ uCode : chr "AUS" "AUT" "BEL" "BGD" ...
#> $ uName : chr "Australia" "Austria" "Belgium" "Bangladesh" ...
#> $ GDP_group : chr "XL" "L" "L" "M" ...
#> $ GDPpc_group : chr "XL" "XL" "L" "S" ...
#> $ Pop_group : chr "L" "M" "L" "XL" ...
#> $ EurAsia_group: chr "Asia" "Europe" "Europe" "Asia" ...
#> $ Time : num 2018 2018 2018 2018 2018 ...
#> $ Area : num 7741220 83871 30528 148460 110879 ...
#> $ Energy : num 81.3 27 41.83 27.92 9.96 ...
#> $ GDP : num 1304.5 390.8 468 220.8 53.2 ...
#> $ Population : num 24451 8735 11429 164670 7085 ...
The point is that if COINr tools don’t get you where you want to go, knowing your way around the coin allows you to access the data exactly how you want.