Introduction

Tables as places where data tables are recorded can be pretty messy. The ‘tidy’ paradigm in R proposes that data are organised so that variables are recorded in columns, observations in rows and that there is only one value per cell (Wickham 2014). This, however, is only one interpretation of how data should be organised and especially when scraping data from multiple heterogeneous sources, one frequently encounters messy data that don’t follow this paradigm.

The tidyr package is one of the most popular tools to bring data into a tidy format. However, up until today it is limited to tables that are already organised into topologically coherent (rectangular) chunks. In tabshiftr we try to describe and work with a further dimension of messiness, where data are available as so-called disorganised (messy) data, data that are not yet arranged into rectangular form.

The basics

Data can be disorganised according to multiple dimensions. To understand those dimensions, we first need to understand the nature of data. Data "of the same kind" are collected in a variable, which is always a combination of a name and the values. In a table, names are typically in the topmost row and values are in the column below that name (Tab. 1). Conceptually, there are two types of variables in any table:

  1. Variables that contain categorical levels that identify the units for which values have been observed (they are called identifying variables here).
  2. Variables that have been measured or observed and that consequently represent the target values of that measurement, be they continuous or categorical (they are called observed variables here).

Moreover, a table is part of a series when other tables of that series contain the same variables, irrespective of how the distinct tables of that series are arranged.

Table 1: An example table containing one identifying and two observed variables, with the variable names in the topmost row and the values in all consecutive rows.
identifying variable observed variable (continuous) observed variable (categorical)
sample 1 10 blue
sample 2 20 black
sample 3 30 white

Dimensions of disorganisation

Several of the following dimensions of disorganisation can co-occur in one table. Examples of each of them are presented in the section Table types.

  1. Each level of an identifying variable, or a categorical observed variable are organised into separate topologically coherent chunks (clusters) of data, instead of all variables in one cluster/table.
  2. The names of observed variables are treated as if they were the levels of an identifying variable (they are listed), instead of each observed variable in its own column. This is also called "long" data.
  3. The levels of one or more identifying variable are spread over several columns, instead of all levels in only one column. This is also called "wide" data.
  4. Several identifying variables are stored in the same column or one identifying variable is split up into several columns.
  5. A table is organised so that one or more variables are "not in one piece" (e.g., empty rows and/or columns).
  6. One or more variables are merely implicitly available (e.g., as table or file name or in a "meta-data header")
  7. Tables that are part of the same series (i.e., that contain the same variables), contain variables in different columns in each table.
  8. Some columns of a particular chunk are missplaced vertically or some rows are missplaced horizontally so that variables and observations don’t match across the whole table (not yet explicitly solved here).

Schema descriptions

The approach of tabshiftr is based on capturing the arrangement of a table in a so-called schema description. Typically there is an input and an output schema, describing the arrangement of the input and output tables, respectively. The advantage of this procedure is that input and output tables exist explicitly and the schema maps the transformation of the data. As we want to end up with tidy tables, the output schema is pre-determined by a tidy table of the included variables. The input schema description, however, needs to be provided by the user.

To set up a schema description, the following steps should be taken:

  1. Variables: Clarify which are the identifying variables and which are the observed variables. Make sure not to mistake a listed observed variable as identifying variable.

  2. Clusters: Determine whether there are clusters and if so, find the origin (top left cell) of each cluster and provide the required information in setCluster(top = ..., left = ...) (Tab. 9, Tab. 10 & Tab. 11). It is advised to treat a table that contains meta-data in the top rows as cluster, as this is often the case with implicit variables (Tab. 6). All variables need to be specified in each cluster (in case clusters are all organised in the same arrangement), or relative = TRUE can be used. Data may be organised into clusters a) whenever a set of variables occurs more than once in the same table, nested into another variable, or b) when the data are organised into separate spreadsheets or files according to one of the variables (depending on the context, these issues can also be solved differently). In both cases the variable responsible for clustering (the cluster ID) can be either an identifying variable, or a categorical observed variable:

  3. Identifying variables: Determine the following:

    • is the variable available at all? This is particularly important when the data are split up into tables that are in spreadsheets or files. Often the variable that splits up the data (and thus identifies the clusters) is not explicitly available in the table anymore. In such a case, provide the value in setIDVar(..., value = ...) (Tab. 6).
    • all columns in which the variable values sit.
    • in case the variable is in several columns, determine additionally the row in which its values sit. In this case, the values will look like they are part of the header (Tab. 4).
    • in case the variable must be split off of another column, provide a regular expression that results in the target subset via setIDVar(..., split = ...) (Tab. 5).
    • in case the variable is distinct from the main table, provide the explicit (non-relative) position and set setIDVar(..., distinct = TRUE) (Tab. 12).

  4. Observed variable: Determine the following:

    • all columns in which the values of the variable sit (Tab. 7).
    • the unit and conversion factor.
    • in case the variable is not tidy, go through the following cases one after the other:
      • in case the variable is nested in a wide identifying variable, determine in addition to the columns in which the values sit also the rows in which the variable name sits (Tab. 3 & Tab. 4).
      • in case the names of the variable are given as a value of an identifying variable (Tab. 8), give the column name as key, together with the name of the respective observed variable (as it appears in the table) in values.
      • in case the name of the variable is the ID of clusters, specify key = "cluster" and in values the cluster number the variable refers to (Tab. 11).

  5. Meta-data: Provide potentially information about the table header (setHeader()) and format (setFormat()).

Table types

In this section we look at some examples of disorganised data, discuss the dimension along which they are disorganised and show which schema description can be used to reorganise them. This is certainly not an exhaustive list of possible table arrangements, but it covers certain basic building blocks of the most common operations and should be extensible enough to capture many mutations of the presented tables.

Ideally, we read in tables while treating any header rows as data, i.e., by not setting the first row as header. This makes sense, because in disorganised tables it’s often not only the first row that is part of the header.

Table contains one cluster

Separated observed variables

In case the observed variables are arranged into individual columns (Tab. 2), we have tidy data (Wickham 2014), which are already in the target arrangement. The variables in a tidy table may however, still need different names, units and transformation factors.

Table 2: A tidy table.
X1 X2 X3 X4 X5
territories period commodities harvested production
unit 1 year 1 soybean 1111 1112
unit 1 year 1 maize 1121 1122
unit 1 year 2 soybean 1211 1212
unit 1 year 2 maize 1221 1222
unit 2 year 1 soybean 2111 2112
unit 2 year 1 maize 2121 2122
unit 2 year 2 soybean 2211 2212
unit 2 year 2 maize 2221 2222
schema <- setHeader(rows = 1) %>%
  setIDVar(name = "territories", columns = 1) %>%
  setIDVar(name = "year", columns = 2) %>%
  setIDVar(name = "commodities", columns = 3) %>%
  setObsVar(name = "harvested", unit = "ha", columns = 4) %>%
  setObsVar(name = "production", unit = "t", columns = 5)

In case identifying variables are factors with a small number of levels, those levels may falsely be used as names of (several) other variables, where they would be next to each other and thus "wide". The other variables (both identifying and observed variables) would then be "nested" in those wide identifying variables. In those cases we have to record the columns and the row in which the values of the identifying variable (which appear to be names) are found. Beware that in those cases you also need to adapt the header specification, so that this row is excluded from the respective nested variable.

Table 3: The values variables are nested within the identifying variable commodities.
X1 X2 X3 X4 X5 X6
territories period soybean maize
harvested production harvested production
unit 1 year 1 1111 1112 1121 1122
unit 1 year 2 1211 1212 1221 1222
unit 2 year 1 2111 2112 2112 2122
unit 2 year 2 2211 2212 2121 2222
schema <- setHeader(rows = c(1, 2)) %>%
  setIDVar(name = "territories", columns = 1) %>%
  setIDVar(name = "year", columns = 2) %>%
  setIDVar(name = "commodities", columns = c(3, 5), row = 1) %>%
  setObsVar(name = "harvested", unit = "ha", columns = c(3, 5), row = 2) %>%
  setObsVar(name = "production", unit = "t", columns = c(4, 6), row = 2)

In case several variables are nested within other variables, we have to specify for all nested or nesting variables in which rows their values sit and adapt the header.

Table 4: The identifying variable commodities is nested in the identifying variable period. The target variable is spread across those nested columns.
X1 X2 X3 X4 X5 X6 X7 X8 X9
territories year 1 year 2
soybean maize soybean maize
harvested production harvested production harvested production harvested production
unit 1 1111 1112 1121 1122 1211 1212 1221 1222
unit 2 2111 2211 2121 2221 2112 2212 2122 2222
schema <- setHeader(rows = c(1:3)) %>%
  setIDVar(name = "territories", columns = 1) %>%
  setIDVar(name = "year", columns = c(2, 6), row = 1) %>%
  setIDVar(name = "commodities", columns = c(2, 4, 6, 8), row = 2) %>%
  setObsVar(name = "harvested", unit = "ha", columns = c(2, 4, 6, 8), row = 3) %>%
  setObsVar(name = "production", unit = "t", columns = c(3, 5, 7, 9), row = 3)

Several variables in one column

Sometimes it may be the case that several variables are stored in the same column, for example when a territorial unit is given together with the year of observation (e.g. unit1, year1). In those cases, the resulting variables in the output schema need to be specified so that they point to that same column but extract information via a regular expression (find more information here). For example, .+?(?=_) gives everything up until the first _ and (?<=\\_).* everything after the _.

Table 5: The variables year and commodities are stored in the same column.
X1 X2 X3 X4
territories unit harvested production
unit 1 year 1_soybean 1111 1112
unit 1 year 1_maize 1121 1122
unit 1 year 2_soybean 1211 1212
unit 1 year 2_maize 1221 1222
unit 2 year 1_soybean 2111 2112
unit 2 year 1_maize 2121 2122
unit 2 year 2_soybean 2211 2212
unit 2 year 2_maize 2221 2222
schema <- setHeader(rows = 1) %>%
  setIDVar(name = "territories", columns = 1) %>%
  setIDVar(name = "year", columns = 2, split = ".+?(?=_)") %>%
  setIDVar(name = "commodities", columns = 2, split = "(?<=\\_).*") %>%
  setObsVar(name = "harvested", unit = "ha", columns = 3) %>%
  setObsVar(name = "production", unit = "t", columns = 4)

Implicit variables

When data are split up into clusters, the identifying variable is often recorded not in the table as an explicit variable, but is only provided in the file or table name. In those cases, we have to register the variable nevertheless, and provide the value of that variable in value = of the identifying variable.

Table 6: The information about which territory we are dealing with is missing or implied by some meta-data.
X1 X2 X3 X4
some metadata, which does not tell us that this is in fact unit 1
commodities harvested production period
soybean 1111 1112 year 1
maize 1121 1122 year 1
soybean 1211 1212 year 2
maize 1221 1222 year 2
schema <- setCluster(id = "territories", top = 2, left = 1) %>%
  setHeader(rows = 1, relative = TRUE) %>%
  setIDVar(name = "territories", value = "unit 1") %>%
  setIDVar(name = "year", columns = 4, relative = TRUE) %>%
  setIDVar(name = "commodities", columns = 1, relative = TRUE) %>%
  setObsVar(name = "harvested", unit = "ha", columns = 2, relative = TRUE) %>%
  setObsVar(name = "production", unit = "t", columns = 3, relative = TRUE)

Listed observed variables

Some tables contain a column where the names of observed variables (harvested and production) are treated as if they were the values of an identifying variable (dimension), while the values are presented in only one column (values) (Tab. 7). To end up with tidy data in those cases, we need to extract the values associated with the observed variables. Thus, we define the observed variables and specify the key = in which the variable names sit, and the value = the variable name has, to extract that variable.

Table 7: The variable names of the observed variable are treated as if they were the values of the identifying variable dimension.
X1 X2 X3 X4 X5
territories period commodities dimension values
unit 1 year 1 soybean harvested 1111
unit 1 year 1 maize harvested 1121
unit 1 year 1 soybean production 1112
unit 1 year 1 maize production 1122
unit 1 year 2 soybean harvested 1211
unit 1 year 2 maize harvested 1221
unit 1 year 2 soybean production 1212
unit 1 year 2 maize production 1222
unit 2 year 1 soybean harvested 2111
unit 2 year 1 maize harvested 2121
unit 2 year 1 soybean production 2112
unit 2 year 1 maize production 2122
unit 2 year 2 soybean harvested 2211
unit 2 year 2 maize harvested 2221
unit 2 year 2 soybean production 2212
unit 2 year 2 maize production 2222
schema <- setHeader(rows = 1) %>%
  setIDVar(name = "territories", columns = 1) %>%
  setIDVar(name = "year", columns = 2) %>%
  setIDVar(name = "commodities", columns = 3) %>%
  setObsVar(name = "harvested", unit = "ha", columns = 5,
            key = "dimension", value = "harvested") %>%
  setObsVar(name = "production", unit = "t", columns = 5,
            key = "dimension", value = "production")

Moreover, (several) identifying variables may be wide additionally and we have to proceed as mentioned above, by providing the columns and the rows of the variable values (which appear to be names).

Table 8: The identifying variable commodities is treated as if it were the observed variables while the variable names of the observed variable are treated as if they were the values of the identifying variable dimension.
X1 X2 X3 X4 X5
territories period dimension soybean maize
unit 1 year 1 harvested 1111 1121
unit 1 year 1 production 1112 1122
unit 1 year 2 harvested 1211 1221
unit 1 year 2 production 1212 1222
unit 2 year 1 harvested 2111 2121
unit 2 year 1 production 2112 2122
unit 2 year 2 harvested 2211 2221
unit 2 year 2 production 2212 2222
schema <- setHeader(rows = 1) %>%
    setIDVar(name = "territories", columns = 1) %>%
    setIDVar(name = "year", columns = 2) %>%
    setIDVar(name = "commodities", columns = c(4, 5), row = 1) %>%
    setObsVar(name = "harvested", unit = "ha", columns = c(4, 5),
              key = "dimension", value = "harvested") %>%
    setObsVar(name = "production", unit = "t", columns = c(4, 5),
              key = "dimension", value = "production")

Fuzzy column names

WIP

Missplaced columns or rows

WIP

Table contains several clusters

Clusters are often of the same arrangement within one table, they can be repeated along rows (horizontally) or along columns (vertically). A table should be treated like a cluster also when the table contains not only the table, but perhaps also text that may be scattered across the document and that does not allow the table to start at the table origin in the topmost left cell.

To reorganise those data into tidy form, each cluster is "cut out", rearranged individually and appended to the end of an output table by reorganise().

Horizontal clusters

In case clusters are sitting right next to each other in the same origin row (Tab. 9), it is sufficient to provide the topmost row and all leftmost columns at which a new cluster starts. In case there is some arbitrary horizontal space between clusters, also the width (of each cluster) needs to be provided.

Table 9: Horizontal clusters of the identifying variable period.
X1 X2 X3 X4 X5 X6 X7
period commodities harvested production commodities harvested production
unit 1 unit 2
year 1 soybean 1111 1112 soybean 2111 2112
year 1 maize 1121 1122 maize 2121 2122
year 2 soybean 1211 1212 soybean 2211 2212
year 2 maize 1221 1222 maize 2221 2222
schema <- setCluster(id = "territories", left = c(2, 5), top = 2) %>%
  setHeader(rows = 1) %>%
  setIDVar(name = "territories", columns = c(2, 5), row = 2) %>%
  setIDVar(name = "year", columns = 1) %>%
  setIDVar(name = "commodities", columns = c(2, 5)) %>%
  setObsVar(name = "harvested", columns = c(3, 6), unit = "ha") %>%
  setObsVar(name = "production", columns = c(4, 7), unit = "t")

Vertical clusters

For vertically arranged clusters (Tab. 10), just like for the horizontal case, the respective rows, columns (and heights) need to be provided.

Table 10: Vertical clusters of the identifying variable period.
X1 X2 X3 X4 X5
territories period commodities harvested production
unit 1
year 1 soybean 1111 1112
year 1 maize 1121 1122
year 2 soybean 1211 1212
year 2 maize 1221 1222
unit 2
year 1 soybean 2111 2112
year 1 maize 2121 2122
year 2 soybean 2211 2212
year 2 maize 2221 2222
schema <- setCluster(id = "territories", top = c(3, 8), left = 2) %>%
  setHeader(rows = 1) %>%
  setIDVar(name = "territories", columns = 1, row = c(2, 7)) %>%
  setIDVar(name = "year", columns = 2) %>%
  setIDVar(name = "commodities", columns = 3) %>%
  setObsVar(name = "harvested", columns = 4, unit = "ha") %>%
  setObsVar(name = "production", columns = 5, unit = "t")

Messy clusters

In case several clusters are neither aligned along a row nor a column, and are all of differing size, the respective information need to be provided at the same index of the respective property. For example, three clusters, where the first cluster starts at (1,1) and is 3 by 4 cells in size, where the second clusters starts at (5,2) and is 5 by 5 cells in size, and so on, needs to be specified as below.

schema <- setCluster(top = c(1, 5, 1), left = c(1, 2, 5), 
                     width = c(3, 5, 2), height = c(4, 5, 3), id = ...) %>% 
  setHeader(rows = 1) %>% 
  setIDVar(name = "territories", columns = 1, relative = TRUE) %>% 
  ...

Additionally, given that at least the tables within each cluster are all arranged in the same way, the contained variables can be specified so that their row and column indices are given relative to the cluster position (relative = TRUE). If also that is not the case, the row and column values for each cluster need to be provided for the respective variables in the same way as for cluster positions.

Clusters of observed variables

The previous two types of clusters are clusters of identifying variables, but it may also be the case that the observed variables are split up into distinct clusters. Here, we need to specify first of all in clusters "id = "observed" to indicate that the observed variable is the cluster ID. Next, we need to set up the observed variables so that they contain "key = "cluster" and in value the number of the cluster this variable can be found in. Moreover, we provide the column(s) and all rows that contain the values of each observed variable. This is in contrast to nested observed variables, where the rows of the variable names need to be provided.

Table 11: Vertical clusters of the observed variables.
X1 X2 X3 X4 X5
territories period commodities values
harvested
unit 1 year 1 soybean 1111
unit 1 year 1 maize 1121
unit 1 year 2 soybean 1211
unit 1 year 2 maize 1221
unit 2 year 1 soybean 2111
unit 2 year 1 maize 2121
unit 2 year 2 soybean 2211
unit 2 year 2 maize 2221
production
unit 1 year 1 soybean 1112
unit 1 year 1 maize 1122
unit 1 year 2 soybean 1212
unit 1 year 2 maize 1222
unit 2 year 1 soybean 2112
unit 2 year 1 maize 2122
unit 2 year 2 soybean 2212
unit 2 year 2 maize 2222
schema <- setCluster(id = "observed", left = 2, top = c(3, 12), height = 8) %>%
  setHeader(rows = 1) %>%
  setIDVar(name = "territories", columns = 2) %>%
  setIDVar(name = "year", columns = 3) %>%
  setIDVar(name = "commodities", columns = 4) %>%
  setObsVar(name = "harvested", columns = 5, unit = "ha",
            key = "cluster", value = 1) %>%
  setObsVar(name = "production", columns = 5, unit = "t",
            key = "cluster", value = 2)

Variables that are distinct from a cluster

When not all identifying variables can be provided relative to the cluster origin, for example because they are missing for some clusters, it makes more sense to define such a variable as a distinct variable. This is done by providing row and col as absolute values and setting distinct = TRUE. Other variables that are all arranged in the same way in each cluster can be specified so that their row and column indices are given relative to the cluster position (relative = TRUE).

Table 12: Several clusters where one variable is not available for each cluster, but distinct of them.
X1 X2 X3 X4 X5 X6 X7
commodities harvested production
unit 1
soybean 1111 1112 year 1
maize 1121 1122 year 1
soybean 1211 1212 year 2
maize 1221 1222 year 2
commodities harvested production commodities harvested production
unit 2 unit 3
soybean 2111 2112 soybean 3111 3112 year 1
maize 2121 2122 maize 3121 3122 year 1
soybean 2211 2212 soybean 3211 3212 year 2
maize 2221 2222 maize 3221 3222 year 2
schema <- setCluster(id = "territories", top = c(1, 8, 8), left = c(1, 1, 4),
                     width = 3, height = 6) %>%
  setHeader(rows = 1, relative = TRUE) %>%
  setIDVar(name = "territories", columns = 1, row = 2, relative = TRUE) %>%
  setIDVar(name = "year", columns = 4, row = c(3:6), distinct = TRUE) %>%
  setIDVar(name = "commodities", columns = 1, relative = TRUE) %>%
  setObsVar(name = "harvested", unit = "ha", columns = 2, relative = TRUE) %>%
  setObsVar(name = "production", unit = "t", columns = 3, relative = TRUE)

Fuzzy cluster positions

WIP

References

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23. https://doi.org/10.18637/jss.v059.i10.