Overview

This package is designed to build sequences of bits for any setting where large amounts of non-complex data have to be stored in an efficient way. This can also be useful when documenting the metadata of any tabular dataset by collecting information throughout the dataset creation process. The resulting data structure is referred to as a “bit field”, which can be stored as a sequence of 0s and 1s, or as an integer, reducing the size of the contained information drastically. This is commonly used in MODIS dataproducts to document layer quality.

Think of a bit as a switch representing off and on states. A combination of a pair of bits can store four states, and n bits can accommodate 2^n states. In R, integers are typically 32-bit values, allowing a single integer to store 32 switches (called flags here) and 2^32 states. These states could be the outcomes of functions returning boolean values (each using one bit) or functions returning a small set of cases (using the corresponding number of bits).

In essence, bitfield allows you to capture a diverse range of information into a single value, like a column in a table or a raster layer accompanying a modelled gridded dataset. This is beneficial not only for reporting quality metrics, provenance, or other metadata but also for simplifying the reuse of complex ancillary data in (semi) automated script-based workflows.

Installation

Install the official version from CRAN:

# install.packages("bitfield")

Install the latest development version from github:

devtools::install_github("luckinet/bitfield")

Examples

library(bitfield)

library(dplyr, warn.conflicts = FALSE); library(CoordinateCleaner); library(stringr)

Let’s first build an example dataset

input <- tibble(x = sample(seq(23.3, 28.1, 0.1), 10),
                y = sample(seq(57.5, 59.6, 0.1), 10),
                commodity = rep(c("soybean", "maize"), 5),
                yield = rnorm(10, mean = 10, sd = 2),
                year = rep(2021, 10))

validComm <- c("soybean", "maize")

And make it have some non-ordinary values

input$x[5] <- 259
input$x[9] <- 0
# input$x[10] <- "23.546"
input$y[10] <- NA_real_
input$y[9] <- 0
input$commodity[c(3, 5)] <- c(NA_character_, "honey")
input$year[c(2:3)] <- c(NA, "2021r")

kable(input)

x	y	commodity	yield	year
24.1	59.4	soybean	10.109169	2021
24.0	58.8	maize	10.617382	NA
27.3	58.4	NA	9.268043	2021r
26.8	58.1	maize	8.948955	2021
259.0	57.9	honey	9.859947	2021
23.7	59.1	maize	6.318179	2021
24.2	58.7	soybean	11.860984	2021
25.1	59.0	maize	11.278150	2021
0.0	0.0	soybean	9.949784	2021
24.7	NA	maize	9.670879	2021

The first step is in creating what is called registry in bitfield. This registry captures all the information required to build the bitfield

newRegistry <- bf_create(width = 12, length = dim(input)[1])

The width = specifies how many bits are in the registry.
The lenght = specifies how long the output table is. This is usually taken from an input.
The name = specifies the label of the registry, which becomes very important when publishing, because registry and output table are stored in different files and it must be possible to unambiguously associate them to one another.

Then, individual bit flags need to be grown by specifying a mapping function and which position of the bitfield should be modified. To help with growing bits, various naming-rules are important to keep in mind

if your mapping function returns a boolean value, the bit flags will be FALSE == 0 and TRUE == 1.
if your mapping function returns cases, they will be assigned a sequence of numbers that are encoded by their respective binary representation, i.e. if there are 3 cases (which takes up 2 bits), the bit flags will be case 1 = 00, case 2 == 01 and case 3 == 10, and so on.

A flag is declared by calling a suitable function, some of which are provided here, but some of which are already available elsewhere (more below). For example bf_na(x = input, test = "x") will test whether the column x in the table input has NA-values. These functions are provided to bf_grow(), where the bitfield is characterised.

newRegistry <- newRegistry %>%
  # tests for coordinates ...
  bf_grow(flags = bf_na(x = input, test = "x"),
          pos = 1, registry = .) %>%
  bf_grow(flags =  bf_range(x = input, test = "x", min = -180, max = 180),
          pos = 2, registry = .) %>%
  bf_grow(flags = bf_decimals(x = input, test = "x"),
          pos = 10:11, registry = .) %>%

  # ... or override NA test
  bf_grow(flags = bf_range(x = input, test = "y", min = -90, max = 90),
          pos = 3, na_val = FALSE, registry = .)  %>%

  # test for matches with an external vector
  bf_grow(flags = bf_match(x = input, test = "commodity", against = validComm),
          pos = 4, na_val = FALSE, registry = .) %>%
  
  # define cases
  bf_grow(flags = bf_case(x = input, exclusive = FALSE, 
                          high = yield > 11, 
                          medium = yield < 11 & yield > 9, 
                          small = yield < 9), 
          pos = 8:9, registry = .)

It is also possible to use other functions that return flags, where it is required to provide a name and a concise yet expressive description, which is otherwise automatically provided by the bf_* function. Then you need to keep in mind:

chose name and description so that they reflect the outcome of the mapping function. If the function tests whether a value is NA and returns TRUE if the value is NA, the name and description should indicate that the bit flag is TRUE == 1 when an NA value has been found.
A concise rule to name flags should follow the same rule used by the bf_* functions, where the functional aspect is followed by the variable that is tested, for example distinct_x_y when columns x and y shall have distinct values.

newRegistry <- newRegistry %>%
  # use external functions, such as from CoordinateCleaner ...
  bf_grow(flags = cc_equ(x = input, lon = "x", lat = "y", value = "flagged"), 
          name = "distinct_x_y", desc = c("x and y coordinates are not identical, NAs are FALSE"),
          pos = 5, na_val = FALSE, registry = .) %>%
  
  # ... or stringr ...
  bf_grow(flags = str_detect(input$year, "r"), 
          name = "flag_year", desc = c("year values do have a flag, NAs are FALSE"),
          pos = 6, na_val = FALSE, registry = .) %>%
  
  # ... or even base R
  bf_grow(flags = !is.na(as.integer(input$year)), 
          name = "valid_year", desc = c("year values are valid integers"),
          pos = 7, registry = .)
#> Testing equal lat/lon
#> Flagged NA records.
#> Warning in bf_grow(flags = !is.na(as.integer(input$year)), name = "valid_year",
#> : NAs durch Umwandlung erzeugt

The resulting strcuture is basically a record of all the things that are grown on the bitfield.

newRegistry

Finally the registry needs to be combined (note: input data vectors have been stored into the environment bf_env). This will result in a vector of integers.

(intBit <- bf_combine(registry = newRegistry))
#>  [1] 735 159 695 863 213 863 607 607 207 715

As mentioned above, the registry is a record of things, which is required to decode the bitfield (similar to a key). Together with the legend, the bit flags can then be converted back to human readable text or used in any downstream workflow.

bitfield <- bf_unpack(x = intBit, registry = newRegistry, sep = "-")
#> # A tibble: 9 × 4
#>   name            flags pos   description                                       
#>   <chr>           <int> <chr> <chr>                                             
#> 1 not_na_x            2 1     the values in column 'x' do not contain any NAs   
#> 2 range_x             2 2     the values in column 'x' range between [-180,180] 
#> 3 range_y             2 3     the values in column 'y' range between [-90,90]   
#> 4 match_commodity     2 4     the values in column 'commodity' are contained in…
#> 5 distinct_x_y        2 5     x and y coordinates are not identical, NAs are FA…
#> 6 flag_year           2 6     year values do have a flag, NAs are FALSE         
#> 7 valid_year          2 7     year values are valid integers                    
#> 8 cases               3 8:9   the values are split into the following cases [1:…
#> 9 decimals            2 10:11 the values in 'x' have 0|1 decimals

# -> prints legend by default, which is also available in bf_env$legend

input %>% 
  bind_cols(bitfield) %>% 
  kable()

x	y	commodity	yield	year	bf_int	bf_binary
24.1	59.4	soybean	10.109169	2021	735	1-1-1-1-1-0-1-10-10
24.0	58.8	maize	10.617382	NA	159	1-1-1-1-1-0-0-10-00
27.3	58.4	NA	9.268043	2021r	695	1-1-1-0-1-1-0-10-10
26.8	58.1	maize	8.948955	2021	863	1-1-1-1-1-0-1-01-10
259.0	57.9	honey	9.859947	2021	213	1-0-1-0-1-0-1-10-00
23.7	59.1	maize	6.318179	2021	863	1-1-1-1-1-0-1-01-10
24.2	58.7	soybean	11.860984	2021	607	1-1-1-1-1-0-1-00-10
25.1	59.0	maize	11.278150	2021	607	1-1-1-1-1-0-1-00-10
0.0	0.0	soybean	9.949784	2021	207	1-1-1-1-0-0-1-10-00
24.7	NA	maize	9.670879	2021	715	1-1-0-1-0-0-1-10-10

Together with the rules mentioned above, we can read the binary representation on step at a time. For example, considering the second position, with the description the values in column 'x' range between [-180,180], we see that row five has the value 0, which means according to naming-rule 1 (FALSE == 0), that the x-value here should be outside of the range of [-180, 180], which we can confirm.

Bitfields for other data-types

This example here shows how to compute quality bits for tabular data, but this technique is especially helpful for raster data. To keep this package as simple as possible, no specific methods for rasters were developed (so far), they instead need to be converted to tabular form and joined to the attributes or meta data that should be added to the QB, for example like this

library(terra)

raster <- rast(matrix(data = 1:25, nrow = 5, ncol = 5))

input <- values(raster) %>% 
  as_tibble() %>% 
  rename(values = lyr.1) %>% 
  bind_cols(crds(raster), .)

# from here we can continue creating a bitfield and growing bits on it just like shown above...
intBit <- bf_combine(...)

# ... and then converting it back to a raster
QB_rast <- crds(raster) %>% 
  bind_cols(intBit) %>% 
  rast(type = "xyz", crs = crs(raster), extent = ext(raster))

To Do

write registry show method
include MD5 sum for a bitfield and update it each time the bitfield is grown further

bitfield

Overview

Installation

Examples

Bitfields for other data-types

To Do

Links

License

Citation

Developers