This package is designed to build sequences of bits for any setting where large amounts of non-complex data have to be stored in an efficient way. This can also be useful when documenting the metadata of any tabular dataset by collecting information throughout the dataset creation process. The resulting data structure is referred to as a “bit field”, which can be stored as a sequence of 0s and 1s, or as an integer, reducing the size of the contained information drastically. This is commonly used in MODIS dataproducts to document layer quality.
Think of a bit as a switch representing off and on states. A combination of a pair of bits can store four states, and n bits can accommodate 2^n states. In R, integers are typically 32-bit values, allowing a single integer to store 32 switches (called flags
here) and 2^32 states. These states could be the outcomes of functions returning boolean values (each using one bit) or functions returning a small set of cases (using the corresponding number of bits).
In essence, bitfield
allows you to capture a diverse range of information into a single value, like a column in a table or a raster layer accompanying a modelled gridded dataset. This is beneficial not only for reporting quality metrics, provenance, or other metadata but also for simplifying the reuse of complex ancillary data in (semi) automated script-based workflows.
Install the official version from CRAN:
# install.packages("bitfield")
Install the latest development version from github:
devtools::install_github("luckinet/bitfield")
library(bitfield)
library(dplyr, warn.conflicts = FALSE); library(CoordinateCleaner); library(stringr)
Let’s first build an example dataset
input <- tibble(x = sample(seq(23.3, 28.1, 0.1), 10),
y = sample(seq(57.5, 59.6, 0.1), 10),
commodity = rep(c("soybean", "maize"), 5),
yield = rnorm(10, mean = 10, sd = 2),
year = rep(2021, 10))
validComm <- c("soybean", "maize")
And make it have some non-ordinary values
input$x[5] <- 259
input$x[9] <- 0
# input$x[10] <- "23.546"
input$y[10] <- NA_real_
input$y[9] <- 0
input$commodity[c(3, 5)] <- c(NA_character_, "honey")
input$year[c(2:3)] <- c(NA, "2021r")
kable(input)
x | y | commodity | yield | year |
---|---|---|---|---|
24.1 | 59.4 | soybean | 10.109169 | 2021 |
24.0 | 58.8 | maize | 10.617382 | NA |
27.3 | 58.4 | NA | 9.268043 | 2021r |
26.8 | 58.1 | maize | 8.948955 | 2021 |
259.0 | 57.9 | honey | 9.859947 | 2021 |
23.7 | 59.1 | maize | 6.318179 | 2021 |
24.2 | 58.7 | soybean | 11.860984 | 2021 |
25.1 | 59.0 | maize | 11.278150 | 2021 |
0.0 | 0.0 | soybean | 9.949784 | 2021 |
24.7 | NA | maize | 9.670879 | 2021 |
The first step is in creating what is called registry in bitfield
. This registry captures all the information required to build the bitfield
width =
specifies how many bits are in the registry.lenght =
specifies how long the output table is. This is usually taken from an input.name =
specifies the label of the registry, which becomes very important when publishing, because registry and output table are stored in different files and it must be possible to unambiguously associate them to one another.Then, individual bit flags need to be grown by specifying a mapping function and which position of the bitfield should be modified. To help with growing bits, various naming-rules are important to keep in mind
FALSE == 0
and TRUE == 1
.case 1 = 00
, case 2 == 01
and case 3 == 10
, and so on.A flag is declared by calling a suitable function, some of which are provided here, but some of which are already available elsewhere (more below). For example bf_na(x = input, test = "x")
will test whether the column x
in the table input
has NA
-values. These functions are provided to bf_grow()
, where the bitfield is characterised.
newRegistry <- newRegistry %>%
# tests for coordinates ...
bf_grow(flags = bf_na(x = input, test = "x"),
pos = 1, registry = .) %>%
bf_grow(flags = bf_range(x = input, test = "x", min = -180, max = 180),
pos = 2, registry = .) %>%
bf_grow(flags = bf_decimals(x = input, test = "x"),
pos = 10:11, registry = .) %>%
# ... or override NA test
bf_grow(flags = bf_range(x = input, test = "y", min = -90, max = 90),
pos = 3, na_val = FALSE, registry = .) %>%
# test for matches with an external vector
bf_grow(flags = bf_match(x = input, test = "commodity", against = validComm),
pos = 4, na_val = FALSE, registry = .) %>%
# define cases
bf_grow(flags = bf_case(x = input, exclusive = FALSE,
high = yield > 11,
medium = yield < 11 & yield > 9,
small = yield < 9),
pos = 8:9, registry = .)
It is also possible to use other functions that return flags, where it is required to provide a name and a concise yet expressive description, which is otherwise automatically provided by the bf_*
function. Then you need to keep in mind:
NA
and returns TRUE
if the value is NA
, the name and description should indicate that the bit flag is TRUE == 1
when an NA
value has been found.bf_*
functions, where the functional aspect is followed by the variable that is tested, for example distinct_x_y
when columns x
and y
shall have distinct values.
newRegistry <- newRegistry %>%
# use external functions, such as from CoordinateCleaner ...
bf_grow(flags = cc_equ(x = input, lon = "x", lat = "y", value = "flagged"),
name = "distinct_x_y", desc = c("x and y coordinates are not identical, NAs are FALSE"),
pos = 5, na_val = FALSE, registry = .) %>%
# ... or stringr ...
bf_grow(flags = str_detect(input$year, "r"),
name = "flag_year", desc = c("year values do have a flag, NAs are FALSE"),
pos = 6, na_val = FALSE, registry = .) %>%
# ... or even base R
bf_grow(flags = !is.na(as.integer(input$year)),
name = "valid_year", desc = c("year values are valid integers"),
pos = 7, registry = .)
#> Testing equal lat/lon
#> Flagged NA records.
#> Warning in bf_grow(flags = !is.na(as.integer(input$year)), name = "valid_year",
#> : NAs durch Umwandlung erzeugt
The resulting strcuture is basically a record of all the things that are grown on the bitfield.
newRegistry
Finally the registry needs to be combined (note: input data vectors have been stored into the environment bf_env
). This will result in a vector of integers.
(intBit <- bf_combine(registry = newRegistry))
#> [1] 735 159 695 863 213 863 607 607 207 715
As mentioned above, the registry is a record of things, which is required to decode the bitfield (similar to a key). Together with the legend, the bit flags can then be converted back to human readable text or used in any downstream workflow.
bitfield <- bf_unpack(x = intBit, registry = newRegistry, sep = "-")
#> # A tibble: 9 × 4
#> name flags pos description
#> <chr> <int> <chr> <chr>
#> 1 not_na_x 2 1 the values in column 'x' do not contain any NAs
#> 2 range_x 2 2 the values in column 'x' range between [-180,180]
#> 3 range_y 2 3 the values in column 'y' range between [-90,90]
#> 4 match_commodity 2 4 the values in column 'commodity' are contained in…
#> 5 distinct_x_y 2 5 x and y coordinates are not identical, NAs are FA…
#> 6 flag_year 2 6 year values do have a flag, NAs are FALSE
#> 7 valid_year 2 7 year values are valid integers
#> 8 cases 3 8:9 the values are split into the following cases [1:…
#> 9 decimals 2 10:11 the values in 'x' have 0|1 decimals
# -> prints legend by default, which is also available in bf_env$legend
input %>%
bind_cols(bitfield) %>%
kable()
x | y | commodity | yield | year | bf_int | bf_binary |
---|---|---|---|---|---|---|
24.1 | 59.4 | soybean | 10.109169 | 2021 | 735 | 1-1-1-1-1-0-1-10-10 |
24.0 | 58.8 | maize | 10.617382 | NA | 159 | 1-1-1-1-1-0-0-10-00 |
27.3 | 58.4 | NA | 9.268043 | 2021r | 695 | 1-1-1-0-1-1-0-10-10 |
26.8 | 58.1 | maize | 8.948955 | 2021 | 863 | 1-1-1-1-1-0-1-01-10 |
259.0 | 57.9 | honey | 9.859947 | 2021 | 213 | 1-0-1-0-1-0-1-10-00 |
23.7 | 59.1 | maize | 6.318179 | 2021 | 863 | 1-1-1-1-1-0-1-01-10 |
24.2 | 58.7 | soybean | 11.860984 | 2021 | 607 | 1-1-1-1-1-0-1-00-10 |
25.1 | 59.0 | maize | 11.278150 | 2021 | 607 | 1-1-1-1-1-0-1-00-10 |
0.0 | 0.0 | soybean | 9.949784 | 2021 | 207 | 1-1-1-1-0-0-1-10-00 |
24.7 | NA | maize | 9.670879 | 2021 | 715 | 1-1-0-1-0-0-1-10-10 |
Together with the rules mentioned above, we can read the binary representation on step at a time. For example, considering the second position, with the description the values in column 'x' range between [-180,180]
, we see that row five has the value 0
, which means according to naming-rule 1 (FALSE == 0
), that the x-value here should be outside of the range of [-180, 180], which we can confirm.
This example here shows how to compute quality bits for tabular data, but this technique is especially helpful for raster data. To keep this package as simple as possible, no specific methods for rasters were developed (so far), they instead need to be converted to tabular form and joined to the attributes or meta data that should be added to the QB, for example like this
library(terra)
raster <- rast(matrix(data = 1:25, nrow = 5, ncol = 5))
input <- values(raster) %>%
as_tibble() %>%
rename(values = lyr.1) %>%
bind_cols(crds(raster), .)
# from here we can continue creating a bitfield and growing bits on it just like shown above...
intBit <- bf_combine(...)
# ... and then converting it back to a raster
QB_rast <- crds(raster) %>%
bind_cols(intBit) %>%
rast(type = "xyz", crs = crs(raster), extent = ext(raster))