Aggregate/Scale a detailed level of a hierarchical variable to an aggregate level

Aggregate counts or probabilities from a detailed level of a hierarchical variable to an aggregate level or scale the detailed level values so that the detailed level aggregated together equals the aggregate level.

agg(
  dt,
  id_cols,
  value_cols,
  col_stem,
  col_type,
  mapping,
  agg_function = sum,
  missing_dt_severity = "stop",
  present_agg_severity = "stop",
  overlapping_dt_severity = "stop",
  na_value_severity = "stop",
  collapse_interval_cols = FALSE,
  quiet = FALSE
)

scale(
  dt,
  id_cols,
  value_cols,
  col_stem,
  col_type,
  mapping = NULL,
  agg_function = sum,
  missing_dt_severity = "stop",
  overlapping_dt_severity = "stop",
  na_value_severity = "stop",
  collapse_interval_cols = FALSE,
  collapse_missing = FALSE,
  quiet = FALSE
)

Arguments

dt: [data.table()]
Data to be aggregated or scaled.
id_cols: [character()]
ID columns that uniquely identify each row of dt.
value_cols: [character()]
Value columns that should be aggregated.
col_stem: [character(1)]
The name of the variable to be aggregated or scaled over. If aggregating an 'interval' variable should not include the '_start' or '_end' suffix.
col_type: [character(1)]
The type of variable that is being aggregated or scaled over. Can be either 'categorical' or 'interval'.
mapping: [data.table()]
For 'categorical' variables, defines how different levels of the hierarchical variable relate to each other. For aggregating 'interval' variables, it is used to specify intervals to aggregate to, while when scaling the mapping is inferred from the available intervals in dt.
agg_function: [function()]
Function to use when aggregating, can be either sum (for counts) or prod (for probabilities).
missing_dt_severity: [character(1)]
What should happen when dt is missing levels of col_stem that prevent aggregation or scaling from occurring? Can be either 'skip', 'stop', 'warning', 'message', or 'none'. Default is 'stop'. See section on 'Severity Arguments' for more information.
present_agg_severity: [logical(1)]
What should happen when dt already has requested aggregates (from mapping)? Can be either 'skip', 'stop', 'warning', 'message', or 'none'. Default is 'stop'. See section on 'Severity Arguments' for more information.
overlapping_dt_severity: [character(1)]
When aggregating/scaling an interval variable or collapse_interval_cols=TRUE what should happen when overlapping intervals are identified? Can be either 'skip', 'stop', 'warning', 'message', or 'none'. Default is 'stop'. See section on 'Severity Arguments' for more information.
na_value_severity: [character(1)]
What should happen when 'NA' values are present in value_cols? Can be either 'skip', 'stop', 'warning', 'message', or 'none'. Default is 'stop'. See section on 'Severity Arguments' for more information.
collapse_interval_cols: [logical(1)]
Whether to collapse interval id_cols (not including col_stem if it is an interval variable). Default is 'False'. If set to 'True' the interval columns are collapsed to the most detailed common intervals and will error out if there are overlapping intervals. See details or vignettes for more information.
quiet: [logical(1)]
Should progress messages be suppressed as the function is run? Default is False.
collapse_missing: [logical(1)]
When scaling a categorical variable, whether to collapse missing intermediate levels in mapping. Default is 'False' and the function errors out due to missing data.

Value

[data.table()] with id_cols and value_cols columns for requested aggregates or with scaled values.

Details

The agg function can be used to aggregate to different levels of a pre defined hierarchy. For example a categorical variable like location you can aggregate the country level to global level or for a numeric 'interval' variable like age you can aggregate from five year age-groups to all-ages combined.

The scale function can be used to scale different levels of hierarchical variables like location, so that the sub-national level aggregated together equals the national level. Similarly, it can be used to scale a numeric 'interval' variable like age so that the five year age groups aggregated together equals the all-ages value.

If 'location' is the variable to be aggregated or scaled then col_stem = 'location' and 'location' must be included in id_cols. If 'age' is the variable to be aggregated or scaled then col_stem = 'age' and 'age_start' and 'age_end' must be included in id_cols since both variables are needed to represent interval variables.

The mapping argument defines how different levels of the hierarchical variable relate to each other. For numeric interval variables the hierarchy can be inferred while for categorical variables the full hierarchy needs to be provided.

mapping for categorical variables must have columns called 'parent' and 'child' that represent how each possible variable relates to each other. For example if aggregating or scaling locations then mapping needs to define how each child location relates to each parent location. It is then assumed that each parent location in the mapping hierarchy will need to be aggregated to.

mapping for numeric interval variables is only needed when aggregating data to define exactly which aggregates are needed. It must have columns for '{col_stem}_start' and '{col_stem}_end' defining the start and end of each aggregate interval that is need. There can be an optional 'include_NA' logical column that allows 'NA' col_stem values to be included in the aggregate for certain requested aggregates. When scaling data, mapping should be NULL since the hierarchy can be inferred from the available intervals in dt.

agg and scale work even if dt is not a square dataset. Meaning it is okay if different combinations of id_vars have different col_stem values available. For example if making age aggregates, it is okay if some location-years have 5-year age groups while other location-years have 1-year age groups.

If collapse_interval_cols = TRUE it is okay if the interval variables included in id_vars are not all exactly the same, agg and scale will collapse to the most detailed common intervals collapse_common_intervals() prior to aggregation or scaling. An example of this is when aggregating subnational data to the national level (so col_stem is 'location' and col_type is 'categorical') but each subnational location contains different age groups. agg() and scale() first aggregate to the most detailed common age groups before making location aggregates.

The agg and scale functions currently only work when combining counts or probabilities. If the data is in rate-space then you need to convert to count space first, aggregate/scale, and then convert back.

Severity Arguments

missing_dt_severity:

Check for missing levels of col_stem, the variable being aggregated or scaled over.

stop: throw error (this is the default).
warning or message: throw warning/message and continue with aggregation/scaling for requested aggregations/scalings where expected input data in dt is available.
none: don't throw error or warning, continue with aggregation/scaling for requested aggregations/scalings where expected input data in dt is available.
skip: skip this check and continue with aggregation/scaling.

present_agg_severity (agg only):

Check for requested aggregates in mapping that are already present

stop: throw error (this is the default).
warning or message: throw warning/message, drop aggregates and continue with aggregation.
none: don't throw error or warning, drop aggregates and continue with aggregation.
skip: skip this check and add to the values already present for the aggregates.

na_value_severity:

Check for 'NA' values in the value_cols.

stop: throw error (this is the default).
warning or message: throw warning/message, drop missing values and continue with aggregation/scaling where possible (this likely will cause another error because of missing_dt_severity, consider setting missing_dt_severity = "skip" for functionality similiar to na.rm = TRUE).
none: don't throw error or warning, drop missing values and continue with aggregation/scaling where possible (this likely will cause another error because of missing_dt_severity, consider setting missing_dt_severity = "skip" for functionality similiar to na.rm = TRUE).
skip: skip this check and propagate NA values through aggregation/scaling.

overlapping_dt_severity: Check for overlapping intervals that prevent collapsing to the most detailed common set of intervals. Or check for overlapping intervals in col_stem when aggregating/scaling.

stop: throw error (this is the default).
warning or message: throw warning/message, drop overlapping intervals and continue with aggregation/scaling where possible (this may cause another error because of missing_dt_severity).
none: don't throw error or warning, drop overlapping intervals and continue with aggregation/scaling where possible (this may cause another error because of missing_dt_severity).
skip: skip this check and continue with aggregation/scaling.

Examples

# aggregate count data from present day Iran provinces to historical
# provinces and Iran as a whole
input_dt <- data.table::CJ(location = iran_mapping[!grepl("[0-9]+", child),
                                                   child],
                           year = 2011,
                           value = 1)
output_dt <- agg(dt = input_dt,
                 id_cols = c("location", "year"),
                 value_cols = "value",
                 col_stem = "location",
                 col_type = "categorical",
                 mapping = iran_mapping)
#> Aggregating location
#> Aggregate 1 of 15: Tehran 2006
#> Aggregate 2 of 15: Tehran 1986-1995
#> Aggregate 3 of 15: Markazi 1966-1976
#> Aggregate 4 of 15: Markazi 1956
#> Aggregate 5 of 15: Zanjan 1976-1996
#> Aggregate 6 of 15: Gilan 1956-1966
#> Aggregate 7 of 15: Mazandaran 1956-1996
#> Aggregate 8 of 15: East Azarbayejan 1956-1986
#> Aggregate 9 of 15: Kermanshahan 1956
#> Aggregate 10 of 15: Khuzestan and Lorestan 1956
#> Aggregate 11 of 15: Fars and Ports 1956
#> Aggregate 12 of 15: Khorasan 1956-1996
#> Aggregate 13 of 15: Isfahan and Yazd 1966
#> Aggregate 14 of 15: Isfahan and Yazd 1956
#> Aggregate 15 of 15: Iran (Islamic Republic of)

# scale count data from present day Iran provinces to Iran national value
input_dt <- data.table::CJ(location = iran_mapping[!grepl("[0-9]+", child),
                                                   child],
                           year = 2011,
                           value = 1)
input_dt_agg <- data.table::data.table(
  location = "Iran (Islamic Republic of)",
  year = 2011, value = 62
)
input_dt <- rbind(input_dt, input_dt_agg)
output_dt <- scale(dt = input_dt,
                   id_cols = c("location", "year"),
                   value_cols = "value",
                   col_stem = "location",
                   col_type = "categorical",
                   mapping = iran_mapping,
                   collapse_missing = TRUE)
#> Scaling location
#> Scaling 1 of 1: Iran (Islamic Republic of)

# aggregate age-specific count data
input_dt <- data.table::data.table(year = 2010,
                        age_start = seq(0, 95, 1),
                        value1 = 1, value2 = 2)
gen_end(input_dt, id_cols = c("year", "age_start"), col_stem = "age")
age_mapping <- data.table::data.table(age_start = c(0, 15, 85),
                                      age_end = c(5, 60, Inf))
output_dt <- agg(dt = input_dt,
                 id_cols = c("year", "age_start", "age_end"),
                 value_cols = c("value1", "value2"),
                 col_stem = "age",
                 col_type = "interval",
                 mapping = age_mapping)
#> Aggregating age
#> Interval group 1 of 1: [0, 1),[1, 2),[2, 3),[3, 4),[4, 5),[5, 6),[6, 7),[7, 8),[8, 9),[9, 10),[10, 11),[11, 12),[12, 13),[13, 14),[14, 15),[15, 16),[16, 17),[17, 18),[18, 19),[19, 20),[20, 21),[21, 22),[22, 23),[23, 24),[24, 25),[25, 26),[26, 27),[27, 28),[28, 29),[29, 30),[30, 31),[31, 32),[32, 33),[33, 34),[34, 35),[35, 36),[36, 37),[37, 38),[38, 39),[39, 40),[40, 41),[41, 42),[42, 43),[43, 44),[44, 45),[45, 46),[46, 47),[47, 48),[48, 49),[49, 50),[50, 51),[51, 52),[52, 53),[53, 54),[54, 55),[55, 56),[56, 57),[57, 58),[58, 59),[59, 60),[60, 61),[61, 62),[62, 63),[63, 64),[64, 65),[65, 66),[66, 67),[67, 68),[68, 69),[69, 70),[70, 71),[71, 72),[72, 73),[73, 74),[74, 75),[75, 76),[76, 77),[77, 78),[78, 79),[79, 80),[80, 81),[81, 82),[82, 83),[83, 84),[84, 85),[85, 86),[86, 87),[87, 88),[88, 89),[89, 90),[90, 91),[91, 92),[92, 93),[93, 94),[94, 95),[95, Inf)
#> Aggregate 1 of 3: [0, 5)
#> Aggregate 2 of 3: [15, 60)
#> Aggregate 3 of 3: [85, Inf)

# scale age-specific probability data