R/agg_scale.R
agg_scale.Rd
Aggregate counts or probabilities from a detailed level of a hierarchical variable to an aggregate level or scale the detailed level values so that the detailed level aggregated together equals the aggregate level.
agg(
dt,
id_cols,
value_cols,
col_stem,
col_type,
mapping,
agg_function = sum,
missing_dt_severity = "stop",
present_agg_severity = "stop",
overlapping_dt_severity = "stop",
na_value_severity = "stop",
collapse_interval_cols = FALSE,
quiet = FALSE
)
scale(
dt,
id_cols,
value_cols,
col_stem,
col_type,
mapping = NULL,
agg_function = sum,
missing_dt_severity = "stop",
overlapping_dt_severity = "stop",
na_value_severity = "stop",
collapse_interval_cols = FALSE,
collapse_missing = FALSE,
quiet = FALSE
)
[data.table()
]
Data to be aggregated or scaled.
[character()
]
ID columns that uniquely identify each row of dt
.
[character()
]
Value columns that should be aggregated.
[character(1)
]
The name of the variable to be aggregated or scaled over. If aggregating an
'interval' variable should not include the '_start' or '_end' suffix.
[character(1)
]
The type of variable that is being aggregated or scaled over. Can be either
'categorical' or 'interval'.
[data.table()
]
For 'categorical' variables, defines how different levels of the
hierarchical variable relate to each other. For aggregating 'interval'
variables, it is used to specify intervals to aggregate to, while when
scaling the mapping is inferred from the available intervals in dt
.
[function()
]
Function to use when aggregating, can be either sum
(for counts) or
prod
(for probabilities).
[character(1)
]
What should happen when dt
is missing levels of col_stem
that
prevent aggregation or scaling from occurring? Can be either 'skip',
'stop', 'warning', 'message', or 'none'. Default is 'stop'. See section on
'Severity Arguments' for more information.
[logical(1)
]
What should happen when dt
already has requested aggregates (from
mapping
)? Can be either 'skip', 'stop', 'warning', 'message',
or 'none'. Default is 'stop'. See section on 'Severity Arguments' for more
information.
[character(1)
]
When aggregating/scaling an interval variable or collapse_interval_cols=TRUE
what should happen when overlapping intervals are identified? Can be either
'skip', 'stop', 'warning', 'message', or 'none'. Default is 'stop'. See
section on 'Severity Arguments' for more information.
[character(1)
]
What should happen when 'NA' values are present in value_cols
? Can be
either 'skip', 'stop', 'warning', 'message', or 'none'. Default is 'stop'.
See section on 'Severity Arguments' for more information.
[logical(1)
]
Whether to collapse interval id_cols
(not including col_stem
if it is
an interval variable). Default is 'False'. If set to 'True' the interval
columns are collapsed to the most detailed common intervals and will error
out if there are overlapping intervals. See details or vignettes for more
information.
[logical(1)
]
Should progress messages be suppressed as the function is run? Default is
False.
[logical(1)
]
When scaling a categorical
variable, whether to collapse missing
intermediate levels in mapping
. Default is 'False' and the function
errors out due to missing data.
[data.table()
] with id_cols
and value_cols
columns for
requested aggregates or with scaled values.
The agg
function can be used to aggregate to different levels of a pre
defined hierarchy. For example a categorical variable like location you can
aggregate the country level to global level or for a numeric 'interval'
variable like age you can aggregate from five year age-groups to all-ages
combined.
The scale
function can be used to scale different levels of hierarchical
variables like location, so that the sub-national level aggregated together
equals the national level. Similarly, it can be used to scale a numeric
'interval' variable like age so that the five year age groups aggregated
together equals the all-ages value.
If 'location' is the variable to be aggregated or scaled then
col_stem = 'location'
and 'location' must be included in id_cols.
If
'age' is the variable to be aggregated or scaled then col_stem = 'age'
and
'age_start' and 'age_end' must be included in id_cols
since both variables
are needed to represent interval variables.
The mapping
argument defines how different levels of the hierarchical
variable relate to each other. For numeric interval variables the hierarchy
can be inferred while for categorical variables the full hierarchy needs to
be provided.
mapping
for categorical variables must have columns called 'parent' and
'child' that represent how each possible variable relates to each other.
For example if aggregating or scaling locations then mapping needs to define
how each child location relates to each parent location. It is then assumed
that each parent location in the mapping
hierarchy will need to be
aggregated to.
mapping
for numeric interval variables is only needed when aggregating data
to define exactly which aggregates are needed. It must have columns for
'{col_stem}
_start' and '{col_stem}
_end' defining the start and end of each
aggregate interval that is need. There can be an optional 'include_NA' logical
column that allows 'NA' col_stem
values to be included in the aggregate
for certain requested aggregates. When scaling data, mapping
should be
NULL
since the hierarchy can be inferred from the available intervals in
dt
.
agg
and scale
work even if dt
is not a square dataset. Meaning it is
okay if different combinations of id_vars
have different col_stem
values
available. For example if making age aggregates, it is okay if some
location-years have 5-year age groups while other location-years have 1-year
age groups.
If collapse_interval_cols = TRUE
it is okay if the interval variables
included in id_vars
are not all exactly the same, agg
and scale
will
collapse to the most detailed common intervals
collapse_common_intervals()
prior to aggregation or scaling. An example
of this is when aggregating subnational data to the national level (so
col_stem
is 'location' and col_type
is 'categorical') but each
subnational location contains different age groups. agg()
and scale()
first aggregate to the most detailed common age groups before making location
aggregates.
The agg
and scale
functions currently only work when combining counts or
probabilities. If the data is in rate-space then you need to convert to count
space first, aggregate/scale, and then convert back.
missing_dt_severity
:
Check for missing levels of col_stem
, the variable being aggregated or
scaled over.
stop
: throw error (this is the default).
warning
or message
: throw warning/message and continue with
aggregation/scaling for requested aggregations/scalings where expected input
data in dt
is available.
none
: don't throw error or warning, continue with aggregation/scaling
for requested aggregations/scalings where expected input data in dt
is
available.
skip
: skip this check and continue with aggregation/scaling.
present_agg_severity
(agg
only):
Check for requested aggregates in mapping
that are already present
stop
: throw error (this is the default).
warning
or message
: throw warning/message, drop aggregates and continue
with aggregation.
none
: don't throw error or warning, drop aggregates and continue with
aggregation.
skip
: skip this check and add to the values already present for the
aggregates.
na_value_severity
:
Check for 'NA' values in the value_cols
.
stop
: throw error (this is the default).
warning
or message
: throw warning/message, drop missing values and
continue with aggregation/scaling where possible (this likely will cause
another error because of missing_dt_severity
, consider setting
missing_dt_severity = "skip"
for functionality similiar to na.rm = TRUE
).
none
: don't throw error or warning, drop missing values and continue
with aggregation/scaling where possible (this likely will cause another error
because of missing_dt_severity
, consider setting
missing_dt_severity = "skip"
for functionality similiar to na.rm = TRUE
).
skip
: skip this check and propagate NA
values through
aggregation/scaling.
overlapping_dt_severity
:
Check for overlapping intervals that prevent collapsing to the most detailed
common set of intervals. Or check for overlapping intervals in col_stem
when aggregating/scaling.
stop
: throw error (this is the default).
warning
or message
: throw warning/message, drop overlapping intervals
and continue with aggregation/scaling where possible (this may cause another
error because of missing_dt_severity
).
none
: don't throw error or warning, drop overlapping intervals and
continue with aggregation/scaling where possible (this may cause another
error because of missing_dt_severity
).
skip
: skip this check and continue with aggregation/scaling.
# aggregate count data from present day Iran provinces to historical
# provinces and Iran as a whole
input_dt <- data.table::CJ(location = iran_mapping[!grepl("[0-9]+", child),
child],
year = 2011,
value = 1)
output_dt <- agg(dt = input_dt,
id_cols = c("location", "year"),
value_cols = "value",
col_stem = "location",
col_type = "categorical",
mapping = iran_mapping)
#> Aggregating location
#> Aggregate 1 of 15: Tehran 2006
#> Aggregate 2 of 15: Tehran 1986-1995
#> Aggregate 3 of 15: Markazi 1966-1976
#> Aggregate 4 of 15: Markazi 1956
#> Aggregate 5 of 15: Zanjan 1976-1996
#> Aggregate 6 of 15: Gilan 1956-1966
#> Aggregate 7 of 15: Mazandaran 1956-1996
#> Aggregate 8 of 15: East Azarbayejan 1956-1986
#> Aggregate 9 of 15: Kermanshahan 1956
#> Aggregate 10 of 15: Khuzestan and Lorestan 1956
#> Aggregate 11 of 15: Fars and Ports 1956
#> Aggregate 12 of 15: Khorasan 1956-1996
#> Aggregate 13 of 15: Isfahan and Yazd 1966
#> Aggregate 14 of 15: Isfahan and Yazd 1956
#> Aggregate 15 of 15: Iran (Islamic Republic of)
# scale count data from present day Iran provinces to Iran national value
input_dt <- data.table::CJ(location = iran_mapping[!grepl("[0-9]+", child),
child],
year = 2011,
value = 1)
input_dt_agg <- data.table::data.table(
location = "Iran (Islamic Republic of)",
year = 2011, value = 62
)
input_dt <- rbind(input_dt, input_dt_agg)
output_dt <- scale(dt = input_dt,
id_cols = c("location", "year"),
value_cols = "value",
col_stem = "location",
col_type = "categorical",
mapping = iran_mapping,
collapse_missing = TRUE)
#> Scaling location
#> Scaling 1 of 1: Iran (Islamic Republic of)
# aggregate age-specific count data
input_dt <- data.table::data.table(year = 2010,
age_start = seq(0, 95, 1),
value1 = 1, value2 = 2)
gen_end(input_dt, id_cols = c("year", "age_start"), col_stem = "age")
age_mapping <- data.table::data.table(age_start = c(0, 15, 85),
age_end = c(5, 60, Inf))
output_dt <- agg(dt = input_dt,
id_cols = c("year", "age_start", "age_end"),
value_cols = c("value1", "value2"),
col_stem = "age",
col_type = "interval",
mapping = age_mapping)
#> Aggregating age
#> Interval group 1 of 1: [0, 1),[1, 2),[2, 3),[3, 4),[4, 5),[5, 6),[6, 7),[7, 8),[8, 9),[9, 10),[10, 11),[11, 12),[12, 13),[13, 14),[14, 15),[15, 16),[16, 17),[17, 18),[18, 19),[19, 20),[20, 21),[21, 22),[22, 23),[23, 24),[24, 25),[25, 26),[26, 27),[27, 28),[28, 29),[29, 30),[30, 31),[31, 32),[32, 33),[33, 34),[34, 35),[35, 36),[36, 37),[37, 38),[38, 39),[39, 40),[40, 41),[41, 42),[42, 43),[43, 44),[44, 45),[45, 46),[46, 47),[47, 48),[48, 49),[49, 50),[50, 51),[51, 52),[52, 53),[53, 54),[54, 55),[55, 56),[56, 57),[57, 58),[58, 59),[59, 60),[60, 61),[61, 62),[62, 63),[63, 64),[64, 65),[65, 66),[66, 67),[67, 68),[68, 69),[69, 70),[70, 71),[71, 72),[72, 73),[73, 74),[74, 75),[75, 76),[76, 77),[77, 78),[78, 79),[79, 80),[80, 81),[81, 82),[82, 83),[83, 84),[84, 85),[85, 86),[86, 87),[87, 88),[88, 89),[89, 90),[90, 91),[91, 92),[92, 93),[93, 94),[94, 95),[95, Inf)
#> Aggregate 1 of 3: [0, 5)
#> Aggregate 2 of 3: [15, 60)
#> Aggregate 3 of 3: [85, Inf)
# scale age-specific probability data