R/interval_formatting.R
gen_interval_cols.RdhierarchyUtils assumes numeric interval variables are grouped into left-closed, right-open intervals. \(a <= x < b\). Each interval can be described by their endpoints from which interval lengths and nicer formatted interval names can be created.
gen_end(dt, id_cols, col_stem, right_most_endpoint = Inf)
gen_length(dt, col_stem)
gen_name(
dt,
col_stem,
format = "to",
format_infinite = "plus",
right_most_endpoint = Inf
)[data.table()]col_stem-specific data.
[character()]
ID columns that uniquely identify each row of dt. This must include
'{col_stem}_start'.
[character(1)]
Base name of the numeric variable column. Does not include the '_start',
'_end' etc. suffix.
[numeric(1)]
Assumed right most endpoint of '{col_stem}_end'. Default is Inf.
[character(1)]
Formatting style for the interval names. Default is 'to'; can also be
'interval' or 'dash'.
[character(1)]
Formatting style for infinite endpoint intervals. Default is 'plus'; can
also be '+'. Ignored when format = "interval".
Invisibly returns reference to modified dt.
gen_end generates a new column '{col_stem}_end' for the
right-open endpoint of each interval from a series of left-closed endpoints
'{col_stem}_start'.
gen_end assumes that only the most detailed intervals are present in the
input dataset; including overlapping intervals will not return expected
results. For example if you had intervals of 0-5, 5-10, 2-7, 10+ in dt
(but only the start of each interval is provided to the dataset) then the
inferred intervals would be 0-2, 2-5, 5-7, 7-10, 10+.
Input data dt for gen_end must:
Contain all columns specified in id_cols.
Have a column called '{col_stem}_start'.
Have each row uniquely identified by each combination of id_cols.
gen_length generates a new column {col_stem}_length for the length of
each interval. Input data dt for gen_length must contain
'{col_stem}_start' and '{col_stem}_end' columns.
gen_name generates a new column {col_stem}_name describing each interval.
Formatting style for intervals:
\([a, b)\) interval notation is used when format = 'interval.
a to b is used when format = "to".
a-b is used when format = "dash".
Formatting style for infinite endpoint interval:
\([a, Inf)\) interval notation is used when format = 'interval.
a plus is used when format_infinite = "plus".
a+ is used when format = "+".
input_dt <- data.table::data.table(location = "France", year = 2010,
sex = "female",
age_start = 0:95,
value1 = 1, value2 = 2)
id_cols <- c("location", "year", "sex", "age_start")
gen_end(input_dt, id_cols, col_stem = "age")
gen_length(input_dt, col_stem = "age")
gen_name(input_dt, col_stem = "age")