Generate columns to help describe numeric variable intervals

hierarchyUtils assumes numeric interval variables are grouped into left-closed, right-open intervals. \(a <= x < b\). Each interval can be described by their endpoints from which interval lengths and nicer formatted interval names can be created.

gen_end(dt, id_cols, col_stem, right_most_endpoint = Inf)

gen_length(dt, col_stem)

gen_name(
  dt,
  col_stem,
  format = "to",
  format_infinite = "plus",
  right_most_endpoint = Inf
)

Arguments

dt: [data.table()]
col_stem-specific data.
id_cols: [character()]
ID columns that uniquely identify each row of dt. This must include '{col_stem}_start'.
col_stem: [character(1)]
Base name of the numeric variable column. Does not include the '_start', '_end' etc. suffix.
right_most_endpoint: [numeric(1)]
Assumed right most endpoint of '{col_stem}_end'. Default is Inf.
format: [character(1)]
Formatting style for the interval names. Default is 'to'; can also be 'interval' or 'dash'.
format_infinite: [character(1)]
Formatting style for infinite endpoint intervals. Default is 'plus'; can also be '+'. Ignored when format = "interval".

Value

Invisibly returns reference to modified dt.

Details

gen_end generates a new column '{col_stem}_end' for the right-open endpoint of each interval from a series of left-closed endpoints '{col_stem}_start'.

gen_end assumes that only the most detailed intervals are present in the input dataset; including overlapping intervals will not return expected results. For example if you had intervals of 0-5, 5-10, 2-7, 10+ in dt (but only the start of each interval is provided to the dataset) then the inferred intervals would be 0-2, 2-5, 5-7, 7-10, 10+.

Input data dt for gen_end must:

Contain all columns specified in id_cols.
Have a column called '{col_stem}_start'.
Have each row uniquely identified by each combination of id_cols.

gen_length generates a new column {col_stem}_length for the length of each interval. Input data dt for gen_length must contain '{col_stem}_start' and '{col_stem}_end' columns.

gen_name generates a new column {col_stem}_name describing each interval.

Formatting style for intervals:

\([a, b)\) interval notation is used when format = 'interval.
a to b is used when format = "to".
a-b is used when format = "dash".

Formatting style for infinite endpoint interval:

\([a, Inf)\) interval notation is used when format = 'interval.
a plus is used when format_infinite = "plus".
a+ is used when format = "+".

Examples

input_dt <- data.table::data.table(location = "France", year = 2010,
                                   sex = "female",
                                   age_start = 0:95,
                                   value1 = 1, value2 = 2)
id_cols <- c("location", "year", "sex", "age_start")
gen_end(input_dt, id_cols, col_stem = "age")
gen_length(input_dt, col_stem = "age")
gen_name(input_dt, col_stem = "age")