hierarchyUtils assumes numeric interval variables are grouped into left-closed, right-open intervals. \(a <= x < b\). Each interval can be described by their endpoints from which interval lengths and nicer formatted interval names can be created.

gen_end(dt, id_cols, col_stem, right_most_endpoint = Inf)

gen_length(dt, col_stem)

gen_name(
  dt,
  col_stem,
  format = "to",
  format_infinite = "plus",
  right_most_endpoint = Inf
)

Arguments

dt

[data.table()]
col_stem-specific data.

id_cols

[character()]
ID columns that uniquely identify each row of dt. This must include '{col_stem}_start'.

col_stem

[character(1)]
Base name of the numeric variable column. Does not include the '_start', '_end' etc. suffix.

right_most_endpoint

[numeric(1)]
Assumed right most endpoint of '{col_stem}_end'. Default is Inf.

format

[character(1)]
Formatting style for the interval names. Default is 'to'; can also be 'interval' or 'dash'.

format_infinite

[character(1)]
Formatting style for infinite endpoint intervals. Default is 'plus'; can also be '+'. Ignored when format = "interval".

Value

Invisibly returns reference to modified dt.

Details

gen_end generates a new column '{col_stem}_end' for the right-open endpoint of each interval from a series of left-closed endpoints '{col_stem}_start'.

gen_end assumes that only the most detailed intervals are present in the input dataset; including overlapping intervals will not return expected results. For example if you had intervals of 0-5, 5-10, 2-7, 10+ in dt (but only the start of each interval is provided to the dataset) then the inferred intervals would be 0-2, 2-5, 5-7, 7-10, 10+.

Input data dt for gen_end must:

  • Contain all columns specified in id_cols.

  • Have a column called '{col_stem}_start'.

  • Have each row uniquely identified by each combination of id_cols.

gen_length generates a new column {col_stem}_length for the length of each interval. Input data dt for gen_length must contain '{col_stem}_start' and '{col_stem}_end' columns.

gen_name generates a new column {col_stem}_name describing each interval.

Formatting style for intervals:

  • \([a, b)\) interval notation is used when format = 'interval.

  • a to b is used when format = "to".

  • a-b is used when format = "dash".

Formatting style for infinite endpoint interval:

  • \([a, Inf)\) interval notation is used when format = 'interval.

  • a plus is used when format_infinite = "plus".

  • a+ is used when format = "+".

Examples

input_dt <- data.table::data.table(location = "France", year = 2010,
                                   sex = "female",
                                   age_start = 0:95,
                                   value1 = 1, value2 = 2)
id_cols <- c("location", "year", "sex", "age_start")
gen_end(input_dt, id_cols, col_stem = "age")
gen_length(input_dt, col_stem = "age")
gen_name(input_dt, col_stem = "age")