R/interval_formatting.R
gen_interval_cols.Rd
hierarchyUtils assumes numeric interval variables are grouped into left-closed, right-open intervals. \(a <= x < b\). Each interval can be described by their endpoints from which interval lengths and nicer formatted interval names can be created.
gen_end(dt, id_cols, col_stem, right_most_endpoint = Inf)
gen_length(dt, col_stem)
gen_name(
dt,
col_stem,
format = "to",
format_infinite = "plus",
right_most_endpoint = Inf
)
[data.table()
]col_stem
-specific data.
[character()
]
ID columns that uniquely identify each row of dt
. This must include
'{col_stem}
_start'.
[character(1)
]
Base name of the numeric variable column. Does not include the '_start',
'_end' etc. suffix.
[numeric(1)
]
Assumed right most endpoint of '{col_stem}
_end'. Default is Inf
.
[character(1)
]
Formatting style for the interval names. Default is 'to'; can also be
'interval' or 'dash'.
[character(1)
]
Formatting style for infinite endpoint intervals. Default is 'plus'; can
also be '+'. Ignored when format = "interval"
.
Invisibly returns reference to modified dt
.
gen_end
generates a new column '{col_stem}
_end' for the
right-open endpoint of each interval from a series of left-closed endpoints
'{col_stem}
_start'.
gen_end
assumes that only the most detailed intervals are present in the
input dataset; including overlapping intervals will not return expected
results. For example if you had intervals of 0-5, 5-10, 2-7, 10+ in dt
(but only the start of each interval is provided to the dataset) then the
inferred intervals would be 0-2, 2-5, 5-7, 7-10, 10+.
Input data dt
for gen_end
must:
Contain all columns specified in id_cols
.
Have a column called '{col_stem}
_start'.
Have each row uniquely identified by each combination of id_cols
.
gen_length
generates a new column {col_stem}_length
for the length of
each interval. Input data dt
for gen_length
must contain
'{col_stem}
_start' and '{col_stem}
_end' columns.
gen_name
generates a new column {col_stem}_name
describing each interval.
Formatting style for intervals:
\([a, b)\) interval notation is used when format = 'interval
.
a to b
is used when format = "to"
.
a-b
is used when format = "dash"
.
Formatting style for infinite endpoint interval:
\([a, Inf)\) interval notation is used when format = 'interval
.
a plus
is used when format_infinite = "plus"
.
a+
is used when format = "+"
.
input_dt <- data.table::data.table(location = "France", year = 2010,
sex = "female",
age_start = 0:95,
value1 = 1, value2 = 2)
id_cols <- c("location", "year", "sex", "age_start")
gen_end(input_dt, id_cols, col_stem = "age")
gen_length(input_dt, col_stem = "age")
gen_name(input_dt, col_stem = "age")