Rank non-unique rows in a data.table using defined priority orders

prioritize_dt(
  dt,
  rank_by_cols,
  unique_id_cols = rank_by_cols,
  rank_order,
  warn_missing_levels = FALSE,
  warn_non_unique_priority = FALSE,
  check_top_priority_unique_only = FALSE
)

Arguments

dt	[`data.table()`] Data to determine rank priority for.
rank_by_cols	[`character()`] Apply `rank_order` priorities to each unique combination of `rank_by_cols` in `dt`. This should be equal to or a subset of `unique_id_cols`.
unique_id_cols	[`character()`] ID columns that once ranked by priority will uniquely identify rows of `dt` in combination with the priority column. This should be a superset of `rank_by_cols`. Default is equal to `rank_by_cols`.
rank_order	[`list()`] Named [`list()`] defining the priority order to use when ranking non-unique rows. Each element of `rank_order` corresponds to a column in `dt`, the prioritization is applied according to the order of elements in `rank_order`. Possible values for each column are '1' (ascending), '-1' (descending) or ordered factor levels when the column is not a numeric. See details for more information.
warn_missing_levels	[`logical(1)`] Whether to warn about missing levels for elements of `rank_order` or throw error. Default is 'FALSE' and errors out if there are missing levels.
warn_non_unique_priority	[`logical(1)`] Whether to warn about specified `rank_by_cols` & `rank_order` leading to non-unique rows of `dt` after generating 'priority' column. Default is 'FALSE' and errors out if there are non-unique rows.
check_top_priority_unique_only	[`logical(1)`] When checking for non-unique rows of `dt` after generating the 'priority' column with the `rank_by_cols` & names of `rank_order`, only check the priority=1 rows. This is useful when specified `rank_order` levels are not exhaustive leading to 'NA' priorities for some rows. Default if 'FALSE' and errors out if there are any non-unique rows.

Value

dt with a new 'priority' column generated using the rules specified in rank_order. 'priority' equal to 1 is the highest priority

Details

prioritize_dt uses data.table::setorderv to order dt according to rank_order. prioritize_dt takes three possible values to specify the order of a column in dt.

'1', order a numeric column in ascending order (smaller values have higher priority).
'-1', order a numeric column in descending order (larger values have higher priority).
factor levels, to order a categorical column in a custom order with the first level having highest priority. When not all present values of the column are defined in the levels, the priority will be NA and a warning printed if quiet = FALSE.

The order of elements in rank_order matters. The more important rules should be placed earlier in rank_order so that they are applied first.

Examples

# preliminary data with only total population
dt_total <- data.table::CJ(
  location = "USA", year = 2000, age_start = 0, age_end = Inf,
  method = c("de facto", "de jure"),
  status = c("preliminary")
)
# final data in 10 year age groups
dt_10_yr_groups <- data.table::CJ(
  location = "USA", year = 2000, age_start = seq(0, 80, 10),
  method = c("de facto", "de jure"),
  status = c("final")
)
dt_10_yr_groups[, age_end := age_start + 10]
#>     location year age_start   method status age_end
#>  1:      USA 2000         0 de facto  final      10
#>  2:      USA 2000         0  de jure  final      10
#>  3:      USA 2000        10 de facto  final      20
#>  4:      USA 2000        10  de jure  final      20
#>  5:      USA 2000        20 de facto  final      30
#>  6:      USA 2000        20  de jure  final      30
#>  7:      USA 2000        30 de facto  final      40
#>  8:      USA 2000        30  de jure  final      40
#>  9:      USA 2000        40 de facto  final      50
#> 10:      USA 2000        40  de jure  final      50
#> 11:      USA 2000        50 de facto  final      60
#> 12:      USA 2000        50  de jure  final      60
#> 13:      USA 2000        60 de facto  final      70
#> 14:      USA 2000        60  de jure  final      70
#> 15:      USA 2000        70 de facto  final      80
#> 16:      USA 2000        70  de jure  final      80
#> 17:      USA 2000        80 de facto  final      90
#> 18:      USA 2000        80  de jure  final      90
dt_10_yr_groups[age_start == 80, age_end := Inf]
#>     location year age_start   method status age_end
#>  1:      USA 2000         0 de facto  final      10
#>  2:      USA 2000         0  de jure  final      10
#>  3:      USA 2000        10 de facto  final      20
#>  4:      USA 2000        10  de jure  final      20
#>  5:      USA 2000        20 de facto  final      30
#>  6:      USA 2000        20  de jure  final      30
#>  7:      USA 2000        30 de facto  final      40
#>  8:      USA 2000        30  de jure  final      40
#>  9:      USA 2000        40 de facto  final      50
#> 10:      USA 2000        40  de jure  final      50
#> 11:      USA 2000        50 de facto  final      60
#> 12:      USA 2000        50  de jure  final      60
#> 13:      USA 2000        60 de facto  final      70
#> 14:      USA 2000        60  de jure  final      70
#> 15:      USA 2000        70 de facto  final      80
#> 16:      USA 2000        70  de jure  final      80
#> 17:      USA 2000        80 de facto  final     Inf
#> 18:      USA 2000        80  de jure  final     Inf

input_dt <- rbind(dt_total, dt_10_yr_groups)
input_dt[, n_age_groups := .N, by = setdiff(names(input_dt), c("age_start", "age_end"))]
#>     location year age_start age_end   method      status n_age_groups
#>  1:      USA 2000         0     Inf de facto preliminary            1
#>  2:      USA 2000         0     Inf  de jure preliminary            1
#>  3:      USA 2000         0      10 de facto       final            9
#>  4:      USA 2000         0      10  de jure       final            9
#>  5:      USA 2000        10      20 de facto       final            9
#>  6:      USA 2000        10      20  de jure       final            9
#>  7:      USA 2000        20      30 de facto       final            9
#>  8:      USA 2000        20      30  de jure       final            9
#>  9:      USA 2000        30      40 de facto       final            9
#> 10:      USA 2000        30      40  de jure       final            9
#> 11:      USA 2000        40      50 de facto       final            9
#> 12:      USA 2000        40      50  de jure       final            9
#> 13:      USA 2000        50      60 de facto       final            9
#> 14:      USA 2000        50      60  de jure       final            9
#> 15:      USA 2000        60      70 de facto       final            9
#> 16:      USA 2000        60      70  de jure       final            9
#> 17:      USA 2000        70      80 de facto       final            9
#> 18:      USA 2000        70      80  de jure       final            9
#> 19:      USA 2000        80     Inf de facto       final            9
#> 20:      USA 2000        80     Inf  de jure       final            9

output_dt <- prioritize_dt(
  dt = input_dt,
  rank_by_cols = c("location", "year"),
  unique_id_cols = c("location", "year", "age_start", "age_end"),
  rank_order = list(
    method = c("de facto", "de jure"), # prioritize 'de facto' sources highest
    n_age_groups = -1 # prioritize sources with more age groups
  )
)