Skip to contents

Simulate a high-cardinality feature and a binary response

Usage

sim_postcode_samples(
  df_levels,
  n = 2000L,
  threshold = 1000,
  prob = c(0.3, 0.1),
  seed = 1001
)

Arguments

df_levels

Number of levels.

n

Number of samples.

threshold

The threshold for determining if a postal code is rare.

prob

Occurrence probability vector of the class 1 event in rare and non-rare postal codes.

seed

Random seed.

Value

A data frame of samples with postal codes, response labels, and level rarity status.

Note

The code is derived from the example described in the "rare levels" vignette in the vtreat package.

Examples

df_levels <- sim_postcode_levels(nlevels = 500, seed = 42)
df_postcode <- sim_postcode_samples(
  df_levels,
  n = 10000, threshold = 3000, prob = c(0.2, 0.1), seed = 43
)
head(df_postcode)
#>   postcode label is_rare
#> 1   z03139     0   FALSE
#> 2   z01208     0   FALSE
#> 3   z04309     0    TRUE
#> 4   z00303     0   FALSE
#> 5   z03428     0   FALSE
#> 6   z01759     0   FALSE