Simulate a high-cardinality feature and a binary response
Usage
sim_postcode_samples(
df_levels,
n = 2000L,
threshold = 1000,
prob = c(0.3, 0.1),
seed = 1001
)
Arguments
- df_levels
Number of levels.
- n
Number of samples.
- threshold
The threshold for determining if a postal code is rare.
- prob
Occurrence probability vector of the class 1 event in rare and non-rare postal codes.
- seed
Random seed.
Note
The code is derived from the example described in the "rare levels"
vignette in the vtreat
package.
Examples
df_levels <- sim_postcode_levels(nlevels = 500, seed = 42)
df_postcode <- sim_postcode_samples(
df_levels,
n = 10000, threshold = 3000, prob = c(0.2, 0.1), seed = 43
)
head(df_postcode)
#> postcode label is_rare
#> 1 z03139 0 FALSE
#> 2 z01208 0 FALSE
#> 3 z04309 0 TRUE
#> 4 z00303 0 FALSE
#> 5 z03428 0 FALSE
#> 6 z01759 0 FALSE