Sparse Index Tracking with msaenet and CVXR: A Two-Stage Regression Approach

The code to reproduce the results in this post is also available here.

The illusion of choice. Photo by Kenny Eliason.
The illusion of choice. Photo by Kenny Eliason.

Disclaimer

The content in this blog post is for educational and research purposes only. It is not intended to be investment advice, and the author is not a licensed financial advisor. Any investment decisions should be based on your own analysis and consultation with a licensed financial advisor. The author is not responsible for any financial losses or damages resulting from the use of this information.

Background

I always appreciate recommender systems algorithms as they solve the information overload problem, which is quite prevalent in modern life. Sparse index tracking aims to use a limited number of equities to track a financial index. This approach addresses a similar problem as portfolio selection and optimization, with further implications for balancing investment returns and controlling risk.

This topic of sparse portfolios has been discussed in, for example, Brodie et al. (2009) and Benidis, Feng, and Palomar (2018). In the Benidis paper and their amazing R package sparseIndexTracking, the sparse index tracking problem was formulated as a constrained, \(\ell_0\)-norm regression:

\[ \begin{array}{c} \min_\limits{\beta} || y - X\beta ||_2^2 + \lambda ||\beta||_0\\ \text{s.t.} \sum \beta_i = 1, \beta_i >= 0\\ \end{array} \]

where \(y\) can be the daily returns of the index we are tracking, \(X\) is the daily returns of the (large number of) assets, and \(\beta\) being the parameters to be estimated. We can also optimize for different types of tracking errors in this framework.

In the experiments demonstrated by the package on a dataset of S&P 500 returns of year 2010, they were able to use 40–50 assets to track the index. I am always curious about two things, though. First, can we have a even sparser portfolio? Since 40 assets still feels a lot to me. Second, whether we could use a simpler two-step approach (optimization-wise) briefly described in their paper, where one could do asset selection and capital allocation separately. At the end of the analysis below, my conclusion is that both of these are possible, but not without potential trade-offs, at least in terms of tracking error.

Data

The dataset is S&P 500 returns in year 2010 used in the sparseIndexTracking vignette. Load the data and split into a training and a test set:

library("xts")

index2010 <- sparseIndexTracking::INDEX_2010
x_tr <- index2010$X[1:126]
x_te <- index2010$X[127:252]
r_tr <- index2010$SP500[1:126]
r_te <- index2010$SP500[127:252]

Create wrapper functions for P&L plotting using ggplot2, cowplot, and ggsci:

df_pnl <- function(x_te, beta, y_test, title) {
  df <- cbind(
    cumprod(1 + x_te %*% beta),
    cumprod(1 + y_test)
  )
  names(df)[1] <- title
  df
}

plot_pnl <- function(object, title = "Cumulative P&L") {
  ggplot2::autoplot(object, facets = NULL) +
    ggplot2::ggtitle(title) +
    cowplot::theme_minimal_hgrid() +
    ggsci::scale_color_d3() +
    ggplot2::scale_y_continuous(breaks = seq(1, 1.5, .05), limits = c(1, 1.4)) +
    ggplot2::scale_x_date(date_breaks = "1 months", date_labels = "%b") +
    ggplot2::theme(
      axis.title.x = ggplot2::element_blank(),
      legend.position = "bottom"
    )
}

Previous models

The original vignette used \(\ell_0\) penalty and optimized for four different tracking measures: empirical tracking error, downside risk, Huber empirical tracking error, and Huber downside risk. I will only show the empirical tracking error model here:

w_ete <- sparseIndexTracking::spIndexTrack(
  x_tr,
  r = r_tr, lambda = 1e-7, u = 0.5, measure = "ete"
)

names(w_ete[w_ete > 1e-6])
#>  [1] "AMZN UW Equity"  "ANTM UN Equity"  "AXP UN Equity"   "BAC UN Equity"  
#>  [5] "BLL UN Equity"   "CCL UN Equity"   "CSCO UW Equity"  "CVX UN Equity"  
#>  [9] "EQT UN Equity"   "EXPD UW Equity"  "FCX UN Equity"   "FOXA UW Equity" 
#> [13] "GIS UN Equity"   "GOOGL UW Equity" "HBAN UW Equity"  "HPQ UN Equity"  
#> [17] "IBM UN Equity"   "INTC UW Equity"  "IP UN Equity"    "JNJ UN Equity"  
#> [21] "JPM UN Equity"   "KO UN Equity"    "LNC UN Equity"   "LOW UN Equity"  
#> [25] "MCD UN Equity"   "NVDA UW Equity"  "OXY UN Equity"   "PAYX UW Equity" 
#> [29] "PEP UN Equity"   "PFE UN Equity"   "PNW UN Equity"   "QCOM UW Equity" 
#> [33] "RAI UN Equity"   "SE UN Equity"    "SJM UN Equity"   "SNDK UW Equity" 
#> [37] "SPLS UW Equity"  "SYMC UW Equity"  "T UN Equity"     "THC UN Equity"  
#> [41] "UTX UN Equity"   "VZ UN Equity"    "WFC UN Equity"   "XRAY UW Equity" 
#> [45] "ZION UW Equity"
Click here to expand the detailed portfolio
data.frame(sort(w_ete, decreasing = TRUE), fix.empty.names = FALSE)
#>                               
#> JPM UN Equity      0.055246838
#> CVX UN Equity      0.051502618
#> JNJ UN Equity      0.041348636
#> T UN Equity        0.038607714
#> SJM UN Equity      0.035814772
#> MCD UN Equity      0.035779333
#> PFE UN Equity      0.035644779
#> HPQ UN Equity      0.032115541
#> PNW UN Equity      0.029731833
#> LOW UN Equity      0.029536601
#> GOOGL UW Equity    0.029102407
#> FOXA UW Equity     0.029005445
#> SE UN Equity       0.028251061
#> CSCO UW Equity     0.026028571
#> UTX UN Equity      0.024787528
#> PEP UN Equity      0.024660300
#> XRAY UW Equity     0.024074370
#> VZ UN Equity       0.023977678
#> FCX UN Equity      0.023237096
#> SYMC UW Equity     0.022017250
#> INTC UW Equity     0.020711075
#> PAYX UW Equity     0.020476856
#> IBM UN Equity      0.020418841
#> ANTM UN Equity     0.019794073
#> OXY UN Equity      0.019656921
#> LNC UN Equity      0.018175226
#> KO UN Equity       0.017535734
#> EQT UN Equity      0.017435204
#> BAC UN Equity      0.017309960
#> IP UN Equity       0.016389945
#> NVDA UW Equity     0.016203891
#> CCL UN Equity      0.016144022
#> QCOM UW Equity     0.015572908
#> GIS UN Equity      0.015380622
#> AXP UN Equity      0.014981645
#> SPLS UW Equity     0.012327110
#> BLL UN Equity      0.012045603
#> HBAN UW Equity     0.011914526
#> AMZN UW Equity     0.011789461
#> ZION UW Equity     0.010449785
#> EXPD UW Equity     0.008405398
#> WFC UN Equity      0.008395477
#> THC UN Equity      0.007807481
#> SNDK UW Equity     0.005869511
#> RAI UN Equity      0.004338352
#> 1436513D UN Equity 0.000000000
#> 1500785D UN Equity 0.000000000
#> 1518855D US Equity 0.000000000
#> 9876566D UN Equity 0.000000000
#> A UN Equity        0.000000000
#> AA UN Equity       0.000000000
#> AAPL UW Equity     0.000000000
#> ABC UN Equity      0.000000000
#> ABT UN Equity      0.000000000
#> ADBE UW Equity     0.000000000
#> ADM UN Equity      0.000000000
#> ADP UW Equity      0.000000000
#> ADSK UW Equity     0.000000000
#> AEE UN Equity      0.000000000
#> AEP UN Equity      0.000000000
#> AES UN Equity      0.000000000
#> AET UN Equity      0.000000000
#> AFL UN Equity      0.000000000
#> AGN UN Equity      0.000000000
#> AIG UN Equity      0.000000000
#> AIV UN Equity      0.000000000
#> AIZ UN Equity      0.000000000
#> AKAM UW Equity     0.000000000
#> ALL UN Equity      0.000000000
#> ALTR UW Equity     0.000000000
#> AMAT UW Equity     0.000000000
#> AMGN UW Equity     0.000000000
#> AMP UN Equity      0.000000000
#> AMT UN Equity      0.000000000
#> AN UN Equity       0.000000000
#> AON UN Equity      0.000000000
#> APA UN Equity      0.000000000
#> APC UN Equity      0.000000000
#> APD UN Equity      0.000000000
#> APH UN Equity      0.000000000
#> ARG UN Equity      0.000000000
#> AVB UN Equity      0.000000000
#> AVY UN Equity      0.000000000
#> AZO UN Equity      0.000000000
#> BA UN Equity       0.000000000
#> BAX UN Equity      0.000000000
#> BBBY UW Equity     0.000000000
#> BBT UN Equity      0.000000000
#> BBY UN Equity      0.000000000
#> BCR UN Equity      0.000000000
#> BDX UN Equity      0.000000000
#> BEN UN Equity      0.000000000
#> BF/B UN Equity     0.000000000
#> BHI UN Equity      0.000000000
#> BIIB UW Equity     0.000000000
#> BK UN Equity       0.000000000
#> BMY UN Equity      0.000000000
#> BRCM UW Equity     0.000000000
#> BSX UN Equity      0.000000000
#> BXP UN Equity      0.000000000
#> C UN Equity        0.000000000
#> CA UW Equity       0.000000000
#> CAG UN Equity      0.000000000
#> CAH UN Equity      0.000000000
#> CAM UN Equity      0.000000000
#> CAT UN Equity      0.000000000
#> CBG UN Equity      0.000000000
#> CBS UN Equity      0.000000000
#> CCE UN Equity      0.000000000
#> CELG UW Equity     0.000000000
#> CF UN Equity       0.000000000
#> CHK UN Equity      0.000000000
#> CHRW UW Equity     0.000000000
#> CI UN Equity       0.000000000
#> CINF UW Equity     0.000000000
#> CL UN Equity       0.000000000
#> CLX UN Equity      0.000000000
#> CMA UN Equity      0.000000000
#> CMCSA UW Equity    0.000000000
#> CME UW Equity      0.000000000
#> CMI UN Equity      0.000000000
#> CMS UN Equity      0.000000000
#> CNP UN Equity      0.000000000
#> CNX UN Equity      0.000000000
#> COF UN Equity      0.000000000
#> COG UN Equity      0.000000000
#> COH UN Equity      0.000000000
#> COL UN Equity      0.000000000
#> COP UN Equity      0.000000000
#> COST UW Equity     0.000000000
#> CPB UN Equity      0.000000000
#> CRM UN Equity      0.000000000
#> CSX UN Equity      0.000000000
#> CTAS UW Equity     0.000000000
#> CTL UN Equity      0.000000000
#> CTSH UW Equity     0.000000000
#> CTXS UW Equity     0.000000000
#> CVS UN Equity      0.000000000
#> D UN Equity        0.000000000
#> DD UN Equity       0.000000000
#> DE UN Equity       0.000000000
#> DFS UN Equity      0.000000000
#> DGX UN Equity      0.000000000
#> DHI UN Equity      0.000000000
#> DHR UN Equity      0.000000000
#> DIS UN Equity      0.000000000
#> DNB UN Equity      0.000000000
#> DO UN Equity       0.000000000
#> DOV UN Equity      0.000000000
#> DOW UN Equity      0.000000000
#> DPS UN Equity      0.000000000
#> DRI UN Equity      0.000000000
#> DTE UN Equity      0.000000000
#> DUK UN Equity      0.000000000
#> DVA UN Equity      0.000000000
#> DVN UN Equity      0.000000000
#> EA UW Equity       0.000000000
#> EBAY UW Equity     0.000000000
#> ECL UN Equity      0.000000000
#> ED UN Equity       0.000000000
#> EFX UN Equity      0.000000000
#> EIX UN Equity      0.000000000
#> EL UN Equity       0.000000000
#> EMC UN Equity      0.000000000
#> EMN UN Equity      0.000000000
#> EMR UN Equity      0.000000000
#> EOG UN Equity      0.000000000
#> EQR UN Equity      0.000000000
#> ES UN Equity       0.000000000
#> ESRX UW Equity     0.000000000
#> ETFC UW Equity     0.000000000
#> ETN UN Equity      0.000000000
#> ETR UN Equity      0.000000000
#> EXC UN Equity      0.000000000
#> EXPE UW Equity     0.000000000
#> F UN Equity        0.000000000
#> FAST UW Equity     0.000000000
#> FDX UN Equity      0.000000000
#> FE UN Equity       0.000000000
#> FIS UN Equity      0.000000000
#> FISV UW Equity     0.000000000
#> FITB UW Equity     0.000000000
#> FLIR UW Equity     0.000000000
#> FLR UN Equity      0.000000000
#> FLS UN Equity      0.000000000
#> FMC UN Equity      0.000000000
#> FSLR UW Equity     0.000000000
#> FTI UN Equity      0.000000000
#> GD UN Equity       0.000000000
#> GE UN Equity       0.000000000
#> GILD UW Equity     0.000000000
#> GLW UN Equity      0.000000000
#> GME UN Equity      0.000000000
#> GPC UN Equity      0.000000000
#> GPS UN Equity      0.000000000
#> GS UN Equity       0.000000000
#> GWW UN Equity      0.000000000
#> HAL UN Equity      0.000000000
#> HAR UN Equity      0.000000000
#> HCN UN Equity      0.000000000
#> HCP UN Equity      0.000000000
#> HD UN Equity       0.000000000
#> HES UN Equity      0.000000000
#> HIG UN Equity      0.000000000
#> HOG UN Equity      0.000000000
#> HON UN Equity      0.000000000
#> HOT UN Equity      0.000000000
#> HRB UN Equity      0.000000000
#> HRL UN Equity      0.000000000
#> HRS UN Equity      0.000000000
#> HST UN Equity      0.000000000
#> HSY UN Equity      0.000000000
#> HUM UN Equity      0.000000000
#> ICE UN Equity      0.000000000
#> IFF UN Equity      0.000000000
#> INTU UW Equity     0.000000000
#> IPG UN Equity      0.000000000
#> IRM UN Equity      0.000000000
#> ISRG UW Equity     0.000000000
#> ITW UN Equity      0.000000000
#> IVZ UN Equity      0.000000000
#> JEC UN Equity      0.000000000
#> JNPR UN Equity     0.000000000
#> JWN UN Equity      0.000000000
#> K UN Equity        0.000000000
#> KEY UN Equity      0.000000000
#> KIM UN Equity      0.000000000
#> KLAC UW Equity     0.000000000
#> KMB UN Equity      0.000000000
#> KR UN Equity       0.000000000
#> KSS UN Equity      0.000000000
#> L UN Equity        0.000000000
#> LB UN Equity       0.000000000
#> LEG UN Equity      0.000000000
#> LEN UN Equity      0.000000000
#> LH UN Equity       0.000000000
#> LLL UN Equity      0.000000000
#> LLTC UW Equity     0.000000000
#> LLY UN Equity      0.000000000
#> LM UN Equity       0.000000000
#> LMT UN Equity      0.000000000
#> LUK UN Equity      0.000000000
#> LUV UN Equity      0.000000000
#> M UN Equity        0.000000000
#> MA UN Equity       0.000000000
#> MAS UN Equity      0.000000000
#> MAT UW Equity      0.000000000
#> MCHP UW Equity     0.000000000
#> MCK UN Equity      0.000000000
#> MCO UN Equity      0.000000000
#> MDT UN Equity      0.000000000
#> MET UN Equity      0.000000000
#> MJN UN Equity      0.000000000
#> MKC UN Equity      0.000000000
#> MMC UN Equity      0.000000000
#> MMM UN Equity      0.000000000
#> MO UN Equity       0.000000000
#> MON UN Equity      0.000000000
#> MRK UN Equity      0.000000000
#> MRO UN Equity      0.000000000
#> MS UN Equity       0.000000000
#> MSFT UW Equity     0.000000000
#> MSI UN Equity      0.000000000
#> MTB UN Equity      0.000000000
#> MU UW Equity       0.000000000
#> MUR UN Equity      0.000000000
#> MYL UW Equity      0.000000000
#> NBL UN Equity      0.000000000
#> NDAQ UW Equity     0.000000000
#> NEE UN Equity      0.000000000
#> NEM UN Equity      0.000000000
#> NI UN Equity       0.000000000
#> NKE UN Equity      0.000000000
#> NOC UN Equity      0.000000000
#> NOV UN Equity      0.000000000
#> NSC UN Equity      0.000000000
#> NTAP UW Equity     0.000000000
#> NTRS UW Equity     0.000000000
#> NUE UN Equity      0.000000000
#> NWL UN Equity      0.000000000
#> OI UN Equity       0.000000000
#> OMC UN Equity      0.000000000
#> ORLY UW Equity     0.000000000
#> PBCT UW Equity     0.000000000
#> PBI UN Equity      0.000000000
#> PCAR UW Equity     0.000000000
#> PCG UN Equity      0.000000000
#> PCL UN Equity      0.000000000
#> PCLN UW Equity     0.000000000
#> PCP UN Equity      0.000000000
#> PDCO UW Equity     0.000000000
#> PEG UN Equity      0.000000000
#> PFG UN Equity      0.000000000
#> PG UN Equity       0.000000000
#> PGR UN Equity      0.000000000
#> PH UN Equity       0.000000000
#> PHM UN Equity      0.000000000
#> PKI UN Equity      0.000000000
#> PM UN Equity       0.000000000
#> PNC UN Equity      0.000000000
#> POM UN Equity      0.000000000
#> PPG UN Equity      0.000000000
#> PPL UN Equity      0.000000000
#> PRU UN Equity      0.000000000
#> PSA UN Equity      0.000000000
#> PWR UN Equity      0.000000000
#> PX UN Equity       0.000000000
#> PXD UN Equity      0.000000000
#> R UN Equity        0.000000000
#> RF UN Equity       0.000000000
#> RHI UN Equity      0.000000000
#> RHT UN Equity      0.000000000
#> RL UN Equity       0.000000000
#> ROK UN Equity      0.000000000
#> ROP UN Equity      0.000000000
#> ROST UW Equity     0.000000000
#> RRC UN Equity      0.000000000
#> RSG UN Equity      0.000000000
#> RTN UN Equity      0.000000000
#> SBUX UW Equity     0.000000000
#> SCG UN Equity      0.000000000
#> SEE UN Equity      0.000000000
#> SHW UN Equity      0.000000000
#> SLB UN Equity      0.000000000
#> SNA UN Equity      0.000000000
#> SNI UN Equity      0.000000000
#> SO UN Equity       0.000000000
#> SPG UN Equity      0.000000000
#> SPGI UN Equity     0.000000000
#> SRCL UW Equity     0.000000000
#> SRE UN Equity      0.000000000
#> STI UN Equity      0.000000000
#> STJ UN Equity      0.000000000
#> STT UN Equity      0.000000000
#> STZ UN Equity      0.000000000
#> SWK UN Equity      0.000000000
#> SWN UN Equity      0.000000000
#> SYK UN Equity      0.000000000
#> SYY UN Equity      0.000000000
#> TAP UN Equity      0.000000000
#> TDC UN Equity      0.000000000
#> TGNA UN Equity     0.000000000
#> TGT UN Equity      0.000000000
#> TIF UN Equity      0.000000000
#> TJX UN Equity      0.000000000
#> TMK UN Equity      0.000000000
#> TMO UN Equity      0.000000000
#> TROW UW Equity     0.000000000
#> TRV UN Equity      0.000000000
#> TSN UN Equity      0.000000000
#> TSO UN Equity      0.000000000
#> TSS UN Equity      0.000000000
#> TWC UN Equity      0.000000000
#> TWX UN Equity      0.000000000
#> TXT UN Equity      0.000000000
#> UNH UN Equity      0.000000000
#> UNM UN Equity      0.000000000
#> UNP UN Equity      0.000000000
#> UPS UN Equity      0.000000000
#> USB UN Equity      0.000000000
#> V UN Equity        0.000000000
#> VAR UN Equity      0.000000000
#> VFC UN Equity      0.000000000
#> VLO UN Equity      0.000000000
#> VMC UN Equity      0.000000000
#> VNO UN Equity      0.000000000
#> VRSN UW Equity     0.000000000
#> VTR UN Equity      0.000000000
#> WAT UN Equity      0.000000000
#> WEC UN Equity      0.000000000
#> WFM UW Equity      0.000000000
#> WHR UN Equity      0.000000000
#> WM UN Equity       0.000000000
#> WMB UN Equity      0.000000000
#> WMT UN Equity      0.000000000
#> WU UN Equity       0.000000000
#> WY UN Equity       0.000000000
#> WYN UN Equity      0.000000000
#> WYNN UW Equity     0.000000000
#> XEL UN Equity      0.000000000
#> XL UN Equity       0.000000000
#> XLNX UW Equity     0.000000000
#> XOM UN Equity      0.000000000
#> XRX UN Equity      0.000000000
#> YUM UN Equity      0.000000000
#> ZBH UN Equity      0.000000000

df_pnl(x_te, w_ete, r_te, "PortfolioETE") |> plot_pnl()
ETE portfolio and SP500 cumulative P&L.

Figure 1: ETE portfolio and SP500 cumulative P&L.

A two-step algorithm for parsimonious solutions

Revisiting my questions: can we get an more parsimonious model with fewer assets selected while still maintaining tracking performance relatively well? Additionally, instead of optimizing a constrained sparse regression objective, we could potentially use a two-stage algorithm. Following these ideas, things should work like this:

  1. Use an unconstrained sparse regression for selecting the assets, preferably yielding sparser solutions.
  2. Use a constrained OLS regression on the selected assets where the coefficients are non-negative and sum to 1.

This is obviously a “poor man’s approach” but is simpler in terms of implementation as even I can do it in no time, and it’s shown below.

Stage 1

Create wrapper functions for extracting \(\beta\)s from msaenet models:

get_beta <- function(object) {
  beta <- coef(object)
  names(beta) <- if (inherits(object, "msaenet.msaenet")) {
    rownames(object$beta)
  } else {
    colnames(object$model$X)
  }
  beta
}

get_nzv <- function(object) {
  beta <- get_beta(object)
  names(beta[which(beta != 0)])
}

Stage 2

Use CVXR by Fu et al. (2020) to do a constrained sum-to-one OLS regression on the selected variables to get the weights:

library("CVXR")

storeg <- function(x, y) {
  x <- as.matrix(x)
  y <- as.vector(y)
  p <- ncol(x)
  beta <- Variable(p)

  obj <- sum((y - x %*% beta)^2)
  constr <- list(beta >= 0, sum(beta) == 1)
  prob <- Problem(Minimize(obj), constr)
  result <- solve(prob)

  structure(
    list(
      result = result,
      beta = result$getValue(beta)
    ),
    class = "storeg"
  )
}

Get asset names and coefficients in the portfolio:

get_portfolio <- function(beta, nzv) {
  as.vector(beta) |>
    setNames(nzv) |>
    sort(decreasing = TRUE) |>
    data.frame(fix.empty.names = FALSE)
}

Chaining things together:

fit_enet <- msaenet::msaenet(
  x = as.matrix(x_tr),
  y = as.vector(r_tr),
  family = "gaussian",
  init = "ridge",
  tune = "cv",
  nsteps = 10,
  lower.limits = 0,
  verbose = FALSE,
  seed = 42
)

fit_enet |> get_nzv()
#> [1] "CTAS UW Equity" "HON UN Equity"  "HPQ UN Equity"  "L UN Equity"   
#> [5] "XOM UN Equity"

fit_enet_sto <- storeg(x_tr[, get_nzv(fit_enet)], r_tr)

get_portfolio(fit_enet_sto$beta, get_nzv(fit_enet))
#>                         
#> L UN Equity    0.2518363
#> XOM UN Equity  0.2370656
#> CTAS UW Equity 0.1945193
#> HON UN Equity  0.1630314
#> HPQ UN Equity  0.1535473

df_pnl(x_te[, get_nzv(fit_enet)], fit_enet_sto$beta, r_te, "Portfolio.msaenet") |>
  plot_pnl()
msaenet portfolio and SP500 cumulative P&L.

Figure 2: msaenet portfolio and SP500 cumulative P&L.

We have selected much fewer (five) assets to track the index. The tracking error was also not too big.

Stability of the solution

Since we used cross validation to tune the hyperparameters, there is a possibility that we were “lucky” in getting a good model so it might be useful to check the stability of the selection results in many repeats with different seeds, and consequently, varying splits of the training data.

library("doParallel")
registerDoParallel(detectCores())

fit_rep <- foreach::foreach(seed = 1:100) %dopar% {
  msaenet::msaenet(
    x = as.matrix(x_tr),
    y = as.vector(r_tr),
    family = "gaussian",
    init = "ridge",
    tune = "cv",
    nsteps = 10,
    lower.limits = 0,
    verbose = FALSE,
    seed = seed
  )
}
as.data.frame(table(unlist(sapply(fit_rep, get_nzv)))) |>
  ggplot2::ggplot(ggplot2::aes(x = Freq, y = Var1)) +
  ggplot2::geom_point(size = 3, color = ggsci::pal_d3()(1)) +
  ggplot2::scale_x_continuous(
    name = "Selection frequency out of 100 experiments",
    limits = c(0, 100),
    expand = c(0, 5)
  ) +
  ggplot2::scale_y_discrete(name = NULL, expand = c(0, 0.5)) +
  cowplot::theme_minimal_grid()
Dot plot of asset selection frequency out of 100 experiments.

Figure 3: Dot plot of asset selection frequency out of 100 experiments.

Even sparser solutions

We can plug in different penalized regression methods in the first stage to get different portfolios with different degrees of sparsity and asset structure. For example, we can try to get an even sparser solution:

fit_snet <- suppressWarnings(msaenet::msasnet(
  x = as.matrix(x_tr),
  y = as.vector(r_tr),
  family = "gaussian",
  init = "ridge",
  tune = "cv",
  nsteps = 10,
  verbose = FALSE,
  seed = 42
))

fit_snet |> get_nzv()
#> [1] "CINF UW Equity" "COP UN Equity"  "HON UN Equity"

fit_snet_sto <- storeg(x_tr[, get_nzv(fit_snet)], r_tr)

get_portfolio(fit_snet_sto$beta, get_nzv(fit_snet))
#>                         
#> CINF UW Equity 0.4773397
#> HON UN Equity  0.3016434
#> COP UN Equity  0.2210169

df_pnl(x_te[, get_nzv(fit_snet)], fit_snet_sto$beta, r_te, "Portfolio.msasnet") |>
  plot_pnl()
Sparse portfolio using multi-step SCAD-net and SP500 cumulative P&L.

Figure 4: Sparse portfolio using multi-step SCAD-net and SP500 cumulative P&L.

As we further reduce the number of assets from five to three, the tracking error was ok in the beginning of the testing period but went up in the latter part. Similar situation for another try:

fit_mnet <- suppressWarnings(msaenet::msamnet(
  x = as.matrix(x_tr),
  y = as.vector(r_tr),
  family = "gaussian",
  init = "ridge",
  tune = "cv",
  nsteps = 10,
  verbose = FALSE,
  seed = 42
))

fit_mnet |> get_nzv()
#> [1] "CINF UW Equity" "HES UN Equity"  "HON UN Equity"

fit_mnet_sto <- storeg(x_tr[, get_nzv(fit_mnet)], r_tr)

get_portfolio(fit_mnet_sto$beta, get_nzv(fit_mnet))
#>                         
#> CINF UW Equity 0.5383087
#> HON UN Equity  0.3042618
#> HES UN Equity  0.1574295

df_pnl(x_te[, get_nzv(fit_mnet)], fit_mnet_sto$beta, r_te, "Portfolio.msamnet") |>
  plot_pnl()
Sparse portfolio using multi-step MCP-net and SP500 cumulative P&L.

Figure 5: Sparse portfolio using multi-step MCP-net and SP500 cumulative P&L.

Summary

While there are still open challenges in sparse index tracking such as determining rebalancing frequency, refining backtest strategies, and ensuring portfolio stability, this two-stage regression approach provides a simpler and potentially more parsimonious model for tackling this problem.