Fast RomanoWolf Multiple Testing Corrections for fixest đș
For the final chapter of my dissertation, I had examined the effects of ordinal class rank on the academic achievement of Danish primary school students (following a popular identification strategy introduced in a paper by Murphy and Weinhard). Because of the richness of the Danish register data, I had a large number of potential outcome variables at my disposal, and as a result, I was able to examine literally all the outcomes that the previous literature had covered in individual studies  the effect of rank on achievement, personality, risky behaviour, mental health, parental investment, etc.  in one paper.
But with (too) many outcome variables comes a risk: inflated type 1 error rates, or an increased risk of âfalse positivesâ. So I was wondering: were all the significant effects I estimated  shown in the figure above  ârealâ, or was I simply being fooled by randomness?
A common way to control the risk of false positive when testing multiple hypothesis is to use methods that control the âfamilywiseâ error rate, i.e.Â the risk of at least one false positive in a family of S hypotheses.
Among such methods, Romano and Wolfâs correction is particularly popular, because it is âuniformly the most powerfulâ. Without going into too much detail, Iâll just say that if you have a choice between a number of methods that control the familywise error rate at a desired level \(\alpha\), you might want to choose the âmost powerfulâ one, i.e.Â the one that has the highest success rate of finding a true effect.
Now, there is actually an amazing Stata package for the RomanoWolf method called rwolf
.
But this is an R blog, and Iâm an R guy âŠ In addition, my regression involved several million rows and some highdimensional fixed effects, and rwolf
quickly showed its limitations. It just wasnât fast enough!
While playing around with the rwolf
package, I finally did my due diligence on the method it implements, and after a little background reading, I realized that both the Romano and Wolf method  as well as its main rival, the method proposed by Westfall and Young  are based on the bootstrap!
But wait! Had I not just spent a lot of time porting a superfast bootstrap algorithm from R to Stata? Could I not use Roodman et alâs âfast and wildâ cluster bootstrap algorithm for bootstrapbased multiple hypothesis correction?
Of course it was inevitable: I ended up writing an R package. Today I am happy to present the first MVP version of the wildwrwolf
package!
The wildrwolf package
You can simply install the package from github or runiverse by typing
# install.packages("devtools")
devtools::install_github("s3alfisc/wildrwolf")
# from runiverse (windows & mac, compiled R > 4.0 required)
install.packages('wildrwolf', repos ='https://s3alfisc.runiverse.dev')
The Romano Wolf correction in wildrwolf
is implemented as a postestimation commands for multiple estimation objects from the fabulous fixest
package.
To demonstrate how wildrwolf's
main function, rwolf
, works, letâs first simulate some data:
library(wildrwolf)
library(fixest)
set.seed(1412)
library(wildrwolf)
library(fixest)
set.seed(1412)
N < 5000
X1 < rnorm(N)
X2 < rnorm(N)
rho < 0.5
sigma < matrix(rho, 4, 4); diag(sigma) < 1
u < MASS::mvrnorm(n = N, mu = rep(0, 4), Sigma = sigma)
Y1 < 1 + 1 * X1 + X2
Y2 < 1 + 0.01 * X1 + X2
Y3 < 1 + 0.4 * X1 + X2
Y4 < 1 + 0.02 * X1 + X2
for(x in 1:4){
var_char < paste0("Y", x)
assign(var_char, get(var_char) + u[,x])
}
group_id < sample(1:100, N, TRUE)
data < data.frame(Y1 = Y1,
Y2 = Y2,
Y3 = Y3,
Y4 = Y4,
X1 = X1,
X2 = X2,
group_id = group_id,
splitvar = sample(1:2, N, TRUE))
We now estimate a regression via the fixest
package:
fit < feols(c(Y1, Y2, Y3, Y4) ~ csw(X1,X2),
data = data,
cluster = ~group_id,
ssc = ssc(cluster.adj = TRUE))
# clean workspace except for res & data
rm(list= ls()[!(ls() %in% c('fit','data'))])
For all eight estimated regressions, we want to apply the RomanoWolf correction to the parameter of interest, X1
. We simply need to provide the regression object of type fixest_multi
(it is also possible to simply provide a list of objects of type fixest
), the parameter of interest, the number of bootstrap draws, and possibly a seed (for replicability). Thatâs it! If the regressions use clustered standard errors, rwolf
will pick this up and run a wild cluster bootstrap, otherwise just a robust wild bootstrap.
pracma::tic()
res_rwolf < wildrwolf::rwolf(
models = fit,
param = "X1",
B = 9999,
seed = 23
)
##

  0%

=========  12%

==================  25%

==========================  38%

===================================  50%

============================================  62%

====================================================  75%

=============================================================  88%

====================================================================== 100%
pracma::toc()
## elapsed time is 3.980000 seconds
For \(N=5000\) observations with \(G=100\) clusters, \(S=8\) hypothesis and \(B=9999\) bootstrap draws, the calculation of RomanoWolf corrected pvalues takes less than 5 seconds. If you ask me, that is pretty fast! =) đ
We can now compare the results of rwolf
with the uncorrected pvalues and another popular multiple testing adjustment, Holmâs method.
pvals < lapply(fit, function(x) pvalue(x)["X1"]) > unlist()
df <
data.frame(
"uncorrected" = pvals,
"Holm" = p.adjust(pvals, method = "holm"),
"rwolf" = res_rwolf$pval
)
rownames(df) < NULL
round(df, 3)
## uncorrected Holm rwolf
## 1 0.000 0.000 0.000
## 2 0.000 0.000 0.000
## 3 0.140 0.420 0.367
## 4 0.033 0.132 0.128
## 5 0.000 0.000 0.000
## 6 0.000 0.000 0.000
## 7 0.398 0.420 0.394
## 8 0.152 0.420 0.367
Both Holmâs method and rwolf
produce similar corrected pvalues, which  in general  are larger than the uncorrected pvalues.
But does it actually work? Monte Carlo Experiments
We test \(S=6\) hypotheses and generate data as
\[Y_{i,s,g} = \beta_{0} + \beta_{1,s} D_{i} + u_{i,g} + \epsilon_{i,s} \] where \(D_i = 1(U_i > 0.5)\) and \(U_i\) is drawn from a uniform distribution, \(u_{i,g}\) is a cluster level shock with intracluster correlation \(0.5\), and the idiosyncratic error term is drawn from a multivariate random normal distribution with mean \(0_S\) and covariance matrix
S < 6
rho < 0.5
Sigma < matrix(1, 6, 6)
diag(Sigma) < rho
Sigma
This experiment imposes a data generating process as in equation (9) in Clarke, Romano and Wolf, with an additional error term \(u_g\) for \(G=20\) clusters and intracluster correlation 0.5 and \(N=1000\) observations.
You can run the simulations via the run_fwer_sim()
function attached
in the package.
# note that this will take some time
res < run_fwer_sim(
seed = 232123,
n_sims = 500,
B = 499,
N = 1000,
s = 6,
rho = 0.5 #correlation between hypotheses, not intracluster!
)
# > res
# reject_5 reject_10 rho
# fit_pvalue 0.274 0.460 0.5
# fit_pvalue_holm 0.046 0.112 0.5
# fit_padjust_rw 0.052 0.110 0.5
Both Holmâs method and wildrwolf
control the family wise error rates, at both the 5 and 10% significance level. Very cool!
The method by Westfall and Young
A package for Westfall and Youngâs (1993) method is currently in development. I am not yet fully convinced that it works as intented  in simulations, I regularly find that wildwyoung
overrejects.
Literature
Clarke, Damian, Joseph P. Romano, and Michael Wolf. âThe RomanoâWolf multiplehypothesis correction in Stata.â The Stata Journal 20.4 (2020): 812843.
Murphy, Richard, and Felix Weinhardt. âTop of the class: The importance of ordinal rank.â The Review of Economic Studies 87.6 (2020): 27772826.
Romano, Joseph P., and Michael Wolf. âStepwise multiple testing as formalized data snooping.â Econometrica 73.4 (2005): 12371282.
Roodman, David, et al.Â âFast and wild: Bootstrap inference in Stata using boottest.â The Stata Journal 19.1 (2019): 460.
Westfall, Peter H., and S. Stanley Young. Resamplingbased multiple testing: Examples and methods for pvalue adjustment. Vol. 279. John Wiley & Sons, 1993.