Plan Sample Size

The significance of the unique effect of one or a set of predictors in the regression model is determined by (1) PRE (Proportional Reduction in Error, also called partial eta_squared in ANOVA, or partial R_squared in regression), (2) number of parameters in the regression model, and (3) sample size. As a result, given PRE, the number of parameters in the regression model, and expected statistical power, we can plan the sample size for one or a set of predictors to reach the expected statistical power (usually 0.80) and the expected significance level (usually 0.05). This is just power_lm()’s mission.

To compare the power of r2 and PRE, Keng also gives power_r() to conduct power analysis for Pearson’s r. power_r() has readable arguments and is easy to use, hence is not detailed in vignettes.

Understand the arguments of power_lm()

Among the arguments of power_lm(), PRE, PC, and PA merit further explanation.

PRE

PRE is partial R_squared. Partial R_squared is the square of the partial correlation. You could calculate PRE from the partial correlation. Other statistical software or R packages often plan sample size for regression models through Cohen’s f_squared, or its square root, Cohen’s f. power_lm() use PRE here because PRE and its square root, partial correlation, are more meaningful. The partial correlation is the net correlation between the outcome of regression (e.g., depression) and the predictor (e.g., problem-focused coping) or set of predictors (e.g., the dummy codes of class) of interest. Put differently, the partial correlation is the pure correlation between the outcome and the predictor or set of predictors of interest after controlling for all other predictors, no matter how many they are. We should give a nice guess about the partial correlation. For example, we may guess that, after controlling for other predictors, the partial correlation between the outcome depression and the predictor problem-focused coping is 0.2, then PRE = 0.22 = 0.04. You may get the effect size Cohen’s f_squared or f of problem-focused coping predicting depression, and in this case you could convert Cohen’s f_squared or f to PRE. Keng provides a function calc_PRE() to help users to convert the partial correlation r_p, or f_squared or f to PRE.

PA and PC

Suppose that your regression model has m predictors totally. This model is the augmented model (Model A), and has both the focal predictors (e.g., gender, the dummy codes of class) and other less-important predictors like covariates. The number of parameters of this augmented model (Model A), PA, is m + 1, since the intercept is also a parameter. PA should be at least 1.

The model without the focal predictors is the compact model (Model C). Suppose that the number of the focal predictors is k, the resulting number of parameters of the compact model (Model C), PC, is m + 1 - k. PC should be at least 0.

Note that power_lm() follows Aberson’s (2019), and the planned sample size is more conservative than other statistical software like G*power. However, the difference is small and negligible.

Application

Given that regression analysis is equivalent to t-test and ANOVA, power_lm() could plan the sample size for perhaps all common research designs.

A set of predictors

You may be interested in the power and required sample size of the full regression model, you could treat all predictors as a set. The Model C is the intercept-only model, hence PC = 1. Suppose your regression model has m predictors, hence PA = m + 1.

m predictors’ total PRE is actually R2. Assuming m predictors’ total PRE is 0.02, m is 10, we plan the sample size using the following code:

library(Keng)
power_lm(PRE = 0.02, PC = 1, PA = 11)
#> -- Given ---------------------------
#> PRE = 0.02
#> f_squared = 0.02040816
#> PC = 1
#> PA = 11
#> sig_level = 0.05
#> -- Post-hoc power analysis --------
#> n = NULL
#> power_post = NULL
#> -- Sample size planning -----------
#> Expected power = 0.8
#> Minimum sample size = 816

As shown above, this design needs at least 816 cases/observations.

A continuous predictor

You may be interested in the power and required sample size of one continuous predictor. Suppose your regression model has m predictors, in this case PA = m + 1, PC = (m + 1) - 1.

Moderation model

You may be interested in the two-way moderation model. In the two-way moderation model, the focal predictor is actually the two-way interaction term. Suppose your regression model has m predictors, in this case PA = m + 1, PC = (m + 1) - 1.

You may be interested in the three-way moderation model. In the three-way moderation model, the focal predictors are two two-way interaction terms and one three-way interaction term. Suppose your regression model has m predictors, in this case PA = m + 1, PC = (m + 1) - 3.

One-sample t -test or an intercept-only regression

If you are interested in the difference between the mean of one group and 0, you may turn to the one-sample t -test. Or, you could establish an intercept-only model. Then the focal parameter is the intercept. In this case PA = 1, PC = 0.

Note that in this case you must use the CORRECT PRE to yield the correct power and planned sample size. Do not compute PRE from Cohen’s one-sample d ; instead, compute PRE from the t value of the one-sample t -test by converting t to r, and then converting r to PRE. You could also compute the correct PRE using compare_lm() function.

library(Keng)
data("depress")
# WRONG PRE for the one-sample t-test
(Cohens_d <- effectsize::cohens_d(dm1 ~ 1, data = depress)$Cohens_d)
#> [1] 4.985268
(r_from_d <- effectsize::d_to_r(Cohens_d, n1 = 94))
#> [1] 0.9287831
(PRE_from_d <- r_from_d^2)
#> [1] 0.862638
# CORRECT PRE for the one-sample t-test
out <- t.test(dm1 ~ 1, depress)
(r_from_t <- effectsize::t_to_r(t = out$statistic, df_error = out$parameter)$r)
#> [1] 0.9806709
(PRE_from_t <- r_from_t^2)
#> [1] 0.9617153
# PRE from lm()
fit0 <- lm(dm1 ~ 0, depress)
fit1 <- lm(dm1 ~ 1, depress)
compare_lm(fit0, fit1)[7, 4]
#> [1] NA

Two-sample t -test or a binary predictor

If you are interested in the difference between two groups (e.g., experimental vs control), you may turn to the t -test. Or, you could treat the group variable as a binary predictor and conduct regression analysis. Then the focal predictor is the binary group predictor. Suppose your regression model has m predictors, in this case PA = m + 1, PC = (m + 1) - 1.

ANOVA or a multicategorical predictor

If you are interested in the difference between multiple groups, you may turn to ANOVA. Or, you could treat the group variable as a multicategorical independent variable. Then you could code it using a coding schema like dummy coding. No matter which coding schema you use, for a multicategorical independent variable with j levels, it should be coded into (j - 1) predictors, which are the set of focal predictors. Suppose your regression model has m predictors, among which there are (j - 1) codes, in this case PA = m + 1, PC = (m + 1) - (j - 1).

ANOVA concerning repeated measures

If you are interested in the outcome that were repeatedly measured, you may turn to repeated-measures-ANOVA. In essence, repeated-measures-ANOVA computes the difference score of interest (contrasts), and then conducts between-factor ANOVA. Similarly, you could compute the difference score of interest, and then conduct regression analyses.

A special case is there is no between-subject factor. Under this circumstance, treat the difference score as the outcome and establish an intercept-only model like one-sample t -test.

Reference

Aberson, C. L. (2019). Applied power analysis for the behavioral sciences. Routledge.