What NOT to do with NON-“null” results – Part III: Underpowered study, but significant result
What NOT to do with NON-“null” results
Part III: Underpowered study, but significant results
I continue the series on how to interpret experiments that do not go according to plan. In Part I and Part II I alluded to an important but often overlooked issue in research: when you have an underpowered study but a statistically significant result. If this happens a typical response may be:
Hooray!!! I did it, even with N = 10 I found the effect I expected! I proved all the naysayers wrong! And I saved a bunch of time and money by not having to collect all the data that annoying “power analysis” spat out! And to top it off, my effect is massive! Way bigger than the literature suggested!While you may believe this is cause for celebration, unfortunately, things are not as they seem. Unlike before, this time I say what you shouldn’t do is report “there is an effect of X, p<.05”. While you did reach that coveted statistical significance threshold, your inference is likely wrong and incomplete. In this post, I will explain why and what you should report instead. The focus will be on explaining Type M and Type S errors (see Fig 1).
Nature, here I come!
MAIN
I recently checked within my circles of academics and non-academics how well the concepts are known:
Glossing over the lack of engagement (I’m not great at social media), I would conclude that people aren't very familiar with Type M and Type S error. These errors relate to why finding a significant p-value in an underpowered study (too few participants) will mislead you in estimating the potential effect. For a comprehensive read (or to cite), I recommend Gelman and Carlin (2014), Fraley and Vazire (2014), and Ioannidis (2008).
So, what are these errors?
- Type M error: effects found in small sample studies are likely to be inflated (error of magnitude) (aka. the winner’s curse; van Zwet & Cator, 2021), given that the test is statistically significant.
- Type S error: effects found in small sample studies have a higher chance of being in the wrong direction (error of sign), given that the test is statistically significant.
Note: small sample studies are also less likely to find true effects (i.e., Type II error). In underpowered studies it is difficult to differentiate between an inflated but true (non-null) effect and a false (null) one.
Note 2: while often forgotten, Type I error is also (indirectly) affected by power. With a small sample, it is also more likely that the found effect is a statistical artefact, if the null is true. As the power of a study decreases, the ratio of false positives to true positives increases! This is referred to as the Positive Predictive Value (PPV), which is the ratio of true positive (TP) over all positive values (False positives [FP] + TP). In underpowered studies, PPV is lower. See Update 1.
Note 3: Such misleading and inflated effects are caused by the use of an (arbitrary) significance cut-off in research. When you have a “true” discovery but the study is underpowered and the p-value is at the edge of significance, the estimate is likely inflated (Ioannidis, 2008). This is why decisions like optional (early) stopping, data peeking, or multiplicity testing are so dangerous (although, principled solutions do exist, like sequential (interim) analyses and false discovery rate corrections).
Small refresher on statistical power (1-Type II error), power (e.g., 80%, 95%) is the probability that you correctly reject the null hypothesis (i.e., how often, in the long run, will you detect the presence of a true effect). Power is computed based on, among other things, the effect size of interest. This is why having an understanding of the effect you wish to measure is so important, and why reporting and relying on inflated effects can be so damaging to science.
Aside: Type M and S errors occurs both when using small samples (underpowered) and when studying small effects with noisy measures (high measurement error); but the later is more difficult to illustrate.
The inflation effect
I mostly focus on Type M error, as this is more common and difficult to resolve even with increased power. The effect low power has on inflating effect sizes is termed the inflation
effect (Fig 2).
Albers, & Lakens, 2018).
If you don't find the above convincing, you can also play around with some R code:
library(pwr) # Set parameters n_a <- n_b <- 10 # Sample size for both groups d <- 0.2 # True effect size (Cohen's d) alpha <- 0.05 # Significance level # Simulate 100 studies and store effect sizes for significant results significant_effects <- c() for (i in 1:100) { # Generate data group_a <- rnorm(n = n_a, mean = 0, sd = 1) group_b <- rnorm(n = n_b, mean = d, sd = 1) # Perform t-test and store effect size if significant result <- t.test(group_b, group_a, var.equal = TRUE) if (result$p.value < alpha) { significant_effects <- c(significant_effects, abs(result$estimate)) } } # Compute average effect size of significant results avg_effect <- mean(significant_effects) # Compute average inflation of effect size compared to true effect size avg_inflation <- mean(significant_effects) / d # Print results cat(paste0("Average effect size of significant results: ", round(avg_effect, 2), "\n")) cat(paste0("Average inflation of effect size compared to true effect size: ", round(avg_inflation, 2)))
The code samples from two groups with a d = 0.20 between them, and computes the average estimated effect size in all significant tests out of 100 simulations. It provides the average inflation rate of the estimated effect compared to the true effect. For a d = 0.20, I would need N = 788 (n = 394) to achieve 80% power; If I only collect N = 20 (n = 10) my power is 7%. Here, although I set the true effect at d = 0.20, the avg. effect across the significant results was d = 0.56 (inflation of effect size: 2.81x).
If the simulations and theoretical explanations were not convincing, I'll give you a simple example of why this occurs. If two groups are different
by only a small degree, and you collect only a handful of data points
(participants), the only way the t-test will come back significant is if the
data is sampled from the tail-ends of the two distributions but in opposite
directions, i.e., you are measuring the most extreme low scores in Group A
versus the most extreme high scores in Group B (Fig 3).
Fig 3 illustrates two things: 1) why you get these inflated effects in underpowered studies, and 2) why you can easily get something completely different if power is low.
I will use one of my favourite teaching tools, the ESCI p-value dance by Geoff Cumming (link). This brilliant app shows just how random p-values and our estimates are given the power of the study. I highly recommend readers play around with the app, and watch the YouTube videos on this topic. It will change the way you see p-values and the value (or lack there) of a single experiment.
Say you anticipate and plan for a SESOI of d = 0.20 in a one-sample design. While you know that you need around N = 84, you fail to meet your target (or worse, stop collecting data once the p-value is "significant"). You only collect a sample of N = 10. This gives the study 20% power.
Looking at Fig 4, I want you to imagine that each line is one experiment you could have run but didn't. Take the top line as the data you collected on Monday. You had 10 participants come in the lab, they did your task, you analysed the data, and got p = .011 (sig). You went home happy 😄. Now, imagine you were sick Monday and ran the study on Tuesday, you collect 10 participants, analyse the data and get p = .537 (ns). Uh-oh, that's not good 😱. Now you have a "failed" study.
This is what happens with underpowered studies. The amount of variability between potential experiments, using the same design, but low power (here 20%), is huge! Also, take note of the width of the 95%CIs; they are all massive, providing very imprecise measures of your effect size (big Margin of Errors [MoE]), which would make interpretation difficult. Each one of those lines could be a different universe in which you ran your study. This is why power is so important.
Say we ran the same study, but we planned for N = 80 (~80% power). Let's see how the universe would look now:
With a properly-powered study, not only is the chance that we get a significant effect more likely (80% of the time if we investigate a true effect), but our estimates are also more precise (narrower 95%CIs). In this universe, it wouldn't matter what day of the week you ran the study, you would have reason to celebrate at the end 😎.
OK, now back to the situation at hand. You ran an underpowered study, and you got a significant p-value. What can you do to ensure the inferences you make are realistic and your readers understand the issues of this seemingly successful study. You have some options for analyses and what you report. But unfortunately, nothing can overcome the limitations of an underpowered study. You can only make things transparent.
Design Analysis
Gelman and Carlin suggest either prospectively or retrospectively to run a design calculation, using their retrodesign() and sim_plot() function in the retrodesign package. If you can define the true effect in your field, or a range of plausible effects, you can check to see how likely it is if you run a well-powered study or a poorly powered study that you obtain an inflated effect or a reversed effect. As an example, retrodesign(.40, .141), assumes the true effect is the population is d = 0.40, and the SE = .141, this gives us around 80% power. At this power level the probability of a Type S error is almost 0%, and the expected exaggeration (inflation) factor is 1.12x. You can visually see what this would look like in Fig 1, where the triangles are Type S error studies and the squares are Type M error studies. While I'm still making sense of the paper, my best interpretation is it can be used as a tool alongside your power analysis to estimate if you do get a significant result, how much can you trust your effect estimate. For instance, if the expected effect and standard error were so that you have 20% power, you would get 0.4% probability of Type S error, but a 2.17x exaggeration factor. If you then collect an even smaller sample than you planned, these issues would be even worse!
Sensitivity Analysis and Power Curve
A solution which I think is more intuitive and would avoid the sample samba issue (see footnotes) is to pre-register your power analysis + a power curve. A power curve simply plots the power level at different sample sizes for your design. This would give your readers an idea of the power drop (Type II error increase) from your planned sample (e.g., N = 50; 80% power) and your obtained sample (e.g., N = 20; 40% power). See Fig 6.
Fig 6: Power curve (G*Power) for a one-sample t-test assuming a d = 0.40. It shows on the y-axis the sample size, and on the x-axis the power level. |
Alongside this power curve, which hopefully disuades people from post hoc playing with the values until they get an OK-looking estimated sample size, you can also report a sensitivity analysis. Briefly, there are 4 values that we juggle when considering power in a design: alpha threshold, effect size, N, and power level. You always hold 3 fixed to estimate the 4th. In power analyses you hold alpha, effect size, and power fixed, and estimate N. In sensitivity analyses, you hold alpha and power fixed, input the N you got, and estimate the effect size. This output is the smallest effect size you can reliably detect at your required power given your sample (aka. minimum detectable effect; MDE). See Fig 7.
Fig 7: Sensitivity analysis (G*Power) for a final sample of N = 10. The smallest effect that could be detected with 95% power at 5% alpha is d = 1.29. Anything smaller, and power is lower. |
You might want to also show the sensitivity plot, so the smallest reliable d's at different sample sizes can be easily understood.
Fig 8: Plot of the effect size against different possible sample sizes when power = 95% and alpha = 0.05. |
Or you can plot sensitivity between effect size and power, showing different power levels for smaller or larger effects.
Fig 9: Plot of the effect size against the desired power when n = 10 and alpha = 0.05. |
Thus, at the end of your experiment if a Reviewer or Editor demands you report a sample size justification ("Why did you collect only 32 people? Explain yourself! You call this science!?"), you can report the MDE (here, d = 1.29). If your effect is larger or equal to the MDE (e.g., d = 1.50), then you can argue that you have sufficient power to have detected it. But, if your effect size is smaller (e.g., d = 0.90), your study was most likely underpowered, and the effect could be a Type I error or a combination of errors. So caution is advised.
Compromise power analysis (β/α ratio)
There is also a compromise power analysis, which keeps N and effect size fixed, and calculates the optimal Type I and II error rates for your experiment. I am not a fan of this approach, as it can result in the use (excuse) of non-standard cut-offs, e.g., alpha = 0.35 and power = 64% (in the above example for N = 10 and d = 0.4). There is nothing wrong with the approach per se, but it moves the focus from long-term error rates in a field (e.g., alpha = 5%, power = 95%), to a per study inference focus (see justify your alpha; Maier & Lakens, 2022). In this toot thread, Lakens contends that this approach is beneficial if you *need* to make a decision in an applied scenario.
Note: Remember, post hoc (observed) (retrospective) power is not a thing! And hopefully now you see why, as it's simply a transformation of the obtained p-value which itself is contingent on the effect size in your sample. But, if in underpowered studies the effect size is inflated, this “power” calculation is misleading (overestimated)!
Additional inferential metrics: False Positive Risk (FPR₅₀) and Vovk-Sellke Maximum p-Ratio (VS-MPR)
If you are willing to dip your toes into a more "subjective" realm of frequentist statistics, you can compute a few metrics that allow you to make stronger per study inferences. Recall, that frequentist statistics concerns itself with long-term inferences, which is why research fields set standard cut-offs for Type I and II errors. But, if you are willing to sacrifice a bit of objectivity, you can report two additional metrics: FPR₅₀ and VS-MPR.
FPR₅₀ (Colquhoun, 2014). "If you observe a “significant” p-value after doing a single unbiased experiment, what is the probability that your result is a false positive?" That is the answer this metric aims to provide. I encourage readers to see the full paper by Colquhoun or watch this excellent video (here), as I won't cover the full explanation here. Briefly, if we assume a 50:50 probability that our hypothesis is correct before we do the study (even if in your mind you are *sure* you are right), we can calculate from your obtained p-value the probability that your result occured by chance. You can use this online calculator. Say you ran a study: 20% power, true (anticipated) d = 0.40, and you obtained p = .05. You can compute a False Positive Risk based on this, FPR₅₀ = 0.29, a 29% false positive risk with this sample. If you claimed a discovery based on this p-value you would be "making a fool of yourself" a third of the time.
VS-MPR (Vovk-Sellke Maximum p-Ratio). This weird-sounding metric (sometimes called the Bayes Factor Bound) offers a frequentist version of a Bayes Factor in favour of an effect vs. no effect. Importantly, it's not in favour of *your* effect, but any effect at the obtained p-value. As with FPR, this is a transformation of the p-value, that compares H₀ (no effect) to all possible H₁ (alt. hypothesis) and picks the best one. It finds the best effect size that could favour your study's observed p-value, and says how much more evidence there is for that state of the universe than the one where there is no effect. If in a study you obtain a p = .05 (exactly), it translates to a VS-MPR = 2.46; so your p-value is only 2.46 times more likely under the best H₁ than under H₀. This metric is useful for hypothesis testing - infering the presence/absence of an effect - but not for estimating an effect. I mention it as it is reported by JASP in all frequentist tests, so you may have seen it around; or you can compute it using the spida2 package. You can learn more about it here.
Non-solution: bootstrapping, while very useful (and I use it here), is not a cure for small sample sizes, regardless of what some say (see Algina, et al., 2006). If your data is non-representative and sparse, bootstrapping will only make the problem worse. You can check the reliability of your estimate by increasing the number of bootstrapps and seeing how much the values vary (see MADLIB). If they are unstable, mention this in your results, to prevent incorrect inferences.
[To demonstrate the issue, I simulated the following data: Control (M = 100, SD = 80) and Treatment (M = 120, SD = 80), normally distributed. This provides a d = 0.40 difference. I then sampled n = 10 from each group]
expect that your Method section reads like this for Sample:
"An a priori power analysis for an independent samples comparison, two-tailed, at α = 0.05 and power = 80%, for a d = .40, determined that 200 participants needed to be recruited; Control (n = 100) and Treatment (n = 100)."
If you failed to reach this sample size, in your Results section you can write this:
"Due to [reason for small sample] only N = 20 (n = 10 per condition) could be recruited. To estimate the minimum detectable effect (MDE), a sensitivity analysis considering the final sample size was conducted. This revealed that effect sizes of d = 1.32 or larger could be detected with 80% power at an α = 0.05."
This will set some expectations in your readers as to what is a feasible reliable discovery in such a study (not necessarily your study!). Specifically, if this effect is impossibly large for your field, such a study might provide no useful insights. You could also provide a sensitivity plot in your Supplementary Materials (SM). FYI, the current sample, N = 20, has only 13% power to detect the planned effect. If you report a design analysis, at this power level the probability of a Type S error is 1.8%, and the expected exaggeration (inflation) factor is 2.87x.
For write-up you would say (Descriptives):
Moving over to tests, you have several options. First, you can naively run a Welch's t-test. Second, you could run a Bootstrapped t-test. Third, you could run a Randomization (Permutation) test. Fourth, as in Part I, you could not run a test of significance at all (recommended), and just interpret the effect size and confidence intervals. There are several variations of all of these, and some are more appropriate given your data and assumptions. Here, I go with the Randomization test, as it has the fewest assumptions, and provides a exact p-value, not an asymptotic p-values (the latter is the default version in most software).
For write-up you would say (Hypothesis testing):
And that should be sufficient for this single result. If the journal permits SM, I would also add any meaningful descriptives and plots there. Like these (from bmbstats):
The aim with these is not to bombard the reader with information, or try to make your paper seem more technical and robust than it is, but to provide clarity regarding the (poor) data.
[Note on tests: I ran each of the tests mentioned above, and here are the results:
The bmbstats test, bizarrely reports a much lower p-value, but I can't tell
why, as they should all be more-or-less the same. I ran it with 1, 2, 5, 10k boots and it just oscillated between .006 and .009. This difference might seem small, but if you convert to an s-value you can see how this might affect how people interpret the result, s = 7.39 (i.e., a result as surprising as 7 heads in a row), while the Welch t-test is s = 5.44 (i.e., as surprising as 5 heads in a row)] Update 1: After speaking with the package author, this may be due to the calculation used by bmbstats; as such, I would not rely on this specific result.
In your discussion, to interpret the above figure and data, you would need to focus strongly on the Limitations, as readers are likely to overinterpret results - like how you seemingly found a large health benefit of your treatment plan - and not see the issues:
That should be sufficient for the Discussion section, as long as you provide the data, so it may be used in the future. Pooling similar studies can provide more accurate/realistic estimates of effects (meta-analyses).
To properly communicate underpowered studies, you must be as transparent as possible with your data and tempered with your inferences. Confidence intervals, effect size estimates, plotting the data, and making it available are crucial elements of this process.
Remember that the replicability of a study is not determined by the p-value (e.g., a p = .001 does not mean it is “more likely to replicate” than a p = .05). It is determined by the power of the study. So, if your study had 20% power and you happen to get lucky and get a significant result, regardless of the p-value obtained, the next replication using the same study design also has 20% power to detect the effect. As Fraley and Vazire put it “[t]he power of a design […] is statistically independent of the outcome of any one study”.
Importantly, reporting effects from underpowered studies will have a knock-on effect when others attempt to replicate or plan their research based on such estimates, as their studies will also be underpowered (e.g., if you rely on an inflated effect, d = 1.40, while the “real” effect is d = 0.40, your sample will be off by a factor of 11x), and can result in a false non-replication. To quote Fraley and Vazire again “a research domain based on underpowered studies is likely to be full of ‘‘failures to replicate’’ and ‘‘inconsistent findings’’.
Consider only the data I generated compared to what I ended up reporting ("finding"). The Means and SDs of the two groups do not reflect my data-generating parameters, and the estimated effect size (which in Cohen's d = 1.13) is 2.83x what it should be given the population effect (d = 0.40); which funny enough, is close to the estimated inflation factor from the design analysis!
Statistical power can explain how often you would expect to find, in the long term, a significant effect if you are studying a true effect, but also how often you will miss one and not despair (on the flipside, it should also make you very sceptical when you see a paper that reports multiple significant results in a row; see Schimmack’s Incredibility Index).
Depending on your field of study, a Type M or S error can mean that you overestimate the effectiveness of a treatment (e.g., response to medicine) or even the direction (e.g., the medicine is making things worse, but due to a sign error you believe it helps). Although the latter is less common, as type S error decreases rapidly with increased power, while Type M is more difficult to reduce (Lu et al., 2018).
If you're curious, the average power of psychology studies seems to be 30-50%...
Some might argue “Well, how can I know the "true" effect is advance? Maybe it really is that big or even bigger!”. Sure, it may be that you happen to be studying an effect that is extremely large, however, you can take an educated guess as to how large effects in your field tend to be (e.g., in my sub-field I typically see 0.20 < d < 1.00), or if you plan for a SESOI, then over repeated studies your estimate will be closer to the true value and you can adjust accordingly. It is better to err on the side of caution with exploratory research.
One thing I am sure none of my readers would ever do, but does exist as a practice, is the “sample size samba” (Schulz & Grimes, 2005) where authors – in wanting to comply with journal requirements for sample size justification – will “adjust” (dishonestly) the effect size estimate after-the-fact to justify their final sample size.
Another thing to note, the options and solutions I mention here are for existing data. There are of course more principled things you can do if you know in advance what your sample recruitment constratints are (e.g., increase the precision of your measure or use more trials, pre-register a directional hypothesis, or use a within-subjects design).
Finally, I did not cover other sources of such inflated effects in published research, but I will name some extra ones: publication bias, selective reporting, p-hacking, flexibility in analysis, researcher degrees of freedom. All of these (and more) contribute to inflated effects. With small samples any modification to the analysis can have drastic effects (e.g., the effect of 1 outlier in a sample of N = 10 vs. N = 1000).
PS: I’ve avoided the elephant in the room, one solution to small samples is using Bayesian statistics (with informed priors). Bayesian estimation typically requires fewer data points per parameter (van de Schoot, et al., 2015).
Updates
#1: I have adjusted Note 2 to reflect more clearly that the effect of (under)power on Type I error refers to the PPV ratio, and not to the α level itself. PPV = TP / (FP + TP) and is the complement to the FPR. It reflects " the probability that a 'positive' research finding reflects a true effect" (Button, et al., 2013). Power affects this by decreasing the numerator and increasing the denominator of the formula, producing a lower PPV. The PPV formula can be rewritten to include Type I and II errors like this: PPV = (1 - β) * (R) / [(1 - β) * (R) + α * (1 - R)], where R = is the prior probability of the hypothesis being true. Hence, although the rate of false positives (Type I error) is fixed, the rate of true positives (Power) decreases, lowering the relative proportion of true positives in all positives. Thus, if our only metric for publishing a study is that it is "significant" (positive), a reported result is less likely to be a true positive when the study is underpowered. I thank Daniel Lakens for highlighting this point.
Albers, C., & Lakens, D. (2018). When power analyses based on pilot data are biased: Inaccurate effect size estimators and follow-up bias. Journal of Experimental Social Psychology, 74, 187–195. https://doi.org/10.1016/j.jesp.2017.09.004
Algina, J., Keselman, H. J., & Penfield, R. D. (2006). Confidence Interval Coverage for Cohen’s Effect Size Statistic. Educational and Psychological Measurement, 66(6), 945–960. https://doi.org/10.1177/0013164406288161
Button, K. S., Ioannidis, J. P., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S., & Munafò, M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature reviews neuroscience, 14(5), 365-376. https://doi.org/10.1038/nrn3475
Colquhoun D. (2014). An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science, 1(3), 140216. https://doi.org/10.1098/rsos.140216
Fraley, R. C., & Vazire, S. (2014). The N-Pact Factor: Evaluating the Quality of Empirical Journals with Respect to Sample Size and Statistical Power. PLoS ONE, 9(10), e109019. https://doi.org/10.1371/journal.pone.0109019
Gelman, A., & Carlin, J. (2014). Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors. Perspectives on Psychological Science, 9(6), 641–651. https://doi.org/10.1177/1745691614551642
Ioannidis, J. P. A. (2008). Why Most Discovered True Associations Are Inflated. Epidemiology, 19(5), 640–648. http://www.jstor.org/stable/25662607
Lu, J., Qiu, Y., & Deng, A. (2018). A note on Type S/M errors in hypothesis testing. British Journal of Mathematical and Statistical Psychology, 72(1), 1–17. Portico. https://doi.org/10.1111/bmsp.12132
Maier M, & Lakens D. Justify Your Alpha: A Primer on Two Practical Approaches. Advances in Methods and Practices in Psychological Science. 2022;5(2). https://doi.org/10.1177/25152459221080396
Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551–566. https://doi.org/10.1037/a0029487
Schulz, K. F., & Grimes, D. A. (2005). Sample size calculations in randomised trials: mandatory and mystical. The Lancet, 365(9467), 1348–1353. https://doi.org/10.1016/s0140-6736(05)61034-3
van de Schoot, R., Broere, J. J., Perryck, K. H., Zondervan-Zwijnenburg, M., & van Loey, N. E. (2015). Analyzing small data sets using Bayesian estimation: the case of posttraumatic stress symptoms following mechanical ventilation in burn survivors. European journal of psychotraumatology, 6, 25216. https://doi.org/10.3402/ejpt.v6.25216
van Zwet, E. W., & Cator, E. A. (2021). The significance filter, the winner's curse and the need to shrink. Statistica Neerlandica, 75(4), 437-452. https://doi.org/10.1111/stan.12241
Comments
Post a Comment