What to do with "null" results - Part I: Nonsignificant and underpowered
What to do with "null" results
Part I: Nonsignificant and underpowered
PREAMBLE
- I assume you have some familiarity with frequentist stats and NHST
- I will default to general conventions to avoid unnecessary verbiage (e.g., "p > .05" instead of saying "a p-value higher than your per-selected alpha criterion for long-run Type I error control")
- When possible, I will explain a concept briefly instead of just pointing to a 300+ page statistics book, but with the implied risk my explanation will be limited.
- this is a guide (i explain something) not a tutorial (i show you how to do it); I can do a tutorial if people want (let me know in the comments)
- I will only use open license software to explain concepts, but no R code unless I have to; I know people might just want to implement solutions with things like JASP and G*Power directly.
PREMISE
A common scenario you might face is when the test you just ran reports as p-value bigger than .05; the dreaded "nonsignificant" result. So, what do you do next?
One thing you shouldn't do is report "there is no effect of X, p>.05". This is both wrong and incomplete. In this post, I will explain why and what you should report instead.
MAIN
I posted a question about this topic on Mastodon [follow me here]. While the question is not solely about nonsignificant results, it illustrates a common issue when running studies (i.e., a sample size that is too small based on what you planned to investigate, and a nonsigificant result).
For the non-significant scenario, some suggestions are:
1. Gardner-Altman plot for the data and difference.
2. Equivalence plot (Lakens, 2017) (or just put a ROPE range in the G-A plot); testing might be pushing it.
3. compatibility curve (Rafi and Greenland, 2020)
4. explain the data is less appropriate for a per study inference approach, but could be useful in a meta-analysis (so report raw data, eff size, CIs).
5. 🤷♂️
If you follow the toot thread (yes, that's what they are called), you will see a difference in opinions on what can be done.
I expect that your Method section reads like this for Sample:
"An a priori power analysis for an independent samples comparison, two-tailed, at α = 0.05 and power = 80%, for a d = .40, determined that 200 participants needed to be recruited; Control (n = 100) and Treatment (n = 100)."
Now, in your Results section you can write this:
"Due to [reason for small sample] only N = 40 (n = 20 per condition) could be recruited. To estimate the minimum detectable effect (MDE), a sensitivity analysis considering the final sample size was conducted. This revealed that effect sizes of d = .91 or larger could be detected with 80% power and at an α = 0.05."
This tells the reader that (1) you understand the limitations on inference this problem produces, and (2) what an alternative result could have looked like in your data (e.g., maybe in your field d > .90 is almost unheard of or even impossible, so the study had no chance of findings something of interest).
Now, on to the results. I will take Cumming's suggestion and say don't report any significance tests. [insert surprised Pikachu face] We can rely solely on estimation statistics to communicate the findings of the study. In Jamovi, you can install the ESCI module for estimation statistics. This will allow you to plot the data you have, and (using bootstrapping) estimate the difference in groups. The plot will contain all your data, the mean difference (effect size) and confidence intervals around this difference (with a distribution which results from the bootstrapping procedure; it is not the same as a credible interval posterior in a Bayesian test). My plot looks a bit different as the ESCI module makes uneditable plots, so I made mine in R using the dabestr package.
Footnotes:
One thing to note, if your sample is too small AND unbalanced (e.g., n1=10, n2=30) then running any tests is not advised at all. Even the current bootstrapped approach may be ignored, as the estimates are unlikely to be stable (see MADLIB approach in this tweet).
The reasoning I currently have for not reporting p-curves (compatibility curves) in this scenario is that they rely on both the assumptions of the model (e.g., Student's t-test) and the data, which here are unlikely to hold and would be difficult to test outright.
References:
Cumming, G. (2014). The New Statistics: Why and How. Psychological Science, 25(1), 7–29. https://doi.org/10.1177/0956797613504966
Lakens, D. (2017). Equivalence Tests: A Practical Primer for t-Tests, Correlations, and Meta-Analyses. Social Psychological and Personality Science, 8 (4), 355–362. https://doi.org/10.1177/1948550617697177
Rafi, Z., & Greenland, S. (2020). Semantic and cognitive tools to aid statistical science: replace confidence and significance by compatibility and surprise. BMC medical research methodology, 20(1), 1-13.
Comments
Post a Comment