What to do with "null" results

Part I: Nonsignificant and underpowered

PREAMBLE

I assume you have some familiarity with frequentist stats and NHST
I will default to general conventions to avoid unnecessary verbiage (e.g., "p > .05" instead of saying "a p-value higher than your per-selected alpha criterion for long-run Type I error control")
When possible, I will explain a concept briefly instead of just pointing to a 300+ page statistics book, but with the implied risk my explanation will be limited.
this is a guide (i explain something) not a tutorial (i show you how to do it); I can do a tutorial if people want (let me know in the comments)
I will only use open license software to explain concepts, but no R code unless I have to; I know people might just want to implement solutions with things like JASP and G*Power directly.

PREMISE

A common scenario you might face is when the test you just ran reports as p-value bigger than .05; the dreaded "nonsignificant" result. So, what do you do next?

One thing you shouldn't do is report "there is no effect of X, p>.05". This is both wrong and incomplete. In this post, I will explain why and what you should report instead.

MAIN

I posted a question about this topic on Mastodon [follow me here]. While the question is not solely about nonsignificant results, it illustrates a common issue when running studies (i.e., a sample size that is too small based on what you planned to investigate, and a nonsigificant result).

For the non-significant scenario, some suggestions are:

1. Gardner-Altman plot for the data and difference.

2. Equivalence plot (Lakens, 2017) (or just put a ROPE range in the G-A plot); testing might be pushing it.

3. compatibility curve (Rafi and Greenland, 2020)

4. explain the data is less appropriate for a per study inference approach, but could be useful in a meta-analysis (so report raw data, eff size, CIs).

5. 🤷‍♂️

If you follow the toot thread (yes, that's what they are called), you will see a difference in opinions on what can be done.

Geoff Cumming (of The New Statistics) argues that if you failed to collect sufficient data for your intended test, there is no justified reason to even report, say, a t-test. You might read that and wonder "why not?", but the answer is built in the NHST framework. To test a hypothesis you need to set some guidelines in advance (a priori), such as your accepted level of long-run false positive error (alpha, e.g., .05), the smallest effect you would care about reliably detecting in the long run (SESOI, e.g., Cohen d = .40), and the long-run error rate for missing such an effect, false negative, (power, e.g., 80%). Now, I will pause here a second so that I make sure all readers of that last paragraph understand what the issue is, as this is where people get NHST wrong and why you get nasty comments from people like me on your manuscripts that what you are claiming is wrong or nonsensical.

NHST is a frequentist framework for doing science. What that means is that we don't care about a single study per se, but of a way of doing science that has some acceptable long-term error rates for discoveries (hence the name frequentist). So yes, NHST is not built for telling you if your current study is correct or incorrect (the best guess would be 50-50). You read that right. Each individual study you run should be taken as a coin flip on if you found something or not. The purpose of NHST is to ensure that if you would keep doing the study that way (e.g., 10,000 studies), and you looked at all the results together, they would produce false positives 5% of the time and false negatives 20% of the time (in my scenario). You will never know if you are in those errors or not for your particular study!

(If you are now asking "if this is true, then why do people even interpret their results? can't we just cut the discussion sections and just add the data to some pile?". Well, yes and no. Over-interpreting one study is silly, but people do it because it's the culture of publishing. I doubt you get published if you just write "here's some data, bye". However, Neyman (and Pearson) argued that this approach to science is a way of inductive behaviour (not inference), meaning if we obtain the result we expect we just act as if it is right, until we are given a reason to change our position. The Bayesian perspective I leave for another post)

OK, with that sorted, I will now split the post into two scenarios: 1. you have insufficient data and a nonsignificant result, and 2. you meet your planned N, and have a nonsignificant result. (In the future, I might show the 3rd option of insufficient data but a significant result, and talk about the winner's curse and Type M error)

Scenario 1: Nobody wanted to do your study (undersampled, nonsignificant)

The main reason for writing this post was to tackle this exact issue, as rarely do I see any guidance on what should be done, and how one should report such results. A typical response, say, from the grumpy old Prof. down the hall is "just don't publish it" (i.e., file drawer). Now, sometimes this might be justifiable, e.g., all your data is corrupted, your apparatus failed, or you record only 2/400 people, but even then I would argue you should write it up as it can be useful for others to know. Why did it fail? Was the task too confusing/hard/painful? Did you put the electrodes on backwards? You could then try submitting in JOTE as a Reflection Article, or just put it up as a working paper on PsyArXiv.

But, let's say the problem wasn't that bad, you just failed to meet your target for financial reasons, time constraints, or due to a small population for your target effect. What now?

Maybe you heard that if you have a nonsignificant result you can run equivalence tests to see if there really is "no effect" or just noisy data (if you haven't heard of these don't worry I will get to them in the next scenario; Part II). However, as Cumming commented, this may not be feasible in this scenario. Hypothesis testing and Equivalence testing both assume you collected sufficient data to answer your question and use these protocols to sort out what is going on in your data. But, in the absence of sufficient data to even theoretically test what you expected you have no way of sorting out signal from noise. If you needed N = 200 (n = 100 per group) to reliably detect a d = .40, but you only collect N = 40 (n = 20) then your power drops from 80% to 23%. While your false positive is still 5%, your false negative is now 77%. Unfortunately, once you have collected the data, you cannot determine IF your study is underpowered, just that it MIGHT be one explanation for your nonsig. result. "Post hoc" power is not a thing [see this blog post]. After the fact, the only thing you can compute - and should - is a sensitivity analysis (see below). This allows you to say "given that I want a false positive of 5% and a false negative of 80% in the long run, with the N I have what is the lowest effect that I could still reliably detect?", here the answer is d = .91. This is mostly useful to know for Scenario 3 (undersampled but significant) which I will cover in the future.

Thus, in an undersampled scenario, if you obtain a nonsignificant p-value your result means either (1) there is no effect or (2) it was underpowered to detect an effect. The problem is we have no way of determining which option is true.

What to do and what to report:

I expect that your Method section reads like this for Sample:

"An a priori power analysis for an independent samples comparison, two-tailed, at α = 0.05 and power = 80%, for a d = .40, determined that 200 participants needed to be recruited; Control (n = 100) and Treatment (n = 100)."

Now, in your Results section you can write this:

"Due to [reason for small sample] only N = 40 (n = 20 per condition) could be recruited. To estimate the minimum detectable effect (MDE), a sensitivity analysis considering the final sample size was conducted. This revealed that effect sizes of d = .91 or larger could be detected with 80% power and at an α = 0.05."

This tells the reader that (1) you understand the limitations on inference this problem produces, and (2) what an alternative result could have looked like in your data (e.g., maybe in your field d > .90 is almost unheard of or even impossible, so the study had no chance of findings something of interest).

Now, on to the results. I will take Cumming's suggestion and say don't report any significance tests. [insert surprised Pikachu face] We can rely solely on estimation statistics to communicate the findings of the study. In Jamovi, you can install the ESCI module for estimation statistics. This will allow you to plot the data you have, and (using bootstrapping) estimate the difference in groups. The plot will contain all your data, the mean difference (effect size) and confidence intervals around this difference (with a distribution which results from the bootstrapping procedure; it is not the same as a credible interval posterior in a Bayesian test). My plot looks a bit different as the ESCI module makes uneditable plots, so I made mine in R using the dabestr package.

Gardner-Altman plot for two independent groups

You may also plot the mean difference in terms of standardized units (here, Hedges' g).

Gardner-Altman plot for two independent groups with Hedges g difference plot

For write-up you would say:

"Given the sample size, the result of the manipulation are presented using a Gardner-Altman plot for a difference between two independent groups (Fig x). Participants in the Control condition (M = 51.37, SD = 30.44, n = 20) show a pattern of lower health scores than participants in the Treatment condition (M = 66.54, SD = 35.48, n = 20), Mdiff = 15.2, 95% CI [-3.29, 36.40], Hedges' g = 0.45 95% CI [-0.17, 1.03]. While the result is consistent with the hypothesis, there is high uncertainty in its estimation, with the data being compatible with a large range of values, from highly beneficial, 36.40 points (g = 1.03), to slightly detrimental, -3.29 points (g = -0.17), and even 0."

Push back against a Reviewer or Editor who suggests you "need" to also report a p-value (here, it was p = 0.155) or talk in terms of "significant" and "nonsignificant". Estimation statistics relies on interpreting the data holistically. In your discussion, to interpret the above figure and data, you could write something like this:

"The results, while in the predicted direction, are too uncertain for a clear conclusion to be made, as they are compatible with beneficial (and large enough to be of theoretical/clinical relevance) effects of treatment on health, but also with detrimental effects on health. The present result and the data, available on OSF [!], may be used to inform future investigations and meta-analyses. We refrain from further speculation."

And there you go. A simple way to present nonsignificant and underpowered results. Report your data in full, provide a copy of it on OSF or another open repository, and don't over-interpret any result from a single study.

Footnotes:

One thing to note, if your sample is too small AND unbalanced (e.g., n1=10, n2=30) then running any tests is not advised at all. Even the current bootstrapped approach may be ignored, as the estimates are unlikely to be stable (see MADLIB approach in this tweet).

The reasoning I currently have for not reporting p-curves (compatibility curves) in this scenario is that they rely on both the assumptions of the model (e.g., Student's t-test) and the data, which here are unlikely to hold and would be difficult to test outright.

References:

Cumming, G. (2014). The New Statistics: Why and How. Psychological Science, 25(1), 7–29. https://doi.org/10.1177/0956797613504966

Lakens, D. (2017). Equivalence Tests: A Practical Primer for t-Tests, Correlations, and Meta-Analyses. Social Psychological and Personality Science, 8 (4), 355–362. https://doi.org/10.1177/1948550617697177

Rafi, Z., & Greenland, S. (2020). Semantic and cognitive tools to aid statistical science: replace confidence and significance by compatibility and surprise. BMC medical research methodology, 20(1), 1-13.

Search This Blog

Figuring Stuff Out