MAGIC in reporting statistical results[edit | edit source]
Notes on Robert Abelson’s Statistics as Principled Argument
MAGIC is an acronym for a set of principles in reporting and evaluating results.
Magnitude[edit | edit source]
Refers to the size of the effect, and the practical importance of it. Reporting effect sizes is something we should routinely do, in addition to (or maybe even instead of?) reporting statistical significance. There are a variety of different effect sizes available, and they can be converted from one to another (which helps a lot with meta-analysis).
The rubric for small, medium, and large effect sizes is a start, but there also are alternate effect sizes that are worth considering. Evidence Based Medicine developed the Number Needed to Treat (NNT), Number Needed to Harm (NNH), and similar effect sizes for categorical outcomes. Rosenthal’s Binomial Effect Size Display (BESD) and Cohen’s non-overlap measures are other examples of ways of re-expressing effect sizes.
The other key point is the practical significance. If the outcome is life or death, then even small effects have clinical relevance. The protective effects of low-dose aspirin against heart attack are a frequently-cited example; expressed as a point-biserial correlation, the r < .001, but it is statistically significant, and extrapolated across millions of people, it translates into hundreds or thousands of lives saved.
Articulation[edit | edit source]
How detailed are the analyses? Is there sufficient supporting detail to contextualize the findings? Did the investigator look for possible interactions (with sex, or with other subgroups?). A good degree of articulation adds depth and nuance without undermining the narrative. The opposite would be “slicing the salami” thinly and doing a series of papers that present narrow subsets of analyses or subgroups -- the “least publishable unit.”
Generalizability[edit | edit source]
Are the results likely to replicate? Do they apply to a general population? Or are they likely to be limited to a particular sample? A common critique of a lot of psychology research (especially social psychology and personality work) is that it has been the study of undergraduate psychology majors. Ouch.
Interest[edit | edit source]
Is the topic intrinsically interesting? Is there a practical application? Do the authors do a good job of engaging the reader and conveying the value of the work? The ABT idea from Olson is a technique for clarifying the message. Abelson’s idea of “interest” also focuses on the practical importance of the finding.
Credibility[edit | edit source]
Is the work looking trustworthy? Or is it “fishy?” Abelson describes a continuum from conservative to liberal approaches to analysis. The conservative extreme would be to write down the hypothesis and the analytic plan ahead of time (a priori), only test one primary outcome (or use a conservative analysis to adjust for multiple testing, such as a Bonferroni correction or alpha <.01), and not run or report any additional analyses. The pre-registration movement (ClinicalTrials.gov, and the replication project at OSF) are examples of the conservative approach.
The liberal approach is to do lots of exploration, sensitivity and subgroup analyses, and report it. If framed as “exploratory analysis” or “context of discovery,” it could be okay. Machine learning algorithms and “data mining” usually are taking this more liberal approach.
Where it gets sketchy is when people run lots of analyses (liberal methodology), but misreport it as if it were conservative analysis. Only reporting the most significant results as if they were the a priori hypotheses, or failing to disclose how many analyses were run, is when “liberal” turns into p-hacking.
Abelson was writing before there was a replication crisis, before OSF existed…. The possibility of sharing code, sharing data, and publishing the analyses make it possible to adopt a more liberal approach to analysis, yet also do it responsibly.
Machine learning methods are another development in the field, where the computer automates running a wide range of models, and then looks at both the fit (described as accuracy of prediction, or reducing “bias” in prediction) and the stability of the model across resamplings (often described as the variance of the estimates across k-fold cross validation) as a way of trying to balance “discovery” with reproducibility.
External Links[edit | edit source]
- Wikipedia page for Statistical significance
- Wikipedia page for Number Needed to Treat
- Wikipedia page for Number Needed to Harm
- Rosenthal's original article on Binomial Effect Size Display
- Interactive shiny website that displays overlap with Cohen's d
- Further description and interaction of overlap with Cohen's d
- Wikipedia page on the Bonferroni correction
- Wikipedia page on Machine learning
- Wikipedia page on Data mining
- Wikipedia page on P-hacking also known as data dredging
- Wikipedia page on Cross validation with a section focused on k-fold cross validation