ABCT Bipolar SIG/Listserv Topics/Inter-rater reliability

This thread, on concerns of interrater reliability, got started on the listserv following the publication of a review article (open access!) on the topic by June Gruber and Lauren Weinstock. They posted some discussion questions on the email list. This page lists some of the responses. Feel free to add to it!

Paper in question

Gruber, J. & Weinstock, L. M. (2018). Interrater reliability in bipolar disorder research: Current practices and suggestions for enhancing best practices. International Journal of Bipolar Disorders, 6, 1-3. https://link.springer.com/article/10.1186/s40345-017-0111-7

Surprisingly, there are no formal guidelines describing this process nor common variations within and between labs among different features of IRR (e.g., training, consensus meetings, process for coding interviews, and reporting standards). We highlight this gap and briefly summarize each of these four features and the variety of ways different researchers approach it in our field. We thought this would be a good group to share this with and have an open discussion together. We put a few key questions below in the spirit of increasing dialogue and moving forward together as a field. We would love to hear other's thoughts!

Question 1: Do you feel there are clear standards in our field re: acceptable practices for IRR?

EY: No, I think that there is not clarity about either the methods or the benchmarks, and there isn't much reflection about why IRR might be important (with the review article being a nice exception!).

Question 2: Have you encountered confusion in your own work or with trainees about IRR?

EY: Yes, it has felt like a rabbit hole.

Question 3: What aspects of the IRR process would you like to see more discussion about or find the most confusion on?

EY: I think that it would be good to have more transparency in reporting, more detail about the methods used, and more thought about the implications for our findings and the patterns of results we see in the literature.

For this to work, there would need to be an "amnesty" of sorts, so that it was okay for authors to accurately describe what they did and what the product was, without papers getting rejected out of hand or people throwing bombs at each other's methodology. Otherwise the peer review process creates a situation where the incentives are to use weak methods and tests to make reliability "look good," or to omit or misrepresent the actual state of affairs.

For this to really happen, it would require some journal editors making a strong commitment to encourage authors to report more accurately (including additional space? or mandatory supplemental materials? but that creates an incentive to send the paper somewhere that is less work) and to support them in the review process.

The DSM-5 field trials are an interesting case study. The IRRs were low by any traditional rubric, and there was a lot of debate about "what happened?" and whether to publish them. Kraemer was thoughtful about the methods used, which were not structured interviews, and did not have rater training or feedback, so it was not surprising that the estimates were lower. They created a new rubric and rolled it out with the results in the same paper.

If readers forget about the differences in the method or process, then the new rubric is analogous to grade inflation. Suddenly lower performance gets a meritorious label. Kraemer's point was probably intended to be that we should have different benchmarks for different designs or goals, but that is a degree of nuance that the field will probably gloss over. Certainly it is helpful to authors to have a reputable citation that they can invoke to re-label their kappa as "good."

Question 4: How do you think we can increase transparency moving forward?

EY: I think that there are several options, but none is perfect.

The "publish more details" and "be forthright and transparent" is going to create a "kick me here!" situation, where our candor will be met by reviewers attacking us for what we just disclosed. If it only is bipolar researchers that try to get more frank about it, it also could lead to the perception that bipolar research is a mess, when actually it would be us holding ourselves to higher standards. It might be possible to get specific journal editors and reviewers to support the change (as mentioned above), but it would be a slow process and no guarantee of success.

A second option might be to publish the information on websites (here on Wikiversity? On lab web pages) where it would not be subject to peer review, but could be accessed and cited.

There are probably other options. Looking forward to hearing others' thoughts!

More Questions! (posted by June Gruber; first responses by Eric Youngstrom)

Question: In your opinion, for studies that are not focused on measurement development (i.e., in which IRR itself is not the primary or secondary study outcome), would calculating IRR before or after doing primary study outcome analysis have any bearing on the primary results? Or would you view calculation of IRR as a separate analysis from the analysis of study outcome, and thus the timing would have little bearing?

EY: It seems to me that the reliability has a direct bearing on any analysis that uses the variable. The reliability attenuates the effect sizes we observe. If the true correlation is a .6, and we measure a variable with 80% reliability, we would expect to observe a correlation of .6 * .8 = .48. If we measured the variable with 50% reliability instead, we'll get a correlation of .3 instead, all else being equal. If we are talking about outcome variables, then the lower reliability attenuates the treatment effect the same way. If we are focused on diagnoses, then imperfect reliability means we are mixing in cases that don't have what we thought they did (reducing diagnostic accuracy in ROC studies; and boosting the placebo effect and reducing treatment effects in trials). If people reported reliability consistently, then it would be possible to disattenuate the effect sizes across studies. Hunter and Schmidt advocated this in all of their writing about meta-analysis. It was a reasonable recommendation in the industrial-organizational literature, where they worked. In every meta-analysis I have done, I have tried to capture the reliability information. I have never been able to use the reliability estimates in any meta-analysis yet, because the majority of the articles reviewed don't report it.

Not checking reliability doesn't solve anything -- the problems are still baked into the data whether we choose to look at it or report it or not. And checking the reliability in a way that makes rosy assumptions and gives shiny coefficients usually is just papering over the problems that actually are hurting the primary analyses for the study.

From a purist's point of view, the best way to evaluate reliability would be to get an accurate reading of it during the time that the outcome variables are constructed, and measure it in a way that reflects the key facets that matter in the reproducibility of the score. Don't just report Cronbach's alpha when the core analysis hinges on the accuracy of a diagnosis, or the severity of the score, for example. Reliability should be like a readout on the car dashboard, indicating whether the motor is running okay or having serious problems. Taking a picture of the engine thermometer before the trip or after it is over is not going to provide the same utility.

But that is an idealistic view. When 70% of studies don't report reliability at all, and more than half of those that do are presenting what really is a heroic best case scenario (or a fiction), then pretty much any reporting would be a step in the right direction.

Question: Do you think it would be inappropriate to use data from raters still in training at various stages for IRR analyses? My understanding is the role of training in IRR protocols is not yet well established and varies somewhat between labs, and that sometimes consensus meetings can be held between raters to help solidify their training (e.g., discuss discrepancies and correct errors, or address rater drift over time) and that in these cases it can be appropriate to report the scores obtained during these consensus meetings. Would doing this pose any major issue in your opinion (e.g., signal a serious research error)?

EY: I certainly don't see this as a black and white issue. I think that it would be fine to include, especially provided that it is described accurately. (Good luck figuring out how to do that and navigate peer review! But if I am a reviewer, I promise that I will not only be sympathetic, but advocate for the transparency in my comments to the editor, too). It would not be a "major issue," and it is not likely to have a huge effect on the results; it would be a nice increase in transparency.

Implicit in my response to the earlier question is that an ideal would be continuous monitoring of reliability, or at least the facet of it most important to the goals of the study. Reliability is not a box to check off to satisfy reviewers (or it shouldn't be); it is a way of checking that we measured our key variables in a way that supports all of the analyses downstream. Without adequate reliability, the study will fail (or get a Type I error!).