Talk:WikiJournal Preprints/Comparison between the Lund-Browder chart and the BurnCase 3D® for consistency in estimating total body surface area burned
Article information
Yoo, K; Woo, G; Jang, T; Song, J.
You can follow its progress through the peer review process at this tracking page.
First submitted:
Article text
Author info:
Suggested (provisional) citation format:
Yoo, K; Woo, G; Jang, T; Song, J (2020). "Comparison between the Lund-Browder chart and the BurnCase 3D® for consistency in estimating total body surface area burned". WikiJournal Preprints. https://en.wikiversity.org/wiki/WikiJournal_Preprints/Comparison_between_the_Lund-Browder_chart_and_the_BurnCase_3D%C2%AE_for_consistency_in_estimating_total_body_surface_area_burned.
License: This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction, provided the original author and source are credited.
This is the pre-publication public peer review for the article Comparison between the Lund-Browder chart and the BurnCase 3D® for consistency in estimating total body surface area burned
Copyvio check[edit]
- Pass. Report from WMF copyvios tool: 0% Plagiarism, 100% unique. Mikael Häggström (discuss • contribs) 15:02, 16 November 2018 (UTC)
Initial editorial comments[edit]
Comments by Thomas Shafee
These comments were submitted on , and refer to this previous version of the article
As a minor edit, I would recommend omitting the "®" symbol after the first use, since it is thereafter implied. The layout, format, and linking are all appropriate.
Thank you so much for reviewing the manuscript. The following is my responses and revised text reflecting your comments.
First peer reviewer[edit]
Review by
anonymous peer reviewer , MD PhD burns researcher
This review was submitted on , and refers to this previous version of the article
This is an interesting study and an important one. There are however several concerns that need to be addressed.
First, 3D image is not novel and similar type of studies have been conducted before. what is new and different about this study?
Second, you only have small burns and i think this is a bias introduced in this study. smaller burns are easier to estimate than burns over 20% tbsa. you need to add some larger burn sizes.
Third, we are aware of the challenge to estimate burn size in adipose patients. how was the BMI of your subjects?
Fourth, i am concerned about only 3 surgeons. again, why only 3 and not more?
Fifth, there are many limitations that require mentioning but the main one is what is novel and different compared to other already published trials.
Thank you so much for reviewing the manuscript. The following is my responses and revised text reflecting your comments.
Answer about first and fifth issues : identical difference between this article and other researches
As you pointed out, TBSA (burn size estimation) using computers is an active research topic globally. However, to the best of our knowledge, we have found no paper written on the topic by South Korean scholars. One motivation for our study was that this paper, essentially as a preliminary study, would hopefully trigger subsequent research activities on estimating burn area using computers. The revised portion of the paper is as following:
A sentence at the last part of introduction section was fixed :
“...In this paper, we compare the difference of the estimation for %TBSA produced with the BurnCase 3D and the conventional Lund-Browder chart as observed on burn patients’ pictures, as a preliminary study, this paper investigates pros and cons of adoption."
A new paragraph was added in Discussion section :
“Despite sufficiently wide spread of EMR (electronic medical record) systems South Koreain South Korea, little work has been published on computer-aided burn size estimation. In writing this paper as a preliminary study, we hoped to trigger future research activities. We did not include statistically sufficient number of study participants or patients, however, we were able to demonstrate differences of provisional vs. computer-aided burn area estimation methods.”
A first sentence of conclusion section was changed :
As a preliminary study, this paper observed some variance in %TBSA estimation among participants when using the Lund-Browder chart and when using the BurnCase 3D.
Answer to the second issue : need more subjects which are larger than 20% TBSA
Thank you for this important comment. Between 2013 and 2017, the number of burn patients admitted to one of the author’s hospital was 52 and a total of 2,244 photos have been taken on the patients. However, we ended up with only 263 photos from 26 patients for this paper after discarding pictures that could not be used for various factors: burn size/location could not be determined; presence of blister or slough made estimation impossible; photos were taken after surgery. We agree that the numbers are too small for making reliable claims, however, our consultation with a statistician indicated the numbers are large enough for conducting a preliminary study. Even so, we very strongly agree with you on the point that we need to collect more photos from more patients.
Answer to the third issue : comment about BMI of subjects
In the chart we provided BMI data yet we omitted explanation on it in the text; the following has been added to the Result section.
Additional descriptions were added after the third sentence of Result section :
"...and their body mass index was 24.6±3.4kg/m^{2}. World Health Organization(WHO) defines 25kg/m^{2} and 30kg/m^{2} in BMI as overweight and obesity each in general. However, Korean Society for the Study of Obesity(KOSSO) and Asia-Pacific section of WHO have agreed to use lower indices of 23kg/m^{2} and 25kg/m^{2} for overweight and obesity because the prevalence of diabetes and cardiovascular diseases doubles above the 25kg/m^{2} ^{[1]}^{[2]}^{[3]}. With the regular WHO standard, 13 patients, or 50% of the 26, were overweight and none were in the category of obesity. With the WHO standard for Asians, then, 6 patients (23.1%) were overweight and 13 patients (50%) obese.
- ↑ Bray, GA; Bouchard, C. (2004). Handbook of obesity. New York, NY: Marcel Dekker, Inc.
- ↑ WHO (1997). Obesity: Preventing and managing the global epidemic. Report on a WHO consultation on obesity.
- ↑ WHO/IASO/IOTF (2000). The Asia-Pacific perspective: redefining obesity and its treatment. Health communications Australia: Melbourne.
Answer to fourth issue : Lack of the number of observers
Thank you for this important comment. From the early stage of our study, this concern forced us consult a statistician. We settled on 3 observers in order to reduce inter-observer variation. We revised the manuscript by adding a Material and methods on inter-observer variation.
"... poor quality were discarded. After consulting a statistician, the number of participant was determined to 3 in order to reduce inter-observer variation.”
Other issue
The reason why Participant No. 3’s evaluation results deviate from the other participants is that he was going through the learning curve for the software along the experiment. We revised the manuscript to include this factor.
The fifth sentence at 3rd paragraph in discussion section is changed
from “This hypothesis merits additional research with a large number of participants”
to “We believe discussing the learning curve for BurnCase 3D will merit additional research”.
--Stonegaze (discuss • contribs) 05:15, 26 November 2018 (UTC)
Second peer review[edit]
Review by
Herbert Haller , HLMedConsult
This review was submitted on , and refers to this previous version of the article
Competing interests declared: medical advisor to the BurnCase 3D project without financial benefits
This paper compares computer-assisted TBSA calculation using BurnCase3D with manual estimation using Lundt-Browder charts. The study’s main findings include significant differences in TBSA estimates between two methods in two out of three participants. No such differences were found in the third participant, who was an experienced BurnCase3D user. The study also found differences in time requirements, with estimation using BurnCase3D taking significantly more time.
As a medical advisor to the BurnCase project, I am happy to see these questions receiving research attention. The study seems to provide evidence that the use of BurnCase can lead to improved accuracy in TBSA estimates at the cost of increased time requirements. Unfortunately, I find two methodological weak points that may lead the authors to underestimate the advantage of computer-assisted methods.
The first weak point is that participants were trained on the use of BurnCase3D, including the use of 5 tutorial cases, before doing any other estimation. It would have been preferable to train participants after they completed estimates using Lundt-Browder since we have found that use of BurnCase3D leads to “calibration effect” which increases accuracy in non-computer assisted estimation methods. Some evidence for this effect is provided by the paper itself which showed that the experienced BurnCase3D user among the three participants showed no differences in TBSA estimates when using BurnCase3D and when using Lundt-Browder. I believe that this leads to the authors underestimating variance between estimates completed using the two methods. It would encourage discussion of this point and it would be interesting to see whether in fact there is evidence of increasing calibration in participants across the different example cases.
The second weak point, which may be methodological or simply a lack of discussion, is regarding the timing estimates for the use of BurnCase3D. There are two ways of using BurnCase3D. The first involves manually fitting a photograph to a 3D model in the application, which takes a lot of time but allows for very precise marking of burns on the 3D model. The second method involves letting users simply mark the burned surface area on the 3D model with reference to the photos, but without explicitly fitting the photos to the 3D model. This second method is much faster and we have found that it is nearly as accurate as the first. Experience shows that clinical practitioners eventually adopt the second method.
Based on the large time requirements reported in the study, I would expect that a mix of these two techniques was used. I think this leads to an overestimation of the effective time requirements for BurnCase3D in clinical practice. Ideally, I would have liked to see a comparison of the two methods of using BurnCase3D as well, but at the very least, the paper in its current state is missing an in-depth discussion of the precise technique used and how it relates to clinical practice. This is especially important since the paper mentions that it aims at analyzing the pros and cons of using computer-assisted methods in practice.
I’d also have been interested to see whether estimates differ based on wound location since one would expect that 3D measurements are superior to 2D methods in the case of lateral wounds. I would also have liked a discussion of some of the large differences in estimates between Lundt-Browder and BurnCase3D in cases where TBSA is large (e.g. >20%). While the paper mentions that there is “some difference” in estimates between the two methods, the rather extreme magnitude of those differences in cases of high TBSA seems worthy of separate mention.
Thank you for this insightful comment. We have revised the manuscript reflecting your words.
Response to issue No. 1: TBSA estimation discrepancy in BurnCase 3D
Thanks to your insightful comment, we realized we should have paid more attention to this issue during our research and consequently we looked into the literature. We could not find studies that squarely addressed this specific question, however, a few works we reviewed after your comment were relevant to this issue to varying degrees, which we included to our discussions section as following:
After the sentence “Participant 3 is an author of this study and had some experience of using the software, which may explain the insignificant statistical difference.”
“There is a possibility that in the case of Participant 3, such previous experiences could have improved the burn size estimation. In other words, the experience of using BurnCase 3D and checking the output may have enabled Participant 3 to ‘calibrate’ the result for when not using BurnCase3D. This suggests that results obtained using BurnCase3D are more positive and closer to being objective when compared with classical methods^{[5][20][21]}. This implies that the discrepancy of burn-size estimation could be reduced among burn specialist, medical students, and research associates if BurnCase3D is widely used^{[5]}. However, we believe research on this calibration effect and the learning curve for BurnCase 3D merits additional research."
We also deleted the sentence “Ruling out Participant 3, we could hypothesize that there is a meaningful difference in burn estimation between the Lund-Browder chart method and the BurnCase3D” and “We believe discussing the learning curve for BurnCase 3D will merit additional research“.
Response to Issue No. 2: area estimation via BurnCase 3D
As you pointed out, in BurnCase 3D there are two methods of displaying patient's wounds on the 3D model using photos. We have included this point in the manuscript and revised it as following:
"BurnCase 3D has two methods of estimating %TBSA based on patient’s photos - one that involves superimposing the photo on the 3D model and the other that does not. In the non-superimposition method, the wound and the 3D model are merely displayed separately. Superimposing photos on the 3D model creates a more accurate wound representation at the cost of more estimation time by users. On the other hand, without superimposition, %TBSA estimation can be performed faster and the difference is very marked when the burn area is large." Participants spent around 2.5 minutes for burn estimation with the Lund-Browder chart and more than 9 minutes with the BurnCase 3D, and more than 10 minutes for cases when %TBSA was thought to be greater than 10%. The greater the number of burn images, the more time our participants took for estimation. For our experiment, all of the photos were superimposed, and this may have contributed to the increased estimation time for each participant. Had we provided only the non-superimposition method, their estimation time may have been reduced, however, this factor was not anticipated in the experiment designing phase.
In our future work, we will adopt both methods for %TBSA estimation."
We also deleted the sentence “We did not perform a post-hoc analysis on this phenomenon but we suspect degree of expertise and the number of burn images could be the culprit”.
Response to Issue No. 3: subjects' body parts for estimation
Accuracy of BurnCase 3D seems to vary depending on the location of patient’s wound. This was not mentioned in our manuscript, which we now included as below. It appears after the 8th sentence of Result section that ends with “and 8 by electric burn.” and also include a new table:
"Out of the 26 subjects, the wounds appeared on different body parts as following: 8 cases in the head; 8 cases in the neck; 16 cases in the trunk (Ant. & Post.); 2 cases in the right buttock; 3 cases in left buttock; 4 cases in the genitalia; 10 cases in the right arm (except hand); 11 cases in the left arm (except hand); 2 cases in the right hand; 3 cases in the left hand; 8 cases in the right leg; 7 cases in the left leg; 4 cases in the right foot; 3 cases in the left foot. Only 3 cases had wounds near the anus or the perineum (Table 2)."
The following sentence has been newly added to Discussion section, after the second paragraph. "One known shortcoming of BurnCase3D is that estimation result can be inaccurate when the wound is located where different body parts overlap, e.g., perineum or the medial side in the buttock. Only 3 out of the 26 patient cases had wounds near the buttock or the perineum in our experiment. In other words, the sample size is too small for claiming that BurnCase 3D has a weakness in estimating burn areas around either areas because 23 of our cases were outside of either regions."
Response to Issue No. 4: Grouping patients with TBSA threshold of 20% and comparing them
In our study, only 5 out of the 26 subjects had TBSA greater than 20% as determined with the Lund-Browder chart. We agree that the estimation results of BurnCase3D results may diverge above the TBSA threshold of 20%, however, our statistician who performed the post-hoc analysis does not agree. We concluded that a larger size of subjects and participants are what we need in our future work, and this was stated in Discussion as following:
"Out of the 26 subjects, 5 had TBSA greater than 20% according to the Lund-Browder chart and the remainder, 21 subjects had TBSA less than 20%.
Discussion with our statistician for post-hoc analysis has concluded that we may not know with confidence whether the two groups are different. Our future work will research with more subjects and more participants."
--Stonegaze (discuss • contribs) 03:09, 9 January 2019 (UTC)
Editorial comments[edit]
Comments by Thomas Shafee
These comments were submitted on , and refer to this previous version of the article
Thank you for your responses to the peer reviewer comments. A couple of minor points have been raised by the editorial board:
- The abbreviation TBSA would be best written in full in the title and the first time in the abstract. Ditto PC in the introduction, etc. Consider participants versus participant in the methods section and other such copy editing changes.
- There are six tables. Are any presentable as figures e.g. the Inter-observer variability tables?
- The journal uses the style of punctuation-before-references.
- We recommend adding ORCIDs for the remaining authors (registration for those currently without).
- Where possible, tables should be made sortable for added functionality.
- The conclusions should include at least a sentence, and maybe a paragraph, about “Limitations and future directions”.
- “Participants” is an ambiguous term in the tables. I would suggest switching to “Rater” or “Observer” for the column headings to keep the role distinct from patients (who are also contributing to the data).
Thank you for this insightful comment. We have revised the manuscript reflecting your words.
- Yes, that's correct. I made changes to the manuscript as you recommended. However, I am embarrassed to say this but is it possible to change the title? My understanding is that is is not so.
- Can your question be paraphrased as ‘Pick a table that represents all of the 6 tables.’? If it is so, then Table 6 is the representative of all. Please let me know if I am mistaken.
- Thank you for pointing out this mistake. I corrected the punctuations.
- We added ORCID for all authors.
- I noticed that you already sorted the table. Thank you!
- Reflecting your comment, we revised the manuscript as following:
...when the BurnCase 3D was used. One limitation of this study was that there were very few patients with TBSA 20% or above and few had burns on perineum, making our post-hoc analysis less robust for evaluating BurnCase 3D. The meaningful discrepancy in user proficiency for using BurnCase 3D among observers and weak diversity of burn areas were additional factors that could be improved in the future. The next research would include a larger number of observers with varied proficiencies; more in-depth evaluation method for BurnCase 3D; diverse burn areas as well as specific body parts. Since accuracy of %TBSA estimation is related to prognosis, correlation of %TBSA estimation versus mortality may meaningfully contribute to future work.
- We are now using 'observer' only and it enhances clarity.
Comments by Thomas Shafee
These comments were submitted on , and refer to this previous version of the article
Thank you for addressing all of these. The title edit is no problem at all - I've now implemented via a "page move".
Comments by Mark Worthen
These comments were submitted on , and refer to this previous version of the article
Copy edit suggestion
CURRENT:
Background: The objective of this study is finding out time requirement and discrepancy in estimation for TBSA burned among burn experts when using the Lund-Browder chart and the BurnCase 3D®.
SUGGESTED REVISION:
Objective: Measure time required to determine total body surface area (TBSA) burned (%TBSA) using the Lund-Browder chart and BurnCase 3D®, and calculate variance between the two methods' %TBSA estimates.
Inter-rater reliability
Since the article promises to compare the *consistency* of the Lund-Browder Chart and the BurnCase 3D®, I want to make sure the appropriate statistics were computed to measure such consistency. (I should acknowledge right away that I'm not a "stats guy", so my apologies in advance my concerns are unwarranted.)
In the Material and methods section, the authors state, "Repeated ANOVA measure was used for evaluating inter-observer variability. Statistical analysis was performed using the SAS® version 9.4 for Windows." And in the Results section, the authors report, " Inter-individual variation was tested with Repeated ANOVA measure and we found no statistical significance (p = 0.31)."
If I understand correctly, reporting "no statistical significance" is not what we need to know. We need to know the inter-rater reliability, computed as an intraclass coefficient (ICC). Here's a quote from Hallgren (2012):
"Significance test results are not typically reported in IRR studies, as it is expected that IRR estimates will typically be greater than 0 for trained coders (Davies & Fleiss, 1982)." (p. 11 of the author's manuscript on PMC)
The ICC should be readily available from the authors' SAS output, although we should have someone well-versed in determining inter-rater reliability check if the correct model, type, unit, and effect were chosen for the analysis.
So that it's clear what I mean, here's my non-statistician guess at what should have been chosen for the analysis:
model: two-way
type: absolute agreement
unit: average-measures
effect: mixed
References
Hallgren KA. Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial. Tutor Quant Methods Psychol. 2012 ; 8(1): 23–34. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3402032/
Davies M, Fleiss JL. Measuring agreement for multinomial data. Biometrics. 1982; 38(4):1047–1051.
Thank you for your comment about Material and Methods section. We agree with your comment. Following is our revision in Material and Methods:
Statistical analysis was performed using the SAS® version 9.4 for Windows. In order to compare the results based on Lund-Browder chart and the BurnCase 3D, we performed a paired t-test. In consideration of inter-observer variability, we used repeated ANOVA measure, in which we opted for the default setting. To verify differences among the three groups, we used one-way model, in which we chose average and nested-effects for unit and effect, respectively.
Statistical Review[edit]
I commend the authors on a clearly written study on an important topic. Strengths of the paper include a clear statement of the clinical and practical importance of accurate burn wound size estimation, inclusion of the raw data, and the combination of literature citations and Wiki links.
What follows are recommendations for improving the statistical presentation and enhancing the value of the report. In recognition of the fact that I have been very delayed in my review due to personal circumstances, as well as that the source data were available, I have gone ahead and run additional analyses that I would suggest for consideration, rather than just recommending them and having the authors consult their own statistician. For these analyses, I used SPSS v25, and I am including the syntax as an appendix to the review to maximize reproducibility.
Report ICCs:[edit]
In the discussion section, the authors mention some ICCs reported in prior work. ICCs are a clear and well recognized way of quantifying agreement. I went ahead and estimated the ICC for absolute agreement using a random effects model (which assumes that the raters are sampled from a universe of larger raters). This is the most conservative model for estimating reliability, but also the most appropriate – a consistency (vs. absolute) agreement model would ignore any difference in rater calibration, which clearly would be clinically important (e.g., if a less experienced rater tended to under-estimate burn size, that would matter; and the ICC should reflect any such tendency). The goal of the analysis is to estimate ICC for raters in general, not only when used by those 3 exact clinicians (so a random effects estimate is preferable to a fixed effects one) (McGraw & Wong, 1996).
Even using the conservative ICC model, both methods produce excellent inter-rater reliability: ICC(A,1) = .97 for the Lund-Browder chart and .92 for the BurnCase 3D. Both are considered “excellent” according to a variety of rubrics (Cicchetti & Sparrow, 1981; Fleiss, 1981; Landis & Koch, 1977; Regier et al., 2012).
Note that my recommendation is the same as Dr. Worthen's, except for the last detail: I do not think that "average measures" is what we want here, because that would estimate the reliability if all three judges read every patient's results, and we used the average of the three ratings as the final estimate of burn size. This maximizes reliability, but in this case would triple the cost and time involved (e.g., having three clinicians look at every burn patient). Averaging multiple ratings is occasionally done in research, and almost never done in practice.
Test difference of methods pooling raters:[edit]
Paired t-test provides a clearer take-home message than the analyses broken down separately by rater do. When doing the analysis as a paired t-test across the 78 ratings, the Lund-Browder ratings are significantly larger on average, by 1.92 (p <.0005).
I also made a Bland-Altman plot to examine the reproducibility of the scores graphically. The authors are welcome to rebuild it or use it in their article (I uploaded it to WikiMedia Commons for their convenience).
The Bland-Altman plot (Bland & Altman, 1986) suggests that the most influential disagreement was an outlier where the BurnCase 3D actually produced a larger estimate. Regression analyses predicted Lund-Browder using BurnCase 3D ratings as the predictor confirmed that this was an influential outlier based on Studentized deleted residuals (-6.0, where +/- 3 would be concerning), Mahalanobis’ distance (20.3, where > 4.0 would be concerning without post hoc correction, and values >10 would be concerning after Bonferroni or Sidak correction for 26 observations), and Cook’s Distance (4.7, also concerning). Dropping this case, the difference between Lund-Browder and BurnCase 3D gets larger, but not at a clinically meaningful degree (i.e., the mean difference on the remaining pairs is still 1.9).
The other trend the Bland-Altman plot shows is a fan-shaped pattern where the discrepancies between methods get larger as the size of the burn gets larger.
I am not in a position to comment whether a 2% difference in TBSA is clinically meaningful, but I would invite the authors to weigh in. The result requires re-writing part of the conclusions. I have suggested language for that, which the authors can use or revise.
Food for thought: I agree with the authors’ statistician that the number of “tough to measure” burn sites is not large enough for statistical significance testing. However, the regression framework or the Bland-Altman plot provide exploratory ways of investigating the effects of these hypotheses. With the Bland-Altman plot, there were four estimates that were “out of bounds” (>1.96 SD) with the Lund-Browder method providing larger estimates. If two or more of these were burns on the buttocks or perineum, that would be strongly suggestive of their hypothesis, which was based on prior data and experience, not a post hoc explanation based on the current data. Similarly, the authors could create a dummy code for whether the burn site was one of the “difficult to measure” sites, and include that as a predictor in the regression – that would provide an estimate of the extent to which BurnCase 3D estimates differed at those sites.
Administration Time[edit]
I was impressed that the authors tracked the administration time. A paired t-test (or a repeated measures ANOVA) comparing the 26 x 3 = 78 paired observations (Lund vs. BurnCase 3D) again would probably indicate significant time differences. The authors are welcome to run these additional analyses and comment accordingly.
Intra-observer and Inter-Observer significance[edit]
The current presentation in Tables 4 & 5 and in associated text describes “Intra-Observer” significance and Inter-Observer significance. Several thoughts here:
"Intra-observer" appears to describe differences between Lund-Browder and BurnCase 3D across 26 cases within each rater. This might be interesting as a supplemental analysis, but it is less important than the general question of “how well do these methods agree in general, when used by different raters?” That question is better address by methods such as a paired t-test and correlation, Bland-Altman plots, or extensions of ICC models.
The “inter-observer significance” label is confusing to me. I think that it refers to a test of differences across the 3 x 26 = 78 paired observations, but I cannot replicate the p value (I get a highly significant result with any approach that treats the observations as paired – paired t-test or repeated measures ANOVA). The p-value is much less important than the effect sizes, too: What is the level of agreement (ICC?)? What is the typical discrepancy between methods? Please revise accordingly.
Discussion and Conclusions[edit]
I have suggested an additional sentence restating the goals of the study. Additional revisions could talk more about the time component and costs involved.
The significant difference in estimates between the two methods (based on the paired t-test or Bland-Altman plot) are new findings based on re-analysis of the data the authors provided that lead to different substantive conclusions (e.g., that the BurnCase 3D tends to produce smaller estimates of burn size, and that the bias gets larger with larger burn sizes (r = .26, p < .05; r = .48 if drop the influential outlier, which seems alarming to me).
Here, I am looking forward to the authors’ comments on the patterns in the new analyses, as they are clearly thoughtful researchers and experienced clinicians, and I have no content expertise vis burns.
Other Minor Suggestions[edit]
I have made minor suggestions to changes in wording to enhance the readability of the article.
Apology[edit]
I apologize for my lengthening of the review process. I am relatively new to reviewing on the Wikipedia Journal platform (as opposed to in traditional journals, where I have several decades of experience). A combination of differences in convention, as well as lack of skill with some technical aspects, slowed the process. Personal circumstances this year also contributed. I am sorry, and I have gone into more detail than I normally would with analyses and suggestions in the hopes of offering some amends. The intent is not to “micro-manage” nor to be overly directive, but rather to share my thinking and to make it easier and faster for you to incorporate whatever you find helpful.
References
Bland, J. M., & Altman, D. G. (1986). Statistical methods for assessing agreement between two methods of clinical measurement. Lancet, 1, 307-310.
Cicchetti, D. V., & Sparrow, S. A. (1981). Developing criteria for establishing interrater reliability of specific items: Applications to assessment of adaptive behavior. American Journal of Mental Deficiency, 86, 127-137.
Fleiss, J. L. (1981). Statistical Methods for Rates and Proportions (1st ed.). Wiley.
Landis, J. R., & Koch, G. G. (1977). An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics, 33, 363-374.
McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1, 30-46. https://doi.org/https://doi.org/10.1037/1082-989X.1.1.30
Regier, D. A., Narrow, W. E., Clarke, D. E., Kraemer, H. C., Kuramoto, S. J., Kuhl, E. A., & Kupfer, D. J. (2012). DSM-5 Field Trials in the United States and Canada, Part II: Test-retest reliability of selected categorical diagnoses. American Journal of Psychiatry, 170, 59–70. https://doi.org/10.1176/appi.ajp.2012.12070999
Appendix – SPSS Syntax for Additional Analyses[edit]
The data file I used for the analyses is available at OSF.IO here. I restructured the data so that the three raters were stacked on top of each other (tall/molten format), and added variable labels so that the relationship of the old and new variables is hopefully easier to follow. The .sav file is the labeled SPSS version; there also is a .CSV version that keeps the same variable names but does not preserve the labels.
title 'Burns data for WJM Review'. Subtitle ‘Code by Eric A. Youngstrom, PhD’. *Note: The data are copied from Table 3 of the submission. *I created new variables, lundtall & burncasetall, that stacked the three raters. *I also made a very tall, skinny pair of variables, area and method, to play with this in R, but did not use these variables in the analyses for the review. desc /var lund1 burncase1 lund2 burncase2 lund3 burncase3 lundtall burncasetall area method. * Is there a difference in methods? . t-test pairs lundtall with burncasetall (paired). * Bland - Altman plot. compute average= mean(lundtall,burncasetall). compute dlundburn= lundtall-burncasetall. var labels average 'Average of the two methods' /dlundburn 'Difference between Lund-Browder and BurnCase 3D estimates'. desc /var average dlundburn. GGRAPH /GRAPHDATASET NAME="graphdataset" VARIABLES=dlundburn[LEVEL=scale] average[LEVEL=scale] Rater[LEVEL=nominal] MISSING=LISTWISE REPORTMISSING=NO /GRAPHSPEC SOURCE=VIZTEMPLATE(NAME="Scatterplot"[LOCATION=LOCAL] MAPPING( "color"="Rater"[DATASET="graphdataset"] "x"="average"[DATASET="graphdataset"] "y"="dlundburn"[DATASET="graphdataset"] "Title"='Bland-Altman plot of agreement between methods')) VIZSTYLESHEET="Traditional"[LOCATION=LOCAL] LABEL='SCATTERPLOT: average-dlundburn' DEFAULTTEMPLATE=NO. corr /var average dlundburn. regression /dep lundtall /enter burncasetall /residuals outliers (cook, sdresid, mahal) id(case). * Case #2 is an outlier according to all three criteria. temp. select if case ne 2. t-test pairs lundtall with burncasetall (paired). * Eliminating the outlier makes the discrepancy more significant, but does not change the size of the mean discrepancy much. temp. select if case ne 2. corr /var average dlundburn. * Looks like there is an influential outlier. * ICC (a,1) for Lund-Browder. * SPSS code assumes cast/wide format, with each rater as a separate column. RELIABILITY /VARIABLES=lund1 lund2 lund3 /ICC =MODEL(RANDOM) TYPE(absolute) CIN=95 TESTVAL=0.98 . * ICC = .972. * ICC (a,1) for BurnCase 3D. RELIABILITY /VARIABLES=burncase1 burncase2 burncase3 /ICC =MODEL(RANDOM) TYPE(absolute) CIN=95 TESTVAL=0.98 . * ICC = .917, not sig different from published value. RELIABILITY /VARIABLES=lund1 burncase1 lund2 burncase2 lund3 burncase3 /ICC =MODEL(RANDOM) TYPE(absolute) CIN=95 TESTVAL=0.98 . ********************* * Other calculations to feed into McGraw & Wong formulae to manually calculate ICC. manova area by method (1 2) rater (1 3) /print cellinfo. * No significant differences between rater or method (nor an interaction). correlation lund1 lund2 lund3 burncase1 burncase2 burncase3 lundtall burncasetall. manova lundtall by rater (1 3) case (1 26) /print cellinfo. manova burncasetall by rater (1 3) case (1 26) /print cellinfo.
Eyoungstrom (discuss • contribs) 15:57, 17 December 2019 (UTC)