Evidence-based assessment/BipolarYouthMeta

From Wikiversity
Jump to navigation Jump to search

Share article
 Email |  Facebook |  Twitter |  LinkedIn |  Mendeley |  ResearchGate

Completion status: this resource is ~50% complete.
Subject classification: this is a psychology resource.

WikiJournal Preprints
Open access • Publication charge free • Public peer review

WikiJournal User Group is a publishing group of open-access, free-to-publish, Wikipedia-integrated academic journals. <seo title=" Wikiversity Journal User Group, WikiJournal Free to publish, Open access, Open-access, Non-profit, online journal, Public peer review "/>

<meta name='citation_doi' value=>

Article information

Author: Eric A. Youngstrom

See author information ▼

Abstract


Multivariate Meta-Analysis of the Discriminative Validity of Caregiver, Youth, and Teacher Rating Scales for Bipolar Disorder in Youths: Mother Still Knows Best about Mania[edit | edit source]

Eric A. Youngstrom       Joshua A. Langfus Caroline Vincent Jacquelynne Genzlinger

University of North Carolina at Chapel Hill

Avery Loeb (Chapel Hill High School or NCSSM?), X. Nicholas Fogg (Loyola), Emma G. Choplin (U of Miami), TBD....

Gregory Egerton

University at Buffalo, The State University of New York

Anna Van Meter

Northwell Hospital/Zucker Hillside

First published version: (2015) Archives of Scientific Psychology.

Author Note

Eric A. Youngstrom, Department of Psychology and Neuroscience, University of North Carolina at Chapel Hill; Jacquelynne E. Genzlinger, Department of Psychology and Neuroscience, University of North Carolina at Chapel Hill; Gregory A. Egerton, Department of Psychology, University at Buffalo, The State University of New York; Anna R. Van Meter, xxxx

This work was supported in part by a grant from the Lindquist Foundation. We thank Camille Sowder, Mian-Li Ong, Karen Bourne, and Ericka McKinney for their help with coding and contributions to preliminary analyses. We thank Angela Bardeen, Ph.D., for consultation with the search strategy and tracking of coding. Thanks also to Wolfgang Viechtbauer, Ph.D., for consultation around the metafor software package and technical aspects of multivariate meta-analysis.

Correspondence concerning this article should be addressed to Eric A. Youngstrom, Department of Psychology and Neuroscience, University of North Carolina at Chapel Hill, CB #3270, Davie Hall, Chapel Hill, NC 27599.

Email: [[1]]


Plain English Abstract[edit | edit source]

The past two decades have seen a rapid increase in the amount of research on bipolar disorder in children and adolescents, including studies that look at the accuracy of symptom checklists as a way of telling if a youth might have bipolar disorder. How accurate are these checklists? Does accuracy change if they are completed by the youth or a teacher instead of the primary caregiver? Are checklists that focus specifically on symptoms of mania more accurate than checklists with more general content—typical of older measures? How much does the performance of checklists change depending on whether the sample only includes youths seeking treatment, versus including a healthy comparison group? We addressed these research questions by systematically reviewing major publication databases (PsycINFO, PubMed, and GoogleScholar) and looking at 4094 hits based on our search. We looked for studies that reported enough information to (1) estimate the size of the difference in checklist scores (“effect size”) between cases with versus without research diagnoses of bipolar disorder for (2) youths 18 years of age or younger, (3) including at least 10 cases with bipolar disorder. Because we wanted to compare caregiver, teacher, and youth report on the same measures, we used a newer statistical technique, multivariate meta-analysis, to combine and compare results within as well as across studies. We found 63 effect sizes from 8 checklists used in 27 separate samples, including 11,941 youths, of whom 1,834 had diagnoses of bipolar disorder. Overall, checklists did a good job separating cases with bipolar from other youths, with an effect size of 1.05, meaning that bipolar cases scored more than a standard deviation higher. Caregiver report was the most accurate across all checklists, performing significantly better than youth or teacher report. Scales focusing on manic symptoms also outperformed general symptom checklists. Sample composition also changed the accuracy of the checklists a great deal: Many studies either included healthy children or excluded youths with diagnoses that are difficult to tell apart from bipolar disorder. These studies gave an overly-optimistic sense of how well the checklist might do at identifying youths with bipolar disorder in most clinical settings. Three checklists have shown validity in multiple studies and appear accurate enough to be helpful in improving diagnosis in clinical practice.


Abstract[edit | edit source]

Objective: To meta-analyze the diagnostic efficiency of checklists for discriminating pediatric bipolar disorder (PBD) from other conditions. Hypothesized moderators included (a) informant – we predicted caregiver report would produce larger effects than youth or teacher report; (b) scale content – scales that include manic symptoms should be more discriminating; and (c) sample design – samples that include healthy control cases or impose stringent exclusion criteria are likely to produce inflated effect sizes.

Methods: Searches in PsycINFO, PubMed, and GoogleScholar generated 4094 hits. Inclusion criteria were (1) sufficient statistics to estimate a standardized effect size, (2) age 18 years or less, and (3) at least 10 cases (4) with diagnoses of PBD based on semi-structured diagnostic interview. Multivariate mixed regression models accounted for nesting of multiple effect sizes from different informants or scales within the same sample.

Results: Data included 63 effect sizes from 8 rating scales across 27 separate samples (N=11,941 youths, 1,834 with PBD). The average effect size was g=1.05. Random effect variance components within study and between study were significant, ps<.00005. Informant, scale content, and sample design all explained significant unique variance, even after controlling for design and reporting quality.

Discussion: Checklists have clinical utility for assessing PBD. Caregiver reports discriminated PBD significantly better than teacher and youth self report, although all three showed discriminative validity. Studies using “distilled” designs with healthy control comparison groups, or stringent exclusion criteria, produced significantly larger effect size estimates that could lead to inflated false positive rates if used as described in clinical practice.

Keywords: Bipolar disorder; children and adolescents; sensitivity and specificity; meta-analysis; mania

Multivariate Meta-Analysis of the Discriminative Validity of Caregiver, Youth, and Teacher Rating Scales for Bipolar Disorder in Youths: Mother Still Knows Best about Mania[edit | edit source]

The diagnosis of bipolar disorder in children and adolescents has been one of the most contentious issues in child mental health over the past two decades.[1][2] The questions of whether bipolar disorder could manifest before puberty, whether the same criteria should be used for children as for adults, and the validity and importance of collateral reports about mood and behavior by caregivers and teachers have guided a growing body of research.[3][4][5] Given these longstanding debates and the growing literature on the topic of pediatric bipolar disorder, it is opportune to undertake a quantitative review of the literature on assessment of pediatric bipolar disorder. A meta-analysis also can address larger themes of cross-informant validity and fundamental research design issues that cut across the wider domains of clinical assessment.

Importance of Accurate Identification of Bipolar Disorder[edit | edit source]

A substantial portion of mood disorders fall along the spectrum of bipolar disorders, which includes not only bipolar I, but also bipolar II, cyclothymic disorder, and bipolar not otherwise specified (now “other specified bipolar and related disorders").[6] Both longitudinal and epidemiological studies indicate that at least a third of serious mood disorders follow a bipolar course.[7][8][9] Bipolar disorder also needs different treatment strategies.[10] It is not just the difference between bipolar and unipolar depression that matters for prognosis or treatment prescription. Disruptive behavior disorders—oppositional defiant disorder, conduct disorder, and the new diagnosis of dysregulated mood disorder with dysphoria—also are challenging to distinguish from bipolar disorder,[11] and they would indicate substantially different approaches to treatment.[12][13] The same is true of attention-deficit/hyperactivity disorder (ADHD), which has been the nexus of extensive debate both because of overlapping symptoms as well as concerns about possible iatrogenic effects of using stimulants when the person has bipolar disorder.[14][15][16]

Despite the risks associated with misdiagnosis, clinical practice often does an exceptionally poor job of recognizing bipolar disorder. A meta-analysis comparing clinical diagnoses of children and adolescents to structured or semi-structured diagnoses found an average kappa of .27.[17] Dismayingly, the kappa was even lower for bipolar disorder, K=.08. Similarly, comparisons of the accuracy of clinical diagnoses versus research consensus diagnoses—arguably even more valid than semi-structured interviews alone[18]—have found that bipolar diagnoses are among the least accurate, particularly among ethnic minority groups.[19][20] Misdiagnosis reduces the likelihood of appropriate intervention.[21]

Potential Role of Rating Scales for Discriminating Between Bipolar and Other Diagnoses[edit | edit source]

Rating scales and checklists can potentially improve diagnosis, providing tools that are inexpensive and depend less on training to be implemented consistently across settings,[22][23] and often have good psychometric properties within the populations and settings where they are used. Some also offer age-based norms, providing an empirical method for comparing behavior and emotions against milestones of normative development.[24] If diagnostic efficiency statistics, such as sensitivity and specificity or diagnostic likelihood ratios are available, then it is possible to combine information from checklist scores with estimates of baseline probability, and other risk factors to come up with a revised probability of diagnosis.[25] The assessment methods advocated by Evidence Based Medicine (EBM) use Bayesian techniques, packaged in a way that is accessible to clinicians, to integrate the information from test results with other available clinical data.[26]

The application of these methods to the specific problem of diagnosing pediatric bipolar disorder has already shown large effect sizes for changing clinical practice by making estimates more accurate, eliminating a bias towards overestimating the probability of a bipolar diagnosis, and improving the consistency of agreement (i.e, reducing the range of opinion between clinicians).[23] The cumulative effect is enhanced agreement about the next clinical action to recommend for a given case.[27]

Various instruments are now available that assess clusters of symptoms related to bipolar disorder. Some of these, such as the Achenbach System of Empirically Based Assessment,[28] do not include a mania scale, but contain subscales measuring other symptom principal components, such as attention problems, aggressive behavior, and anxious/depressed symptoms, that bipolar disorder influences.[29] Several other checklists were originally written for adults and then tested for use with adolescents,[30][31] or adapted for parents to report about their child’s mood and behavior.[32][31][33]A few were originally conceived and designed for use with pediatric samples.[34][35] Though the development of new measures is progress for the field, it also complicates the instrument selection process. With few head-to-head comparison studies, it is difficult to compare the performance of these instruments. It is timely to do a meta-analysis to compare measure performance, and to identify conceptually and clinically meaningful moderators of measure performance.

Potential Moderators of Diagnostic Accuracy of Measures Used with Youths[edit | edit source]

Several design issues likely complicate interpretation of the literature on PBD assessment.

Differences in Informant[edit | edit source]

It is axiomatic that assessment of youths should involve multiple informants:[36] parents and teachers observe the youth in different, important developmental contexts, and they have different implicit expectations for typical youth behavior. Youths have their own perspective on their lives, and have privileged access to their internal states; but they also show large developmental changes in verbal ability, metacognition, and their degree of psychological mindedness, all of which change the reliability and validity of their responses to rating scales.[37] For all of these reasons, the correlation between caregiver, teacher, and youth ratings tends to be only moderate (e.g., r ~.2 to .3) across a broad range of psychopathology constructs.[38][37] The moderate degree of agreement has a big impact on the definition of clinical “caseness”—a clinical elevation according to one informant will typically be linked with only modest elevations according to other observers.[39][40] A practical consideration for both researchers and clinicians is deciding how to proceed when informants do not agree. Requiring unanimity among caregivers, teachers, and youths (e.g., using the AND rule) identifies the most impaired cases[41] and sharply reduces false positive rates; but it also identifies only a quarter as many cases as would meet the definition of caseness according to any one of the three informants (e.g., using the Boolean OR rule).[39] This was directly pertinent to the debates during the DSM-5 revision process about whether to require impairment in multiple settings as part of the criteria for establishing a manic episode.[42][43]

Informant issues are especially salient in the area of pediatric bipolar disorder. Clinicians often give more credence to youth report of internalizing problems, because youths have more direct access to their own subject mood states.[44][45] However, whenever studies have compared youth and caregiver report in the same sample, caregiver report has produced larger effect sizes for discriminating cases with bipolar disorder from other conditions.[46][47][31][48] This might be due to bipolar disorder also creating substantial externalizing problems, which collateral informants often notice earlier and find more bothersome than the youth. Mania and hypomania include other symptoms—such as pressured speech or flight of ideas—that others find worrisome sooner than the person experiencing them. Parents notice irritable mood at significantly lower levels of mania than the youth, who may notice symptoms of increased energy, hypersexuality, and decreased need for sleep sooner instead.[49] There is evidence with both youths[46] and adults[50] that hypomania and mania compromise peoples’ insight into their behavior and how it is perceived by others, possibly further undermining the credibility[44] and validity of youth report.

Beyond caregiver and youth report, some experts argue that teacher report is important for “corroborating” mania—that manic symptoms are more credible and likely to be more impairing when observed by multiple informants across multiple settings.[51] Conversely, others assert that teacher report should not be included in the decision-making algorithm for diagnosing pediatric bipolar disorder, as it has less validity than caregiver and possibly than youth report and has failed to demonstrate incremental validity for the purpose of predicting diagnosis.[52] Unfortunately, less work has evaluated teacher report (relative to caregiver and youth report) with regard to pediatric bipolar disorder. Several studies found that the Achenbach Teacher Report Form (TRF) is significantly elevated on multiple scales in the presence of pediatric bipolar disorder compared to ADHD or to healthy controls.[53][48] However, the effect sizes tend to be smaller for teacher report than caregiver report, and the effect sizes shrink further when the comparison group is also treatment seeking instead of healthy controls.[54] Although they have not yet been compared head-to-head in the same sample, teacher report on manic symptom scales also produced smaller effect sizes than caregiver report on the same instruments.[55]

Scale Content[edit | edit source]

Another potential moderator is the item content of the scale. Widely used measures such as the Achenbach System of Empirically Based Assessment[28] do not include a mania scale, and they often do not have items assessing symptoms that might be specific to mania, such as elated mood or grandiosity. The omissions reflect the time period when the item pool was written, predating consideration that bipolar disorder might manifest in childhood.[56] Subsequent research using these scales found that youths with bipolar disorder showed elevations on multiple clinical syndrome scales.[29] These scales are often elevated in the context of other diagnoses besides bipolar disorder, indicating that they are not specific to bipolar. Although there was initial enthusiasm for a “bipolar profile” consisting of elevations on multiple scales, subsequent research found that many cases showing the profile did not meet criteria for bipolar disorder.[57][58] Other analyses found that the Externalizing score captured most of the diagnostic information from the ASEBA with regard to potential bipolar disorder, and there was no incremental value in adding the syndrome scores after looking first at Externalizing.[54][57][59]

In contrast, other scales focus on symptoms of mania, either using the DSM symptoms as the basis of the items (e.g.),[35][60] or even expanding the item pool to include other clinical features that might be associated with hypomania or mania in addition to the canonized DSM symptoms (e.g.).[34][61] These scales are likely to be more diagnostically sensitive to bipolar disorder because they ask directly about the relevant symptoms. They also may be more diagnostically specific to bipolar disorder inasmuch as they also include distinctive symptoms.

Differences in Interview Strategy[edit | edit source]

In addition to the question of who completes the rating scale to describe the youth’s emotions and behavior, it also is vital to consider how we arrive at our diagnoses. Mental health lacks the equivalent of an autopsy or pathology report that can conclusively establish a diagnosis. In a field where a “gold standard” diagnosis is impossible, perhaps the best we can do is a “LEAD” standard—the Longitudinal, Expert evaluation of All Data—including history of development, prior treatment and response, family history of pathology, and integration of collateral informant perspectives as well as direct observation of behavior.[18] Many research studies approximately approach the LEAD standard by combining a semi-structured interview with expert clinician review and sometimes unstructured interviewing to fill in gaps or probe alternate hypotheses. Semi-structured interviews are much less likely to be used in clinical practice because of length, as well as practitioners valuing autonomy.[62] As we include fewer additional sources of information in the diagnostic process, we must place greater weight on the remaining ones.

The least common denominator in clinical diagnoses of children is an interview with the primary caregiver. The caregiver is most likely to initiate the referral for outpatient services, and young children are unlikely to have the patience, focus, or meta-cognition needed to complete many semi-structured interviews. If the interview is redesigned to be developmentally appropriate for young children, it is difficult to connect with adolescent and adult interviews or diagnostic nosologies.[63][64] However, if the diagnostic formulation is based solely on the caregiver interview, then there is no source of potentially disconfirming information. Several factors can undermine the validity of caregiver report, including the caregiver’s own stress or psychopathology,[37][65] seeking disability or educational accommodations (“secondary gains”), or complex interactions around issues with the juvenile justice system or child custody.[36] Even if the youth does not complete a semi-structured interview, direct interaction and observation provide key data about mental status, the presence or absence of stereotypic behavior, and a variety of other factors that can change diagnoses.[66][67]

The issue of interview informants has prompted much discussion within the field of pediatric bipolar disorder research. Although most research groups gravitated towards using some version of the Kiddie Schedule for Affective Disorders and Schizophrenia [68][69][70] as the core semi-structured diagnostic interview,[71] some groups relied primarily or solely on parent interviews when the case was a youth younger than 12 years.[58][72][73][74] Others insisted on also interviewing the youth (Findling, Youngstrom, et al., 2005).[53][75] When groups reported different rates of comorbid pervasive developmental disorders, anxiety disorders, or family histories of antisocial personality and other parental diagnoses (see Kowatch, Youngstrom, Danielyan, & Findling, 2005; for reviews),[76][5][77] it became important to isolate the source of the differences. In addition to differences in the content and organization of mood items in the different interviews used,[78] differences in training, or differences in ascertainment and referral patterns, interviewing only the parent may inflate the association between caregiver-reported checklists and diagnoses—even when the diagnosis is blind to the checklist and based on semi-structured interview.[1] Unlike factors reviewed above, interviewing only the parent likely affects both sensitivity and specificity by exaggerating the degree of separation between the distributions for those with versus without the diagnosis. Leaning heavily on the caregiver for both the criterion and the predictor will exaggerate the apparent effect size.

Study Design Features Especially Potent in Diagnostic Efficiency Studies[edit | edit source]

Experts have developed standardized guidelines for reporting and critically evaluating the design features of studies evaluating diagnostic tests (e.g., STARD; [79]) as well as general reports of empirical studies.[80] Here we will focus on factors that (a) affect the severity of the target condition, thus altering the diagnostic sensitivity; and (b) affect the composition of the comparison group, thereby changing the diagnostic specificity.

Design factors changing the diagnostic sensitivity of a measure[edit | edit source]

The more severe the illness, the easier it is to distinguish from other conditions. Factors affecting the severity of the target condition include the stage of illness, the severity of the presentation, and the use of broad or narrow target definitions.[81] For bipolar disorder, the variability in mood states further complicates the picture because the same illness may manifest with periods of euthymia, hypomania, mania, dysthymia, depression, or mixed mood presentations. Additionally, unlike most diagnoses, bipolar diagnoses persist even after the person recovers from an episode, technically being coded as “in remission.” Therefore, bipolar disorder is heterogeneous—spanning from high functioning people in remission all the way to severely disorganized behavior requiring psychiatric hospitalization. In practice, the severity of illness correlates with participants’ recruitment setting: inpatient samples have the highest average degree of mania, community samples the lowest average,[82][83] and outpatient samples usually fall in between. All else being equal, mania will be easier than hypomania to tell apart from ADHD or depression. Diagnostic sensitivity of tests will vary as a direct function of the severity of the illness.[81]

Similarly, the “broad” versus “narrow” definition of diagnosis plays a prominent role in pediatric bipolar disorder. The narrowest research operational definitions require the presence of elated mood and/or grandiosity, whereas irritable mood would be sufficient using DSM-IV and DSM-5 criteria (sometimes characterized as the “intermediate” phenotype).[84]At the other extreme, some groups may have relaxed the requirement of distinct episodes of change in mood or energy, potentially stretching the “broad phenotype” to include cases that do not share core features of bipolar illness.[85][86][43] Consequently, youth diagnosed using “broad” definitions of PBD are likely to be more difficult to distinguish from other cases than youth diagnosed using “narrow” criteria.

Another wrinkle within bipolar disorder comes from the diagnoses of cyclothymic disorder and bipolar Not Otherwise Specified (NOS – the term used in DSM-IV) or Other Specified Bipolar and Related Disorders (OS-BRD – the DSM-5 parlance; American Psychiatric Association, 2013). Cyclothymic disorder is rarely used in clinical practice in the USA[87] yet is more common than bipolar I in epidemiological samples (Van Meter, Moreira, & Youngstrom, 2011)[88] and is associated with a high degree of impairment in youths (Van Meter, Youngstrom, Demeter, & Findling, 2013; Van Meter, Youngstrom, Youngstrom, Feeny, & Findling, 2011). Similarly, bipolar NOS appears more common than bipolar I in outpatient youth samples, and is associated with a high degree of impairment (Findling, Youngstrom, et al., 2005).[89] However, both cyclothymic disorder and bipolar NOS, by definition, have less severe manic symptoms than bipolar I or II and may be harder to identify using manic symptom checklists. Further complicating the matter, both cyclothymia and bipolar NOS progress to bipolar I or II at high rates during prospective follow-up, raising questions about whether these are prodromes or early stages of illness rather than distinct disorders (Van Meter, Youngstrom, & Findling, 2012).[90][91] This creates tension between internal versus external validity: samples focusing on acutely manic bipolar I presentations will produce higher sensitivity estimates, but the results will generalize less well to applications where the goal is to identify bipolar spectrum disorder or earlier, milder stages of bipolar disorder.

All of these considerations change the diagnostic sensitivity of the test because they change the distribution of scores among the target group. Sensitivity is defined as the percentage of cases having the target diagnosis that also score above a designated threshold on the test of interest. The average score on the scale will be higher if the severity of presentation is more extreme. As the distribution shifts towards higher scores, a larger percentage of people will score above any given threshold, increasing the sensitivity of the test. For our purposes, studies with greater rates of bipolar I, more cases with current manic episodes, or drawing larger percentages from inpatient settings are all likely to have higher average scores on scales intended to detect mania. More subtly, samples including broader definitions of bipolar disorder, or enrolling people in varying states of illness, will tend to have more variation in scores. In addition to altering the sensitivity of the scale, the greater variance within the bipolar group also increases the overlap in score distribution with the comparison group, reducing the scale’s diagnostic accuracy.

Design factors changing the diagnostic specificity of a measure[edit | edit source]

The composition of the comparison group directly affects the diagnostic specificity of the measure. Anything that lowers the mean, or decreases the variability in the distribution of scale scores in the comparison group will increase the effect size and decrease the amount of overlap between the bipolar and nonbipolar score distributions. One common design element that would have this effect is the inclusion of healthy controls. Healthy controls, by definition, will have low scores on any symptom measure. Adding them to the sample will shift the mean lower. More subtly, because healthy controls tend to show a floor effect on clinical measures, they bunch together at the lower end of the scale and increase skew.

Another design element that can affect specificity is excluding cases with diagnoses that mimic aspects of bipolar disorder. Unipolar depression and bipolar depression look quite similar, for example. ADHD has multiple symptoms and features that overlap with symptoms of hypomania and mania, including high activity, distractibility, and impulsivity.[92] Oppositional defiant disorder and conduct disorder also entail high degrees of irritable mood, aggressive behavior, and rule-breaking that can look like the mood or impulsive risky behavior of mania.[93] Post-traumatic stress disorder and schizophrenia can produce symptoms that overlap with mania, shading into the more psychotic presentations. Symptom overlap raises the average score on scales where the content includes symptoms that multiple disorders “share.” Endorsing the symptoms due to other disorders raises the average score in the comparison group, increasing the percentage of “false positive” results and directly reducing the specificity of tests.[81][40] Bias may be stronger for scales that mostly contain nonspecific items, such as irritability and distraction.

If the research design exaggerates diagnostic specificity, a weak test could appear better than a stronger one evaluated in a more generalizable sample. Then subsequent applications of the test, under more clinically realistic conditions would produce systematically higher rates of false positives (the converse of lower diagnostic specificity), creating upwardly biased posterior probabilities. Flawed research designs could reintroduce the same bias that rating scales were intended to fix.

Nested Effect Sizes and Technical Issues in Establishing Relative Superiority[edit | edit source]

Another issue is trying to determine whether some measures perform significantly better than others, after accounting for differences due to informant or research design. It is possible to test the difference between effect sizes from different samples, comparing the discrepancy to what would be expected under the null hypothesis of no difference beyond sampling error.[94] More statistically powerful tests are possible when the effect sizes come from the same sample,[95] and a few reports have already directly tested performance differences between measures in the same sample.[40][54] These analyses control for illness severity, comparison group composition, interviewer training, and a host of other factors that could differ between studies.[81] Unfortunately there is no “master linking sample” that compares all of the contending measures against each other head to head.

However, a mixed effects meta-analysis model can account for the fact that some measures may be confounded with design features in the available literature (e.g., if only one research group has published results with a particular measure, then it will be harder to tease apart characteristics of the measure from the set of design factors used by the group). Multivariate meta-analyses can disentangle the effects of design artifact from the differences between measures, allowing direct comparison of measure performance (Viechtbauer, 2010). Clinicians would want to know which measure to use for high stakes decisions, and researchers could enhance studies by switching to the more valid measure.

Research Questions/Hypotheses[edit | edit source]

We expected the average effect size to be large, reflecting a big standardized difference in mean scores for those with bipolar versus other conditions. However, we also expected the effect sizes to show significant heterogeneity, and we had specific hypotheses about moderating variables. Based on the few prior within-sample comparisons, we predicted that caregiver report would show larger average effect sizes than youth or teacher report. We also hypothesized that scale content would matter: measures that ask about symptoms more specific to mania should show larger overall effect sizes than measures that focus on externalizing symptoms in general, or that combine components originally designed to assess depression, attention problems, and aggressive behavior.[29][96] A third hypothesis was that studies that only directly interviewed the caregiver would produce larger effect size estimates than those that included direct interview and observation of the youth as part of the criterion diagnosis, due to shared source variance. A fourth hypothesis was that the use of more “distilled” samples that identify more homogeneous and symptomatic cases of bipolar, exclude diagnoses that frequently are difficult to distinguish from bipolar, or that include healthy controls in the comparison group, would yield much larger effect sizes.

We hypothesized that all moderators would remain significant when entered together in the regression models. If “distilled design” was a significant moderator, we would give primacy to the “nondistilled” estimates of effect size as more clinically generalizable, and treat them as the main focus of discussion. Similarly, if interview strategy (including the youth versus relying only on the caregiver) moderated results, then the results based on integrated interviews would take precedence, as they would be less affected by shared source variance. We explored whether there were significant differences between scales after controlling for moderators, but anticipated that the variability between samples, combined with the number of multiple comparisons, would make those results tentative. Sensitivity analyses examined whether results changed substantively after controlling for quality of design (following the scheme used in Kowatch et al., 2005) or quality of reporting (using the recommended QUADAS-2 tool developed to operationalize the STARD Guidelines).[97]  

Method[edit | edit source]

Inclusion and Exclusion Criteria[edit | edit source]

Studies were included if they reported (a) cases with a diagnosis of a bipolar spectrum disorder made via a structured or semi-structured interview, (b) as well as a comparison group, (c) with both groups completing the same checklists assessing manic, hypomanic, or externalizing symptoms, (d) with data reported for participants 18 years or younger. Cases could be drawn from clinical or community samples. Exclusion criteria included having fewer than 10 cases with bipolar diagnoses (per Kraemer, 1992; to provide reasonably stable estimates of diagnostic sensitivity) (examples of excluded cases are cited),[98] not including a rating scale,[99] not publishing results in an English format (note that we did not find any studies that had usable effect sizes that had been published in other languages), only having data for the bipolar group and no comparison group,[100] only reporting clinical diagnoses based on chart review or unstructured interviews.[87] We limited the search period to 1993 and later so that the DSM-IV criteria would be available and used. Functionally, there were no group comparison studies published prior to then on the topic anyway; only case reports.[101] There were no geographical or cultural restrictions. Studies with adult samples were included only if they reported sufficient information about the subset of cases 18 years and younger. This resulted in exclusion of several studies that were reviewed in Waugh et al. (exclusions here: [102][103][104][105]). We excluded effect sizes and studies where the groups were not defined by diagnostic interviews, but instead by proxy definitions of bipolarity based on rating scales, such as the CBCL proxy,[106][107] elevated scores on a parent-reported mania scale,[108] or “corroborated” mania reported by multiple informants on the same rating scale.[41] These scenarios involve criterion contamination, where the scores on the measure contributed directly to the determination of the criterion “diagnosis” definition.[79][81] Per DSM-IV, bipolar spectrum diagnoses could include bipolar I, bipolar II, cyclothymic disorder, and bipolar Not Otherwise Specified (NOS). All studies included in the analysis reported that they used DSM-IV criteria, but the publications did not report effect sizes separately for the different bipolar diagnoses, so it was not possible to estimate effect sizes for each type of bipolar disorder.

Moderator Definitions[edit | edit source]

We created a coding manual in Microsoft Excel, where the variable names, definitions, value labels, and examples were in rows or comment boxes next to the coding area. In addition to publication year, country of data collection, clinical setting (epidemiological/general community, outpatient, acute tertiary setting), and variables necessary for coding study design and reporting quality (detailed below), we also coded several potential moderator variables.

Informant[edit | edit source]

For each effect size, we coded whether the informant completing the checklist or scale was the caregiver (including foster parents or custodial relatives, although in the vast majority of cases across all samples it was the biological mother), the teacher, or the youth. Analyses used dummy codes with caregiver as the reference category.

Type of Scale[edit | edit source]

For each effect size, we coded whether the scale contained symptoms specific to mania versus comprising items or subscales originally designed to measure other pathology. For example, the “bipolar profile” from the Achenbach System of Empirically Based Assessment (ASEBA) instruments[28] consists of a combination of the Aggressive Behavior, Attention Problems, and Anxious/Depressed scales. There is no “mania” scale on the ASEBA; the manic items it contains are those that overlap with other disorders, and thus factor analyses assigned them to other subscales. The meta-analyses used a dummy code that defined nonspecific scales as the reference category (testing whether there was an advantage in using scales with more mania-specific content).

Interview Strategy[edit | edit source]

For each sample, we coded whether the criterion diagnoses derived from interviews solely with the primary caregiver versus also involving direct interview of the proband youth. One study also included interview with the teacher on an inpatient/residential unit as an additional source.[75] We included this study in the “not relying solely on the caregiver” category.

Distilled Sample Design[edit | edit source]

This dichotomous variable coded whether the original study used a design likely to inflate the observed effect sizes. This was coded “yes” if the sample included healthy controls as part of the comparison group, lowering the mean score for the comparison group and also potentially lowering the standard deviation. It also was coded “yes” if the design excluded diagnoses likely to share symptoms similar to those characteristic of bipolar disorder, such as unipolar depression, ADHD, conduct disorder, or psychosis. Many studies of phenomenology of bipolar disorder relied on healthy controls or groups with ADHD but excluded comorbid mood disorder as comparison conditions.

Search Strategies[edit | edit source]

As recommended in PRISMA,[109] we consulted with a social sciences reference librarian while designing and revising the search strategy. Reference and citation databases searched included PubMed, PsycINFO, SSCI, ERIC, and GoogleScholar. We piloted the search protocol, consulted with a reference librarian, and implemented the revised protocol. Either PubMed or PsycINFO indexed all of the published reports that met inclusion criteria. (Pediatric OR juvenile OR child* OR adolescen*) AND (“bipolar disorder” OR mani* OR cyclothymi*) AND [(Sensitivity AND Specificity) OR comparison] (PubMed search here). Review articles and chapters were checked for additional sources. This generated 1342 hits in PsycINFO, and 4094 hits in PubMed when the search was updated on September 1, 2014. We pulled hits into a RefWorks database, where we could sort them and annotate them to track disposition. Four relevance judges completed the search training (including review of guidelines and a session of orientation and consultation with a reference librarian about search optimization) and then conducted and reviewed the searches. A content expert (EAY) reviewed all ambiguous cases and instances of disagreement. After reviewing titles and abstracts, and initial elimination of multiple publications using the same dataset, we retrieved 69 articles for detailed review and coding. We examined the reference lists in all studies that met inclusion criteria, along with scrutinizing the bibliographies of recent reviews (Geller & DelBello, 2003[110][29][111][112] (***Put in 2020 chapters here!) The Mick et al. (2003) paper identified two additional samples meeting inclusion criteria,[113] and a chapter in one edited volume provided sufficient information to add another sample and effect size.[114] A review of papers found five datasets where the article captured by the search did not include sufficient information, but a second paper by the same group included the necessary information.[115][116] We did not locate any primary reports published in languages other than English, although some reports published in English language journals gathered data using translated versions of measures into Korean/Hangul,[117] French[118] and Dutch.[74] The final dataset included 25 distinct reports reporting 27 samples (two reports published data on two samples). The initial PubMed search identified 12 of 25 usable sources (48% search sensitivity) and indirectly identified 3 more samples (60% search sensitivity, broadly defined); PsycINFO identified 11 sources directly (44% search sensitivity) and 5 more indirectly (64% search sensitivity, broadly defined). The low search sensitivity is partly an artifact of our decision to include studies that reported sufficient statistics even if they did not report diagnostic sensitivity and specificity in the article, as few research groups have used receiver operating characteristic analyses in this literature until recently. In three cases our group obtained access to the primary data and estimated effect sizes directly from the raw data. Figure 1 shows the flow diagram for the search process.

Coding Procedures[edit | edit source]

Coders were undergraduate psychology majors, doctoral students, and the senior investigator. Training included reading methodology papers (QUADAS; PRISMA; STARD;),[79][109][97] sample meta-analyses focused on pediatric bipolar disorder (Kowatch et al., 2005; Van Meter et al., 2011), orientation to diagnostic efficiency statistics,[119] and then coding two articles and comparing scores to those of a content expert, resolving discrepancies and clarifying concepts. We double-coded all studies for effect sizes, moderator variables, and reporting quality. Some articles reported sufficient statistics to estimate the effect size several different ways, contributing to small discrepancies in effect sizes estimates if the coders used different methods. The content expert reviewed all discrepancies and assigned a final code after perusing the source material, using the method that made fewest distributional assumptions to estimate the effect size. In three cases, the raw data were available and analyzed to provide more extensive information than provided in the primary publication, which often was a preliminary report.

Quality Ratings[edit | edit source]

We used two systems to code the quality of the study design and reporting. The first was based on a prior meta-analysis of pediatric bipolar disorder (Kowatch et al., 2005). It assigned points for adequate sample size (N>30), interviewing both caregiver and youth (versus one informant only), using a formal consensus process, following DSM criteria, including spectrum diagnoses (e.g., cyclothymic disorder, bipolar NOS), recording comorbid diagnoses, and systematically asking about lifetime episodes. Higher scores indicated more comprehensive assessment of the bipolar phenotype.

Rater Reliability[edit | edit source]

Inter-rater agreement was good (ICC for absolute agreement > .87 for demographics and moderator variables, > .95 for effect size metrics, and > .80 for quality ratings). The most likely source of disagreement was when raters selected different formulae for estimating effect sizes, or when one coder was aware of algebraic methods that could transform reported information into something that could be extracted and coded, and the other rater had coded the parameter as missing.

Statistical Methods[edit | edit source]

We used Hedges’ g, a standardized mean difference that corrects Cohen’s d for a slight upward bias in small samples, as our summary effect size.[120] There are three advantages to using standardized mean difference for the purposes of meta-analysis: (a) the studies reviewed more often reported Cohen’s d than AUC; (b) meta-analytic techniques are more highly developed for standardized mean difference than combining AUCs, and (c) analysis of sensitivity and specificity create technical challenges avoided by focusing on other metrics.[81][121] We used standard formulae to convert sufficient statistics into g (see Lipsey & Wilson, 2001, for list and formulae). AUC converts directly to Cohen’s d, and then to g. If only sensitivity and specificity were reported, these could be converted to an AUC estimate,[121] and then to a d and finally a g. Sample means, standard deviations, and n for the bipolar and comparison group also were sufficient for direct estimation of g. Study variance estimate calculations followed standard methods (Viechtbauer, 2010). All estimates used inverse variance weighting, and we report 95% CIs for the weighted effect sizes.

Most studies reported multiple relevant effect sizes. The nesting of several effect sizes in the same sample could occur because of the use of multiple informants (e.g., caregiver, youth, or teacher report), comparison of multiple scales in the same study, or re-analysis of data to examine the influence of sampling design (distilled or not) on effect size estimates. When studies reported multiple scales from the same measure, analyses used a single estimate: Externalizing was the preferred CBCL scale because it has tied or outperformed “bipolar profiles” in multiple samples.[57][59] We used brief versions instead of full-length versions when both were reported because they are more likely to be used in practice. The metafor package (Viechtbauer, 2010b) in R[122] was the platform for all analyses, as it is one of the few meta-analysis programs that currently handles nested effect sizes within the same sample (Viechtbauer, 2010a), allowing us to test key moderator variables.

Analyses used mixed meta-regression models. We had several hypothesis-driven moderators of interest, but also want to preserve generalizability, so a mixed approach was best (Viechtbauer, 2010a). We examined each moderator separately, but also created a fully augmented model to test whether each moderator showed a unique incremental effect. Cochran’s Q tested homogeneity of effect sizes, along with graphical methods (e.g., forest plots – see Figure 2). Nonsignificant Q values indicate little heterogeneity beyond sampling error. We used a mixed model extension of Egger’s test for publication bias, although we expected publication bias to be low because diagnostic efficiency requires large effect sizes, making statistical significance a relatively low bar to exceed. We examined standardized residuals from the fitted models, instead of funnel plots, as a way of testing for influential outliers while accounting for the nested structure of the data (Viechtbauer, 2010a).

Similarly, the Meta-Analysis Reporting Standards (MARS;)[80] suggest estimating power when conducting meta-analyses. Power exceeded 99.9% to reject the null hypothesis of g ~ 0, because effect sizes need to be g > .5 to begin to provide diagnostically useful information, and preferably much larger.[123] To account for nesting, we bracketed power estimates by using the number of independent samples (27) as a low end and the number of effect sizes as the high end. Power to detect moderate heterogeneity (e.g., values of .67) was between .64 and .91 based on 27 independent samples and 63 disaggregated effects, respectively.[124] Power was between .86 and .99 for large heterogeneity. We used outlier diagnostics to identify influential cases (Viechtbauer, 2010a), and we conducted robustness sensitivity analyses to examine their effects on parameter estimates.

Results[edit | edit source]

Figure 1 presents a flow diagram showing the search process. We identified 27 distinct samples from 25 reports published between 1995 and 2014, contributing 63 effect sizes. Of the effect sizes, 38 used caregiver report on a total of 10,232 youths between the ages of 5 and 18 years: 1719 with research interview diagnoses of bipolar disorder, 3150 healthy controls or youths from the general community, and 5363 with other disorders besides bipolar spectrum diagnoses. Youth report generated 14 effect sizes (based on 448 youths with bipolar diagnoses, 1028 healthy youths, and 1542 with other diagnoses), and teacher report had 11 effect sizes (based on 377 cases with bipolar diagnoses, 58 healthy youths, and 855 with other diagnoses). All child and teacher effect sizes were nested within subsets of caregiver data with the exception of the Lewinsohn et al. (2003) chapter, which only included youth report. In terms of other candidate moderator variables, 14 of the 27 samples (52%) used distilled sample designs, and 31 of 63 effect sizes (49%) were based on scales with mania symptom content. Effect sizes came from samples in seven countries, most from the USA, but with each informant contributing effect sizes in at least two countries (see Table 1 for a summary of sample-level characteristics). Eight different checklists contributed effect sizes: the ASEBA contributed 25 effect sizes; the General Behavior Inventory (GBI;)[61] contributed nine; the Mood Disorders Questionnaire (MDQ)[60] added eight; the Conners (1999) had seven, the questionnaire version of the Young Mania Rating Scale (YMRS;)[33] added six, the Child Mania Rating Scale (CMRS;)[35] provided four, and the Child and Adolescent Symptom Inventory (CASI)[125][126] and Child Bipolar Questionnaire (CBQ)[34] had two each. Table 2 reports the effect sizes, along with the moderator variables and other statistics at the level of the effect size. Because some effect sizes used full-length scales and others used short forms, Table 2 also reports the number of items constituting the scale for each effect size. If effect sizes were based on different sample sizes, due to changes in informant or missing data, then we used the N for each effect size rather than a single estimate or weight for the whole sample. Figure 2 displays the forest plot for the raw effect sizes, sorted by informant and magnitude of effect, and using shading to show whether the sample used a distilled design.

Assessment of study quality[edit | edit source]

We used two a priori measures of study quality. The Kowatch system rated design features important for the investigation of pediatric bipolar disorder. All studies reported sufficient information to code all of the Kowatch criteria (except for one item from Hazell et al., 1998). Scaled as percent of maximum possible score, study quality ranged from 50% to 100%, with an average of 83%. The overall quality of the studies included was good in terms of using semi-structured interviews, implementing DSM criteria, capturing comorbid and confounding diagnoses, and other features that enhance confidence in the robustness of findings.

In terms of the quality of reporting results, scores ranged from 45% to 95%, with an average of 73% on the QUADAS-2. Published reports often omitted QUADAS-2 elements: only 12 samples clearly reported the time interval between the diagnostic interview and gathering the rating scales; only 20 clearly specified whether or not the diagnoses were blind to the rating scales; only 21 made clear whether the rating scale was interpreted without prior knowledge of the diagnosis; and only 9 of 25 studies reported all suggested elements. No study included a flow diagram.

Overall Summary of Effect Sizes[edit | edit source]

We used a multivariate metaregression (rma.mv in metafor), modeling the nesting of the effect sizes in the 27 samples and treating both the within study and between study variance estimates as random effects. The overall estimate of effect size was g = 1.05. There was tremendous heterogeneity, Cochran’s Q(62 df) = 738.25, p < .00005. There were substantial variance components both for the within samples nesting of effect sizes (level 1 in a hierarchical linear model conceptual framework) – sigma2 = .13, as well as between samples (level 2) – sigma2 = .20. This became the baseline model for exploration of moderators and covariates. Table 3 reports the variance estimates and Cochran’s Q for this and the subsequent augmented multivariate meta-regression models.

Moderator Analyses[edit | edit source]

Informant: Caregiver versus Youth or Teacher Report[edit | edit source]

Our primary moderator of conceptual interest was the informant who completed the scale. Multivariate metaregression used two dummy codes, comparing youth versus caregiver and teacher versus caregiver. The multivariate framework allowed simultaneous inclusion of all effect sizes and studies in the analysis versus needing to run analyses separately by informant on different subsets (thus number of studies and number of cases is consistent across all moderator analyses).

Informant type explained a significant amount of the heterogeneity, Q (2 df) = 53.84, p < .00005. Also, the within-study variance estimate dropped to sigma2 = .04 when including informant in the model (see Table 3). The parameter estimates indicated that caregiver report produced the largest effect size, g = 1.20, with youth report averaging g = -0.48 lower, and teacher report g = -0.65 lower (all p < .00005).

Mania scale content[edit | edit source]

Scales with mania-specific content should reduce false positive response rates in other diagnostic groups, increasing the effect size. Multivariate meta-regression using dummy-coded mania scale content as the sole moderator did not produce significant improvement in fit, Q(1 df) = 1.98, p = .159. However, mania scale content made a significant incremental contribution after controlling for any of the other moderators (including in the fully augmented model, below). Because this was a hypothesized moderator, we retained it in subsequent analyses.

Parent-only diagnostic interview[edit | edit source]

Another candidate moderator was whether the diagnostic interview relied solely on the parent, without the interviewer also talking directly to the youth in question. This occurred in six of the 27 samples, all of which only reported effect sizes using caregiver-rated scales. Consistent with expectations about shared source variance inflating the predictor-criterion association, interviewing only the parent produced significantly higher g estimates, b = 0.62, Q (1 df) = 5.80, p = .016

Distilled versus Clinically Representative Samples[edit | edit source]

The final moderator of interest focused on the impact of sampling design. A dummy code contrasting “distilled” versus clinically generalizable designs accounted for significant variance in the effect sizes. Entered as the sole moderator in a multivariate meta-regression, it earned a Q(1 df) of 6.64, p = .010, with distilled samples averaging g values .50 higher than the nondistilled, more generalizable samples.

Fully augmented model[edit | edit source]

A fully augmented model included all the moderators of interest simultaneously. This model accounted for substantial variance, Q (5 df) = 84.03, p < .00005. It also reduced the random effect variance components both at level 1 (within samples) – sigma2 = .03 versus .13 for the model with no moderators, as well as level 2 (between samples) -- sigma2 = .12 versus .20 in the initial model (see Table 3). There still was significant remaining heterogeneity, Cochran’s Q (57 df) = 279.55, p < .00005. The profile of likelihood plots indicated that the model provided accurate estimates, and the intra-class correlation between the estimated and true effects was 0.22.[127]

Table 5 presents the regression weights and confidence intervals for the fully augmented model. The intercept was b = 0.72, p < .00005, meaning that the average effect size for caregiver report from a non-distilled, generalizable sample, using a measure that did not include specific manic symptom content, and interviewing both the youth and the caregiver, would have a g ~ .7. All moderators remained significant in the augmented model. Teacher report was associated with significantly lower effect sizes, b = -0.61, p < .00005, as was youth report, b = -0.46, p < .00005. Scales with specific mania item content generated moderately larger effect sizes, b = 0.28, p = .004. Relying only on the caregiver during the diagnostic interview significantly inflated effect sizes, b = 0.42, p = .002. As hypothesized, the use of distilled samples produced much larger effect sizes than generalizable samples, b = 0.54, p = .002. Figure 2 shows the distribution of effect sizes broken down by informant and also whether or not the sample used a distilled design, graphically illustrating the effects of the two most potent moderators.

Testing the Robustness of the Meta-Regression Models[edit | edit source]

Outlier analyses[edit | edit source]

Preliminary analyses identified one study as a highly influential outlier. Re-examining the article found that the authors had reported a standard error as if it were a standard deviation (with the T-score SD being less than 3, instead of close to 10). Correcting this and recalculating the effect size, the study no longer was an outlier. We corrected this before running the models reported above.

Standardized residuals flagged two studies as potential outliers in the multivariate analyses: Marchand et al. (2005) reported an effect size 1.04 g units larger than would be predicted based on the meta-regression model, z = 2.60, p < .01. Papachristou et al. (2013) reported an effect size g = 1.21 units smaller than predicted based on the model, z = -3.35, p < .01. Marchand et al. (2005) used the YMRS, which has shown highly variable results across other samples. The article did not explicitly report whether the diagnoses were blind to the scale, and details about the diagnostic procedures also were sparse. Papachristou et al. used the CBCL. Re-running the model with those two studies excluded did not change the substantive pattern of findings; all moderators remained significant with similar coefficient sizes (results available upon request from author). Controlling for year of publication, comorbidity, sample size, and a variety of other parameters that meta-analysis guidelines[80][109] recommend testing also did not alter the significance or substantive pattern of results.

Publication Bias[edit | edit source]

All of the analyses described above checked for influential outliers, and examined the effects of omitting outliers on sensitivity analyses. We also used the random effects, mixed model extension of Egger’s regression test of publication bias (Viechtbauer, 2010a). There was no evidence of publication bias, p > .81 in the fully augmented model, and p > .50 in the model with all moderators also including ratings of reporting and design quality.

Because the MARS reporting guidelines ask for fail-safe N, we estimated it using two methods, separately for each informant to reduce the effects of nesting within sample. Table 4 reports the results. By every method, it appears highly unlikely that publication bias threatens conclusions about the validity of the effect sizes. In summary, all three informants produced statistically significant differences between the score distributions for bipolar versus comparison groups, with teacher report producing a medium effect size in Cohen’s (1988) rubric, youth report yielding a large effect size, and caregiver report an effect size more than 50% larger than what conventionally is considered “large.”

Other Sensitivity Analyses[edit | edit source]

Neither design quality (as operationalized by the Kowatch scoring) nor the reporting quality (operationalized as QUADAS-2 Total) moderated the observed effect sizes, either in isolation or after controlling for the other moderators (all p > .20); nor did including them change the significance of any of the other moderators. We also checked whether the number of scale items, the year of publication, the percentage of cases with ADHD in the sample, or whether the study had sponsorship from a pharmaceutical company had any association with the observed effect; none did after controlling for the a prior hypothesized moderators or by itself.

Exploratory Comparison of Specific Measures[edit | edit source]

We ran an exploratory version of the multivariate regression model to see if the literature supported any generalizations about the relative performance of measures. We used the effect size based on the ASEBA Externalizing score[28] as the comparator in a set of dummy codes that tested all scales contributing at least two independent effect sizes: the Child and Adolescent Symptom Inventory (CASI),[125] the Child Bipolar Questionnaire (CBQ),[34] the Conners (1999), the General Behavior Inventory (GBI;),[61] the Mood Disorder Questionnaire (MDQ),[60] and the rating scale version of the Young Mania Rating Scale (YMRS).[33] We chose the ASEBA Externalizing score as the comparator because (a) the ASEBA is the most studied scale in this review, (b) the ASEBA contributed the most effect sizes across levels of the informant and distilled moderator variables, (c) the Externalizing scale is easier for clinicians to use than other putative “bipolar” profiles, because it is a standard part of the ASEBA scoring algorithm, (d) Externalizing is not subject to concerns about overfitting to a particular sample, whereas putative “bipolar” profiles necessarily were developed post hoc on the initial sample, and (e) in all samples reporting effect sizes based on both Externalizing and multi-scale profiles, the effect size based on Externalizing was larger in all but two cases. The Externalizing score essentially is an “incumbent” measure that any challenger would need to defeat to supplant it in research or practice. We did not include the dummy code for whether or not the scale content specific manic item content, because it would have been highly collinear with the scale dummy codes.

This model accounted for substantial variance, Q (11 df) = 111.17, p < .00005. It also reduced the random effect variance components both at level 1 – sigma2 = .02 versus .13 for the model with no moderators, as well as level 2 – sigma2 = .14 versus .20 in the initial model. There still was significant remaining heterogeneity, Cochran’s Q (51 df) = 253.85, p < .00005.

The model intercept was b = 0.72, 95% CI [.44, 1.00]; it was the estimated true effect size for the caregiver CBCL Externalizing score from a non-distilled, generalizable sample, including both the youth and caregiver in the diagnostic interview. All four moderators remained significant in the augmented model even after controlling for scales used. Adjusting for all variables, teacher report was associated with significantly lower effect sizes, b = -0.60, 95% CI [-.77, -.44], p < .00005, as was youth report when compared to caregiver report, b = -0.48, 95% CI [-.63, -.33], p < .00005. Three scales demonstrated significant differences compared to the ASEBA Externalizing score: the MDQ averaged g = 0.39 higher than the Externalizing effect size, p =.003. The CMRS averaged g = 0.33 higher after controlling for all variables, p =.024. The GBI averaged g = 0.31 higher, p =.002. See Table 6 for full model.

Clinical Interpretability[edit | edit source]

To provide more clinically meaningful description of results, we saved the predicted values from the meta-regression, and then converted the g into an estimated ROC AUC (using formula #4 from & Hedges, 1995). We also report the predicted sensitivity that could be expected for a threshold chosen to have specificity = .90 (Hasselbad & Hedges, 1995, formula #13), along with the corresponding diagnostic likelihood ratio for scoring above the specificity =.90 threshold. We report the upper and lower bounds based on the confidence intervals of the mixed model regressions. See Table 7. Figure 3 shows the estimated ROC curves for caregiver, teacher, and youth report for clinically generalizable (nondistilled) designs. The figure also includes three reference curves as benchmarks. The diagonal line represents chance performance (AUC = .50). If the base rate of bipolar were 10%, then just randomly diagnosing cases “betting the base rate” would get 10% of the bipolar cases correct (sensitivity = 10%), and 90% of the nonbipolar cases correct (specificity = 90%).

The shaded gray space marks the performance of clinical diagnosis as usual, with the accuracy of clinical diagnoses of bipolar disorder pegged at kappa ~ .1 based on the converging estimates from both meta-analysis[128] as well as recently published data.[17][19][129] Combining the kappa with the estimated base rate of the target disorder allows estimation of the “diagnosis receiver operating characteristic curve”.[130] With a base rate of 10% -- consistent with reports from many outpatient settings – clinical diagnoses would deliver sensitivity of ~ .19, specificity of ~ .91, and an AUC of .55. The sensitivity almost doubles chance performance if clinicians were “betting the base rate,” but it still is far from adequate.

Because studies do not have an objective gold standard, and even semi-structured diagnostic interviews are not perfect, the reliability of the criterion diagnosis also creates a ceiling that limits test performance.[81][130] If a “perfect” predictor of bipolar disorder existed, it would still appear to be “wrong” when it disagreed with the results of the (imperfect) semi-structured interview. Using kappa of .80, close to the nominal rate of inter-rater reliability reported in those studies that included reliability information, the diagnosis curve for semi-structured interviews would yield sensitivity of .82, and specificity of .98, with an AUC of .90.

Table 1: Summary of sample-level characteristics of studies included in meta-analysis[edit | edit source]

Table 1: Summary of sample-level characteristics of studies included in meta-analysis
Study Nested Effects Country Mean Age Setting Type Interviewed Kowatch Total QUADAS Total
Study Level Characteristics Quality Ratings
Wagner (2006)[31] 2 USA 14.5 Outpatient Parent and self 79% 76%
Doerfler (2010)[131] 1 USA 10.7 Outpatient Parent and self 79% 66%
Marchand (2005)[132] 1 USA 10.7 Community Parent 79% 55%
Henry (2008)[116] 1 USA 10.3 Community Parent and self 71% 87%
Dienes (2002)[72] 1 USA 11.9 At risk Parent 86% 66%
Meyer (2009)[58] 1 USA 13.3 At risk Parent 86% 76%
Tillman (2005)[133] 1 USA 10.5 Outpatient Parent and self 79% 79%
Diler (2009)[57] 1 USA 9.4 Academic Parent and self 86% 68%
Rucklidge (2008)[134] 9 New Zealand 15.4 Community Parent and self 86% 45%
Geller (1998) (reports male, female separately) 3 USA 8.5 Outpatient Parent and self 93% 66%
Papolos (2006)[34] 1 USA Outpatient Parent 93% 74%
Serrano (2012)[135] 2 Spain 11.0 Outpatient Parent and self 86% 76%
Lewinsohn (1995) 1 USA 16.6 Community Parent and self 93% 92%
Faraone (2005)[136] 1 USA Outpatient Parent and self 79% 84%
Faraone (2005)[136] 1 USA Outpatient Parent and self 79% 84%
Hazell (1998) 1 Australia Community Parent and self 50% 47%
Biederman (1995)[73] 1 USA 8.7 Outpatient Parent 64% 74%
Miguez (2013)[118] 3 Switzerland 16.0 6 outpatient/2 inpatient clinics Parent and self 79% 95%
Papachristou (2013)[74] 1 Netherlands 16.0 Community Parent 57% 58%
Ong (2018) 3 USA 6.0 Outpatient Parent and self 100% 92%
Lee et al. (2014)[117] 4 South Korea 13.7 Outpatient + communit1 Parent and self 86% 55%
Carlson & Loney (1998)[75] 2 USA 8.0 Outpatient Parent, teacher, and child 86% 68%
Biederman (1996)[113] 1 USA 11.0 Outpatient + communit1 Parent and self 100% 58%
Uchida (2014)[137] 1 USA 10.3 Outpatient + communit1 Parent and self 79% 82%
Cordeiro (2015) 2 Brazil 8.9 Outpatient Parent and self 79% 74%
Pan (2015) 1 Taiwan 15.6 Community Youth 50% 92%
Findling et al. (2005)* 18 USA 11.3 Outpatient Parent and self 100% 92%
Youngstrom et al, 2005/2020 (Academic) 27 USA 11.7 Outpatient Parent and self 100% 95%
Youngstrom et al, 2005/2020 (Community) 27 USA 10.6 Outpatient Parent and self 100% 95%
Shim (2018) 1 South Korea 14.9 Outpatient and community Parent and self 50% 58%

Note. #Effect sizes estimated using raw data for these samples.


Table 2. Effect size level characteristics and moderators[edit | edit source]

Table 2: Effect size level characteristics and moderators
Study Hedges' g Weight Informant Mania Scale Distilled Design Measure Scale Items N Bipolar N Healthy N Other Nested Effects
Effect Size Moderators Predictor Participants
Wagner (2006) 1.20 0.05 C Y MDQ Total 15 41 0 63 2
Wagner (2006) 0.24 0.04 Y Y MDQ Total 15 41 0 63 2
Doerfler (2010) 0.63 0.04 C ASEBA Externalizing 35 27 0 249 1
Marchand (2005) 2.46 0.05 C Y YMRS Total 11 64 0 66 1
Henry (2008) 1.94 0.04 C Y Y CMRS 10M 10 50 50 50 1
Dienes (2002) 1.10 0.10 C ASEBA Externalizing 35 16 18 24 1
Meyer (2009) 1.22 0.13 C ASEBA Sum 3T 41 9 42 46 1
Tillman (2005) 1.36 0.02 C Y Conners 2 2 93 94 81 1
Diler (2009) 0.58 0.01 C ASEBA Externalizing 35 157 0 228 1
Rucklidge (2008) 0.85 0.06 C Y ASEBA Externalizing 35 25 28 29 9
Rucklidge (2008) 0.65 0.06 Y Y ASEBA Externalizing 32 25 28 29 9
Rucklidge (2008) 0.45 0.06 T Y ASEBA Externalizing 32 25 28 29 9
Rucklidge (2008) 1.10 0.06 C Y Conners Inattentive 9 25 28 29 9
Rucklidge (2008) 0.88 0.06 C Y Conners Hyper/impulsive 9 25 28 29 9
Rucklidge (2008) 0.75 0.06 Y Y Conners Inattentive 9 25 28 29 9
Rucklidge (2008) 0.69 0.06 Y Y Conners Hyper/impulsive 9 25 28 29 9
Rucklidge (2008) 0.76 0.06 T Y Conners Inattentive 9 25 28 29 9
Rucklidge (2008) 0.61 0.06 T Y Conners Hyper/impulsive 9 25 28 29 9
Geller (1998) 1.80 0.09 C Y ASEBA Externalizing 35 27 0 38 3
Geller (1998) 0.72 0.12 T Y ASEBA Externalizing 32 13 0 30 3
Geller (1998) 1.19 0.19 C Y ASEBA Externalizing 35 12 0 13 3
Papolos (2006) 2.54 0.05 C Y Y CBQ Total 84 76 38 21 1
Serrano (2012) 0.74 0.09 C Y CMRS Total 21 28 0 86 2
Serrano (2012) 0.93 0.09 C Y YMRS Total 11 28 0 86 2
Lewinsohn (1995) 0.70 0.01 Y Y GBI 12 12 115 845 749 1
Faraone (2005) 1.75 0.05 C Y ASEBA Sum 3T 41 22 242 229 1
Faraone (2005) 1.45 0.08 C Y ASEBA Sum 3T 41 14 109 287 1
Hazell (1998) 0.55 0.05 C Y ASEBA Externalizing 35 25 27 99 1
Biederman (1995) 1.58 0.04 C Y ASEBA Externalizing 35 31 77 120 1
Miguez (2013) 0.58 0.14 Y Y MDQ Total 15 8 0 68 3
Miguez (2013) 1.34 0.15 C Y MDQ Total 15 8 0 68 3
Miguez (2013) 0.56 0.14 C Y CBQ Total 65 8 0 68 3
Papachristou (2013) 1.12 0.02 C Y ASEBA Externalizing 35 56 1201 973 1
Ong (2018) 0.02 0.01 T Y CASI Total 9 100 0 367 3
Ong (2018) 0.82 0.01 C Y CASI Total 9 162 0 530 3
Ong (2018) 0.87 0.01 C Y GBI 10M 10 162 0 530 3
Lee et al. (2014) 0.65 0.05 Y Y GBI Total 28 25 125 73 4
Lee et al. (2014) 0.49 0.05 Y Y MDQ Total 13 25 125 73 4
Lee et al. (2014) 0.40 0.05 C Y GBI 10M 10 25 125 73 4
Lee et al. (2014) 0.27 0.05 C Y MDQ Total 13 25 125 73 4
Carlson&Loney 0.64 0.07 C ASEBA Externalizing 35 23 0 46 2
Carlson&Loney 0.86 0.07 T ASEBA Sum 3T 45 23 0 46 2
Biederman (1996) 1.17 0.04 C Y ASEBA Externalizing 35 30 107 99 1
Uchida (2014) 1.98 0.02 C Y ASEBA Sum 3T 41 140 117 83 1
Cordeiro (2015) 1.45 0.03 C Y ASEBA Sum 3T 41 75 77 2
Cordeiro (2015) 2.20 0.06 C Y Y YMRS Total 11 75 77 2
Pan (2015) 0.47 0.05 Y Y MDQ Total 13 23 161 1
Findling et al. (2005)* 1.16 0.01 Caregiver Y ASEBA Externalizing 35 218 41 207 18
Findling et al. (2005)* 1.05 0.01 Caregiver Y ASEBA Sum 3T 41 218 41 207 18
Findling et al. (2005)* 1.50 0.02 Caregiver Y Y YMRS Total 11 139 23 133 18
Findling et al. (2005)* 0.45 0.02 Teacher Y ASEBA Externalizing 32 122 12 100 18
Findling et al. (2005)* 1.52 0.01 Caregiver Y Y GBI 10M 10 259 46 242 18
Findling et al. (2005)* 0.53 0.02 Youth Y ASEBA Sum 3T 39 95 18 116 18
Findling et al. (2005)* 0.33 0.02 Teacher Y ASEBA Sum 3T 122 12 100 18
Findling et al. (2005)* 0.74 0.02 Youth Y Y GBI Total 28 85 16 113 18
Findling et al. (2005)* 0.67 0.02 Youth Y ASEBA Externalizing 32 95 18 116 18
Findling et al. (2005)* 0.75 0.02 Youth Y Y GBI 10M 10 85 15 114 18
Findling et al. (2005)* 1.35 0.01 Caregiver Y Y GBI Total 28 258 47 241 18
Findling et al. (2005)* 1.12 0.01 Caregiver Y ASEBA Ad hoc 19 207 37 205 18
Findling et al. (2005)* 0.67 0.03 Youth Y ASEBA Ad hoc 17 57 13 75 18
Findling et al. (2005)* 0.30 0.04 Teacher Y ASEBA Ad hoc 15 66 5 39 18
Findling et al. (2005)* 0.92 0.01 Caregiver Y ASEBA 2 2 205 37 204 18
Findling et al. (2005)* 0.45 0.03 Youth Y ASEBA 2 2 57 12 76 18
Findling et al. (2005)* 0.29 0.04 Teacher Y ASEBA 2 2 66 5 39 18
Findling et al. (2005)* 0.68 0.02 Youth Y Y GBI 7 Up 7 86 16 115 18
Youngstrom et al, 2005/2020 (Academic) 0.69 0.03 Caregiver ASEBA Externalizing 35 65 0 118 27
Youngstrom et al, 2005/2020 (Academic) 0.72 0.03 Caregiver ASEBA Sum 3T 41 65 0 118 27
Youngstrom et al, 2005/2020 (Academic) 0.60 0.02 Caregiver Y YMRS Total 11 74 0 127 27
Youngstrom et al, 2005/2020 (Academic) 0.45 0.05 Teacher Y YMRS Total 10 36 0 55 27
Youngstrom et al, 2005/2020 (Academic) 0.35 0.05 Teacher ASEBA Externalizing 32 37 0 57 27
Youngstrom et al, 2005/2020 (Academic) 0.36 0.03 Youth Y YMRS Total 11 52 0 99 27
Youngstrom et al, 2005/2020 (Academic) 0.17 0.03 Youth ASEBA Sum 3T 39 46 0 88 27
Youngstrom et al, 2005/2020 (Academic) 0.59 0.05 Teacher ASEBA Sum 3T 41 37 0 57 27
Youngstrom et al, 2005/2020 (Academic) 0.12 0.03 Youth Y GBI Total 28 52 0 97 27
Youngstrom et al, 2005/2020 (Academic) 0.21 0.03 Youth ASEBA Externalizing 32 46 0 88 27
Youngstrom et al, 2005/2020 (Academic) 0.11 0.03 Youth Y GBI 10M 10 52 0 98 27
Youngstrom et al, 2005/2020 (Academic) 0.20 0.03 Youth Y MDQ Total 15 52 0 98 27
Youngstrom et al, 2005/2020 (Academic) 0.52 0.06 Teacher Y CMRS Total 20 28 0 39 27
Youngstrom et al, 2005/2020 (Academic) 1.11 0.03 Caregiver Y CMRS Total 21 55 0 90 27
Youngstrom et al, 2005/2020 (Academic) 0.97 0.02 Caregiver Y GBI Total 28 74 0 130 27
Youngstrom et al, 2005/2020 (Academic) 1.13 0.03 Caregiver Y CMRS 10M 10?? 55 0 91 27
Youngstrom et al, 2005/2020 (Academic) 1.06 0.02 Caregiver Y GBI 10M 10 74 0 130 27
Youngstrom et al, 2005/2020 (Academic) 0.80 0.06 Teacher Y GBI Total 16 29 0 48 27
Youngstrom et al, 2005/2020 (Academic) 0.68 0.05 Teacher Y GBI 10 10 36 0 57 27
Youngstrom et al, 2005/2020 (Academic) 1.14 0.02 Caregiver Y MDQ Total 15 75 0 130 27
Youngstrom et al, 2005/2020 (Academic) 0.71 0.03 Caregiver ASEBA Ad hoc 19 62 0 116 27
Youngstrom et al, 2005/2020 (Academic) 0.23 0.03 Youth ASEBA Ad hoc 17 46 0 88 27
Youngstrom et al, 2005/2020 (Academic) 0.41 0.05 Teacher ASEBA Ad hoc 15 36 0 57 27
Youngstrom et al, 2005/2020 (Academic) 0.46 0.03 Caregiver ASEBA 2 2 63 0 115 27
Youngstrom et al, 2005/2020 (Academic) -0.05 0.03 Youth ASEBA 2 2 46 0 87 27
Youngstrom et al, 2005/2020 (Academic) 0.69 0.05 Teacher ASEBA 2 2 37 0 57 27
Youngstrom et al, 2005/2020 (Academic) 0.07 0.03 Youth Y GBI 7 Up 7 52 0 98 27
Youngstrom et al, 2005/2020 (Community) 0.64 0.02 Caregiver ASEBA Externalizing 35 75 0 520 27
Youngstrom et al, 2005/2020 (Community) 0.74 0.02 Caregiver ASEBA Sum 3T 41 75 0 520 27
Youngstrom et al, 2005/2020 (Community) 0.83 0.02 Caregiver Y YMRS Total 11 72 0 530 27
Youngstrom et al, 2005/2020 (Community) -0.20 0.04 Teacher Y YMRS Total 10 29 0 165 27
Youngstrom et al, 2005/2020 (Community) 0.01 0.04 Teacher ASEBA Externalizing 32 28 0 172 27
Youngstrom et al, 2005/2020 (Community) 0.09 0.03 Youth Y YMRS Total 11 42 0 277 27
Youngstrom et al, 2005/2020 (Community) 0.39 0.03 Youth ASEBA Sum 3T 39 39 0 272 27
Youngstrom et al, 2005/2020 (Community) 0.12 0.04 Teacher ASEBA Sum 3T 41 28 0 172 27
Youngstrom et al, 2005/2020 (Community) 0.45 0.03 Youth Y GBI Total 28 41 0 275 27
Youngstrom et al, 2005/2020 (Community) 0.39 0.03 Youth ASEBA Externalizing 32 39 0 272 27
Youngstrom et al, 2005/2020 (Community) 0.42 0.03 Youth Y GBI 10M 10 41 0 274 27
Youngstrom et al, 2005/2020 (Community) 0.44 0.03 Youth Y MDQ Total 15 42 0 276 27
Youngstrom et al, 2005/2020 (Community) 0.15 0.06 Teacher Y CMRS Total 20 21 0 123 27
Youngstrom et al, 2005/2020 (Community) 0.97 0.02 Caregiver Y CMRS Total 21 50 0 347 27
Youngstrom et al, 2005/2020 (Community) 1.02 0.02 Caregiver Y GBI Total 28 75 0 535 27
Youngstrom et al, 2005/2020 (Community) 1.00 0.02 Caregiver Y CMRS 10M 10 51 0 348 27
Youngstrom et al, 2005/2020 (Community) 1.09 0.02 Caregiver Y GBI 10M 10 75 0 535 27
Youngstrom et al, 2005/2020 (Community) 0.19 0.05 Teacher Y GBI Total 16 22 0 134 27
Youngstrom et al, 2005/2020 (Community) 0.22 0.04 Teacher Y GBI 10M 10 29 0 169 27
Youngstrom et al, 2005/2020 (Community) 1.26 0.02 Caregiver Y MDQ Total 15 76 0 535 27
Youngstrom et al, 2005/2020 (Community) 0.80 0.02 Caregiver ASEBA Ad hoc 19 65 0 497 27
Youngstrom et al, 2005/2020 (Community) 0.53 0.03 Youth ASEBA Ad hoc 17 39 0 272 27
Youngstrom et al, 2005/2020 (Community) 0.05 0.04 Teacher ASEBA Ad hoc 15 28 0 172 27
Youngstrom et al, 2005/2020 (Community) 0.76 0.02 Caregiver ASEBA 2 2 68 0 493 27
Youngstrom et al, 2005/2020 (Community) 0.24 0.03 Youth ASEBA 2 2 38 0 266 27
Youngstrom et al, 2005/2020 (Community) 0.11 0.04 Teacher ASEBA 2 2 28 0 173 27
Youngstrom et al, 2005/2020 (Community) 0.42 0.03 Youth Y GBI 7 Up 7 41 0 275 27
Shim (2018) 2.54 0.03 Caregiver Y Y MDQ Total 15 102 106 0 1

Note. Hedges’ g is an effect size that adjusts Cohen’s d to correct for upward bias in small samples.


Table 3. Tests of homogeneity and estimates of random effects variances between effect sizes within study (level 1) and between samples (level 2) for multivariate meta-regression models using maximum likelihood estimation[edit | edit source]

Model Level 1 Variance Level 2 Variance Q Residual (df) Q Model (df)
Base Model: No moderators, but model nesting in sample as a random effect .112 .275 1164.59 (119)**** --
Moderator: Informant .042 .209 401.58 (60)**** 53.84 (2)****
Moderator: Mania scale content .117 .223 728.38 (61)**** 1.98 (1)n.s.
Moderator: Only parent interviewed .136 .128 707.65 (61)**** 5.80 (1)**
Moderator: Distilled design .133 .126 625.24 (61)**** 6.64 (1)**
Moderators: All simultaneously .034 .119 279.55 (57)**** 84.03 (5)****
All moderators plus Kowatch and QUADAS2 Quality .032 .106 273.63 (55)**** 92.20 (7)****
Compare measures against ASEBA, controlling for informant, distilled design, and only parent interviewed .022 .143 253.85 (51)**** 111.17 (11)****

Note. The augmented model including all moderators, but not the quality ratings, produced the best fit. Although “mania scale content” was not a significant moderator by itself, it became significant after controlling for other moderators. Because quality ratings did not improve model fit, subsequent analyses are based on the model with all other moderators, but not quality. Results controlling for quality did not change substantively. * p<.05, ** p<.005, *** p<.0005, ****p<.00005, two-tailed.

Table 4. File drawer estimates for disaggregated effect size estimates grouped by informant[edit | edit source]

Effect Size Source Mean g Rosenberg

(p = .05)

Rosenthal

(p = .05)

Orwin

(g = .20)

Full set (N=120 effects) .79 86,492 104,524 340
Caregiver 1.07 46,210 60,057 278
Youth .44 2017 2,571 39
Teacher .32 440 718 24

Note. The Rosenberg (2005) method estimates the number of unpublished studies with null findings that would be necessary to reduce the average effect size to non-significance, i.e., no longer able to reject the null hypothesis that overall g = 0 at p < .05. The Orwin (1983) method estimates the number of unpublished studies with null results needed to reduce the average effect size to an a priori target magnitude. We selected g = .20 as the target because it is generally considered a “small” effect size and would produce negligible performance in diagnostic applications (corresponding AUC = .56, “poor”).

Table 5. Multivariate meta-regression estimates of the effects of moderators entered together in the model[edit | edit source]

Variable b SE 95% CI
Intercept 0.72*** 0.14 (0.44 to 0.99)
Youth report (vs. caregiver) -0.46*** 0.09 (-0.63 to -0.29)
Teacher report (vs. caregiver) -0.61*** 0.09 (-0.79 to -0.42)
Mania scale content 0.28** 0.10 (0.09 to 0.47)
Only parent interviewed 0.42* 0.21 (0.01 to 0.83)
Distilled design 0.54** 0.17 (0.20 to 0.87)

* p<.05, ** p<.005, *** p<.0005, two-tailed.

Discussion[edit | edit source]

The goal of the present study was to meta-analyze the effect sizes for scales used to distinguish youth with pediatric bipolar disorder from other youth. The topic also provided an opportunity to investigate potential moderators of broad conceptual interest. These include informant effects, such as seeing if there were significant differences between youth or teacher report as compared to the responses of the primary caregiver for the youth.[1][40][43] We also investigated design features such as whether or not the scale included specific symptoms of mania (Geller et al., 1998),[138][49] whether the diagnostic interview only included the parent versus directly interviewing parent and youth, and also the effects of sampling design on the observed effect sizes.[139] These design issues also apply more broadly to investigations of psychological assessment[37] and diagnostic efficiency in general, [79][81] not only in the context of pediatric bipolar disorder.

The goal of the present study was to meta-analyze the effect sizes for scales used to distinguish youth with pediatric bipolar disorder from other youth. The topic also provided an opportunity to investigate potential moderators of broad conceptual interest. These include informant effects, such as seeing if there were significant differences between youth or teacher report as compared to the responses of the primary caregiver for the youth.[1][40][139][43] We also investigated design features such as whether or not the scale included specific symptoms of mania (Geller et al., 1998),[138][49] whether the diagnostic interview only included the parent versus directly interviewing parent and youth, and also the effects of sampling design on the observed effect sizes.[139] These design issues also apply more broadly to investigations of psychological assessment[37] and diagnostic efficiency in general, [79][81] not only in the context of pediatric bipolar disorder.

Moderators of Diagnostic Accuracy Effect Sizes[edit | edit source]

Informant Effects[edit | edit source]

Caregiver report produced substantially larger effect sizes than teacher or youth report, whether the model included no other variables, all hypothesized moderators, or even during all sensitivity analyses. The gap between caregiver versus youth report was g = -0.46 after controlling for all other moderators. The gap between caregiver and teacher report was slightly larger: -0.61 after controlling for all other moderators. The confidence intervals for youth and teacher report overlapped to a large extent, indicating that they performed similarly. The finding that caregiver report yielded larger effect sizes than youth or teacher report is consistent with the few prior studies that directly tested differences between the accuracy of informants.[47][54] Among all the samples that contained effect sizes based on both caregiver and youth report, the caregiver effect size was larger in every instance except for the sample from South Korea.[117] Similarly, in every sample providing both caregiver and teacher rating effect sizes, the caregiver report was larger.

The high validity for caregiver report is reassuring in many ways. Caregivers are more likely to initiate seeking services for the child, rather than the child or adolescent self-referring, and interviewers tend to perceive caregivers as a more credible source of information about the youth’s functioning, especially for younger children.[44] Caregivers also are a primary source of information about developmental history, family mental health history, and other factors crucial to the diagnostic and evaluation process.[65] The greater validity for caregiver report also makes sense; caregivers are likely to have better reading ability and be more psychologically minded than young children or adolescents. Plus, caregivers notice symptoms such as irritable mood at lower levels of mania, whereas the mania needs to be markedly more severe before youths endorse the same symptom.[49] Many other symptoms of hypomania and mania also are likely to distress other people before they bother the person expressing the symptom.

Youth report also consistently produced statistically significant differences between bipolar and non-bipolar comparison groups, but the effect size would conventionally be considered “moderate,” and it translates into modest performance in terms of diagnostic efficiency. Some general reasons that youth report might show lower validity include factors undermining the reliability of youth report, such as poorer reading ability or lack of motivation to complete scales thoughtfully.[36] The content of items might be less developmentally appropriate, although that is unlikely to be a factor for instruments designed specifically for use with children and adolescents, such as the Achenbach System for Empirically Based Assessment or newer mania scales such as the CBQ and CMRS. In the context of evaluating potential bipolar disorder, youth report may show diminished validity to the extent that loss of insight into one’s behavior is a feature of hypomania and mania,[46] as has been observed with adults with bipolar disorder or psychosis.[50] A practical consideration is that reading level precludes using most of these scales with youths younger than age 11 years, and normative data for instruments such as the ASEBA and Conners do not extend below age 11, either.

Teacher report produced effect sizes substantially smaller than caregiver report, and somewhat lower than youth report. Though teacher report still produced statistically significant differences between the bipolar and comparison group, the effect size was small to medium. Translated into an area under the curve, teacher report produced AUC values of ~.60 for mania scales in nondistilled samples, often considered the “poor” range for clinical utility.[140] Some of the symptoms more specific to mania, such as a decreased need for sleep, are difficult to observe in a classroom setting. Many behaviors readily seen in the classroom are also symptoms that overlap with ADHD, reducing the diagnostic specificity of teacher report.

Overall, the differences in performance across informant are consistent with the patterns that have emerged in the few head-to-head comparisons within the same sample.[141][47][54][48] The correlation between caregivers, youths, and teachers about the youth’s behavior problems tends to be modest, and some studies have found that the degree of problems reported by teachers or youths is actually significantly higher in cases with bipolar disorder than would typically be predicted based on caregiver report.[40] However, the general level of problems endorsed by teachers or youths is much lower than by caregivers across almost all of the samples (with the exception of the data from South Korea). It is possible that a combination of selection bias and regression artifacts is contributing to the results: because most of the effect sizes come from clinical outpatient samples, the caregiver concerns were the driving factor for the bulk of referrals. Effectively, outpatient samples select participants on the basis of high levels of caregiver reported problems. Because caregiver and youth or teacher report usually correlate less than r ~.3, regression to the mean dictates that the average youth or teacher scores will be much lower. The attenuation will shrink the scores for the bipolar group more, inasmuch as the caregiver reported higher concerns for them, thus reducing the effect sizes comparing the bipolar to other groups.

It also is possible that youth behavior changes between school and home settings. Some argue that mania should be pervasive and observable across multiple settings.[51] Certainly severely disorganized behavior would be easy to note, but other symptoms, such as irritability, difficulty concentrating, or energy changes may not be remarkable in a classroom setting. Additionally, there are likely to be circadian fluctuations in mood and energy.[142] To the extent that bipolar disorder is associated with an evening chronotype or delay of sleep phase, the attendant mood changes are likely to be most pronounced outside of school hours.[143] The emotional significance of relationships also may contribute to differences in behavior. If rejection sensitivity and a sense of intimacy are salient issues for youths with bipolar disorder,[144][145] then the home life is more emotionally significant than school, at least until peers become more important. A third process that could lead to more conflict at home is the testing of developmental boundaries and the youth’s push for greater control and autonomy.[146] At present, there are no published studies using direct observation of interactions at home involving youths with bipolar disorder. Actimetry and other objective methods for measuring youth behavior could be informative about the extent to which behavior changes between settings or dyads.[147]

The consistent finding that caregiver report produces larger effect sizes than youth or teacher report indicates that the revisions to the psychiatric nosology were justified in not requiring elevated teacher report of manic symptoms to confirm a diagnosis of pediatric hypomania or mania.[6][55] The validity of parent reported mood symptoms also is supported by sensitivity to treatment effects in double-blinded clinical trials (e.g., Findling et al., 2007; Findling, McNamara, et al., 2005),[148] as well as brain imaging studies—where parent report produces larger correlations with patterns of activation in the youth’s brain than diagnostic categories or youth ratings do (e.g.).[149]

Mania Scale Content[edit | edit source]

Results also supported our hypothesis that scales including symptoms more specific to mania would show larger effect sizes. The ASEBA and Conners were written at a time when mania was considered an “adult only” phenomenon, and so many relevant items were not included in the pool. Both include manic symptoms, but they are ones that are not diagnostically specific and thus often attributable to other conditions, as indicated by the factor structures in the measures’ normative samples. More recent versions of some broad checklists, such as the BASC and the CASI, have added a mania scale to at least some informant versions. Thus they may perform more similarly to instruments such as the MDQ, CMRS, and CASI that inquire directly about DSM symptoms, or the GBI and CBQ, which also cover associated features beyond the core DSM symptoms.

The source articles did not report sufficient statistics for us to examine directly whether the inclusion of mania symptoms improved the specificity more than the sensitivity of the scales. However, the greater effect on specificity is plausible because the more sensitive symptoms of bipolar disorder tend to be nonspecific features such as irritable mood and difficulty concentrating. These symptoms are well represented on both general, broad-coverage instruments as well as mania-oriented scales. In contrast, more specific symptoms such as decreased need for sleep and elated mood tend to only be included on scales purpose-built to investigate manic symptoms.

With the exception of the CASI, the commercially distributed broad-coverage instruments do not have a mania scale containing diagnostically specific symptoms, or there are not yet published data indicating the diagnostic performance of any new scale that they have added in the most recent revisions. The CBCL Externalizing score, or the putative bipolar profiles of subscales, show good sensitivity but poorer specificity. This positions the broad measures to be good at ruling out cases of bipolar disorder, but makes them prone to high false positive rates if used alone to screen for bipolar.[25] Positive results on tests with moderate specificity are ambiguous because there is a high false alarm rate (the other side of the low specificity coin). Combined with the base rate of bipolar being low in most settings, the result is low positive predictive values. Put bluntly, most cases “testing positive” for bipolar disorder on measures that do not focus on diagnostically specific symptoms will not actually have bipolar disorder unless the setting is an inpatient unit or similar venue where the base rate of bipolar is high.

Interview Strategy: The Importance of Laying Eyes on the Child[edit | edit source]

Relying solely on the primary caregiver during the semi-structured interview also changed the average effect size significantly. Six of the 27 samples relied solely on the caregiver for the diagnostic interview. Consistent with concerns about shared source variance,[1][150] relying only on one person’s perspective inflated the association between the caregiver rating and the diagnosis, increasing the observed effect size by g ~.6. The six samples only reported effect sizes for caregiver-reported scales, where the influence of shared source variance would be largest. It is interesting to note that these six studies ran the gamut in terms of youth ages, with the average age extending from 8.7 years[73] to 16.0 years,[74]so results were not driven by youth age. Findings reinforce the value of directly interviewing the youth, even when they may be too young to complete a full semi-structured diagnostic interview. In young children, direct observation during the interview may provide an opportunity to observe behaviors that might indicate the presence of a pervasive developmental disorder or other alternate explanations for child behavior.[1] The credibility and sophistication of youth perspective increases steadily with age and verbal ability,[44] making the older youth’s input more valuable. Integrating multiple sources of information is likely to combat factors that would undermine the validity of any single informant, such as demoralization, malingering, or attempting to minimize problems.[18] Inasmuch as diagnoses that synthesize multiple information sources are likely to be more valid, estimates of diagnostic accuracy that are pegged to such diagnoses are likely also to have better validity, even if the size of the coefficient itself appears more humble.

Distilled Sample Enrollment[edit | edit source]

The final moderator variable of interest was the design of the sample. Whereas diagnostic sensitivity and specificity were once thought to be intrinsic properties of a measure, methodologists now realize that these parameters can change markedly as a result of design features. The present findings show that this is not an abstract concern. Differences in sampling design changed the observed effect sizes by g ~ .56, even after controlling for other moderators. This ~.5 bias is consistent with the findings of prior work that compared the effects of distilled versus more generalizable sampling inclusion criteria in the same data set.[139] Taking the eight scales examined in that study and comparing the effect sizes observed in distilled versus nondistilled designs (reported in Table 5 of Youngstrom et al., 2006)[139] found an average upward bias of g = +.43. The PBD literature includes many studies focused on phenomenology and careful descriptive validation of the syndrome. These studies often included healthy control comparison groups, and they also often had stringent exclusion criteria that reduced the amount and types of clinical heterogeneity observed. Although these designs were internally valid for their intended purpose, they are less generalizable to typical clinical practice. The results of the meta-analysis underscore that the reduced generalizability produces systematic bias that exaggerates the diagnostic efficiency of scales. Distilled samples create a rising tide that lifts all boats, inflating the effect sizes of all scales and potentially boosting less valid tests even more than others.[139] This is especially true for scales with nonspecific items, such as externalizing or attention problems, which also would be elevated in groups with other diagnoses but not in healthy controls.[151]

The positive bias is pernicious for two reasons. First, it obscures differences between scales. The boost from a distilled design is larger than the difference between most of the valid measures. Design effects swamp the relative differences between scales, potentially making weak scales look better than a more valid scale tested under more generalizable conditions. Second, the exaggerated effect sizes in a distilled design will bias the clinical interpretation of the scales. Inflated sensitivity and specificity estimates lead to more extreme diagnostic likelihood ratios, and more extreme predictive values. If clinicians rely on a simplistic “positive test result” interpretation, the false positive rate will be higher than it appears. Bad designs, in terms of validity for studying diagnostic efficiency, will contribute to overdiagnosis of bipolar disorder. More rigorous and generalizable designs produce more humble estimates of effect size, but these are essentially “pre-shrunk” to fit their application across a broader range of clinical settings.

The clinical “take home” messages from this moderator analysis reinforce the dicta from Evidence Based Medicine to consider both the validity of the study design, and also whether the participants look like the patients with whom the clinician is working. As receiver operating characteristic analyses become more popular, test consumers should cultivate healthy skepticism: if a test claims to produce excellent results at a challenging task, check carefully to see if design flaws are swelling the effect size. It also behooves researchers to consider the potential biasing effects of design features ahead of time before conducting secondary analyses of data originally gathered for a different purpose than studying diagnostic efficiency. The STARD Guidelines and QUADAS-2 checklist were intended in part to provide a convenient checklist for this sort of rapid evaluation of reports.

Exploratory Comparison of Different Scales[edit | edit source]

We also ran exploratory analyses comparing the performance of measures, adjusting for the moderator variables, and accounting for correlated effects due to nesting within sample. These analyses should be considered tentative, as the available literature has gaps in coverage. Furthermore, the mixed model meta-regression would have less statistical power for direct comparison of measures than would be possible by applying best methods directly to the raw data. However, even with these caveats, the analyses indicated that the GBI, MDQ, and CMRS all performed significantly better than the ASEBA, reflecting that their item sets include the symptoms more specific to bipolar disorder. Reassuringly, the results of the meta-analysis align with the results of the head-to-head comparisons that have been done using raw data in the past. The CASI is likely to also perform better than the ASEBA or other nonspecific measures, based on item content and observed effect size, but the critical region around the observed effect is large based on the analysis only having two effect sizes for the CASI.

Reporting Quality and Sensitivity Analyses[edit | edit source]

Reporting Quality[edit | edit source]

Assessed against the criteria developed by Kowatch et al. (2005), the design of the studies tended to be good. The samples included large numbers of cases, used semi-structured diagnostic interviews, and routinely assessed common comorbidities. The quality of reporting of results was less strong. The bulk of the publications predated the dissemination of the STARD and other reporting guidelines, so it is not surprising that there were some omissions. Key places for improvement include adding flow diagrams, documenting the length of time between checklists and diagnoses, and clarifying whether all participants were included in the analyses. The bad news is that there was no significant trend for the reporting quality to improve in more recent studies, though hopefully that will change. The good news is that reporting quality was not related to the effect sizes, nor did any of the tests of moderators change when adjusting for quality. This does not mean that reporting quality is unimportant, merely that variations in quality did not add significant bias to the observed effect sizes.

Consideration of Alternative Explanations[edit | edit source]

A standard concern with meta-analysis is whether the published literature diverges from results that were never published—the “file drawer problem.” The literature reviewed here is less prone to publication bias, for two reasons. Many of the studies included in the review were descriptive, phenomenological studies or epidemiological studies. For these papers, statistical significance or rejection of a null hypothesis was not a major consideration for “publishability,” making it unlikely that non-significant results would be censored from the published literature. The second reason is that measures used for screening and diagnosis require large effect sizes in order to achieve respectable accuracy rates for classification.[152][123] Consequently, statistical significance is an easy bar to surpass, even with relatively small sample sizes. Consistent with these scenarios, Egger’s test found no evidence of publication bias in the multivariate models. We also ran file drawer analyses separately for caregiver, youth, and teacher report. This is conservative, because it split the overall sample into pieces with fewer effect sizes and constituent cases. Teacher report offered the worst-case scenario, as it included the fewest effect sizes, the least participants, and the smallest average effect size. Even for it, the file drawer would need to be loaded with 10 null studies before the mean effect size for teacher report would decrease to a “small” effect size (g ~ .2), and 70 null studies before teacher report would no longer be significant p < .05. For caregiver report, the number of unpublished studies would need to be more than 19,087 to push the p value greater than .05. Publication bias does not seem to be a major threat here.

A conceptually more interesting issue is potential circularity between the information source for the scale and the criterion diagnosis. This is not the same thing as “criterion contamination,” where the diagnostician would have access to the scale scores when formulating the diagnostic impression. The older concept of shared method variance is closer to the mark.[153] If the same informant fills out two rating scales about different constructs, the resulting scores will correlate with each other due to shared method variance.[150] For the studies reviewed in the meta-analysis, one of three different informants completed the rating scales, and a subset of the same informants contributed to the diagnostic interview. In the most extreme scenario, the caregiver might be the only person interviewed about the youth’s diagnosis, and she or he also completed the rating scale. On one hand, the interview still involves additional information beyond that gleaned from a scale—there are opportunities for probing, follow-up questions, interpretation of nonverbal cues, and clinical judgment.[62][69] On the other hand, caregiver reported scales might have an unfair advantage if the caregiver’s perspective heavily influences the diagnostic interviews. The apparent advantage for caregiver report in terms of larger effect size could be due in part to the shared source variance contributing to both predictor and criterion diagnosis.

This concern is allayed somewhat by the fact that caregiver report showed large effects across all studies, regardless of whether the other informants participated in the diagnostic interview. Parent report’s validity also is supported by sensitivity to treatment effects in double-blinded clinical trials (e.g., Findling et al., 2007; Findling, McNamara, et al., 2005):[148] the double-blind design controls for expectancy effects, as they would contribute to perceived improvement in the placebo arm.[154] Furthermore, in brain imaging studies parent report produces larger correlations with patterns of activation in the youth’s brain than diagnostic categories or youth ratings do (e.g.).[149] It also is also noteworthy that the source variance artifact would inflate agreement with caregiver report across all diagnoses. However, there are published examples where youth report shows the same validity estimates as caregiver report for anxiety diagnoses (Van Meter et al., 2014) or significantly higher validity for reports of posttraumatic stress disorder.[155] These are secondary analyses of some of the same data used in the meta-analysis here and found larger effects for caregiver report predicting bipolar diagnoses.[54][47]This indicates that the validity of the caregiver report is higher for bipolar disorder than for some other diagnoses, showing a degree of specificity that argues against a general methodological artifact. Converging evidence also suggests that manic symptoms are associated with a loss of insight that undermines self report, and many symptoms are likely to be noted earlier by collateral informants—[49] although the weaker performance of teacher report still needs to be reconciled with the higher validity of caregiver report in this regard. Situational specificity in behavior, and dyadic patterns of interaction, remain key topics for further exploration.

Generalizability of Conclusions[edit | edit source]

The literature on the assessment of pediatric bipolar disorder has grown rapidly. At this point it spans a range of measures, multiple research groups, seven different countries drawn from four continents, and five languages of administration. Samples came mostly from outpatient settings, with some high risk offspring samples and an epidemiological sample also contributing. Within the limited range represented in the meta-analysis, clinical setting appears less important than diagnostic inclusion and exclusion criteria.

The literature on the assessment of pediatric bipolar disorder has grown rapidly. At this point it spans a range of measures, multiple research groups, seven different countries drawn from four continents, and five languages of administration. Samples came mostly from outpatient settings, with some high risk offspring samples and an epidemiological sample also contributing. Within the limited range represented in the meta-analysis, clinical setting appears less important than diagnostic inclusion and exclusion criteria.

The data from Korea stand as the notable exception to this overall pattern. It is a single sample, but, it is intriguing that the Korean study used two of the most valid rating scales (the MDQ and the GBI) and found that youth report exceeded parent report on both. Cultural factors, including high levels of stigma against mental health concerns, tremendous parental emphasis on educational excellence, perfectionism, and differences in familial patterns of communication all deserve exploration. Rapid economic and cultural change in Korea also may create age cohort effects, where the adolescents may have more Westernized attitudes towards mood and mental health, and greater awareness and acceptance of mental health issues. These hypotheses would best be addressed by a combination of qualitative studies and replications in other cultures with a high degree of Confucian values, such as Taiwan or Hong Kong, as well as countries with rapidly developing economies but different cultural traditions, such as Chile or India.[156][157] Present results indicate a good degree of generalizability across groups within the United States and other Westernized countries, with a big asterisk qualifying the validity of parent report in Asian cultures.

Limitations[edit | edit source]

No meta-analysis can be stronger than the literature upon which it is based. Because many of the primary articles predated the publication of recent reporting guidelines,[79][80][109] it is not surprising that they did not include many of the suggested elements, such as flow diagrams. The quality of design (aside from distilled samples) and the quality of reporting did not significantly moderate the results of these meta-analyses. Still, it will be helpful for the field to make a conscious effort to improve clarity of reporting about key design features. Several other factors potentially influencing the distribution of scores in cases with bipolar disorder were not reported frequently enough to be included in the meta-analysis. These included things such as the rates of bipolar I, bipolar II, cyclothymic, and NOS cases, or the current mood state of cases. As noted in the introduction, cases with more severe current presentation will on average have higher scores on the scales, making the diagnostic sensitivity higher. Conversely, the published reports also did not include enough details about the diagnostic composition of the comparison group to support investigation of effects on diagnostic specificity. These types of questions could be fruitful topics for a “mega-analysis” pooling raw data from multiple samples.

Although all three of our hypothesized moderators produced significant effects, there still remained significant heterogeneity in effect sizes both within and between samples. We tested standard candidate moderators, such as publication year, and did not identify other significant moderators. However, the heterogeneity suggests that other factors exist that were not coded or reported in the literature. Rates of comorbidity or inclusion of different cognate diagnoses—such as ADHD, depression, or conduct disorder—are a likely contender for accounting for some of this variance. Differences in rater training also may be a source of variation in the definition of the criterion measure. Semi-structured interviews give greater latitude to clinical judgment, making the differences due to training potentially sizeable.[158][159] Conversely, more structured approaches to interviewing may increase the consistency, falling along a continuum to a place where all that differs is whether the items are read by the participant or the interviewer when no alternative phrasing or clinical judgment is allowed. Taken to that extreme, the increased associations could be a form of pseudo-validity if produced by shared source variance versus greater content validity.[18][62]

Implications for Theory, Policy, and Practice[edit | edit source]

The results of the meta-analysis confirm several important theoretical points. The first is that caregiver report produces larger effects for assessing bipolar spectrum disorder than self report or teacher report. Research studies investigating the correlates of youth mood symptoms, such as genetic or imaging studies, would do well to include caregiver-reported measures of mood symptoms.

At a policy level, these findings support the DSM-5 decision not to require impairment in multiple settings or across multiple informants as a requirement for diagnosing bipolar disorder in youths.[6][42][160] Practice guidelines should (a) emphasize gathering caregiver report when possible to clarify assessment questions around potential bipolar disorder, and (b) not require elevation on report from multiple raters about manic symptoms, recognizing that cross-informant agreement tends to be low in general. Evaluation pathways for mood disorder also should incorporate caregiver-reported measures. The fact that several free and brief measures produced some of the best effect sizes in the meta-analysis improves the feasibility of adoption.

With regard to clinical practice, several measures are now well established as tools for evaluating bipolar symptoms in youth. In the treatment literature, having at least two published reports conducted by independent research groups offers protection from allegiance effects as well as quirks of an individual study.[161] Using a similar criterion of two independent replications, five caregiver report measures have established clinically meaningful effect sizes: the GBI, MDQ, CMRS, ASEBA, and CBQ. The CASI and Conners scales have been evaluated in one sample each, and the YMRS performance is too varied and poor to endorse when so many better alternatives are available. The MDQ and GBI also appear adequate as a self-report measure.

How do the checklists compare to the other alternatives available for assessing potential pediatric bipolar disorder? Benchmarking the effect sizes from this meta-analysis against those reported in other reviews shows that caregiver report on best measures yields effect sizes larger than generated by neurocognitive tasks comparing cases with bipolar disorder to ADHD or healthy controls.[162][163] Recent studies have applied machine learning algorithms to fMRI data as a way of discriminating bipolar disorder.[164] These studies involve both overfitting, where the algorithm is optimized for the sample at hand, and “distilled designs” that produced significantly larger effect sizes in this meta-analysis. Even with both of those upward biases, the algorithm produced an AUC of .78. The fMRI methods will produce smaller effects when used under clinically realistic conditions with greater diagnostic heterogeneity, especially given the transdiagnostic nature of brain regions involved in affect regulation.[165][166] Even when fMRI or neurocognitive algorithms evolve to produce comparable or larger effect sizes in generalizable designs, there still are issues of greater cost and more limited accessibility.[130] For the foreseeable future, checklists appear to be the frontrunner in an important niche of clinical assessment despite their imperfections.

Guidelines for Future Research[edit | edit source]

Present findings limn several guidelines and priorities for future research.

Avoid artifacts – especially criterion contamination and distilled designs[edit | edit source]

Consult the STARD guidelines when designing studies of diagnostic measures.[79] Although the STARD guidelines were published in multiple journals and endorsed by multiple editorial boards (e.g.),[167] it is sobering that there is no difference in the average quality of reporting of studies published before or after the STARD guidelines were promulgated.

Similarly, researchers need to consider design issues when deciding to do ROC as a secondary analysis.[81][119] Using biased designs is not an abstract problem—it was pandemic in the literature that we reviewed, and it produced a large bias in the observed results. Comparisons to healthy controls are clinically trivial, and make all measures look good.[139] These designs produce exaggerated effect sizes, which will translate into excessive pseudo-accuracy of clinical decisions. The exclusion of competing diagnoses that are clinically common, such as unipolar depression or conduct disorder, will reduce the number of false positive cases in the research report, biasing the apparent diagnostic specificity upwards. Using the same threshold in a clinical setting that does not exclude these cognate diagnoses will lead to much higher rates of false positives. Put simply, using results from distilled designs will contribute to overdiagnosis when applied in most clinical contexts. Distilled sampling was the single most potent moderator we identified in the meta-regressions. Please do not publish results if based on a biased design—these data are not helpful. Findings that have criterion contamination where the same measure is used to define the proxy diagnosis and then predict are equally problematic. They contribute noise to the literature and risk confusing consumers about choice of measure and interpretation. Peer reviewers should treat these as serious, perhaps fatal, flaws when reviewing manuscripts. On the STARD list of 25 design and reporting considerations, these are 800 pound gorillas; and they have been swinging amok through the mood assessment literature. We excluded a half-dozen studies that used proxy definitions with criterion contamination; and 52% of the samples (44% of the effect sizes) included in the meta-analysis used distilled designs, creating a gargantuan artifact in the literature.

Add objective criterion measures besides diagnosis[edit | edit source]

The next generation studies examining the validity of mood ratings should use objective, heteromethod measures to explore differences in validity of teacher, caregiver, and youth report. Benchmarking these against actimetry,[168] gene expression, brain imaging of core constructs (e.g.),[149] or ecological validators such as high risk behavior (e.g.)[169] all will help triangulate the relative weight clinicians and policy makers need to assign each perspective. These data also will inform future nosological revisions, and they will map the connections between levels of functioning that the NIMH Research Domain Criteria (RDoC) initiative aims to organize conceptually.[170]

Explore culture as a moderator[edit | edit source]

Another goal is to prioritize understanding cultural differences contributing to performance. Within the USA, the few studies that have investigated differential item functioning or structural invariance tend to find that bipolar mood scales tend to show little bias. This contrasts sharply with the disparities in clinical diagnoses and service utilization data. The burden of depression and bipolar disorder is equally serious in the developed and developing world.[171] If these assessment tools are similarly valid across cultures, they offer an inexpensive method for early identification and intervention that can be implemented on a much larger scale than semi-structured interviewing or biological/neurocognitive assays.[130] Part of the discrepancy between assessment and diagnostic findings could be closed by greater use of evidence based assessment methods, using the best of the rating scales and interpreting them within an EBM/Bayesian approach.

However, there also is likely to be a context of cultural differences in beliefs about the causes of emotional and behavioral issues, as well as attitudes towards treatment. The Korean data in the meta-analysis are an intriguing anomaly that suggests a potentially powerful yet nuanced role for culture. More research needs to include groups with traditionally Confucian and other non-Western views,[156][172] not just because it is an interesting academic question, but because most of the human population lives in these cultures, and these factors change the dynamic of medical communication between practitioner and patient.[156]

Study collateral informants in adulthood and late life[edit | edit source]

Whereas including collateral informants is routine with children and adolescents, it is the exception with adults. The difference in perception of behaviors by self versus other is a robust phenomenon, including the “fundamental attribution error” in social psychology. DSM-5’s new emphasis on energy in the mania A criterion also flowed from recognition that memory may be more accurate for changes in energy than mood.[160] The larger effect size for caregiver report of manic symptoms deserves exploration in other age groups, especially given the loss of insight tied to hypomania and mania[50][173] and the tremendous interpersonal consequences of manic episodes.[174][175][176]

Present results also emphasize the importance of prioritizing dissemination of assessment tools and teaching evidence based assessment methods. The scales that performed best in the meta-analysis are in the public domain, but they are not heavily advertised or widely taught in training programs.[177][178][179] The meta-analysis has established that there is a set of caregiver report measures that delivers clinically meaningful effect sizes even when evaluated under externally valid designs. These could produce large improvements in clinical decisions,[180] potentially even more so with under-served minority groups.[27][181]

References[edit | edit source]

  1. 1.0 1.1 1.2 1.3 1.4 1.5 Carlson, G. A.; Klein, D. N. (2014). "How to understand divergent views on bipolar disorder in youth". Annual Review of Clinical Psychology 10: 529–51. doi:10.1146/annurev-clinpsy-032813-153702. ISSN (Electronic) 1548-5943 (Linking) 1548-5951 (Electronic) 1548-5943 (Linking). 
  2. Klein, Rachel G.; Pine, Daniel S.; Klein, Donald F. (1998). "Resolved: Mania is mistaken for ADHD in prepubertal children". Journal of the American Academy of Child & Adolescent Psychiatry 37 (10): 1093–1096. ISSN Allen. 
  3. Fristad, M. A.; MacPherson, H. A. (2014). "Evidence-based psychosocial treatments for child and adolescent bipolar spectrum disorders". J Clin Child Adolesc Psychol 43 (3): 339–55. doi:10.1080/15374416.2013.822309. ISSN 1537-4416. 
  4. Geller, B.; Luby, J. (1997). "Child and adolescent bipolar disorder: A review of the past 10 years". Journal of the American Academy of Child & Adolescent Psychiatry 36 (9): 1168–1176. ISSN Allen. 
  5. 5.0 5.1 Youngstrom, E. A.; Birmaher, B.; Findling, R. L. (2008). "Pediatric bipolar disorder: Validity, phenomenology, and recommendations for diagnosis". Bipolar Disorders 10: 194–214. 
  6. 6.0 6.1 6.2 American Psychiatric Association (2013). Diagnostic and statistical manual of mental disorders (5th ed.). Washington, DC: Author. 
  7. Angst, J.; Azorin, J. M.; Bowden, C. L.; Perugi, G.; Vieta, E.; Gamma, A.; Young, A. H. (2011-08). "Prevalence and characteristics of undiagnosed bipolar disorders in patients with a major depressive episode: the BRIDGE study". Archives of General Psychiatry 68 (8): 791–8. doi:10.1001/archgenpsychiatry.2011.87. ISSN (Electronic) 0003-990X (Linking) 1538-3636 (Electronic) 0003-990X (Linking). 
  8. Angst, J.; Gamma, A.; Bowden, C. L.; Azorin, J. M.; Perugi, G.; Vieta, E.; Young, A. H. (2012-02). "Diagnostic criteria for bipolarity based on an international sample of 5,635 patients with DSM-IV major depressive episodes". European Archives of Psychiatry and Clinical Neuroscience 262 (1): 3–11. doi:10.1007/s00406-011-0228-0. ISSN (Electronic) 0940-1334 (Linking) 1433-8491 (Electronic) 0940-1334 (Linking). 
  9. Merikangas, Kathleen R.; Akiskal, Hagop S.; Angst, Jules; Greenberg, Paul E.; Hirschfeld, Robert M. A.; Petukhova, Maria; Kessler, Ronald C. (2007-05-01). "Lifetime and 12-month prevalence of bipolar spectrum disorder in the National Comorbidity Survey Replication". Archives of General Psychiatry 64 (5): 543–552. doi:10.1001/archpsyc.64.5.543. 
  10. Yatham, L. N.; Kennedy, S. H.; O'Donovan, C.; Parikh, S.; MacQueen, G.; McIntyre, R.; Sharma, V.; Silverstone, P. et al. (2005). "Canadian Network for Mood and Anxiety Treatments (CANMAT) guidelines for the management of patients with bipolar disorder: consensus and controversies". Bipolar Disorders 7 Suppl 3: 5–69. doi:BDI219 [pii]10.1111/j.1399-5618.2005.00219.x. ISSN (Print) 1398-5647 (Print). 
  11. Axelson, D.; Findling, R. L.; Fristad, M. A.; Kowatch, R. A.; Youngstrom, E. A.; McCue Horwitz, S.; Arnold, L. E.; Frazier, T. W. et al. (2012-10). "Examining the proposed disruptive mood dysregulation disorder diagnosis in children in the Longitudinal Assessment of Manic Symptoms study". Journal of Clinical Psychiatry 73 (10): 1342–50. doi:10.4088/JCP.12m07674. ISSN (Electronic) 0160-6689 (Linking) 1555-2101 (Electronic) 0160-6689 (Linking). 
  12. Eyberg, S. M.; Nelson, M. M.; Boggs, S. R. (2008-01). "Evidence-based psychosocial treatments for children and adolescents with disruptive behavior". J Clin Child Adolesc Psychol 37 (1): 215–37. doi:10.1080/15374410701820117. ISSN (Electronic) 1537-4416 (Linking) 1537-4424 (Electronic) 1537-4416 (Linking). 
  13. Fristad, M. A.; MacPherson, H. A. (2014). "Evidence-based psychosocial treatments for child and adolescent bipolar spectrum disorders". J Clin Child Adolesc Psychol 43 (3): 339–55. doi:10.1080/15374416.2013.822309. ISSN 1537-4416. 
  14. Carlson, G. A. (2003). "The bottom line". Journal of Child and Adolescent Psychopharmacology 13 (2): 115–118. 
  15. Scheffer, Russell E.; Kowatch, Robert A.; Carmody, Thomas; Rush, A. John (2005). "Randomized, placebo-controlled trial of mixed amphetamine salts for symptoms of comorbid ADHD in pediatric bipolar disorder after mood stabilization with divalproex sodium". American Journal of Psychiatry 162 (1): 58–64. ISSN 0002-953X. 
  16. Youngstrom, E. A.; Arnold, L. Eugene; Frazier, T. W. (2010). "Bipolar and ADHD comorbidity: Both artifact and outgrowth of shared mechanisms". Clinical Psychology: Science & Practice 17: 350–359. 
  17. 17.0 17.1 Rettew, D. C.; Lynch, A. D.; Achenbach, T. M.; Dumenci, L.; Ivanova, M. Y. (2009). "Meta-analyses of agreement between diagnoses made from clinical evaluations and standardized diagnostic interviews". International Journal of Methods in Psychiatric Research 18 (3): 169–84. doi:10.1002/mpr.289. ISSN 1049-8931. 
  18. 18.0 18.1 18.2 18.3 Spitzer, Robert L. (1983). "Psychiatric diagnosis: Are clinicians still necessary?". Comprehensive Psychiatry 24 (5): 399–411. ISSN 0010-440X. 
  19. 19.0 19.1 Jensen-Doss, A.; Youngstrom, E. A.; Youngstrom, J. K.; Feeny, N. C.; Findling, R. L. (2014-12). "Predictors and moderators of agreement between clinical and research diagnoses for children and adolescents". Journal of Consulting & Clinical Psychology 82 (6): 1151–62. doi:10.1037/a0036657. ISSN (Electronic) 0022-006X (Linking) 1939-2117 (Electronic) 0022-006X (Linking). PMC PMC4278746. //www.ncbi.nlm.nih.gov/pmc/articles/PMCPMC4278746/. 
  20. DelBello, M. P.; Lopez-Larson, Melissa P.; Soutullo, Cesar A.; Strakowski, Stephen M. (2001). "Effects of race on psychiatric diagnosis of hospitalized adolescents: A retrospective chart review". Journal of Child and Adolescent Psychopharmacology 11 (1): 95–103. 
  21. McClellan, Jon; Kowatch, Robert; Findling, Robert L. (2007). "Practice parameter for the assessment and treatment of children and adolescents with bipolar disorder". Journal of the American Academy of Child & Adolescent Psychiatry 46 (1): 107–125. doi:10.1097/01.chi.0000242240.69678.c4. 
  22. Drotar, Dennis, Ruth E. K. Stein, and Ellen C. Perrin. “Methodological Issues in Using the Child Behavior Checklist and Its Related Instruments in Clinical Child Psychology Research. Special Issue: Methodological Issues in Clinical Child Psychology Research.” Journal of Clinical Child Psychology 24, no. 2 (1995): 184–92.
  23. 23.0 23.1 Jenkins, Melissa M., Eric A. Youngstrom, Jason J. Washburn, and Jennifer Kogos Youngstrom. “Evidence-Based Strategies Improve Assessment of Pediatric Bipolar Disorder by Community Practitioners.” Professional Psychology: Research and Practice 42, no. 2 (2011): 121–29. https://doi.org/10.1037/a0022506.
  24. Achenbach, T. M. “What Are Norms and Why Do We Need Valid Ones?” Clinical Psychology: Science & Practice 8, no. 4 (2001): 446–50. https://doi.org/10.1093/clipsy.8.4.446.
  25. 25.0 25.1 Straus, Sharon E.. Evidence-based medicine : how to practice and teach EBM. Glasziou, Paul, 1954-, Richardson, W. Scott (Warren Scott),, Haynes, R. Brian (Fifth ed.). Edinburgh. ISBN 978-0-7020-6296-4. OCLC 999475258. https://www.worldcat.org/oclc/999475258. 
  26. Youngstrom, Eric A.; Halverson, Tate F.; Youngstrom, J. K.; Lindhiem, O.; Findling, R L (2017). "Evidence-Based Assessment from simple clinical judgments to statistical learning: Evaluating a range of options using pediatric bipolar disorder as a diagnostic challenge". Clinical Psychological Science. doi:10.1177/2167702617741845. 
  27. 27.0 27.1 Jenkins, M. M., E. A. Youngstrom, J. K. Youngstrom, N. C. Feeny, and R. L. Findling. “Generalizability of Evidence-Based Assessment Recommendations for Pediatric Bipolar Disorder.” Psychological Assessment 24, no. 2 (June 2012): 269–81. https://doi.org/10.1037/a0025775.
  28. 28.0 28.1 28.2 28.3 Achenbach, T. M., and Leslie A. Rescorla. Manual for the ASEBA School-Age Forms & Profiles. Burlington, VT: University of Vermont, 2001.
  29. 29.0 29.1 29.2 29.3 Mick, E., J. Biederman, G. Pandina, and S. V. Faraone. “A Preliminary Meta-Analysis of the Child Behavior Checklist in Pediatric Bipolar Disorder.” Biological Psychiatry 53, no. 11 (June 1, 2003): 1021–27. https://doi.org/10.1016/S0006-3223(03)00234-8 |.
  30. Danielson, Carla Kmett, E. A. Youngstrom, R. L. Findling, and Joseph R. Calabrese. “Discriminative Validity of the General Behavior Inventory Using Youth Report.” Journal of Abnormal Child Psychology 31 (2003): 29–39.
  31. 31.0 31.1 31.2 31.3 Wagner, K. D., R. Hirschfeld, R. L. Findling, G. J. Emslie, B. Gracious, and M. Reed. “Validation of the Mood Disorder Questionnaire for Bipolar Disorders in Adolescents.” Journal of Clinical Psychiatry 67 (2006): 827–30. https://doi.org/10.4088/JCP.v67n0518.
  32. Youngstrom, E. A.; Findling, R. L.; Danielson, Carla Kmett; Calabrese, Joseph R. (2001). "Discriminative validity of parent report of hypomanic and depressive symptoms on the General Behavior Inventory.". Psychological Assessment 13 (2): 267–276. ISSN 1040-3590. 
  33. 33.0 33.1 33.2 Gracious, B. L., E. A. Youngstrom, R. L. Findling, and Joseph R. Calabrese. “Discriminative Validity of a Parent Version of the Young Mania Rating Scale.” Journal of the American Academy of Child & Adolescent Psychiatry 41 (2002): 1350–59.
  34. 34.0 34.1 34.2 34.3 34.4 Papolos, Demitri F., J. Hennen, Melissa S. Cockerham, H. C. Jr Thode, and E. A. Youngstrom. “The Child Bipolar Questionnaire: A Dimensional Approach to Screening for Pediatric Bipolar Disorder.” Journal of Affective Disorders 95 (2006): 149–58.
  35. 35.0 35.1 35.2 Pavuluri, M. N., D. B. Henry, B. Devineni, J. A. Carbray, and B. Birmaher. “Child Mania Rating Scale: Development, Reliability, and Validity.” Journal of the American Academy of Child and Adolescent Psychiatry 45, no. 5 (May 2006): 550–60.
  36. 36.0 36.1 36.2 Jerome M. Sattler. Assessment of Children: Behavioral and Clinical Applications. 4th ed. La Mesa, CA: Author, 2002.
  37. 37.0 37.1 37.2 37.3 37.4 De Los Reyes, Andres; Kazdin, Alan E. (2005). "Informant Discrepancies in the Assessment of Childhood Psychopathology: A Critical Review, Theoretical Framework, and Recommendations for Further Study.". Psychological Bulletin 131 (4): 483–509. doi:10.1037/0033-2909.131.4.483. ISSN 1939-1455. http://doi.apa.org/getdoi.cfm?doi=10.1037/0033-2909.131.4.483. 
  38. Achenbach, T. M., Stephanie H. McConaughy, and C. T. Howell. “Child/Adolescent Behavioral and Emotional Problems: Implication of Cross-Informant Correlations for Situational Specificity.” Psychological Bulletin 101 (1987): 213–32. https://doi.org/10.1037/0033-2909.101.2.213.
  39. 39.0 39.1 Youngstrom, E. A.; Findling, R. L.; Calabrese, J. R. (2003). "Who are the comorbid adolescents? Agreement between psychiatric diagnosis, parent, teacher, and youth report". Journal of Abnormal Child Psychology 31 (Special section on comorbidity): 231–245. 
  40. 40.0 40.1 40.2 40.3 40.4 40.5 Youngstrom, Eric A.; Meyers, Oren I.; Youngstrom, Jennifer Kogos; Calabrese, Joseph R.; Findling, Robert L. (2006). "Diagnostic and measurement issues in the assessment of pediatric bipolar disorder: Implications for understanding mood disorder across the life cycle". Development and Psychopathology 18: 989–1021. doi:10.1017/S0954579406060494. 
  41. 41.0 41.1 Carlson, G. A.; Youngstrom, E. A. (2003). "Clinical implications of pervasive manic symptoms in children". Biological Psychiatry 53: 1050–1058. 
  42. 42.0 42.1 Youngstrom, Eric A. (2009-06). "Definitional Issues in Bipolar Disorder Across the Life Cycle". Clinical Psychology: Science and Practice 16 (2): 140–160. doi:10.1111/j.1468-2850.2009.01154.x. http://doi.wiley.com/10.1111/j.1468-2850.2009.01154.x. 
  43. 43.0 43.1 43.2 43.3 Leibenluft, Ellen (2011-02). "Severe Mood Dysregulation, Irritability, and the Diagnostic Boundaries of Bipolar Disorder in Youths". American Journal of Psychiatry 168 (2): 129–142. doi:10.1176/appi.ajp.2010.10050766. ISSN 0002-953X. PMID 21123313. PMC PMC3396206. http://psychiatryonline.org/doi/abs/10.1176/appi.ajp.2010.10050766. 
  44. 44.0 44.1 44.2 44.3 Youngstrom, E. A.; Youngstrom, J. K.; Freeman, A. J.; De Los Reyes, A.; Feeny, N. C.; Findling, R. L. (2011-10). "Informants are not all equal: predictors and correlates of clinician judgments about caregiver and youth credibility". Journal of Child and Adolescent Psychopharmacology 21 (5): 407–15. doi:10.1089/cap.2011.0032. ISSN (Electronic) 1044-5463 (Linking) 1557-8992 (Electronic) 1044-5463 (Linking). 
  45. Loeber, Rolf, Stephanie M. Green, and Benjamin B. Lahey. “Mental Health Professionals’ Perception of the Utility of Children, Mothers, and Teachers as Informants on Childhood Psychopathology.” Journal of Clinical Child Psychology 19, no. 2 (1990): 136–43. https://doi.org/10/bmc44s.
  46. 46.0 46.1 46.2 Youngstrom, E. A.; Findling, R. L.; Calabrese, Joseph R. (2004). "Effects of adolescent manic symptoms on agreement between youth, parent, and teacher ratings of behavior problems". Journal of Affective Disorders 82 (Suppl1): –5-S16. 
  47. 47.0 47.1 47.2 47.3 Youngstrom, E. A.; Meyers, Oren I.; Demeter, C.; Kogos Youngstrom, J.; Morello, L.; Piiparinen, R.; Feeny, N. C.; Findling, R. L. et al. (2005). "Comparing diagnostic checklists for pediatric bipolar disorder in academic and community mental health settings". Bipolar Disorders 7: 507–517. doi:10.1111/j.1399-5618.2005.00269.x. 
  48. 48.0 48.1 48.2 Hazell, P. L., T. J. Lewin, and V. J. Carr. “Confirmation That Child Behavior Checklist Clinical Scales Discriminate Juvenile Mania from Attention Deficit Hyperactivity Disorder.” Journal of Paediatrics and Child Health 35, no. Apr (1999): 199–203. https://doi.org/10/bbpb9b.
  49. 49.0 49.1 49.2 49.3 49.4 Freeman, A. J., E. A. Youngstrom, M. J. Freeman, J. K. Youngstrom, and R. L. Findling. “Is Caregiver-Adolescent Disagreement Due to Differences in Thresholds for Reporting Manic Symptoms?” Journal of Child and Adolescent Psychopharmacology 21, no. 5 (October 2011): 425–32. https://doi.org/10/cckx6m.
  50. 50.0 50.1 50.2 Dell’Osso, L., S. Pini, G. B. Cassano, C. Mastrocinque, R. A. Seckinger, M. Saettoni, A. Papasogli, S. A. Yale, and X. F. Amador. “Insight into Illness in Patients with Mania, Mixed Mania, Bipolar Depression and Major Depression with Psychotic Features.” Bipolar Disorders 4, no. 5 (October 2002): 315–22. https://doi.org/10/cxtm2n.
  51. 51.0 51.1 Carlson, G. A. (2011-03). "Will the child with mania please stand up?". Br J Psychiatry 198 (3): 171–2. doi:10.1192/bjp.bp.110.084517. ISSN (Electronic) 0007-1250 (Linking) 1472-1465 (Electronic) 0007-1250 (Linking). 
  52. Youngstrom, Eric A.; Jenkins, Melissa M.; Jensen-Doss, A.; Youngstrom, J. K. (2012). "Evidence-based assessment strategies for pediatric bipolar disorder". Israel Journal of Psychiatry & Related Sciences 49: 15–27. 
  53. 53.0 53.1 Geller, Barbara; Warner, Kathy; Williams, Marlene; Zimerman, Betsy (1998-11-01). "Prepubertal and young adolescent bipolarity versus ADHD: assessment and validity using the WASH-U-KSADS, CBCL and TRF". Journal of Affective Disorders 51 (2): 93–100. doi:10.1016/S0165-0327(98)00176-1. ISSN 0165-0327. http://www.sciencedirect.com/science/article/pii/S0165032798001761. 
  54. 54.0 54.1 54.2 54.3 54.4 54.5 Youngstrom, E. A.; Findling, R. L.; Calabrese, Joseph R.; Gracious, B. L.; Demeter, C.; DelPorto Bedoya, D.; Price, Megan (2004). "Comparing the diagnostic accuracy of six potential screening instruments for bipolar disorder in youths aged 5 to 17 years". Journal of the American Academy of Child & Adolescent Psychiatry 43: 847–858. doi:10.1097/01.chi.0000125091.35109.1e. 
  55. 55.0 55.1 Youngstrom, Eric A.; Joseph, Megan F.; Greene, Jamelle (2008). "Comparing the psychometric properties of multiple teacher report instruments as predictors of bipolar disorder in children and adolescents". Journal of Clinical Psychology 64 (4): 382–401. ISSN 1097-4679. 
  56. Achenbach, T. M., and C. Edelbrock. Manual for the Child Behavior Checklist and Revised Child Behavior Profile. Burlington: University of Vermont, Department of Psychiatry, 1983.
  57. 57.0 57.1 57.2 57.3 Diler, R. S., B. Birmaher, D. Axelson, B. Goldstein, M. Gill, M. Strober, D. J. Kolko, et al. “The Child Behavior Checklist (CBCL) and the CBCL-Bipolar Phenotype Are Not Useful in Diagnosing Pediatric Bipolar Disorder.” Journal of Child and Adolescent Psychopharmacology 19, no. 1 (February 2009): 23–30. https://doi.org/10/ffbc5w.
  58. 58.0 58.1 58.2 Meyer, S. E., G. A. Carlson, E. Youngstrom, D. S. Ronsaville, P. E. Martinez, P. W. Gold, R. Hakak, and M. Radke-Yarrow. “Long-Term Outcomes of Youth Who Manifested the CBCL-Pediatric Bipolar Disorder Phenotype during Childhood and/or Adolescence.” Journal of Affective Disorders 113 (July 14, 2009): 227–35. https://doi.org/10/fjxwjn.
  59. 59.0 59.1 Kahana, Shoshana Y., E. A. Youngstrom, Robert L. Findling, and Joseph R. Calabrese. “Employing Parent, Teacher, and Youth Self-Report Checklists in Identifying Pediatric Bipolar Spectrum Disorders: An Examination of Diagnostic Accuracy and Clinical Utility.” Journal of Child and Adolescent Psychopharmacology 13, no. 4 (2003): 471–88. https://doi.org/10/ftknmb.
  60. 60.0 60.1 60.2 Hirschfeld, R. M., J. B. W. Williams, Robert L. Spitzer, Joseph R. Calabrese, Laurie Flynn, Paul E. Jr Keck, Lydia Lewis, et al. “Development and Validation of a Screening Instrument for Bipolar Spectrum Disorder: The Mood Disorder Questionnaire.” American Journal of Psychiatry 157, no. 11 (November 2000): 1873–75. https://doi.org/10/bnrpft.
  61. 61.0 61.1 61.2 Depue, Richard A., J. F. Slater, H. Wolfstetter-Kausch, Daniel N. Klein, E. Goplerud, and D. A. Farr. “A Behavioral Paradigm for Identifying Persons at Risk for Bipolar Depressive Disorder: A Conceptual Framework and Five Validation Studies.” Journal of Abnormal Psychology 90, no. 5 (1981): 381–437. https://doi.org/10/dcr7nc.
  62. 62.0 62.1 62.2 Garb, H. N. Studying the Clinician: Judgment Research and Psychological Assessment. Washington, DC: American Psychological Association, 1998.
  63. Ablow, Jennifer C., Jeffrey R. Measelle, Helena C. Kraemer, Richard Harrington, J. Luby, Nancy Smider, Lisa Dierker, et al. “The MacArthur Three-City Outcome Study: Evaluating Multi-Informant Measures of Young Children’s Symptomatology.” Journal of the American Academy of Child & Adolescent Psychiatry 38, no. 12 (1999): 1580–90. https://doi.org/10/fgvvdk.
  64. Wakschlag, L. S., S. W. Choi, A. S. Carter, H. Hullsiek, J. Burns, K. McCarthy, E. Leibenluft, and M. J. Briggs-Gowan. “Defining the Developmental Parameters of Temper Loss in Early Childhood: Implications for Developmental Psychopathology.” J Child Psychol Psychiatry 53, no. 11 (November 2012): 1099–1108. https://doi.org/10/f4b7xh.
  65. 65.0 65.1 Richters, J. E. “Depressed Mothers as Informants about Their Children: A Critical Review of the Evidence for Distortion.” Psychological Bulletin 112 (1992): 485–99. https://doi.org/10/fs5vj2.
  66. Carlson, G. A.; Youngstrom, E. A. (2011-10). "Two opinions about one child-What's the clinician to do?". Journal of Child and Adolescent Psychopharmacology 21 (5): 385–7. doi:10.1089/cap.2011.2159. ISSN (Electronic) 1044-5463 (Linking) 1557-8992 (Electronic) 1044-5463 (Linking). 
  67. Morrison, James R.. Diagnosis made easier : principles and techniques for mental health clinicians (Second ed.). New York. ISBN 978-1-4625-1335-2. OCLC 863100520. https://www.worldcat.org/oclc/863100520. 
  68. GELLER, BARBARA; ZIMERMAN, BETSY; WILLIAMS, MARLENE; BOLHOFNER, KRISTINE; CRANEY, JAMES L.; DELBELLO, MELISSA P.; SOUTULLO, CESAR (2001-04). "Reliability of the Washington University in St. Louis Kiddie Schedule for Affective Disorders and Schizophrenia (WASH-U-KSADS) Mania and Rapid Cycling Sections". Journal of the American Academy of Child & Adolescent Psychiatry 40 (4): 450–455. doi:10.1097/00004583-200104000-00014. ISSN 0890-8567. https://doi.org/10.1097/00004583-200104000-00014. 
  69. 69.0 69.1 Kaufman, Joan, B. Birmaher, David Brent, Uma Rao, Cynthia Flynn, Paula Moreci, Douglas Williamson, and Neal Ryan. “Schedule for Affective Disorders and Schizophrenia for School-Age Children-Present and Lifetime Version (K-SADS-PL): Initial Reliability and Validity Data.” Journal of the American Academy of Child & Adolescent Psychiatry 36, no. 7 (1997): 980–88. https://doi.org/10/dbr6hc.
  70. Orvaschel, Helen. “Schizophrenia and Affective Disorders Schedule for Children - Epidemiological Version (KSADS-E).” Nova Southeastern University, Ft. Lauderdale, FL, 1995.
  71. Nottelmann, Editha. “National Institute of Mental Health Research Roundtable on Prepubertal Bipolar Disorder.” Journal of the American Academy of Child & Adolescent Psychiatry 40, no. 8 (2001): 871–78. https://doi.org/10/csvs79.
  72. 72.0 72.1 Dienes, Kimberly A., Kiki D. Chang, C. M. Blasey, Nancy Adleman, and H. Steiner. “Characterization of Children of Bipolar Parents by Parent Report CBCL.” Journal of Psychiatric Research 36 (2002): 337–45. https://doi.org/10/b54zmf.
  73. 73.0 73.1 73.2 Biederman, Joseph; Wozniak, Janet; Kiely, Kathleen; Ablon, Stuart; Faraone, Stephen; Mick, Eric; Mundy, Elizabeth; Kraus, Ilana (1995-04). "CBCL Clinical Scales Discriminate Prepubertal Children with Structured Interview—Derived Diagnosis of Mania from Those with ADHD". Journal of the American Academy of Child & Adolescent Psychiatry 34 (4): 464–471. doi:10.1097/00004583-199504000-00013. https://linkinghub.elsevier.com/retrieve/pii/S0890856709637321. 
  74. 74.0 74.1 74.2 74.3 Papachristou, E., J. Ormel, A. J. Oldehinkel, M. Kyriakopoulos, M. Reinares, A. Reichenberg, and S. Frangou. “Child Behavior Checklist-Mania Scale (CBCL-MS): Development and Evaluation of a Population-Based Screening Scale for Bipolar Disorder.” PLoS One 8, no. 8 (2013): e69459. https://doi.org/10/gfkff7.
  75. 75.0 75.1 75.2 Carlson, Gabrielle A; Loney, Jan; Salisbury, Helen; Volpe, Robert J (1998-11-01). "Young referred boys with DICA-P manic symptoms vs. two comparison groups". Journal of Affective Disorders 51 (2): 113–121. doi:10.1016/S0165-0327(98)00210-9. ISSN 0165-0327. http://www.sciencedirect.com/science/article/pii/S0165032798002109. 
  76. Geller, Barbara; Luby, Joan (1997-09). "Child and Adolescent Bipolar Disorder: A Review of the Past 10 Years". Journal of the American Academy of Child & Adolescent Psychiatry 36 (9): 1168–1176. doi:10.1097/00004583-199709000-00008. https://linkinghub.elsevier.com/retrieve/pii/S0890856709626411. 
  77. Biederman, J., E. Mick, S. V. Faraone, T. Spencer, T. E. Wilens, and J. Wozniak. “Current Concepts in the Validity, Diagnosis and Treatment of Paediatric Bipolar Disorder.” International Journal of Neuropsychopharmacology 6, no. 3 (September 2003): 293–300. https://doi.org/10/cr7jrg.
  78. Galanter, Cathryn A.; Hundt, Stephanie R.; Goyal, Parag; Le, Jenna; Fisher, Prudence W. (2012-06). "Variability Among Research Diagnostic Interview Instruments in the Application of DSM-IV-TR Criteria for Pediatric Bipolar Disorder". Journal of the American Academy of Child & Adolescent Psychiatry 51 (6): 605–621. doi:10.1016/j.jaac.2012.03.010. https://linkinghub.elsevier.com/retrieve/pii/S089085671200233X. 
  79. 79.0 79.1 79.2 79.3 79.4 79.5 79.6 Bossuyt, Patrick M.; Reitsma, Johannes B.; Bruns, David E.; Gatsonis, Constantine A.; Glasziou, Paul P.; Irwig, Les M.; Lijmer, Jeroen G.; Moher, David et al. (2003-01-04). "Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative". BMJ 326 (7379): 41–44. doi:10.1136/bmj.326.7379.41. ISSN 0959-8138. PMID 12511463. PMC PMC1124931. https://www.bmj.com/content/326/7379/41.1. 
  80. 80.0 80.1 80.2 80.3 APA Publications and Communications Board Working Group on Journal Article Reporting Standards (2008-12). "Reporting standards for research in psychology: Why do we need them? What might they be?". American Psychologist 63 (9): 839–851. doi:10.1037/0003-066X.63.9.839. ISSN 1935-990X. PMID 19086746. PMC PMC2957094. http://doi.apa.org/getdoi.cfm?doi=10.1037/0003-066X.63.9.839. 
  81. 81.00 81.01 81.02 81.03 81.04 81.05 81.06 81.07 81.08 81.09 Zhou, Xiao-Hua. (2011). Statistical methods in diagnostic medicine. McClish, Donna K., Obuchowski, Nancy A., Electronic Book Collection. (2nd ed ed.). Hoboken, N.J.: Wiley. ISBN 978-0-470-90651-4. OCLC 714797065.
  82. Merikangas, K. R.; He, J. P.; Burstein, M.; Swendsen, J.; Avenevoli, S.; Case, B.; Georgiades, K.; Heaton, L. et al. (2011-01). "Service utilization for lifetime mental disorders in U.S. adolescents: Results of the National Comorbidity Survey-Adolescent Supplement (NCS-A)". Journal of the American Academy of Child and Adolescent Psychiatry 50: 32–45. doi:10.1016/j.jaac.2010.10.006. ISSN (Electronic) 0890-8567 (Linking) 1527-5418 (Electronic) 0890-8567 (Linking). 
  83. Lewinsohn, Peter M; Klein, Daniel N; Seeley, John R (2000-08). "Bipolar disorder during adolescence and young adulthood in a community sample". Bipolar Disorders 2 (3.2): 281–293. doi:10.1034/j.1399-5618.2000.20309.x. ISSN 1398-5647. http://doi.wiley.com/10.1034/j.1399-5618.2000.20309.x. 
  84. Leibenluft, Ellen; Charney, Dennis S.; Towbin, Kenneth E.; Bhangoo, Robinder K.; Pine, Daniel S. (2003-03). "Defining Clinical Phenotypes of Juvenile Mania". American Journal of Psychiatry 160 (3): 430–437. doi:10.1176/appi.ajp.160.3.430. ISSN 0002-953X. http://psychiatryonline.org/doi/abs/10.1176/appi.ajp.160.3.430. 
  85. Papolos, Demitri F. “Bipolar Disorder and Comorbid Disorders: The Case for a Dimensional Nosology.” In Bipolar Disorder in Childhood and Early Adolescence, edited by B. Geller and M. P. DelBello, 76–106. New York: Guilford, 2003.
  86. Wozniak, Janet; Biederman, Joseph; Kiely, Kathleen; Ablon, J. Stuart; Faraone, Stephen V.; Mundy, Elizabeth; Mennin, Douglas (1995-07). "Mania-Like Symptoms Suggestive of Childhood-Onset Bipolar Disorder in Clinically Referred Children". Journal of the American Academy of Child & Adolescent Psychiatry 34 (7): 867–876. doi:10.1097/00004583-199507000-00010. https://linkinghub.elsevier.com/retrieve/pii/S0890856709635978. 
  87. 87.0 87.1 Youngstrom, Eric; Youngstrom, Jennifer Kogos; Starr, Maryjean (2005-10). "Bipolar Diagnoses in Community Mental Health: Achenbach Child Behavior Checklist Profiles and Patterns of Comorbidity". Biological Psychiatry 58 (7): 569–575. doi:10.1016/j.biopsych.2005.04.004. https://linkinghub.elsevier.com/retrieve/pii/S0006322305004361. 
  88. Merikangas, Kathleen R.; Pato, Michael (2009). "Recent developments in the epidemiology of bipolar disorder in adults and children: Magnitude, correlates, and future directions". Clinical Psychology: Science and Practice 16 (2): 121–133. doi:10.1111/j.1468-2850.2009.01152.x. ISSN 1468-2850. 
  89. Axelson, David; Birmaher, Boris; Strober, Michael; Gill, Mary Kay; Valeri, Sylvia; Chiappetta, Laurel; Ryan, Neal; Leonard, Henrietta et al. (2006-10-01). "Phenomenology of Children and Adolescents With Bipolar Spectrum Disorders". Archives of General Psychiatry 63 (10): 1139. doi:10.1001/archpsyc.63.10.1139. ISSN 0003-990X. http://archpsyc.jamanetwork.com/article.aspx?doi=10.1001/archpsyc.63.10.1139. 
  90. Vieta, Eduard; Reinares, M.; Rosa, A. R. (2011-02). "Staging Bipolar Disorder". Neurotoxicity Research 19 (2): 279–285. doi:10.1007/s12640-010-9197-8. ISSN 1029-8428. http://link.springer.com/10.1007/s12640-010-9197-8. 
  91. Hauser, Marta; Correll, Christoph U (2013-01). "The Significance of At-Risk or Prodromal Symptoms for Bipolar I Disorder in Children and Adolescents". The Canadian Journal of Psychiatry 58 (1): 22–31. doi:10.1177/070674371305800106. ISSN 0706-7437. PMID 23327753. PMC PMC4010197. http://journals.sagepub.com/doi/10.1177/070674371305800106. 
  92. Biederman, Joseph; Jellinek, Michael S. (1998-10). "Resolved: Mania Is Mistaken for ADHD in Prepubertal Children". Journal of the American Academy of Child & Adolescent Psychiatry 37 (10): 1091–1099. doi:10.1097/00004583-199810000-00020. https://linkinghub.elsevier.com/retrieve/pii/S0890856709633189. 
  93. BOWRING, MARGARET ANN; KOVACS, MARIA (1992-07). "Difficulties in Diagnosing Manic Disorders among Children and Adolescents". Journal of the American Academy of Child & Adolescent Psychiatry 31 (4): 611–614. doi:10.1097/00004583-199207000-00006. ISSN 0890-8567. https://doi.org/10.1097/00004583-199207000-00006. 
  94. Viechtbauer, Wolfgang (2007-05). "Hypothesis tests for population heterogeneity in meta-analysis". British Journal of Mathematical and Statistical Psychology 60 (1): 29–60. doi:10.1348/000711005X64042. http://doi.wiley.com/10.1348/000711005X64042. 
  95. Venkatraman, E. S. (2000-12). "A Permutation Test to Compare Receiver Operating Characteristic Curves". Biometrics 56 (4): 1134–1138. doi:10.1111/j.0006-341X.2000.01134.x. http://doi.wiley.com/10.1111/j.0006-341X.2000.01134.x. 
  96. Althoff, Robert R.; Ayer, Lynsay A.; Rettew, David C.; Hudziak, James J. (2010). "Assessment of dysregulated children using the Child Behavior Checklist: A receiver operating characteristic curve analysis.". Psychological Assessment 22 (3): 609–617. doi:10.1037/a0019699. ISSN 1939-134X. PMID 20822273. PMC PMC2936710. http://doi.apa.org/getdoi.cfm?doi=10.1037/a0019699. 
  97. 97.0 97.1 Whiting, Penny F. (2011-10-18). "QUADAS-2: A Revised Tool for the Quality Assessment of Diagnostic Accuracy Studies". Annals of Internal Medicine 155 (8): 529. doi:10.7326/0003-4819-155-8-201110180-00009. ISSN 0003-4819. http://annals.org/article.aspx?doi=10.7326/0003-4819-155-8-201110180-00009. 
  98. Reichart, Catrien G.; van der Ende, Jan; Wals, Marjolein; Hillegers, Manon H.J.; Nolen, Willem A.; Ormel, Johan; Verhulst, Frank C. (2005-12). "The use of the GBI as predictor of bipolar disorder in a population of adolescent offspring of parents with a bipolar disorder". Journal of Affective Disorders 89 (1-3): 147–155. doi:10.1016/j.jad.2005.09.007. https://linkinghub.elsevier.com/retrieve/pii/S0165032705002880. 
  99. Henin, Aude; Mick, Eric; Biederman, Joseph; Fried, Ronna; Wozniak, Janet; Faraone, Stephen V.; Harrington, Kara; Davis, Stephanie et al. (2007). "Can bipolar disorder-specific neuropsychological impairments in children be identified?". Journal of Consulting and Clinical Psychology 75 (2): 210–220. doi:10.1037/0022-006X.75.2.210. ISSN 1939-2117. http://doi.apa.org/getdoi.cfm?doi=10.1037/0022-006X.75.2.210. 
  100. Wilens, Timothy E.; Biederman, Joseph; Forkner, Peter; Ditterline, Jeff; Morris, Mathew; Moore, Hadley; Galdo, Maribel; Spencer, Thomas J. et al. (2003-12). "Patterns of Comorbidity and Dysfunction in Clinically Referred Preschool and School-Age Children with Bipolar Disorder". Journal of Child and Adolescent Psychopharmacology 13 (4): 495–505. doi:10.1089/104454603322724887. ISSN 1044-5463. http://www.liebertpub.com/doi/10.1089/104454603322724887. 
  101. Anthony, James; Scott, Peter (1960). "Manic–Depressive Psychosis in Childhood". Journal of Child Psychology and Psychiatry 1 (1): 53–72. doi:10.1111/j.1469-7610.1960.tb01979.x. ISSN 1469-7610. https://acamh.onlinelibrary.wiley.com/doi/abs/10.1111/j.1469-7610.1960.tb01979.x. 
  102. Meyer, T. D., P. Hammelstein, L. G. Nilsson, P. Skeppar, R. Adolfsson, and J. Angst. “The Hypomania Checklist (HCL-32): Its Factorial Structure and Association to Indices of Impairment in German and Swedish Nonclinical Samples.” Comprehensive Psychiatry 48 (February 2007): 79–87. https://doi.org/10/dkckd5.
  103. de Sousa Gurgel, Wagner; Rebouças, Diego Barreto; Negreiros de Matos, Karla Julianne; Carneiro, Alexandre Henrique Silva; Gomes de Matos e Souza, Fábio (2012-04). "Brazilian Portuguese validation of Mood Disorder Questionnaire". Comprehensive Psychiatry 53 (3): 308–312. doi:10.1016/j.comppsych.2011.04.059. https://linkinghub.elsevier.com/retrieve/pii/S0010440X11001052. 
  104. Miller, Christopher J.; Johnson, Sheri L.; Kwapil, Thomas R.; Carver, Charles S. (2011-02). "Three studies on self-report scales to detect bipolar disorder". Journal of Affective Disorders 128 (3): 199–210. doi:10.1016/j.jad.2010.07.012. PMID 20696479. PMC PMC2992802. https://linkinghub.elsevier.com/retrieve/pii/S0165032710004933. 
  105. Zaratiegui, Rodolfo M.; Vázquez, Gustavo H.; Lorenzo, Laura S.; Marinelli, Marcia; Aguayo, Silvia; Strejilevich, Sergio A.; Padilla, Eduardo; Goldchluk, Aníbal et al. (2011-08-01). "Sensitivity and specificity of the mood disorder questionnaire and the bipolar spectrum diagnostic scale in Argentinean patients with mood disorders". Journal of Affective Disorders 132 (3): 445–449. doi:10.1016/j.jad.2011.03.014. ISSN 0165-0327. http://www.sciencedirect.com/science/article/pii/S0165032711000978. 
  106. Doerfler, Leonard A.; Connor, Daniel F.; Toscano, Peter F. (2011-06). "Aggression, ADHD symptoms, and dysphoria in children and adolescents diagnosed with bipolar disorder and ADHD". Journal of Affective Disorders 131 (1-3): 312–319. doi:10.1016/j.jad.2010.11.029. https://linkinghub.elsevier.com/retrieve/pii/S0165032710007263. 
  107. Mbekou, Valentin; Gignac, Martin; MacNeil, Sasha; Mackay, Pamela; Renaud, Johanne (2014-02). "The CBCL dysregulated profile: An indicator of pediatric bipolar disorder or of psychopathology severity?". Journal of Affective Disorders 155: 299–302. doi:10.1016/j.jad.2013.10.033. https://linkinghub.elsevier.com/retrieve/pii/S0165032713007751. 
  108. Carlson, G. A.; Kelly, Kevin L. (1998). "Manic symptoms in psychiatrically hospitalized children--what do they mean?". Journal of Affective Disorders 51 (2): 123–135. ISSN ALLEN. 
  109. 109.0 109.1 109.2 109.3 Liberati, Alessandro; Altman, Douglas G.; Tetzlaff, Jennifer; Mulrow, Cynthia; Gøtzsche, Peter C.; Ioannidis, John P. A.; Clarke, Mike; Devereaux, P. J. et al. (2009-07-21). "The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate healthcare interventions: explanation and elaboration". BMJ 339. doi:10.1136/bmj.b2700. ISSN 0959-8138. PMID 19622552. PMC PMC2714672. https://www.bmj.com/content/339/bmj.b2700. 
  110. Youngstrom, E. A. (2007). "Pediatric bipolar disorder". In E. J. Mash. Assessment of Childhood Disorders (4th ed.). New York: Guilford Press. pp. 253–304. 
  111. Johnson, Sheri L., Christopher J. Miller, and Lori Eisner. “Bipolar Disorder.” In A Guide to Assessments That Work, edited by John Hunsley and Eric J. Mash, 121–37. New York, NY: Oxford University Press, 2008.
  112. Waugh, Meghan J.; Meyer, Thomas D.; Youngstrom, Eric A.; Scott, Jan (2014-05). "A review of self-rating instruments to identify young people at risk of bipolar spectrum disorders". Journal of Affective Disorders 160: 113–121. doi:10.1016/j.jad.2013.12.019. https://linkinghub.elsevier.com/retrieve/pii/S0165032713008598. 
  113. 113.0 113.1 Biederman, Joseph; Faraone, Stephen; Mick, Eric; Wozniak, Janet; Chen, Lisa; Ouellette, Cheryl; Marrs, Abbe; Moore, Phoebe et al. (1996-08). "Attention-Deficit Hyperactivity Disorder and Juvenile Mania: An Overlooked Comorbidity?". Journal of the American Academy of Child & Adolescent Psychiatry 35 (8): 997–1008. doi:10.1097/00004583-199608000-00010. https://linkinghub.elsevier.com/retrieve/pii/S089085670962493X. 
  114. Lewinsohn, P., J. R. Seeley, and Daniel N. Klein. “Bipolar Disorder in Adolescents: Epidemiology and Suicidal Behavior.” In Bipolar Disorder in Childhood and Early Adolescence, edited by B. Geller and M. P. DelBello, 7–24. New York: Guilford, 2003.
  115. Doerfler, Leonard A.; Connor, Daniel F.; Toscano, Peter F. (2011-10-01). "The CBCL Bipolar Profile and Attention, Mood, and Behavior Dysregulation". Journal of Child and Family Studies 20 (5): 545–553. doi:10.1007/s10826-010-9426-z. ISSN 1573-2843. https://doi.org/10.1007/s10826-010-9426-z. 
  116. 116.0 116.1 Henry, David B.; Pavuluri, Mani N.; Youngstrom, Eric; Birmaher, Boris (2008-04). "Accuracy of brief and full forms of the child mania rating scale". Journal of Clinical Psychology 64 (4): 368–381. doi:10.1002/jclp.20464. http://doi.wiley.com/10.1002/jclp.20464. 
  117. 117.0 117.1 117.2 Lee, Hyun-Jeong; Joo, Yeonho; Youngstrom, Eric A.; Yum, Sun Young; Findling, Robert L.; Kim, Hyo-Won (2014-10). "Diagnostic validity and reliability of a Korean version of the Parent and Adolescent General Behavior Inventories". Comprehensive Psychiatry 55 (7): 1730–1737. doi:10.1016/j.comppsych.2014.05.008. https://linkinghub.elsevier.com/retrieve/pii/S0010440X14001205. 
  118. 118.0 118.1 Miguez, Melissa; Weber, Béatrice; Debbané, Martin; Balanzin, Dario; Gex-Fabry, Marianne; Raiola, Fulvia; Barbe, Rémy P.; Vital Bennour, Marylène et al. (2013-08). "Screening for bipolar disorder in adolescents with the Mood Disorder Questionnaire - Adolescent version (MDQ-A) and the Child Bipolar Questionnaire (CBQ): Screening for BD in adolescents". Early Intervention in Psychiatry 7 (3): 270–277. doi:10.1111/j.1751-7893.2012.00388.x. http://doi.wiley.com/10.1111/j.1751-7893.2012.00388.x. 
  119. 119.0 119.1 Youngstrom, E. A. (2014). "A primer on Receiver Operating Characteristic analysis and diagnostic efficiency statistics for pediatric psychology: We are ready to ROC". Journal of Pediatric Psychology 39: 204–221. doi:10.1093/jpepsy/jst062. ISSN 1465-735X 0146-8693 1465-735X. 
  120. Mark W. Lipsey, and David B. Wilson. Practical Meta-Analysis. Vol. 49. 49 vols. Applied Social Reserach Methods Series. Thousand Oaks, CA: Sage Publications, 2001.
  121. 121.0 121.1 Hasselblad, Vic; Hedges, Larry V. (1995). "Meta-analysis of screening and diagnostic tests.". Psychological Bulletin 117 (1): 167–178. doi:10.1037/0033-2909.117.1.167. ISSN 1939-1455. http://doi.apa.org/getdoi.cfm?doi=10.1037/0033-2909.117.1.167. 
  122. R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing, 2014.
  123. 123.0 123.1 Hummel, T. J. “The Usefulness of Tests in Clinical Decisions.” In Scientist-Practitioner Perspectives on Test Interpretation, edited by James W. Lichtenberg and Rodney K. Goodyear, 59–112. Boston: Allyn and Bacon, 1999.
  124. Hedges, L. V.; Pigott, T. D. (2001-09). "The power of statistical tests in meta-analysis". Psychological Methods 6 (3): 203–217. ISSN 1082-989X. PMID 11570228. https://www.ncbi.nlm.nih.gov/pubmed/11570228. 
  125. 125.0 125.1 Gadow, Kenneth D., and Joyce Sprafkin. Child Symptom Inventories Manual. Stony Brook, NY: Checkmate Plus, 1994.
  126. Gadow, Kenneth D., and Joyce Sprafkin. Adolescent Symptom Inventory: Screening Manual. Stony Brook, NY: Checkmate Plus, 1997.
  127. Konstantopoulos, Spyros (2011). "Fixed effects and variance components estimation in three-level meta-analysis". Research Synthesis Methods 2 (1): 61–76. doi:10.1002/jrsm.35. ISSN 1759-2887. https://onlinelibrary.wiley.com/doi/abs/10.1002/jrsm.35. 
  128. Rettew, David C.; Lynch, Alicia Doyle; Achenbach, Thomas M.; Dumenci, Levent; Ivanova, Masha Y. (2009-09). "Meta-analyses of agreement between diagnoses made from clinical evaluations and standardized diagnostic interviews". International Journal of Methods in Psychiatric Research 18 (3): 169–184. doi:10.1002/mpr.289. PMID 19701924. PMC PMC6878243. http://doi.wiley.com/10.1002/mpr.289. 
  129. Jensen-Doss, Amanda; Youngstrom, Eric A.; Youngstrom, Jennifer Kogos; Feeny, Norah C.; Findling, Robert L. (2014). "Predictors and moderators of agreement between clinical and research diagnoses for children and adolescents.". Journal of Consulting and Clinical Psychology 82 (6): 1151–1162. doi:10.1037/a0036657. ISSN 1939-2117. PMID 24773574. PMC PMC4278746. http://doi.apa.org/getdoi.cfm?doi=10.1037/a0036657. 
  130. 130.0 130.1 130.2 130.3 Kraemer, Helena Chmura. Evaluating Medical Tests: Objective and Quantitative Guidelines. Newbury Park, CA: Sage, 1992.
  131. Doerfler, Leonard A.; Connor, Daniel F.; Toscano, Peter F. (2010-10-07). "The CBCL Bipolar Profile and Attention, Mood, and Behavior Dysregulation". Journal of Child and Family Studies 20 (5): 545–553. doi:10.1007/s10826-010-9426-z. ISSN 1062-1024. http://dx.doi.org/10.1007/s10826-010-9426-z. 
  132. Marchand, William R.; Clark, Steven C.; Wirth, Laurel; Simon, Cindy (2005-03). "Validity of the parent young mania rating scale in a community mental health setting". Psychiatry (Edgmont (Pa.: Township)) 2 (3): 31–35. ISSN 1550-5952. PMID 21179627. PMC 3004712. https://www.ncbi.nlm.nih.gov/pubmed/21179627. 
  133. Tillman, Rebecca; Geller, Barbara (2005-06). "A Brief Screening Tool for a Prepubertal and Early Adolescent Bipolar Disorder Phenotype". American Journal of Psychiatry 162 (6): 1214–1216. doi:10.1176/appi.ajp.162.6.1214. ISSN 0002-953X. http://psychiatryonline.org/doi/abs/10.1176/appi.ajp.162.6.1214. 
  134. Rucklidge, Julia J (2008-01-10). "Retrospective parent report of psychiatric histories: do checklists reveal specific prodromal indicators for postpubertal-onset pediatric bipolar disorder?: Prodromal indicators of bipolar disorder". Bipolar Disorders 10 (1): 56–66. doi:10.1111/j.1399-5618.2008.00533.x. http://doi.wiley.com/10.1111/j.1399-5618.2008.00533.x. 
  135. Serrano, Eduardo; Ezpeleta, Lourdes; Alda, José A.; Matalí, José L.; San, Luis (2011). "Psychometric Properties of the Young Mania Rating Scale for the Identification of Mania Symptoms in Spanish Children and Adolescents with Attention Deficit/Hyperactivity Disorder". Psychopathology 44 (2): 125–132. doi:10.1159/000320893. ISSN 0254-4962. PMID 21228617. https://www.karger.com/Article/FullText/320893. 
  136. 136.0 136.1 Faraone, Stephen V; Althoff, Robert R; Hudziak, James J; Monuteaux, Michael; Biederman, Joseph (2005-12). "The CBCL predicts DSM bipolar disorder in children: a receiver operating characteristic curve analysis". Bipolar Disorders 7 (6): 518–524. doi:10.1111/j.1399-5618.2005.00271.x. ISSN 1398-5647. http://doi.wiley.com/10.1111/j.1399-5618.2005.00271.x. 
  137. Uchida, Mai; Faraone, Stephen V.; Martelon, MaryKate; Kenworthy, Tara; Woodworth, K. Yvonne; Spencer, Thomas J.; Wozniak, Janet R.; Biederman, Joseph (2014-08). "Further evidence that severe scores in the aggression/anxiety-depression/attention subscales of child behavior checklist (severe dysregulation profile) can screen for bipolar disorder symptomatology: a conditional probability analysis". Journal of Affective Disorders 165: 81–86. doi:10.1016/j.jad.2014.04.021. PMID 24882182. PMC PMC4066999. https://linkinghub.elsevier.com/retrieve/pii/S0165032714002055. 
  138. 138.0 138.1 Geller, Barbara; Tillman, Rebecca (2005). "Prepubertal and early adolescent bipolar I disorder: review of diagnostic validation by Robins and Guze criteria". The Journal of Clinical Psychiatry 66 Suppl 7: 21–28. ISSN 0160-6689. PMID 16124838. https://www.ncbi.nlm.nih.gov/pubmed/16124838. 
  139. 139.0 139.1 139.2 139.3 139.4 139.5 139.6 Youngstrom, E. A.; Meyers, Oren I.; Youngstrom, J. Kogos; Calabrese, J. R.; Findling, R. L. (2006). "Comparing the effects of sampling designs on the diagnostic accuracy of eight promising screening algorithms for pediatric bipolar disorder". Biological Psychiatry 60: 1013–1019. doi:10.1016/j.biopsych.2006.06.023. 
  140. Swets, John A.; Dawes, Robyn M.; Monahan, John (2000-05). "Psychological Science Can Improve Diagnostic Decisions". Psychological Science in the Public Interest 1 (1): 1–26. doi:10.1111/1529-1006.001. ISSN 1529-1006. https://doi.org/10.1111/1529-1006.001. 
  141. Geller, Barbara; Warner, Kathy; Williams, Marlene; Zimerman, Betsy (1998-11-01). "Prepubertal and young adolescent bipolarity versus ADHD: assessment and validity using the WASH-U-KSADS, CBCL and TRF". Journal of Affective Disorders 51 (2): 93–100. doi:10.1016/S0165-0327(98)00176-1. ISSN 0165-0327. http://www.sciencedirect.com/science/article/pii/S0165032798001761. 
  142. Murray, Greg; Nicholas, Christian L.; Kleiman, Jan; Dwyer, Robyn; Carrington, Melinda J.; Allen, Nicholas B.; Trinder, John (2009). "Nature’s clocks and human mood: The circadian system modulates reward motivation.". Emotion 9 (5): 705–716. doi:10.1037/a0017080. ISSN 1931-1516. http://doi.apa.org/getdoi.cfm?doi=10.1037/a0017080. 
  143. Harvey, Allison G. (2008-07-01). "Sleep and Circadian Rhythms in Bipolar Disorder: Seeking Synchrony, Harmony, and Regulation". American Journal of Psychiatry 165 (7): 820–829. doi:10.1176/appi.ajp.2008.08010098. ISSN 0002-953X. https://ajp.psychiatryonline.org/doi/full/10.1176/appi.ajp.2008.08010098. 
  144. Ehnvall, Anna; Mitchell, Philip B; Hadzi-Pavlovic, Dusan; Loo, Colleen; Breakspear, Michael; Wright, Adam; Roberts, Gloria; Frankland, Andrew et al. (2011-02). "Pain and rejection sensitivity in bipolar depression: Pain and rejection sensitivity". Bipolar Disorders 13 (1): 59–66. doi:10.1111/j.1399-5618.2011.00892.x. http://doi.wiley.com/10.1111/j.1399-5618.2011.00892.x. 
  145. Robertson, H. A.; Lam, R. W.; Stewart, J. N.; Yatham, L. N.; Tarn, E. M.; Zis, A. P. (1996-12). "Atypical depressive symptoms and clusters in unipolar and bipolar depression". Acta Psychiatrica Scandinavica 94 (6): 421–427. doi:10.1111/j.1600-0447.1996.tb09884.x. ISSN 0001-690X. http://electra.lmu.edu:2096/10.1111/j.1600-0447.1996.tb09884.x. 
  146. Emery, R. “Family Conflicts and Their Developmental Implications: A Conceptual Analysis of Meanings for the Structure of Relationships.” In Family Conflicts, edited by W. Hartup and C. Shantz, 270–98. 0. New York: Cambridge Univ Press, 1992.
  147. Axelson, David A.; Bertocci, Michele A.; Lewin, Daniel S.; Trubnick, Laura S.; Birmaher, B.; Williamson, Douglas E.; Ryan, Neal D.; Dahl, Ronald E. (2003). "Measuring mood and complex behavior in natural environments: use of ecological momentary assessment in pediatric affective disorders.". Journal of Child and Adolescent Psychopharmacology 13 (3): 253–66. ISSN 1044-5463. 
  148. 148.0 148.1 West, Amy E.; Celio, Christine I.; Henry, David B.; Pavuluri, Mani N. (2011-01). "Child Mania Rating Scale-Parent Version: A valid measure of symptom change due to pharmacotherapy". Journal of Affective Disorders 128 (1-2): 112–119. doi:10.1016/j.jad.2010.06.013. https://linkinghub.elsevier.com/retrieve/pii/S016503271000426X. 
  149. 149.0 149.1 149.2 Bebko, Genna; Bertocci, Michele A.; Fournier, Jay C.; Hinze, Amanda K.; Bonar, Lisa; Almeida, Jorge R. C.; Perlman, Susan B.; Versace, Amelia et al. (2014-01-01). "Parsing Dimensional vs Diagnostic Category–Related Patterns of Reward Circuitry Function in Behaviorally and Emotionally Dysregulated Youth in the Longitudinal Assessment of Manic Symptoms Study". JAMA Psychiatry 71 (1): 71. doi:10.1001/jamapsychiatry.2013.2870. ISSN 2168-622X. PMID 24285346. PMC PMC4238412. http://archpsyc.jamanetwork.com/article.aspx?doi=10.1001/jamapsychiatry.2013.2870. 
  150. 150.0 150.1 Podsakoff, Philip M.; MacKenzie, Scott B.; Podsakoff, Nathan P. (2012-01-10). "Sources of Method Bias in Social Science Research and Recommendations on How to Control It". Annual Review of Psychology 63 (1): 539–569. doi:10.1146/annurev-psych-120710-100452. ISSN 0066-4308. http://www.annualreviews.org/doi/10.1146/annurev-psych-120710-100452. 
  151. Yeh, May; Weisz, John R. (2001). "Why are we here at the clinic? Parent–child (dis)agreement on referral problems at outpatient treatment entry.". Journal of Consulting and Clinical Psychology 69 (6): 1018–1025. doi:10.1037/0022-006x.69.6.1018. ISSN 1939-2117. https://psycnet.apa.org/doi/10.1037/0022-006X.69.6.1018. 
  152. Youngstrom, Eric A.; Reyes, Andres De Los (2015-03-04). "Commentary: Moving Toward Cost-Effectiveness in Using Psychophysiological Measures in Clinical Assessment: Validity, Decision Making, and Adding Value". Journal of Clinical Child & Adolescent Psychology 44 (2): 352–361. doi:10.1080/15374416.2014.913252. ISSN 1537-4416. http://www.tandfonline.com/doi/abs/10.1080/15374416.2014.913252. 
  153. Campbell, D. T.; Fiske, D. W. (1959-03). "Convergent and discriminant validation by the multitrait-multimethod matrix". Psychological Bulletin 56 (2): 81–105. ISSN 0033-2909. PMID 13634291. https://www.ncbi.nlm.nih.gov/pubmed/13634291. 
  154. Keck, Paul E; Welge, Jeffrey A; Strakowski, Stephen M; Arnold, Lesley M; McElroy, Susan L (2000-04). "Placebo effect in randomized, controlled maintenance studies of patients with bipolar disorder". Biological Psychiatry 47 (8): 756–761. doi:10.1016/s0006-3223(99)00309-1. ISSN 0006-3223. https://doi.org/10.1016/S0006-3223(99)00309-1. 
  155. You, Dokyoung S.; Youngstrom, Eric A.; Feeny, Norah C.; Youngstrom, Jennifer Kogos; Findling, Robert L. (2017-07-04). "Comparing the Diagnostic Accuracy of Five Instruments for Detecting Posttraumatic Stress Disorder in Youth". Journal of Clinical Child & Adolescent Psychology 46 (4): 511–522. doi:10.1080/15374416.2015.1030754. ISSN 1537-4416. PMID 25946667. PMC PMC4703561. https://www.tandfonline.com/doi/full/10.1080/15374416.2015.1030754. 
  156. 156.0 156.1 156.2 Meeuwesen, Ludwien; van den Brink-Muinen, Atie; Hofstede, Geert (2009-04). "Can dimensions of national culture predict cross-national differences in medical communication?". Patient Education and Counseling 75 (1): 58–66. doi:10.1016/j.pec.2008.09.015. https://linkinghub.elsevier.com/retrieve/pii/S0738399108005302. 
  157. Minkov, Michael; Hofstede, Geert (2011-02-08). Chudzikowski, Katharina. ed. "The evolution of Hofstede's doctrine". Cross Cultural Management: An International Journal 18 (1): 10–20. doi:10.1108/13527601111104269. ISSN 1352-7606. https://www.emerald.com/insight/content/doi/10.1108/13527601111104269/full/html. 
  158. Mackin, Paul; Targum, Steven D.; Kalali, Amir; Rom, Dror; Young, Allan H. (2006/10). "Culture and assessment of manic symptoms". The British Journal of Psychiatry 189 (4): 379–380. doi:10.1192/bjp.bp.105.013920. ISSN 0007-1250. https://www.cambridge.org/core/journals/the-british-journal-of-psychiatry/article/culture-and-assessment-of-manic-symptoms/E3C9813BF62ACFBF966EAAF5AE30D016. 
  159. Dubicka, Bernadka; Carlson, Gabrielle A.; Vail, Andy; Harrington, Richard (2008-04). "Prepubertal mania: diagnostic differences between US and UK clinicians". European Child & Adolescent Psychiatry 17 (3): 153–161. doi:10.1007/s00787-007-0649-5. ISSN 1018-8827. http://link.springer.com/10.1007/s00787-007-0649-5. 
  160. 160.0 160.1 Angst, J. (2013). "Bipolar disorders in DSM-5: strengths, problems and perspectives". International Journal of Bipolar Disorders 1: 12. doi:10.1186/2194-7511-1-12. ISSN (Electronic) 2194-7511 (Linking) 2194-7511 (Electronic) 2194-7511 (Linking). 
  161. Chambless, Dianne L.; Hollon, Steven D. (1998). "Defining empirically supported therapies.". Journal of Consulting and Clinical Psychology 66 (1): 7–18. doi:10.1037/0022-006X.66.1.7. ISSN 1939-2117. http://doi.apa.org/getdoi.cfm?doi=10.1037/0022-006X.66.1.7. 
  162. Walshaw, Patricia D.; Alloy, Lauren B.; Sabb, Fred W. (2010-03). "Executive Function in Pediatric Bipolar Disorder and Attention-Deficit Hyperactivity Disorder: In Search of Distinct Phenotypic Profiles". Neuropsychology Review 20 (1): 103–120. doi:10.1007/s11065-009-9126-x. ISSN 1040-7308. PMID 20165924. PMC PMC2834768. http://link.springer.com/10.1007/s11065-009-9126-x. 
  163. Joseph, Megan F.; Frazier, Thomas W.; Youngstrom, Eric A.; Soares, Jair C. (2008-12). "A quantitative and qualitative review of neurocognitive performance in pediatric bipolar disorder". Journal of Child and Adolescent Psychopharmacology 18 (6): 595–605. doi:10.1089/cap.2008.064. ISSN 1557-8992. PMID 19108664. PMC 2768898. https://www.ncbi.nlm.nih.gov/pubmed/19108664. 
  164. Rocha-Rego, V.; Jogia, J.; Marquand, A. F.; Mourao-Miranda, J.; Simmons, A.; Frangou, S. (2014/02). "Examination of the predictive value of structural magnetic resonance scans in bipolar disorder: a pattern classification approach". Psychological Medicine 44 (3): 519–532. doi:10.1017/S0033291713001013. ISSN 0033-2917. PMID 23734914. PMC PMC3880067. https://www.cambridge.org/core/journals/psychological-medicine/article/examination-of-the-predictive-value-of-structural-magnetic-resonance-scans-in-bipolar-disorder-a-pattern-classification-approach/5E18BAF404B22874F52F40BA9A4A8FC0. 
  165. Hartley, Catherine A; Phelps, Elizabeth A (2010-01). "Changing Fear: The Neurocircuitry of Emotion Regulation". Neuropsychopharmacology 35 (1): 136–146. doi:10.1038/npp.2009.121. ISSN 0893-133X. PMID 19710632. PMC PMC3055445. http://www.nature.com/articles/npp2009121. 
  166. Strakowski, Stephen M; Adler, Caleb M; Almeida, Jorge; Altshuler, Lori L; Blumberg, Hilary P; Chang, Kiki D; DelBello, Melissa P; Frangou, Sophia et al. (2012-06). "The functional neuroanatomy of bipolar disorder: a consensus model: Functional neuroanatomy of bipolar disorder". Bipolar Disorders 14 (4): 313–325. doi:10.1111/j.1399-5618.2012.01022.x. PMID 22631617. PMC PMC3874804. http://doi.wiley.com/10.1111/j.1399-5618.2012.01022.x. 
  167. Meyer, Gregory J. “Guidelines for Reporting Information in Studies of Diagnostic Test Accuracy: The STARD Initiative.” Journal of Personality Assessment 81, no. 3 (2003): 191–93. https://doi.org/10/fcgxs5.
  168. Trull, Timothy J.; Ebner-Priemer, Ulrich (2013-03-28). "Ambulatory Assessment". Annual Review of Clinical Psychology 9 (1): 151–176. doi:10.1146/annurev-clinpsy-050212-185510. ISSN 1548-5943. PMID 23157450. PMC PMC4249763. http://www.annualreviews.org/doi/10.1146/annurev-clinpsy-050212-185510. 
  169. Stewart, Angela J.; Theodore-Oklota, Christina; Hadley, Wendy; Brown, Larry K.; Donenberg, Geri; DiClemente, Ralph; Project STYLE Study Group (2012-11). "Mania Symptoms and HIV-Risk Behavior Among Adolescents in Mental Health Treatment". Journal of Clinical Child & Adolescent Psychology 41 (6): 803–810. doi:10.1080/15374416.2012.675569. ISSN 1537-4416. PMID 22540428. PMC PMC3470757. http://www.tandfonline.com/doi/abs/10.1080/15374416.2012.675569. 
  170. Cuthbert, B. N.; Insel, T. R. (2010-11-01). "Toward New Approaches to Psychotic Disorders: The NIMH Research Domain Criteria Project". Schizophrenia Bulletin 36 (6): 1061–1062. doi:10.1093/schbul/sbq108. ISSN 0586-7614. PMID 20929969. PMC PMC2963043. https://academic.oup.com/schizophreniabulletin/article-lookup/doi/10.1093/schbul/sbq108. 
  171. Gore, Fiona M; Bloem, Paul JN; Patton, George C; Ferguson, Jane; Joseph, Véronique; Coffey, Carolyn; Sawyer, Susan M; Mathers, Colin D (2011-06). "Global burden of disease in young people aged 10–24 years: a systematic analysis". The Lancet 377 (9783): 2093–2102. doi:10.1016/s0140-6736(11)60512-6. ISSN 0140-6736. https://doi.org/10.1016/S0140-6736(11)60512-6. 
  172. Cheung, Fanny M.; Leung, Kwok (1998-01). "Indigenous Personality Measures". Journal of Cross-Cultural Psychology 29 (1): 233–248. doi:10.1177/0022022198291012. ISSN 0022-0221. https://doi.org/10.1177/0022022198291012. 
  173. Young, R. C.; Biggs, J. T.; Ziegler, V. E.; Meyer, D. A. (1978-11). "A Rating Scale for Mania: Reliability, Validity and Sensitivity". British Journal of Psychiatry 133 (5): 429–435. doi:10.1192/bjp.133.5.429. ISSN 0007-1250. https://www.cambridge.org/core/product/identifier/S0007125000198551/type/journal_article. 
  174. Algorta, Guillermo Pérez; Youngstrom, Eric A.; Frazier, Thomas W.; Freeman, Andrew J.; Youngstrom, Jennifer Kogos; Findling, Robert L. (2011-02). "Suicidality in pediatric bipolar disorder: predictor or outcome of family processes and mixed mood presentation?". Bipolar Disorders 13 (1): 76–86. doi:10.1111/j.1399-5618.2010.00886.x. ISSN 1399-5618. PMID 21320255. PMC 3076793. https://www.ncbi.nlm.nih.gov/pubmed/21320255. 
  175. Du Rocher Schudlich, Tina D.; Youngstrom, Eric A.; Calabrese, Joseph R.; Findling, Robert L. (2008-08). "The Role of Family Functioning in Bipolar Disorder in Families". Journal of Abnormal Child Psychology 36 (6): 849–863. doi:10.1007/s10802-008-9217-9. ISSN 0091-0627. http://link.springer.com/10.1007/s10802-008-9217-9. 
  176. Miklowitz, D. J. The Bipolar Disorder Survival Guide: What You and Your Family Need to Know. New York: Guilford, 2002.
  177. Camara, Wayne J.; Nathan, Julie S.; Puente, Anthony E. (2000-04). "Psychological test usage: Implications in professional psychology.". Professional Psychology: Research and Practice 31 (2): 141–154. doi:10.1037/0735-7028.31.2.141. ISSN 1939-1323. http://doi.apa.org/getdoi.cfm?doi=10.1037/0735-7028.31.2.141. 
  178. Camara, Wayne J.; Nathan, Julie S.; Puente, Anthony E. (2000-04). "Psychological test usage: Implications in professional psychology.". Professional Psychology: Research and Practice 31 (2): 141–154. doi:10.1037/0735-7028.31.2.141. ISSN 1939-1323. http://doi.apa.org/getdoi.cfm?doi=10.1037/0735-7028.31.2.141. 
  179. Stedman, James M.; Hatch, John P.; Schoenfeld, Lawrence S. (2001-12-01). "The Current Status of Psychological Assessment Training in Graduate and Professional Schools". Journal of Personality Assessment 77 (3): 398–407. doi:10.1207/S15327752JPA7703_02. ISSN 0022-3891. PMID 11781028. https://doi.org/10.1207/S15327752JPA7703_02. 
  180. Youngstrom, Eric A.; Choukas-Bradley, Sophia; Calhoun, Casey D.; Jensen-Doss, Amanda (2015-02). "Clinical Guide to the Evidence-Based Assessment Approach to Diagnosis and Treatment". Cognitive and Behavioral Practice 22 (1): 20–35. doi:10.1016/j.cbpra.2013.12.005. https://linkinghub.elsevier.com/retrieve/pii/S1077722913001193. 
  181. Pendergast, Laura L.; Youngstrom, Eric A.; Brown, Christopher; Jensen, Dane; Abramson, Lyn Y.; Alloy, Lauren B. (2015). "Structural invariance of General Behavior Inventory (GBI) scores in Black and White young adults.". Psychological Assessment 27 (1): 21–30. doi:10.1037/pas0000020. ISSN 1939-134X. PMID 25222430. PMC PMC4355320. http://doi.apa.org/getdoi.cfm?doi=10.1037/pas0000020.