# Estimation/Capture-Recapture

Collar tagged rock hyrax
Jackdaw with a numbered aluminum ring on its left tarsus
Biologist is marking a Chittenango ovate amber snail to monitor the population.
Marked Chittenango ovate amber snail.

This learning resource is about the Capture and Recapture method, that is commonly used in ecology to estimate an animal population's size where it is impractical or impossible to count every individual[1]. A portion of the population is captured, marked, and released. Later, another portion will be captured and the number of marked individuals within the sample is counted.

• Analyze the requirements and constraints in the epidemiological context of COVID-19. How can the capture-recapture methodology be applied on the estimation of the total number of infection in the population. Explain the challenges according to the available data and performed tests!

## Example

• 100 individuals of the population is captured, marked, and released. Next time only 96 individual of the population could be captured and among those 96 individuals 16 individuals were marked.
• (${\displaystyle N=?}$ unknown estimated total number in the population) ${\displaystyle N}$ is the estimated total number of individuals in population. The real number of population ${\displaystyle N_{t}}$ at time ${\displaystyle t}$ is in general unknown.
• (${\displaystyle M=100}$ know number of marked individuals - first visit) due to the fact that the researcher marks the captured individuals, the number of marked individuals is know. This is the number of individuals in the population that were marked during the first sampling.
• (${\displaystyle n=96}$ know number of individuals - second visit) the number of individuals captured in the second sampling might be different from number of the first visit.
• (${\displaystyle m=16}$ know number of marked individuals - second visit) 16 are marked from 96 in the second visit.
${\displaystyle {\frac {N}{M}}={\frac {N}{100}}={\frac {96}{16}}={\frac {n}{m}}\quad \Rightarrow \quad N={\frac {96}{16}}\cdot 100=600}$

## Assumption of Proportionality

With the assumption that the number of marked individuals within the second sample is proportional to the number of marked individuals in the whole population, an estimate of the total population size can be obtained by dividing the number of marked individuals by the proportion of marked individuals in the second sample. This assumption may not be justified in all cases due to the fact e.g. animals that were marked and caught with a trap learn from that experience and this might reduce the probability that they were caught a second time. This will lead to an overestimation of the population.

## Terminology

The method is most useful when it is not practical to count all the individuals in the population. Other names for this method, or closely related methods, include capture-recapture, capture-mark-recapture, mark-recapture, sight-resight, mark-release-recapture, multiple systems estimation, band recovery, the Petersen method,[2] and the Lincoln method.

## Application in Epidemiology

Another major application for these methods is in epidemiology,[3] where they are used to estimate the completeness of ascertainment of disease registers. Typical applications include estimating the number of people needing particular services (i.e. services for children with learning disabilities, services for medically frail elderly living in the community), or with particular conditions (i.e. illegal drug addicts, people infected with HIV, etc.).[4]

• 100 individuals of the population are tested (COVID-19 - virus test). As a simplication all tested people from the population show COVID-19 symptoms (e.g. cold, fever, ...) and asked for From this sample of the population could be captured and among those 96 individuals 16 individuals were marked.
• (${\displaystyle N=300000}$ estimated total number with a specific requirements or properties in the population - population view) ${\displaystyle N}$ is the estimated total number of individuals in population that have specific requirements or properties. An example could be the number of people showing sympton of cough and fever. The real number of population ${\displaystyle N_{t}}$ at time ${\displaystyle t}$ is in general unknown.
• (${\displaystyle I=?}$ unknown number of infected individuals in the population - population view) due to the fact that the researcher want to know th number of infected people amoung the ${\displaystyle N}$ individual the tested individuals, the number of marked individuals is know. This is the number of individuals in the population that were marked during the first sampling.
• (${\displaystyle n_{t}=96}$ known number of tested individuals - tested subgroup of the population at time ${\displaystyle t}$) the number of individuals tested in this sampling are selected in a way that they share the same specific requirments or properties as the considered group of ${\displaystyle N}$ individuals. Keep in mind, that are in the real number individuals in the group with ${\displaystyle N}$ be different from number of the first visit.
• (${\displaystyle i_{t}=16}$ known number of infected individuals - in the test at time ${\displaystyle t}$) 16 are infected from 96 people in the tested group of individuals.
${\displaystyle {\frac {N}{M}}={\frac {300000}{M}}={\frac {96}{16}}={\frac {n_{t}}{i_{t}}}\quad \Rightarrow \quad M={\frac {16}{96}}\cdot 300000=50000}$

The challenge is to estimate the total number in the population with specific requirements or properties, i.e. the number in the population that show e.g. both symptoms cough and fever. The group of tested individuals e.g. for COVID-19 is not homogeneous. E.g. asymptomatic patient were not tested and do not popup in the health system. Furthermore the health care staff might be tested without symptoms due to their exposure to vulnerable patients. The total number of tested people must be devided into people that show COVID-19 symptoms and therefore the group of ${\displaystyle N=300000}$ individuals must be the estimated value of citizens that show COVID-19 symptoms. Furthermore he group of ${\displaystyle N=300000}$ individuals may not be tested due to the limitation of the test capacity in the considered reqion and only a fraction of the population may pop up in the health system. For the estimation of the total number of infected people in the population without COVID-19 symptoms must be tested to estimate the total number of COVID-19 infected people in the considered population. This is again a challenge with limited test capacity.

• How can researcher collect data in a study that allows to esitmate the number ${\displaystyle N}$?
• Is it possible to use a citizen science approach for the estimation?
• How do you check, if the sample size is sufficient for your estimation of ${\displaystyle N}$ and ${\displaystyle I}$?
• Do you think, that the assumption, that the study population is "closed" is closed is appropriate for the COVID-19 disease?

## Field work related to mark-recapture

Typically a researcher visits a study area and uses traps to capture a group of individuals alive. Each of these individuals is marked with a unique identifier (e.g., a numbered tag or band), and then is released unharmed back into the environment. A mark recapture method was first used for ecological study in 1896 by C.G. Johannes Petersen to estimate plaice, Pleuronectes platessa, populations.[5]

Sufficient time is allowed to pass for the marked individuals to redistribute themselves among the unmarked population.[5]

Next, the researcher returns and captures another sample of individuals. Some individuals in this second sample will have been marked during the initial visit and are now known as recaptures[6]. Other animals captured during the second visit, will not have been captured during the first visit to the study area. These unmarked animals are usually given a tag or band during the second visit and then are released.[5]

Population size can be estimated from as few as two visits to the study area. Commonly, more than two visits are made, particularly if estimates of survival or movement are desired. Regardless of the total number of visits, the researcher simply records the date of each capture of each individual. The "capture histories" generated are analyzed mathematically to estimate population size, survival, or movement.[5]

## Notation

Let

N = Number of animals in the population
n = Number of animals marked on the first visit
K = Number of animals captured on the second visit
k = Number of recaptured animals that were marked

A biologist wants to estimate the size of a population of turtles in a lake. She captures 10 turtles on her first visit to the lake, and marks their backs with paint. A week later she returns to the lake and captures 15 turtles. Five of these 15 turtles have paint on their backs, indicating that they are recaptured animals. This example is (n, K, k) = (10, 15, 5). The problem is to estimate N. N=n*K/k

## Lincoln–Petersen estimator

The Lincoln–Petersen method[7] (also known as the Petersen–Lincoln index[5] or Lincoln index) can be used to estimate population size if only two visits are made to the study area. This method assumes that the study population is "closed"[8]. In other words, the two visits to the study area are close enough in time so that no individuals die, are born, or move into or out of the study area between visits. The model also assumes that no marks fall off animals between visits to the field site by the researcher, and that the researcher correctly records all marks.

Given those conditions, estimated population size is:

${\displaystyle {\hat {N}}={\frac {Kn}{k}},}$

### Derivation

It is assumed[9] that all individuals have the same probability of being captured in the second sample, regardless of whether they were previously captured in the first sample (with only two samples, this assumption cannot be tested directly).

This implies that, in the second sample, the proportion of marked individuals that are caught (${\displaystyle k/K}$) should equal the proportion of the total population that is marked (${\displaystyle n/N}$). For example, if half of the marked individuals were recaptured, it would be assumed that half of the total population was included in the second sample.

In symbols,

${\displaystyle {\frac {k}{K}}={\frac {n}{N}}.}$

A rearrangement of this gives

${\displaystyle {\hat {N}}={\frac {Kn}{k}},}$

the formula used for the Lincoln–Petersen method.[9]

### Sample calculation

In the example (n, K, k) = (10, 15, 5) the Lincoln–Petersen method estimates that there are 30 turtles in the lake.

${\displaystyle {\hat {N}}={\frac {Kn}{k}}={\frac {10\times 15}{5}}=30}$

## Chapman estimator

The Lincoln–Petersen estimator is asymptotically unbiased as sample size approaches infinity, but is biased at small sample sizes.[10] An alternative less biased estimator of population size is given by the Chapman estimator:[10]

${\displaystyle {\hat {N}}_{C}={\frac {(K+1)(n+1)}{k+1}}-1}$

### Sample calculation

The example (K, n, k) = (10, 15, 5) gives

${\displaystyle {\hat {N}}_{C}={\frac {(K+1)(n+1)}{k+1}}-1={\frac {11\times 16}{6}}-1=28.3}$

Note that the answer provided by this equation must be truncated not rounded. Thus, the Chapman method estimates 28 turtles in the lake.

Surprisingly, Chapman's estimate was one conjecture from a range of possible estimators: "In practice, the whole number immediately less than (K+1)(n+1)/(k+1) or even Kn/(k+1) will be the estimate. The above form is more convenient for mathematical purposes."[10](see footnote, page 144). Chapman also found the estimator could have considerable negative bias for small Kn/N [10](page 146), but was unconcerned because the estimated standard deviations were large for these cases.

## Confidence interval

An approximate ${\displaystyle 100(1-\alpha )\%}$ confidence interval for the population size N can be obtained as:

${\displaystyle K+n-k-0.5+{\frac {(K-k+0.5)(n-k+0.5)}{(k+0.5)}}\exp(\pm z_{\alpha /2}{\hat {\sigma }}_{0.5})}$,

where ${\textstyle z_{\alpha /2}}$ corresponds to the ${\displaystyle 1-\alpha /2}$ quantile of a standard normal random variable, and

${\displaystyle {\hat {\sigma }}_{0.5}={\sqrt {{\frac {1}{k+0.5}}+{\frac {1}{K-k+0.5}}+{\frac {1}{n-k+0.5}}+{\frac {k+0.5}{(n-k+0.5)(K-k+0.5)}}}}}$.

It has been shown that this confidence interval has actual coverage probabilities that are close to the nominal ${\displaystyle 100(1-\alpha )\%}$ level even for small populations and extreme capture probabilities (near to 0 or 1), in which cases other confidence intervals fail to achieve the nominal coverage levels.[11]

## Bayesian estimate

The mean value ± standard deviation is

${\displaystyle N\approx \mu \pm {\sqrt {\mu \epsilon }}}$

where

${\displaystyle \mu ={\frac {(K-1)(n-1)}{k-2}}}$ for ${\displaystyle k>2}$
${\displaystyle \epsilon ={\frac {(K-k+1)(n-k+1)}{(k-2)(k-3)}}}$ for ${\displaystyle k>3}$

A derivation is found here: Wikipedia:Talk:Mark and recapture#Statistical treatment.

The example (K, n, k) = (10, 15, 5) gives the estimate N ≈ 42 ± 21.5

## Capture probability

Bank vole, Myodes glareolus, in a capture-release small mammal population study for London Wildlife Trust at Gunnersbury Triangle local nature reserve

The capture probability refers to the probability of a detecting an individual animal or person of interest,[12] and has been used in both ecology and epidemiology for detecting animal or human diseases,[13] respectively.

The capture probability is often defined as a two-variable model, in which f is defined as the fraction of a finite resource devoted to detecting the animal or person of interest from a high risk sector of an animal or human population, and q is the frequency of time that the problem (e.g., an animal disease) occurs in the high-risk versus the low-risk sector.[14] For example, an application of the model in the 1920s was to detect typhoid carriers in London, who were either arriving from zones with high rates of tuberculosis (probability q that a passenger with the disease came from such an area, where q>0.5), or low rates (probability 1-q).[15] It was posited that only 5 out of every 100 of the travelers could be detected, and 10 out of every 100 were from the high risk area. Then the capture probability P was defined as:

${\displaystyle P={\frac {5}{10}}fq+{\frac {5}{90}}(1-f)(1-q),}$

where the first term refers to the probability of detection (capture probability) in a high risk zone, and the latter term refers to the probability of detection in a low risk zone. Importantly, the formula can be re-written as a linear equation in terms of f:

${\displaystyle P=\left({\frac {5}{10}}q-{\frac {5}{90}}(1-q)\right)f+{\frac {5}{90}}(1-q).}$

Because this is a linear function, it follows that for certain versions of q for which the slope of this line (the first term multiplied by f) is positive, all of the detection resource should be devoted to the high-risk population (f should be set to 1 to maximize the capture probability), whereas for other value of q, for which the slope of the line is negative, all of the detection should be devoted to the low-risk population (f should be set to 0. We can solve the above equation for the values of q for which the slope will be positive to determine the values for which f should be set to 1 to maximize the capture probability:

${\displaystyle \left({\frac {5}{10}}q-{\frac {5}{90}}(1-q)\right)>0,}$

which simplifies to:

${\displaystyle q>{\frac {1}{10}}.}$

This is an example of linear optimization.[14] In more complex cases, where more than one resource f is devoted to more than two areas, multivariate optimization is often used, through the simplex algorithm or its derivatives.

## More than two visits

The literature on the analysis of capture-recapture studies has blossomed since the early 1990s. There are very elaborate statistical models available for the analysis of these experiments.[16] A simple model which easily accommodates the three source, or the three visit study, is to fit a Poisson regression model. Sophisticated mark-recapture models can be fit with several packages for the Open Source R programming language. These include "Spatially Explicit Capture-Recapture (secr)",[17] "Loglinear Models for Capture-Recapture Experiments (Rcapture)",[18] and "Mark-Recapture Distance Sampling (mrds)".[19] Such models can also be fit with specialized programs such as MARK[20] or M-SURGE.[21]

Other related methods which are often used include the Jolly–Seber model (used in open populations and for multiple census estimates) and Schnabel estimators[22] (described above as an expansion to the Lincoln–Petersen method for closed populations). These are described in detail by Sutherland.[23]

## Integrated approaches

Modelling mark-recapture data is trending towards a more integrative approach,[24] which combines mark-recapture data with population dynamics models and other types of data. The integrated approach is more computationally demanding, but extracts more information from the data improving parameter and uncertainty estimates.[25]

## References

1. Krebs, Charles J. (2009). Ecology (6th ed.). p. 119. ISBN 978-0-321-50743-3.
2. Chao, A.; Tsay, P. K.; Lin, S. H.; Shau, W. Y.; Chao, D. Y. (2001). "The applications of capture-recapture models to epidemiological data". Statistics in Medicine 20 (20): 3123–3157. doi:10.1002/sim.996. PMID 11590637.
3. Allen (2019). "Estimating the Number of People Who Inject Drugs in A Rural County in Appalachia". American Journal of Public Health 109 (3): 445–450. doi:10.2105/AJPH.2018.304873. PMID 30676803. PMC 6366498. //www.ncbi.nlm.nih.gov/pmc/articles/PMC6366498/.
4. Southwood, T. R. E.; Henderson, P. (2000). Ecological Methods (3rd ed.). Oxford: Blackwell Science.
5. https://www.merriam-webster.com/dictionary/recapture
6. Seber, G. A. F. (1982). The Estimation of Animal Abundance and Related Parameters. Caldwel, New Jersey: Blackburn Press. ISBN 1-930665-55-5.
7. https://www.sciencedirect.com/topics/computer-science/closed-population
8. Charles J. Krebs (1999). Ecological Methodology (2nd ed.). ISBN 9780321021731.
9. Chapman, D.G. (1951). Some properties of the hypergeometric distribution with applications to zoological sample censuses.
10. Sadinle, Mauricio (2009-10-01). "Transformed Logit Confidence Intervals for Small Populations in Single Capture–Recapture Estimation". Communications in Statistics - Simulation and Computation 38 (9): 1909–1924. doi:10.1080/03610910903168595. ISSN 0361-0918.
11. Drenner, Ray (1978). "Capture probability: the role of zooplankter escape in the selective feeding of planktivorous fish". Journal of the Fisheries Board of Canada 35 (10): 1370–1373. doi:10.1139/f78-215.
12. MacKenzie, Darryl (2002). "How should detection probability be incorporated into estimates of relative abundance?". Ecology 83 (9): 2387–2393. doi:10.1890/0012-9658(2002)083[2387:hsdpbi]2.0.co;2.
13. Bolker, Benjamin (2008). Ecological Models and Data in R. Princeton University Press. ISBN 9781400840908.
14. Unknown (1921). "The Health of London". Hosp Health Rev 1: 71–2.
15. McCrea, R.S. and Morgan, B.J.T. (2014) "Analysis of capture-recapture data". Retrieved 19 Nov 2014. "Chapman and Hall/CRC Press". Retrieved 19 Nov 2014.
16. Efford, Murray (2016-09-02). "Spatially Explicit Capture-Recapture (secr)". Comprehensive R Archive Network (CRAN). Retrieved 2016-09-02.
17. Rivest, Louis-Paul; Baillargeon, Sophie (2014-09-01). "Loglinear Models for Capture-Recapture Experiments (Rcapture)". Comprehensive R Archive Network (CRAN). Retrieved 2016-09-02.
18. Laake, Jeff; Borchers, David; Thomas, Len; Miller, David; Bishop, Jon (2015-08-17). "Mark-Recapture Distance Sampling (mrds)". Comprehensive R Archive Network (CRAN).
19. "Program MARK". Archived from the original on 21 February 2006. Retrieved 29 May 2013.
20. "Logiciels". Archived from the original on 2009-07-24.
21. Schnabel, Z. E. (1938). "The Estimation of the Total Fish Population of a Lake". American Mathematical Monthly 45 (6): 348–352. doi:10.2307/2304025.
22. William J. Sutherland, ed. (1996). Ecological Census Techniques: A Handbook. Cambridge University Press. ISBN 0-521-47815-4.
23. Maunder M.N. (2003) Paradigm shifts in fisheries stock assessment: from integrated analysis to Bayesian analysis and back again. Natural Resource Modeling 16:465–475
24. Maunder, M.N. (2001) Integrated Tagging and Catch-at-Age Analysis (ITCAAN). In Spatial Processes and Management of Fish Populations, edited by G.H. Kruse, N. Bez, A. Booth, M.W. Dorn, S. Hills, R.N. Lipcius, D. Pelletier, C. Roy, S.J. Smith, and D. Witherell, Alaska Sea Grant College Program Report No. AK-SG-01-02, University of Alaska Fairbanks, pp. 123–146.
• Besbeas, P; Freeman, S. N.; Morgan, B. J. T.; Catchpole, E. A. (2002). "Integrating mark-recapture-recovery and census data to estimate animal abundance and demographic parameters.". Biometrics 58 (3): 540–547. doi:10.1111/j.0006-341X.2002.00540.x. PMID 12229988.
• Martin-Löf, P. (1961). "Mortality rate calculations on ringed birds with special reference to the Dunlin Calidris alpina". Arkiv för Zoologi (Zoology Files), Kungliga Svenska Vetenskapsakademien (The Royal Swedish Academy of Sciences) Serie 2 Band 13 (21).
• Maunder, M. N. (2004). "Population viability analysis, based on combining integrated, Bayesian, and hierarchical analyses". Acta Oecologica 26 (2): 85–94. doi:10.1016/j.actao.2003.11.008.
• Phillips, C. A.; M. J. Dreslik; J. R. Johnson; J. E. Petzing (2001). "Application of population estimation to pond breeding salamanders". Transactions of the Illinois Academy of Science 94 (2): 111–118.
• Royle, J. A.; R. M. Dorazio (2008). Hierarchical Modeling and Inference in Ecology. Elsevier. ISBN 978-1-930665-55-2.
• Seber, G.A.F. (2002). The Estimation of Animal Abundance and Related Parameters. Caldwel, New Jersey: Blackburn Press. ISBN 1-930665-55-5.
• Schaub, M; Gimenez, O.; Sierro, A.; Arlettaz, R (2007). "Use of Integrated Modeling to Enhance Estimates of Population Dynamics Obtained from Limited Data". Conservation Biology 21 (4): 945–955. doi:10.1111/j.1523-1739.2007.00743.x. PMID 17650245.
• Williams, B. K.; J. D. Nichols; M. J. Conroy (2002). Analysis and Management of Animal Populations. San Diego, California: Academic Press. ISBN 0-12-754406-2.
• Chao, A; Tsay, P. K.; Lin, S. H.; Shau, W. Y.; Chao, D. Y. (2001). "The applications of capture-recapture models to epidemiological data". Statistics in Medicine 20 (20): 3123–3157. doi:10.1002/sim.996. PMID 11590637.