data("book.tee.data", package = "mrds")
Double Observer Surveys: Post 1
Full Independence and Equal Sightability
Double-observer line-transect surveys estimate the abundance of targets (e.g., targets could be birds, bears, bats, bees, etc.) in an area. Double-observer line-transect surveys involve two observers simultaneously searching for targets as they traverse defined paths within the area. Picture one observer walking right behind another, or one observer in the front seat of an aircraft, with another in the back. Both observers independently record target observations. A single detection distance, from observers to the target, is recorded when targets are spotted.
Why the second observer? Wouldn’t a single observer suffice? A single observer would suffice, if we make an assumption, so there is an analysis cost. A standard assumption of single-observer line-transect surveys is that all targets on the path are seen. This assumption is necessary to estimate abundance and is known as the “g(0) = 1” assumption. In other word, the detection function at 0, “g(0)”, is 100%.
The entire point of including two simultaneous observers searching the same areas for the same targets is that it allows analysts to compare the rates of targets seen and not seen by each observer. These rates in turn estimate the probability of detecting targets on the path. In other words, double-observer surveys relax the “g(0) = 1” assumption and do not assume perfect detection on the path. “g(0)” can be estimated from the collected data.
In this post, I demonstrate estimation of the simplest double-observer method using R. The simplest double-observer analysis makes the following two assumptions:
- Observers are fully independent. Detection by one observer does not influence detection by the other, regardless of distance. At all distances, probability of detection at least one observer the sum of individual observer detection probabilities minus the probability of detection by both observers.
- Distance functions are identical. Detection at all positive distances is the same between observers. Put another way, the decline in detection as distance increases is identical among observers.
I show analysis of this simple case using R package mrds
, and then I show equivalent computations in base R. Showing both will make the computations clear. Among observers, I assume observer 2 is somehow distracted, for example with data collection or by having to fly the aircraft, and is not expected to observe everything. Observer 1 is the primary observer and is expected to be the better observer. Observer 2’s overall detection probability is expected to be less than that of Observer 1.
I use the ‘package’::‘function’ R syntax for function calls when the function resides in a non-base package. This syntax makes package membership explicit and, in my opinion, is more helpful than attaching package because attaching packages obscures package memebership of function.
Nevertheless, to re-run the code in this post, you will need the following R packages:
mrds
,knitr
, anddplyr
Example Data
I analyze the golf tee data included in the mrds
package, primarily because it is convenient. The following code copies the data set from mrds
into the current working directory.
The golf tee data comes packaged as a 4 element list, each pre-formatted for input to mrds
. The primary data frame containing detections is in the book.tee.dataframe
element.
::kable(book.tee.data$book.tee.dataframe[1:6,]) knitr
object | observer | detected | distance | size | sex | exposure | |
---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 2.68 | 2 | 1 | 1 |
21 | 1 | 2 | 0 | 2.68 | 2 | 1 | 1 |
2 | 2 | 1 | 1 | 3.33 | 2 | 1 | 0 |
22 | 2 | 2 | 0 | 3.33 | 2 | 1 | 0 |
3 | 3 | 1 | 1 | 0.34 | 1 | 0 | 0 |
23 | 3 | 2 | 0 | 0.34 | 1 | 0 | 0 |
The book.tee.dataframe
data frame contains two rows per detected target (object
) and columns indicating which observer detected the object (observer
and detected
). The common detection distance is in column distance
. Column size
is the number of golf tees in the detected group. The remaining columns (sex
and exposure
) are covariates. sex
is 0 for green golf tees, 1 for yellow tees. exposure
reports a categorical variable based on height of the tee above ground. exposure
equals 0 for ‘low’, short, tees, and 1 for ‘high’, tall, tees. I do not use covariates in this analysis, but will in subsequent posts.
Abundance in Covered Area
Using mrds
The following code uses mrds::ddf
to estimate abundance in the surveyed area under assumptions of this case, i.e., fully-independent observers and identical distance functions. An analysis on of this data on Distancesample.org truncates distances at 4 [m]; but, there are no distances greater than 4 [m] in the data set, so truncation is not necessary and the meta.data = list(width = T)
line is not necessary (but I’ll leave it because truncation is usually necessary). The key mrds
parameter is method='trial.fi'
which specifies observer 1 as the primary observer and observer 2 as the secondary. This model views observer 2 as setting up detection “trials” for observer 1. The other parameters specify the model and the data set.
This is the call to ddf
of the mrds
package.
<- mrds::ddf(method='trial.fi'
mrds.fi.mr mrmodel=~glm(~ 1)
, data=book.tee.data$book.tee.dataframe
, meta.data=list(width=4)
,
)<-summary(mrds.fi.mr)
mrds.fi.mr.sum
mrds.fi.mr.sum>
> Summary for trial.fi object
> Number of observations : 162
> Number seen by primary : 124
> Number seen by secondary (trials) : 142
> Number seen by both (detected trials): 104
> AIC : 510.766
>
>
> Conditional detection function parameters:
> estimate se
> (Intercept) 1.006805 0.189555
>
> Estimate SE CV
> Average p 0.7323944 0.03799899 0.05188324
> Average primary p(0) 0.7323944 0.03715144 0.05072600
> N in covered region 169.3076923 11.64565815 0.06878399
Estimated abundance of golf tee groups in the surveyed area appears on the last line of output in the above box, i.e.,
<- c("N in covered region" = mrds.fi.mr.sum$Nhat
mrds_ests "SE" = mrds.fi.mr.sum$Nhat.se
,
)
mrds_ests> N in covered region SE
> 169.30769 11.64566
The Mark-Recapture Estimator
If observers are fully independent, and probability of detection declines with distance at the same rate for both observers, we only need the mr
(mark-recapture) portion of mrds
(mark-recapture distance-sampling) method. In this case, double-observer distance data can be analyzed as if it came from a two-session closed population mark-recapture data.
The first capture session of two-session closed population mark-recapture studies entails capturing individuals, marking them, and releasing them. Later, these studies conduct a second capture session wherein they record the number of marked individuals captured, as well as the total number of captured individuals in the second sample. Assuming no change in the population (no births, deaths, emigration, or immigration) between capture sessions, we estimate population size as the ratio of the number of animals captured in the second session and the probability of capturing an animal, which we estimate as the number of marked individuals in the second sample divided by the total number of marked individuals in the population.
A double-observer study that assumes equal distance functions and full independence views the secondary observer as the first “occasion” while the primary observer is the second “occasion”. The secondary observer “marks” targets during the first “occasion” while the primary observer “captures” a random sample of marked individuals during the second “occasion”. From one point of view, the secondary observer generates detection trials for the primary observer, hence the name trial.fi
in mrds
, which stands for <trial method>.<fully independent>
.
It is interesting to note that reversing observers, i.e., swapping the primary observer, makes no difference to this analysis. Primary and secondary observers will matter when we consider covariates in the distance function (see a subsequent post on that topic).
To estimate abundance in the surveyed area, we exploit the parallel between the distance sampling situation and mark-recapture analyses, and apply a well known mark-recapture estimator. I will apply the Chapman mark-recapture estimator which is a derivative of the famous Lincoln-Peterson estimator. I start here with the Lincoln-Peterson estimator because its form illuminates the logic of the estimator and motivates the Chapman estimator.
The Lincoln-Peterson estimator equates the ratio of marked individuals in the population just after the first occasion, and the same ratio in the sample taken during occasion two, i.e., \[ \frac{n_1}{N} = \frac{m_2}{n_2}, \] where \(n_1\) is the number of targets seen by the secondary observer, \(n_2\) is the number of targets seen by the primary observer, and \(m_2\) is the number of “marked” targets seen by the primary observer, or the number of targets seen by both the primary and secondary observer. Lincoln-Peterson estimates \(N\) by solving the above equation for \(N\), i.e., \[ \hat{N} = \frac{n_1 n_2}{m_2}. \]
Re-expressing the estimator as, \[ \hat{N} = \frac{n_2}{m_2 / n_1} \] shows that it is the ratio of the number of targets seen by the primary observer (\(n_2\)) and an estimate of the probability of detecting a target in the population (\(m_2 / n_1\)). Horvitz-Thompson type estimators have this general form. Horvitz-Thompson type estimators pervade distance sampling.
It is well known that the Lincoln-Peterson estimator is biased for true population size. To reduce bias, analysts usually apply the so-called Chapman estimator in practice. The Chapman estimator is less biased than the Lincoln-Peterson and its variance estimator is better behaved.
The bias correction of the Chapman estimator simply adds 1 to every count of the Lincoln-Peterson estimator, and subtracts 1 from the result, i.e., \[ \hat{N} = \frac{(n_1+1)(n_2+1)}{(m_2+1)} - 1 \] The Chapman estimator is close to the Lincoln-Peterson estimator when \(m_2\) is a substantial fraction of both \(n_1\) and \(n_2\); but, the two diverge as \(m_2\) decreases.
The estimated standard error of the Chapman estimator is, \[ \hat{se}(N) = \frac{(n_1+1)(n_2+1)(n_1-m_2)(n_2-m_2)}{(m_2+1)^2 (m_2 + 2)}. \]
Using Base R
In this section, I demonstrate base R calculations required to repeat the mrds
analysis. The following code computes \(n_1\), \(n_2\), and \(m_2\). Counts \(n_1\) and \(m_2\) must be calculated from a data frame of observer 1’s detections of observer 2’s targets. In the following code, data frame df_1given2
contains only those targets detected by observer 2 and column detected
informs us whether observer 1 detected each of the targets detected by observer 2. Count \(n_2\) requires the original data set.
# Construct the 'trial' data frame.
# Observer 1's observations of observer 2's targets.
<- book.tee.data$book.tee.dataframe |>
objs_seenBy_2 ::filter(observer == 2 & detected == 1) |>
dplyr::select(object)
dplyr
<- book.tee.data$book.tee.dataframe |>
df_1given2 ::filter(observer == 1) |>
dplyr::right_join(objs_seenBy_2, by = "object")
dplyr
# Counts for the Chapman estimator
<- nrow(df_1given2)
n1 <- book.tee.data$book.tee.dataframe |>
n2 ::filter(observer == 1) |>
dplyr::summarise(n = sum(detected)) |>
dplyr::pull(n)
dplyr<- sum(df_1given2$detected)
m2
c(n1 = n1
n2 = n2
, m2 = m2)
, > n1 n2 m2
> 142 124 104
These counts match output from mrds
(i.e., number seen by primary, secondary, and both). The Lincoln-Peterson estimate of target abundance in the surveyed strip matches the estimate from mrds
exactly, i.e.,
# Lincoln-Peterson estimator ----
<- n1 * n2 / m2
N_lp c("N in covered region (Lincoln-Peterson)" = N_lp)
> N in covered region (Lincoln-Peterson)
> 169.3077
The Chapman estimator is slightly different from the Lincoln-Peterson. The Chapman estimator and it’s variance are,
# Chapman estimator ----
<- ((n1+1) * (n2+1) / (m2+1)) - 1
N_ch <- ( ((n1+1)*(n2+1)*(n1-m2)*(n2-m2)) / ((m2+1)^2 * (m2 + 2)))
se_N_ch <- c("N in covered region" = N_ch
ch_ests "SE" = se_N_ch)
,
ch_ests> N in covered region SE
> 169.23810 11.62452
Finally, confidence intervals constructed from either estimator assume normality and take the standard \(\pm 1.96\) form. For comparison, Table 1 contains a summary of both estimator’s numerical values.
<- dplyr::bind_rows(mrds_ests, ch_ests) |>
fi.ests ::mutate(estimator = c("mrds", "Chapman")) |>
dplyr::select(estimator, dplyr::everything()) |>
dplyr::mutate(lo_95ci = `N in covered region` - 1.96*SE
dplyrhi_95ci = `N in covered region` + 1.96*SE)
, ::kable(fi.ests) knitr
mrds
and base R.
estimator | N in covered region | SE | lo_95ci | hi_95ci |
---|---|---|---|---|
mrds | 169.3077 | 11.64566 | 146.4822 | 192.1332 |
Chapman | 169.2381 | 11.62452 | 146.4540 | 192.0222 |
Abundance in Study Area
Using mrds
Abundance in the larger study area is estimated by assuming density is equal everywhere, and simply applying density in the surveyed area to the entire study area. The reader will see that expansion to study area using mrds
is convenient because the golf tee data are pre-formatted for mrds
; but, I have often been unclear whether I have manipulated data into the format required by mrds
. So, I will attempt to explain the format of the additional information required to estimate study area abundance.
In mrds
, expansion to the study area is accomplished by a call to the dht
function. dht
requires three data frames, in addition to the object returned by ddf
that includes the detection data frame.
- Region Size: A data frame containing one line per sub-region in the study area for which estimates are desired. Each line should contain the region’s identifier in a column named
Region.Label
and the region’s size in a column namedArea
. The region’s area should be in square units of the distance measurements and transect lengths. If distances were collected in meters, area of each region should be in square meters. If distances are in feet, area should be square feet, and so on. - Transect Lengths: A data frame containing one line per transect.
mrds
documentation calls this data frame the ‘samples’ data set because in other studies the rows are points, and not transects. For simplicity, I will call ‘samples’ transects. Each line of this data frame should contain the transect’s identifier in a column namedSample.Label
, the study region identifier containing the transect in a column namedRegion.Label
, and the length of each transect in a column namedEffort
. Transect lengths must be reported in the same measurement units as distances, and in the square root of area units. - Observations: A data frame containing one line per detection.
mrds
documentation calls this the ‘obs’ data set. Each line should contain the detection target’s identifier in a column namedobject
, the transect identifier on which the target was detected in a column namedSample.Label
, and the study region identifier in which the transect resides in a column namedRegion.Label
.
mrds
is inflexible in its column names. The columns of each data frame must be named exactly as stated above.
mrds
does not check compatibility of measurement units. In real studies, analysts must carefully convert, check, and input the correct distance units into all mrds
routines.
The golf tee region data frame appears in Table 2. The golf tee transect data frame appears in Table 3. The first six rows of the golf tee observation data frame appear in Table 4.
Region.Label | Area |
---|---|
1 | 1040 |
2 | 640 |
Sample.Label | Region.Label | Effort |
---|---|---|
1 | 1 | 10 |
2 | 1 | 30 |
3 | 1 | 30 |
4 | 1 | 27 |
5 | 1 | 21 |
6 | 1 | 12 |
7 | 2 | 23 |
8 | 2 | 23 |
9 | 2 | 15 |
10 | 2 | 12 |
11 | 2 | 7 |
object | Region.Label | Sample.Label | |
---|---|---|---|
1 | 1 | 1 | 1 |
2 | 2 | 1 | 1 |
3 | 3 | 1 | 1 |
12 | 4 | 1 | 2 |
13 | 25 | 1 | 2 |
14 | 26 | 1 | 2 |
In the golf tee example, observers walked 11 transects of varying length that totaled 210 [m]. I and others treat the sampled area as covering the entire study area. In most real studies this will not be true. Here, maximum strip width is 4 [m] and study area size is (210 [m])(2(4 [m])) = 1680 [m^2]. The total study area was subdivided into two arbitrary regions of size 1040 [m^2] and 640 [m^2] each. I will ignore these two sub-regions and report abundance over the entire area. The only information necessary to estimate abundance of golf tees from abundance of golf tee groups (see above) is group size.
Given correctly named columns, correct rows, and correct units, exansion to study area is a straightforward call to mrds::dht
.
<- mrds::dht(
mrds_saEsts model = mrds.fi.mr
region.table = book.tee.data$book.tee.region
, sample.table = book.tee.data$book.tee.samples
, obs.table = book.tee.data$book.tee.obs)
,
mrds_saEsts> Abundance and density estimates from distance sampling
> Variance : R2, N/L
>
> Summary statistics
>
> Region Area CoveredArea Effort n k ER se.ER cv.ER
> 1 1 1040 1040 130 72 6 0.5538462 0.02926903 0.05284685
> 2 2 640 640 80 52 5 0.6500000 0.08292740 0.12758061
> 3 Total 1680 1680 210 124 11 0.5904762 0.03641856 0.06167659
>
> Summary for clusters
>
> Abundance:
> Region Estimate se cv lcl ucl df
> 1 1 98.30769 7.201276 0.07325242 84.29420 114.6509 17.842176
> 2 2 71.00000 9.747951 0.13729508 50.31898 100.1809 5.360291
> 3 Total 169.30769 13.520391 0.07985692 143.18017 200.2030 17.840640
>
> Density:
> Region Estimate se cv lcl ucl df
> 1 1 0.09452663 0.006924304 0.07325242 0.08105211 0.1102412 17.842176
> 2 2 0.11093750 0.015231173 0.13729508 0.07862341 0.1565326 5.360291
> 3 Total 0.10077839 0.008047852 0.07985692 0.08522629 0.1191684 17.840640
>
> Summary for individuals
>
> Abundance:
> Region Estimate se cv lcl ucl df
> 1 1 312.6731 26.07225 0.08338501 260.9894 374.5916 12.426140
> 2 2 207.5385 37.99597 0.18307918 128.9206 334.0989 4.691818
> 3 Total 520.2115 49.57222 0.09529243 423.2323 639.4125 12.457770
>
> Density:
> Region Estimate se cv lcl ucl df
> 1 1 0.3006472 0.02506947 0.08338501 0.2509514 0.3601843 12.426140
> 2 2 0.3242788 0.05936870 0.18307918 0.2014384 0.5220295 4.691818
> 3 Total 0.3096497 0.02950727 0.09529243 0.2519240 0.3806027 12.457770
>
> Expected cluster size
> Region Expected.S se.Expected.S cv.Expected.S
> 1 1 3.180556 0.2114629 0.06648615
> 2 2 2.923077 0.1750319 0.05987935
> 3 Total 3.072581 0.1391365 0.04528327
mrds::dht
estimates the abundance of golf tee clusters to be 169.31 groups (“Total” line, “Abundance” table in “Summary for clusters” section of output). Again, normally this number will be different from ‘N in covered area’ because study area size will be greater than the surveyed area. mrds::dht
estimates the abundance of all golf tees in the 1680 [m^2] study area to be 520.21 (“Total” line, “Abundance” table in “Summary for individuals” section of output), which differs from ‘N in the covered area’ by average group size.
Using Base R
Computation of tee abundance in the study area is straightforward. I compute average group size over targets detected by observer 1. I estimate tee abundance in the study area by multiplying abundance in the covered area by average group size.
<- book.tee.data$book.tee.dataframe |>
avgGroupSize ::filter(observer == 1 & detected == 1) |>
dplyr::summarise(mean = mean(size)
dplyrse = sd(size) / sqrt(dplyr::n()))
,
avgGroupSize> mean se
> 1 3.072581 0.1537082
<- data.frame(t(ch_ests)) |>
ch_saEst ::mutate(est = N.in.covered.region * avgGroupSize$mean ) |>
dplyr::pull(est)
dplyr
ch_saEst> [1] 519.9977
This abundance estimate, 520, differs from abundance estimated by mrds
, 520.21, only because I used the Chapman estimate of abundance in the covered area.
Estimation of the variance is involves an approximation for the variance of the product of two random variables. The approximation I use here for the variance of a product is one of the difference between the mrds
analysis and the base R calculations. The following code computes the standard error of abundance based on the Chapman estimator and this approximation.
<- data.frame(t(ch_ests)) |>
ch_saEst_se ::bind_cols(avgGroupSize) |>
dplyr::mutate(est = sqrt(N.in.covered.region^2 * se^2 + mean^2 * SE^2 + se^2*SE^2) ) |>
dplyr::pull(est)
dplyr
ch_saEst_se> [1] 44.22225
<- data.frame(mrds_saEsts$individuals$N) |>
final ::filter(Label == "Total") |>
dplyr::select(Estimate, se, lcl, ucl) |>
dplyr::bind_rows(data.frame(Estimate = ch_saEst
dplyrse = ch_saEst_se
, lcl = ch_saEst - 1.95*ch_saEst_se
, ucl = ch_saEst + 1.95*ch_saEst_se)) |>
, ::mutate(Estimator = c("mrds", "Chapman")) |>
dplyr::select(Estimator, dplyr::everything())
dplyr
::kable(final) knitr
Estimator | Estimate | se | lcl | ucl |
---|---|---|---|---|
mrds | 520.2115 | 49.57222 | 423.2323 | 639.4125 |
Chapman | 519.9977 | 44.22225 | 433.7643 | 606.2311 |
Final Comments
I have shown analysis of a simple distance sampling situation wherein observers are fully independent, their distance functions are identical, and one observer is primary while the other is secondary. In this case, the estimate of abundance reduces to mark-recapture estimates. I have shown calculations using package mrds
and base R. The differences between mrds
and the base R computations I show are:
- MR estimator: In base R, I used the Chapman mark-recapture estimator of abundance instead of the Lincoln-Peterson estimator used by
mrds
. The Chapman estimator is known to be less biased than the Lincoln-Peterson; but,mrds
uses Lincoln-Peterson because it fits directly into the package’s general Horvitz-Thompson estimation approach. - Variance of MR estimator: I used the closed the closed form variance estimator for the Chapman abundance estimate. The
mrds
estimate of standard error of abundance in the surveyed area was very close to the closed form estimate, but assumabley is based on the Horvitz-Thompson variance estimator. That said, the standard error estimated bymrds::ddf
andmrds::dht
differ, and this is a difference I do not yet understand. Standard error reported bymrds::ddf
for N in the covered area was 11.6457 while standard error for the same quantity reported bymrds::dht
was 13.5204. It is interesting to note that this is not an effect of the two sub-regions. When I eliminate sub-regions,mrds::dht
still increase its estimate of standard error relative tomrds::ddf
. In a subsequent post, I will dive into the H-T variance estimator and hopefully figure this out. - Variance of Study Area Abundance: I used the variance of a product formula to compute standard error of the study area’s total abundance. I assume
mrds::dht
uses one of the H-T variance estimators to arrive at a slightly higher value for standard error. - Confidence Interval on Abundance: I used the standard normal assumption and computed 95% confidence intervals using the \(\pm 1.96(se)\) method.
mrds::dht
apparently uses a variance estimator that yields an asymmetric confidence interval.
In conclusion, the simple mark-recapture point estimates of abundance from double-observer data are clear. Some mystery remains, for me, regarding the standard error estimates and confidence intervals on total study area abundance. I will figure this mystery out and post about it later.