Merrill D, Selvin S and Close ER. 1994.
Use of density equalizing map projections (DEMP) in the analysis of a
reported childhood cancer cluster in McFarland, California.
Presented at the Second Conference on Statistics and Computing in
Disease Clustering, Vancouver, B.C., Canada, July 21-22, 1994.
2nd draft version, 7/19/94
Abstract
All three figures 1A, 1B, and 1C have some visible "glitches" due to a bug in
the plotting routine. Those glitches will be fixed in the near future.
The excursion of one community (Taft) outside the boundary is not a serious
problem. Additional points need to be included in the western boundary
of Kern County; then the boundary will "bend" as required. Similar
problems, less visible, also occur elsewhere.
Figure 1A is the original map of the 401 childhood cancer cases (excluding one
with unknown location). Fig. 1A corresponds exactly to page 36 of
Interim Report #2 by Reynolds et al, California Department of Health Services.
(Ref.1)
The solid lines are county boundaries, and the dotted lines are tract
boundaries plus extra subdivisions of the 262 tracts into 1121 triangular
subregions. The population of each tract is known, and all triangular
subregions within a given tract are assumed to have the same population DENSITY.
For this preliminary test we used the 1980 Census count of all persons age 0
through 17. Additional work is required to estimate the person-years for
ages 0 through 15, for the time period 1980-88. Unfortunately the
population denominators calculated in Ref. 1 were lost due to a computer
mishap.
Ten iterations were required to get the density equalized map of Figure 3A.
Five hours were required on a Sun SPARC 10 workstation. The program is
written in Fortran and (for this problem) requires 11MB of memory.
The technical details are described in a draft report by Elon Close
(Ref. 2)
Figure 2A shows the result after five iterations. Figure 3A shows the final
result after ten iterations. With a little practice one can pick out the
corresponding regions in the three figures.
-
Figures 1B, 2B, and 3B correspond to maps in figures 1A, 2A, and 3A
respectively. In Figs 1B-3B, each triangular region in the map has been
plotted with
-
x = target area based on population
-
y = present area from Figs. 1A-3A.
-
In Figure 1B (the initial map) there are many points with y << x. These are
urban areas which need to be greatly magnified. After five iterations (Fig. 2B)
many of these small y values have increased considerably. After 10
iterations (Fig. 3B) most of the areas lie along the 45 degree line with
y approximately equal to x. Although some outliers remain,
Figure 3B indicates that the map in Figure 3A is approximately equalized.
In order to confirm the validity of the resulting map in Figure 3A,
we still need to:
-
Estimate the correct population at risk (ages 0 through 14, years
1980 through 1988) and estimate the errors in those estimates.
-
Adjust for geographic variation in risk factors. For example,
if the age distribution or race distribution of the population is not
geographically uniform, the map must be equalized not on total
population at risk, but intead on the expected numbers of cases in each
tract, calculated from age-race-specific rates for the 4-county region
as a whole. This is exactly analogous to indirect age adjustment
in an SMR calculation.
-
Demonstrate that within statistical variation, the SHAPE of the
DEMP map is not significant for the analyses suggested below.
The analysis of randomly generated "pseudo-cases" should convince
unbelievers.
Preliminary observations: PROVIDED that 1-3 do not alter the final result, it
appears that:
-
The few small clusters observed (e.g. MacFarland) are visible but
may not be statistically significant. This agrees with the
conclusions of your report.
-
Rates throughout Fresno (city) are approximately uniform.
-
Rates in the vicinity of the Fresno/Tulare county boundary
(southeast of Fresno city) are somewhat lower than elsewhere.
-
Rates in the western portion of Bakersfield (city) are slightly higher
than those in the eastern part of the city.
Starting with a DEMP map like that of Figure 3A, simple statistical analyses
can be performed that would not be possible otherwise. Three examples are
suggested here.
-
To test for a cluster near a selected location:
-
Count the number of cases within a selected region which includes
the point. The shape of the region is immaterial but it would normally
be a circle in Figure 1A. The area of the corresponding region in
Figure 3A is exactly proportional to the population at risk and
hence to the expected number of cases. Under the null hypothesis of
equal risk, the number of cases has a Poisson distribution with
parameter lambda given by the area in Figure 3A. The p-value of the
observed number of cases is easily calculated.
-
To test for uniformity of rates throughout the region (one method)
-
To avoid edge effects, exclude from the analysis all regions within a
fixed distance (say 20 "kilometers"*) of the boundary in Figure 3A.
For each case in the remaining region, measure the distance to
the nearest neighbor. The theoretical nearest neighbor distribution
is exactly known, along with its mean and variance. Compare the
observed distribution (or just the mean and variance) of nearest
neighbor distances with the theoretical distribution.
-
To test for uniformity of rates throughout the region (another method)
-
To avoid edge effects, again exclude from the analysis all regions
within fixed distance (say 20 "km"*) of the boundary in Figure 3A.
Construct an imaginary circle with radius 20 "km" and move it
continuously it over the interior analysis region. Everywhere in the
region, count the number of cases that fall within the circle.
The number of such cases has a Poisson distribution, with parameter
lambda given by the size of the circle. A p-value can be exactly
calculated at every point in the study area. On Figure 3A shade red
the regions that have a p-value <.025 (significantly high) and shade
blue those that have a p-value >.975 (significantly low).
Under the null hypothesis of equal risk, Figure 3A should have about
2.5% of its area shaded red and 2.5% of its area shaded blue. If
rates are not geographically uniform, the shaded areas of Figure 3A
will exceed those values. The circle size can be specified before
the analysis begins, depending on whether one wants to see large-area
or small-area effects.
All three of the methods above can be validated by randomly generating cases
under the null hypothesis as follows. Randomly generate 401 cases; assign
each case to one of the 262 tracts with probability proportional to the tract
population. Within the selected tract, plot the case at random (in Figure 1A).
Transform the randomly generated cases along with the real ones, to Figure 3A.
Analyze the real and random cases with exactly the same techniques.
To roughly assess the statistical significance of any analysis, analyze 20
independent random samples of 401 cases each. If the analysis of the real
sample yields a value more extreme than any of the 20 random samples, the
observed result is significant at the 5% confidence level. In the DEMP
analysis all the computation effort is in producing the DEMP map. Only
a slight increase in computing is required to transform 21 samples instead of
one.
Computer simulation techniques can also be used to correct the theoretical
distributions for edge effects, permitting one to use all the cases and not
only those at a far from the boundary of the study area.
The DEMP technique, and the second and third statistical analyses suggested
above, are completely independent of assumptions about the geographic study
area. Except for the unavoidable granularity of the census population data,
the analysis does not depend on arbitrary groupings of geopolitical entities
like communities.
Given the availability of Census data and detailed TIGER map files, the DEMP
analysis can be readily automated to obtain statistical power calculations for
proposed studies in any part of the United States.
For a given study area,
-
the analysis may be "legally" modified as necessary depending on
results obtained with randomly generated data, but NOT in response to
results obtained with real data.
-
the number and type of analyses to be executed AND REPORTED must not
be altered after the analysis of real data has begun.
If these two rules are scrupulously observed, the DEMP analysis technique can
be used for automatic computerized surveillance of routinely collected health
data, providing a basis for unbiased detection and assessment of geographic
clusters of disease.
Footnote:
* In Figure 3A, area is proportional to population, so distance is proportional
to the square root of population. Distance in Figure 3A is expressed in
effective "kilometers" by requiring that Figure 1A and Figure 3A have the same
total area.
References:
1. Interim Report #2 by Reynolds et al, California Department of Health
Services.
2. E.Close
Deane Merrill
Information and Computing Sciences Division
Building 50B, Room 2239
Lawrence Berkeley Laboratory
One Cyclotron Road
Berkeley CA 94720
tel: 510-486-5063
fax: 510-486-6363
internet: dwmerrill@lbl.gov
$PUB/docs/parep/vancouver/draft2.html 11/21/94
dwmerrill@lbl.gov