Merrill D, Selvin S and Close ER. 1994.

Use of density equalizing map projections (DEMP) in the analysis of a reported childhood cancer cluster in McFarland, California.

Presented at the Second Conference on Statistics and Computing in Disease Clustering, Vancouver, B.C., Canada, July 21-22, 1994.

2nd draft version, 7/19/94

Abstract

All three figures 1A, 1B, and 1C have some visible "glitches" due to a bug in the plotting routine. Those glitches will be fixed in the near future.

The excursion of one community (Taft) outside the boundary is not a serious problem. Additional points need to be included in the western boundary of Kern County; then the boundary will "bend" as required. Similar problems, less visible, also occur elsewhere.

Figure 1A is the original map of the 401 childhood cancer cases (excluding one with unknown location). Fig. 1A corresponds exactly to page 36 of Interim Report #2 by Reynolds et al, California Department of Health Services. (Ref.1)

The solid lines are county boundaries, and the dotted lines are tract boundaries plus extra subdivisions of the 262 tracts into 1121 triangular subregions. The population of each tract is known, and all triangular subregions within a given tract are assumed to have the same population DENSITY.

For this preliminary test we used the 1980 Census count of all persons age 0 through 17. Additional work is required to estimate the person-years for ages 0 through 15, for the time period 1980-88. Unfortunately the population denominators calculated in Ref. 1 were lost due to a computer mishap.

Ten iterations were required to get the density equalized map of Figure 3A. Five hours were required on a Sun SPARC 10 workstation. The program is written in Fortran and (for this problem) requires 11MB of memory. The technical details are described in a draft report by Elon Close (Ref. 2)

Figure 2A shows the result after five iterations. Figure 3A shows the final result after ten iterations. With a little practice one can pick out the corresponding regions in the three figures.

Figures 1B, 2B, and 3B correspond to maps in figures 1A, 2A, and 3A respectively. In Figs 1B-3B, each triangular region in the map has been plotted with
x = target area based on population
y = present area from Figs. 1A-3A.
In Figure 1B (the initial map) there are many points with y << x. These are urban areas which need to be greatly magnified. After five iterations (Fig. 2B) many of these small y values have increased considerably. After 10 iterations (Fig. 3B) most of the areas lie along the 45 degree line with y approximately equal to x. Although some outliers remain, Figure 3B indicates that the map in Figure 3A is approximately equalized.

In order to confirm the validity of the resulting map in Figure 3A, we still need to:

  1. Estimate the correct population at risk (ages 0 through 14, years 1980 through 1988) and estimate the errors in those estimates.

  2. Adjust for geographic variation in risk factors. For example, if the age distribution or race distribution of the population is not geographically uniform, the map must be equalized not on total population at risk, but intead on the expected numbers of cases in each tract, calculated from age-race-specific rates for the 4-county region as a whole. This is exactly analogous to indirect age adjustment in an SMR calculation.

  3. Demonstrate that within statistical variation, the SHAPE of the DEMP map is not significant for the analyses suggested below. The analysis of randomly generated "pseudo-cases" should convince unbelievers.

Preliminary observations: PROVIDED that 1-3 do not alter the final result, it appears that:

Starting with a DEMP map like that of Figure 3A, simple statistical analyses can be performed that would not be possible otherwise. Three examples are suggested here.

To test for a cluster near a selected location:

Count the number of cases within a selected region which includes the point. The shape of the region is immaterial but it would normally be a circle in Figure 1A. The area of the corresponding region in Figure 3A is exactly proportional to the population at risk and hence to the expected number of cases. Under the null hypothesis of equal risk, the number of cases has a Poisson distribution with parameter lambda given by the area in Figure 3A. The p-value of the observed number of cases is easily calculated.

To test for uniformity of rates throughout the region (one method)

To avoid edge effects, exclude from the analysis all regions within a fixed distance (say 20 "kilometers"*) of the boundary in Figure 3A. For each case in the remaining region, measure the distance to the nearest neighbor. The theoretical nearest neighbor distribution is exactly known, along with its mean and variance. Compare the observed distribution (or just the mean and variance) of nearest neighbor distances with the theoretical distribution.

To test for uniformity of rates throughout the region (another method)

To avoid edge effects, again exclude from the analysis all regions within fixed distance (say 20 "km"*) of the boundary in Figure 3A. Construct an imaginary circle with radius 20 "km" and move it continuously it over the interior analysis region. Everywhere in the region, count the number of cases that fall within the circle.

The number of such cases has a Poisson distribution, with parameter lambda given by the size of the circle. A p-value can be exactly calculated at every point in the study area. On Figure 3A shade red the regions that have a p-value <.025 (significantly high) and shade blue those that have a p-value >.975 (significantly low).

Under the null hypothesis of equal risk, Figure 3A should have about 2.5% of its area shaded red and 2.5% of its area shaded blue. If rates are not geographically uniform, the shaded areas of Figure 3A will exceed those values. The circle size can be specified before the analysis begins, depending on whether one wants to see large-area or small-area effects.

All three of the methods above can be validated by randomly generating cases under the null hypothesis as follows. Randomly generate 401 cases; assign each case to one of the 262 tracts with probability proportional to the tract population. Within the selected tract, plot the case at random (in Figure 1A). Transform the randomly generated cases along with the real ones, to Figure 3A. Analyze the real and random cases with exactly the same techniques.

To roughly assess the statistical significance of any analysis, analyze 20 independent random samples of 401 cases each. If the analysis of the real sample yields a value more extreme than any of the 20 random samples, the observed result is significant at the 5% confidence level. In the DEMP analysis all the computation effort is in producing the DEMP map. Only a slight increase in computing is required to transform 21 samples instead of one.

Computer simulation techniques can also be used to correct the theoretical distributions for edge effects, permitting one to use all the cases and not only those at a far from the boundary of the study area.

The DEMP technique, and the second and third statistical analyses suggested above, are completely independent of assumptions about the geographic study area. Except for the unavoidable granularity of the census population data, the analysis does not depend on arbitrary groupings of geopolitical entities like communities.

Given the availability of Census data and detailed TIGER map files, the DEMP analysis can be readily automated to obtain statistical power calculations for proposed studies in any part of the United States.

For a given study area,

  1. the analysis may be "legally" modified as necessary depending on results obtained with randomly generated data, but NOT in response to results obtained with real data.

  2. the number and type of analyses to be executed AND REPORTED must not be altered after the analysis of real data has begun.

If these two rules are scrupulously observed, the DEMP analysis technique can be used for automatic computerized surveillance of routinely collected health data, providing a basis for unbiased detection and assessment of geographic clusters of disease.

Footnote:

* In Figure 3A, area is proportional to population, so distance is proportional to the square root of population. Distance in Figure 3A is expressed in effective "kilometers" by requiring that Figure 1A and Figure 3A have the same total area.

References:

1. Interim Report #2 by Reynolds et al, California Department of Health Services. 2. E.Close

Deane Merrill
Information and Computing Sciences Division
Building 50B, Room 2239
Lawrence Berkeley Laboratory
One Cyclotron Road
Berkeley CA 94720
tel: 510-486-5063
fax: 510-486-6363
internet: dwmerrill@lbl.gov


$PUB/docs/parep/vancouver/draft2.html 11/21/94
dwmerrill@lbl.gov