Use of a Density Equalizing Map Projection
in Analyzing Childhood Cancer
in Four California Counties
Deane W. Merrill, Ph.D., Dr.P.H.
22 Mechanic Street
Shelburne Falls, MA 01370
tel: 413-625-9281
email: merrill@crocker.com
Use of a Density Equalizing Map Projection
in Analyzing Childhood Cancer
in Four California Counties
In this study, 401 cases of childhood cancer in four California counties in 1980-88 were analyzed with the innovative methodology of Density Equalizing Map Projections (DEMP). The data were originally collected and analyzed by the California State Department of Health Services (DHS). In addition to the new analytic technique, the present analysis used population data more detailed and more accurate than those in the DHS analysis. The geographic boundaries of the 259 census tracts in the study area were adjusted according to population at risk so as to make population density everywhere constant; then the 401 case locations were plotted on the density equalized map. If risk is everywhere equal, the resulting distribution of cases should be uniform except for statistical variation.
The metric used was a measure of the variability of the density of cases on the density equalized map. The same metric was calculated for independent samples of artificial cases, generated under the null hypothesis of equal risk. The slight geographic non-uniformity observed among the real cases is well within the limits of variation observed in the samples of artificial cases. In agreement with results published by DHS, we conclude that there is no evidence for geographic variation of risk among the cases studied. Subsets of the data, selected by age, sex, race, time period and cancer site, yielded similar negative results.
This research, performed at Lawrence Berkeley National Laboratory (LBNL), was supported by the Office of Environment, Safety and Health, Office of the Deputy Assistant Secretary of Health Studies, Office of Epidemiologic Studies, of the U.S. Department of Energy under Contract No. DE-AC03-76SF00098.
This is a preprint of an article accepted for publication in Statistics in Medicine. Copyright © 2001 John Wiley & Sons, Ltd.
BACKGROUND
The data analyzed in this report were collected by the California State Department of Health Services (DHS) in response to a reported cluster in the community of McFarland. The question of interest is: "are more cases being observed in McFarland (or any other particular area) than would be expected by chance?"
In an earlier analysis of these data, DHS for this problem used area-based methods to search for clusters. The present analysis uses instead a novel point-based method. The DHS analysis aggregated data from 262 census tracts into 101 "communities," then computed age-adjusted rates of childhood cancer for each community and computed 99% confidence limits for the rates. The spatial relationships between any two adjacent or nearby areas are not part of the analysis.
In the present analysis the 262 area-based data are disaggregated into pseudo point data. Pseudo because the analysis does not use the precise location of the cancer case within the census tract; instead, pseudo point data are generated on the assumption that the observed cancer cases can be studied as if they were distributed randomly within their tract. The obvious implication of this assumption is that if clusters of cases occur at a geographic scale more local than the typical size of census tracts, this method cannot find them. This is a not a limitation of the method, but rather of the population data, which (in this study) are not available below the census tract level.
DHS ANALYSIS
Observations of childhood ‘cancer clusters’ in small communities in central California prompted DHS to examine the distribution of childhood cancer in communities throughout the region to see if the overall cancer rate or the distribution of ‘cancer clusters’ was unusual for agricultural towns where pesticide exposure might be elevated [REYN96]. A total of 402 newly diagnosed cases of cancer (including 401 which were assigned to a census tract) were identified for 1980-1988 among children 0-14 years old, residing in four counties of California (Fresno, Kern, Kings and Tulare). Figure 1 illustrates the geographic distribution of the childhood cancer cases.
Population estimates for 1980-1988 were obtained from data sources provided by the California Department of Finance (DOF). Population projections for all races combined, by age and sex for each year, were available for each county. Race/ethnic- and gender-specific tract-level population data were available from the 1980 U.S. Census (STF1A and STF2) for the age groups 0-4 and 5-14. Race/ethnic- and age-specific rates of population change were available for the state as a whole from DOF. The relative rates of change for race/ethnic groups were used in conjunction with county age-specific rates of change to interpolate annual population estimates for each census tract in the four-county study area.
The DHS estimates were calculated as described above and later compared with preliminary 1990 Census data. According to an published report from DHS [REYN91], the DHS person-years estimates were between 76 and 107 percent of the projected population numbers. This small error is unlikely to have affected the conclusions of the analysis. Unfortunately, the DHS analysis cannot be repeated, because the population data files were not saved.
Next, 101 communities were defined as combinations of the 262 census tracts. Census tract configurations for these communities did not necessarily coincide with the boundaries for incorporated cities. Although more than one-third of the cases came from the cities of Fresno and Bakersfield, each was counted only as a single community. Geographic variation of rates within communities was ignored.
Age-adjusted incidence rates and 95 percent confidence intervals were calculated for each community. With the overall four-county rate as a reference, two communities had upper confidence intervals that fell below the overall rate, suggesting they had rates significantly lower than expected. Three communities had lower confidence intervals that fell above the overall rate, corresponding to rates significantly higher than expected. If all communities shared the same underlying risk, one would expect about 5 of the 101 communities to have 95 percent confidence levels that excluded the average rate, exactly what was observed.
Repeating the calculation at the 99 percent confidence level, DHS found only one community, McFarland, to be significantly high; none were significantly low. Figure 2 shows the observed versus expected distribution of cases per community, excluding Fresno and Bakersfield. The results are consistent with those expected from random variation alone.
The sensitivity of the DHS analysis can be improved by applying a different methodology, and the population estimates can be refined with the use of 1990 Census data. Both approaches were used in the subsequent LBNL analysis of the four-county data set.
DENSITY EQUALIZING MAP PROJECTIONS
In studying geographic disease distributions, one normally compares rates among arbitrarily defined geographic subareas (e.g. census tracts), thereby sacrificing some of the geographic detail of the original data. The sparser the data, the larger the subareas must be in order to calculate stable rates. This dilemma is avoided with the technique of Density Equalizing Map Projections (DEMP). Boundaries of geographic subregions are adjusted to equalize population density over the entire study area. Case locations plotted on the transformed map should have a uniform distribution if the underlying disease risk is constant. On the transformed map, the statistical analysis of the observed distribution is greatly simplified. Even for sparse distributions, the statistical significance of a supposed disease cluster can be calculated with validity.
Since the case data are the same, how can the DEMP methodology provide better sensitivity than the DHS analysis? First, in the DEMP map, equal weights are assigned to subareas having equal statistical significance, i.e. equal population at risk. In the DHS analysis, the data were inefficiently used in that all communities were weighted equally regardless of size. Second, the DEMP analysis recognizes the locations of subareas, which are likely to have similar risk factors if they are close to each other. In the DHS analysis described above, the relative locations of the 101 communities were ignored. Third, a variable parameter in the DEMP analysis specifies the distance over which spatial averaging will occur. The size of the analytic units in the DHS analysis was arbitrarily fixed by the choice of 101 communities as the study units.
The DEMP methodology addresses the classic dilemma of comparing disease rates among different geographic areas or time periods. Rates are inadequate, because in an area with small population, even one case can produce a rate of epidemic proportions. On the other hand, a level of significance such as ‘two standard deviations above normal’ hides the rate, which is the quantity of underlying epidemiologic significance. Furthermore, such a significance level is correlated with population size.
Another classic dilemma is the problem of representing geographic variability. If the subareas are chosen too small, stable rates cannot be calculated; if too large, geographic detail is lost. Grouping of subareas to achieve stable rates requires arbitrary decisions that can affect the conclusions of the analysis.
A mapping approach for public health data was first used as early as 1798, when Seaman plotted the locations of yellow fever cases in New York [SEAM1798]. Physician John Snow used a mapping technique to investigate a cholera outbreak in London in 1849 [SNOW1849]. Snow observed a cluster of cases in the vicinity of the Broad Street pump and concluded that the well was contaminated, and the pump handle was removed. Implicit in Snow’s interpretation of the map was the underlying assumption that the population density was relatively uniform.
The same approach, but with corrections for varying population density, was first described in the 1920’s. [KARS23, WALL26, GILL27]. Prior to plotting the cases, county boundaries were adjusted so as to give to each county an area proportional to its population. Then a visible cluster of cases could be correctly interpreted as an increased rate in the region of interest. The first two authors constructed their maps manually with paper and pencil, but Gill [GILL27] was more imaginative. He weighed out lumps of plasticene with weights proportional to the individual county populations; he then assembled a map of the counties of California, with the lumps of plasticene in their proper relative locations. Then, rolling the lumps to uniform thickness with a rolling pin, he automatically created a density equalized map without recourse to a computer. In the present analysis the same trick is performed with a computer, but the underlying principle is no different from Gill’s. Furthermore, Gill’s method was faster than the computerized method, both in implementation and execution!
DATA SOURCES
1. Case data. In April 1993 DHS provided LBNL with data for the 401 childhood cancer cases, excluding one case for which the census tract was not identified. Each record contained year of diagnosis, county code, (1980) census tract code, city, latitude and longitude of the case, age category (0-4 or 5-14), sex, and race/ethnicity (non-Hispanic white, Hispanic, or other). In February 1995 DHS provided LBNL with additional information, i.e. the cancer site of each case (leukemia, brain, or other). The latitude and longitude provided by DHS are indeed the coordinates of the case, not of the census tract. But the latitude and longitude of the case were not used in the present statistical analysis. In conjunction with tract boundaries they were used merely to verify correctness of the tract and county codes. The rationale for this decision is explained under 'PLOTTING CASE LOCATIONS'.
2. Census tract boundaries. A 1980 Census tract map file was purchased from National Planning Data Corporation, and a 1990 Census tract map file from Geographic Data Technology. The units of analysis were chosen to be 259 modified tracts, which (neglecting minor boundary changes) are the smallest aggregates of either the (262) 1980 tracts or the (306) 1990 tracts. To reduce computing time in the DEMP process, unnecessary geographic detail was removed from the tract boundaries until the smallest remaining details were no smaller than 20 percent of the area of their respective tracts. Figure 3 is the resulting map of 259 modified census tracts. Each tract is a single polygon. To prevent the polygon boundaries from self-intersecting during the DEMP process, the 259 polygons were subdivided into 1193 triangles, which were the subareas to be density-equalized. The unique Delaunay triangulation divided each polygon into triangles that are as nearly equiangular as possible [BOOT87]. It was discovered that a large map composed entirely of triangles cannot be density-equalized; in principle there are enough degrees of freedom, but in practice a solution was never reached. Therefore each triangle boundary segment was bisected, and the positions of the break points were allowed to vary freely during the DEMP process.
3. Population data. The analysis used age/sex/race-specific 1980 population estimates for the 262 tracts of the 1980 Census, and age/sex/race-specific 1990 population estimates for the 306 tracts of the 1990 Census. The 1980 and 1990 data were aggregated to a consistent set of geographic units; namely, the 259 modified Census tracts. Under the assumption that population change was linear in each tract between the two census dates (April 1, 1980 and April 1, 1990), age/sex/race-specific estimates of population at risk were obtained for each of the 259 modified tracts, for each of the two time periods 1980-84 and 1985-88.
DENSITY EQUALIZATION
In the literature, density equalized maps have been called population maps, cartograms, contiguous-area cartograms, or anamorphoses. The program used in this analysis is based on a mathematical algorithm by Gusein-Zade and Tikunov [GUSE93]. Technical details are described in LBNL internal documents [CLOS94, MERR95, MERR96, MERR98]. The correction to be applied to each point in the map is calculated explicitly from the required expansion or contraction of each infinitesimal area in the entire map. Convergence is achieved in a specified number of iterations, typically 10. Every polygon in the map generates a radial "push" or "pull" on the rest of the map, depending on whether its present area is smaller or larger than the target area determined by its population. The magnitude of the radial "push" or "pull" decreases with distance, exactly as required to keep constant the areas of polygons that are being passively transformed; i.e. which already have the correct target area.
The Russian algorithm was implemented at LBNL. Additional features [CLOS94, MERR95] were found to be necessary for equalizing highly non-uniform populations like that of the four-county area. In addition, proper map preparation prior to density equalization is essential for reducing the calculation time. The difficulty is that large sparsely settled tracts in rural areas need to stretch and bend around the urban tracts while their area is being reduced to practically zero. There must be sufficient geographic detail to avoid illegal overlapping of polygon boundaries during the iteration process. On the other hand, excessive detail must be avoided, because calculation time increases as the square of the number of points in the map.
Target populations were assigned to each of the 1193 triangles by assuming uniform population density within each tract. The triangle map was density-equalized in ten equal steps. Due to the break point in each triangle boundary segment, the 1193 triangles gradually assumed the form of hexagons as the density equalization progressed.
The projected space following use of the density equalizing projection has the following properties:
1. the area of each census tract is represented on the transformed map as an area that is proportional in size to the number of person years at risk in that tract;
2. an attempt is made to preserve the shape of the census tract though in many cases census tract shapes in the project map will be unrecognizable;
3. directions and distances between cases are quite poorly and inaccurately represented in the projection space;
4. tracts that touch along a common boundary in real space will also touch in the projection space. However, the converse is not true: tracts that touch in the projection space may be quite distant from each other in real space, if the intervening space has zero population (for example a body of water). Tracts on opposite sides of an empty area will end up adjacent to each other in the transformed map, and exactly which tracts will face each other is unpredictable. Any clusters which involve tracts that are widely separated in real space must be interpreted with caution. For this reason it is unwise to analyze in a single map areas which are separated by water bodies or unpopulated areas.
5. The map transformation which equalizes population density is not unique. However, for a given geographic area the present algorithm gives reproducible results over a range of parameters (map detail, number of iterations, etc.). In some theoretical sense the map is minimally distorted, because all infinitesimal areas of the map are simultaneously expanded or contracted in small incremental steps. For testing the null hypothesis of equal risk, the DEMP technique is valid regardless of the shape of the transformed tracts. On the other hand, if significant clusters are observed, their interpretation would require careful study of the original map.
PLOTTING CASE LOCATIONS
Next, the 401 cases were plotted on the transformed map. Because the projection has no knowledge of population density below the census tract level, the DEMP process cannot equalize population density within a tract. Any clustering of cases within a census tract may or may not be due to unknown variations in population density. Such clusters do exist, and are preserved under the transformation. The purpose of the transformation is to produce a case distribution that is uniform under the null hypothesis, i.e. if risk is everywhere equal. This is accomplished by randomly plotting the cases that occur in each tract. Then the null hypothesis is easily tested by measuring the non-uniformity of the case distribution over the entire map.
Since one does not know within a tract the different densities of the of the population at risk, it would not be wise to locate exactly the cases since they might show up as a cluster in the transformed space. This might represent merely a dense population of people at risk within the tract. Because the projection cannot correct for different densities of population within the tract, clusters of cases in the real space would also appear as clusters in the transformed space, but these clusters would be spurious. If one assumes homogeneity of the population at risk within the tract, then to be consistent one must also assume homogeneity of the cases within the tract in order for the methodology to be valid..
The latitude and longitude of each case are not used, as they provide no useful information for testing the null hypothesis. This fact has two pleasant consequences: (1) The DEMP technique can be used to analyze data sets such as SEER (Surveillance, Epidemiology, and End Results) cancer incidence data, where the latitude and longitude of the case are not publicly available; and (2) even if the researcher has access to the exact case locations, presentation of the density equalized map does not compromise the confidentiality of individual respondents.
As stated earlier, the algorithm equalizes population density among tracts. The Russian authors of the algorithm assert that the transformation also preserves population density within individual tracts, provided that the tract boundaries are sufficiently detailed and the number of iteration steps is sufficiently large. We have observed that a uniform distribution of random points within a given tract remains uniform after the transformation, regardless of the deformation of the tract boundary. A practical implication is that the cases can be plotted randomly either before or after the transformation - the result is the same.
ANALYSIS OF CASE LOCATIONS
Figure 4 is the density equalized map for the complete data set of 401 cases, 3.3 million person-years (Mpy). Relative to Figure 3, tracts in densely populated urban areas have increased in size, and tracts in sparsely populated rural areas have decreased in size. In Figure 4, population density is constant over the entire map; the square in the lower left corner shows the area corresponding to 0.1 Mpy. Similar density-equalized maps, summarized in Table I, were prepared for various subsets of the population at risk. For example, the map of sample 2 (not shown) was used to analyze the distribution of cases among 192 non-Hispanic white children.
In Figure 5, the 401 cases of the complete data set are plotted on the density equalized map of Figure 4. In the two upper insets of Figure 5, each case is plotted at two different random locations in the tract where it occurred. In the two lower insets, 401 artificial cases were similarly plotted, with the tract for each case chosen at random under the assumption that rates are everywhere equal. Any apparent clusters in the lower insets are entirely due to statistical variation. In the two upper maps, any apparent clusters are insignificant unless more extreme than the random fluctuations in the two lower insets. In all four maps, the circles indicate the size of an area within which 20 cases are expected.
In addition to an analysis of the 401 combined cases, 12 separate analyses were performed in which the data were successively partitioned according to race/ethnicity, time period of diagnosis, age group, sex, and cancer site. Similar plots (not shown) were made from each of the 12 remaining samples of Table I. In none of the plots was any significant pattern observed. This observation is consistent with the results of the quantitative analysis which will now be described.
|
Table I. Number of cases and million person-years (Mpy) |
|||||||
|
sample |
race/ethnic |
years |
ages |
sex |
cancer site |
cases |
Mpy |
|
1 |
all |
1980-88 |
0-14 |
both |
all |
401 |
3.3 |
|
2 |
non-Hisp white |
1980-88 |
0-14 |
both |
all |
192 |
1.6 |
|
3 |
Hispanic |
1980-88 |
0-14 |
both |
all |
166 |
1.3 |
|
4 |
other |
1980-88 |
0-14 |
both |
all |
43 |
0.4 |
|
5 |
all |
1980-84 |
0-14 |
both |
all |
209 |
1.7 |
|
6 |
all |
1985-88 |
0-14 |
both |
all |
192 |
1.6 |
|
7 |
all |
1980-88 |
0-4 |
both |
all |
211 |
1.2 |
|
8 |
all |
1980-88 |
5-14 |
both |
all |
190 |
2.1 |
|
9 |
all |
1980-88 |
0-14 |
male |
all |
226 |
1.7 |
|
10 |
all |
1980-88 |
0-14 |
female |
all |
175 |
1.6 |
|
11 |
all |
1980-88 |
0-14 |
both |
leukemia |
134 |
3.3 |
|
12 |
all |
1980-88 |
0-14 |
both |
brain |
76 |
3.3 |
|
13 |
all |
1980-88 |
0-14 |
both |
other |
191 |
3.3 |
RELATIVE RISK (RR)
An intuitive estimate of relative risk RR(x,y,k) at a point (x,y) is RR(x,y,k) = Aexp(k)/Aobs(x,y,k), where Aobs(x,y,k) is the area of a circle which contains k cases. k is the bandwidth for smoothing, which determines how large an area around (x,y) will be considered in estimating RR(x,y,k). Here Aexp(k)=kAtot/N is the area of a circle in which one expects to find k cases. Atot is the total map area and N is the number of cases.
The preceding expression is cumbersome to evaluate, and discontinuous in (x,y). A less intuitive but more useful formula for RR(x,y,k) is the following Gaussian Kernel (GK) expression: at a given point (x,y), RR(x,y,k) is the sum over all cases j, of C exp (-dj2/d02) , where C = 2/k, dj is the distance from (x,y) to case j, and d02 = k Atot / (2 N p ). The definitions of d02 and C are chosen such that the two expressions for RR(x,y,k) give the same results for sufficiently large values of k.
The value of RR(x,y,k) was calculated for all points (x,y) in an equally spaced grid over the entire map. In order to avoid a bias near the map boundary, a uniform grid of artificial cases was laid down outside the map contour, with grid spacing such that RR(x,y,k) is exactly 1.0 outside the contour. Figure 6 is a contour plot of RR, corresponding to Figure 5 with k=10. Larger values of k (not shown) produce smoother plots with less random variation, and less geographic detail; smaller values of k produce more random variation, with more geographic detail.
In Figure 6 regions of relatively high and low density are outlined by solid and dotted contours, respectively. The choice of contours is such that if the cases are distributed at random, approximately one-sixth of the area of the plot lies within solid contours, and one-sixth within dotted contours. Clustering, if it occurs, would produce an excess of regions having both high and low density. The area lying within the solid and dotted contours would increase at the expense of the remaining area. In Figure 6, no such clustering is observed.
In order to quantify the degree of clustering observed, RR (relative risk) is sampled at each point (x,y) in a regular grid, over the area of the density equalized map. The number of points in the sampling grid is arbitrary, provided the grid spacing is small compared with the average distance between cases. Figure 7 is the resulting distribution of log RR, from 3145 measurements on a regular grid, of the continuous function RR shown in Figure 6.
RR is observed to have an approximately lognormal distribution, so that log RR has a normal distribution. The mean of RR is one, so that the mean of log RR is zero. If the spatial distribution of cases is random, the variance of the distribution of log RR approaches 1/k as k becomes large, regardless of the arbitrary spacing of the grid points (x,y). The theoretical s.d. error of log RR, equal to sqrt(1/k), corresponds to the interval marked by vertical lines in Figure 7.
These predictions were experimentally verified for various sample sizes N, and for various values of k. Figure 7 displays the results for N=401 and k=10. The theoretical fraction "th tail" of sampled grid points lying outside the vertical lines is .317, with a s.d. error "th sd(tail)" of .073, corresponding to a 95% confidence interval of (0.17, 0.46).
For two randomly plotted locations of each real case (upper insets of Fig. 7), observed values of "tail" were .35 and .29, well within the 95% c.i. Corresponding values of "tail" for the artificially generated cases (lower insets) were .35 and .44, also within the 95% c.i. Because the real cases have lower values of "tail" than the randomly generated cases, we conclude that there is no evidence for geographic variation of risk in the full sample of 401 cases, when the variable parameter k is chosen to be 10. Similar results are obtained with k= 20; values of "tail" are .37 and .28 for the real cases, and .37 and .45 for the random cases. All four values lie within the 95% c.i. (0.11, 0.49).
Similar results (for k=10) were obtained for all 13 of the samples defined in Table I. In each case, the average value TAV obtained from two plotted locations of the real cases was compared with TAV from two plotted locations of one independent sample of artificially generated random cases. These 13 tests are not statistically independent. However, the three race/ethnicity subsamples (2-4) are independent of each other, as are the two time periods (5-6), the two age groups (7-8), the two genders (9-10), and the three cancer sites (11-13). None of the values of TAV from the real cases were significantly different from those obtained with the corresponding artificial cases. One of the 13 samples analyzed has a p-value equal to 0.02; however, according to Bonferroni’s multiple testing criterion, such a measurement would have to yield a p-value less than 0.05/13=0.004 in order to be considered statistically significant at the 95% confidence level.
A final test was empirical and non-parametric, free from any theoretical assumptions. For N=401 and k=10, TAV from two (new) plotted locations of the real cases was compared with TAV from two plotted locations of ten independent samples of artificially generated random cases. The ten independent values of TAV from the artificial cases had a sample mean of 0.307 and a sample variance of (0.037)2. The observed value of TAV from the real cases was 0.34, which is only 0.9 standard deviations above the expected value, corresponding to a p-value about 0.18.
A preliminary LBNL analysis claimed evidence of significantly non-uniform risk [MERR96]. However, those preliminary conclusions are to be disregarded because (a) the 1980 population was not an accurate representation of the 1980-88 population at risk and, more importantly, (b) the corresponding analysis of simulated cases in Ref. MERR96 was erroneous and led to a false conclusion. The error was subtle but extremely important:
In Ref. MERR96, 20 sets of simulated cases were simulated as follows: (a) 401 cases were randomly assigned to the 259 tracts with probability proportional to population at risk, and then (b) the cases in each tract were randomly plotted within their individual tracts. This process, repeated 20 times, yielded a random distribution of 8020 cases over the whole map, which is expected if the map is density equalized correctly.
But the 401 true cases were (a) assigned (by God) only once to census tracts, and then (b) randomly plotted 20 times within their individual tracts. Even though the distribution of the 401 true cases among tracts is consistent with equal risk, the resulting distribution of 8020 case locations is not uniform since all 20 plot locations refer to the unique God-given tract assignment of the 401 cases. What is a random statistical fluctuation in a sample of 401 cases, does not produce a random distribution if each case is plotted 20 times in the same tract. Therefore, for comparison with the analysis of the 401 real cases plotted 20 times, it was erroneous to simply use the 8020 random cases described in the previous paragraph.
A correct process, which was followed in the present analysis, is to calculate a summary statistic TAV from the 401 real cases, randomly plotted twice within their respective tracts. (Time constraints dictated the use of two rather than 20 plot locations. Statistical power is slightly reduced, but fortunately the statistical fluctuations resulting from the random plot locations are small with those resulting from random allocation of the 401 cases to 259 tracts.) Then the same summary statistic TAV was calculated from 401 artificial cases which were assigned once to the 259 census tracts, and then (like the real cases) randomly plotted twice within their individual tracts. Then the whole process was repeated ten times, with ten independent allocations of the random cases to tracts. The average TAV from the real cases fell within the range of the ten TAV values from the artificial cases, consistent with the null hypothesis of equal risk.
CONCLUSIONS
The conclusion of the statistical analysis of this report is that among the 401 cases of the Four County Childhood Cancer data set, there is no evidence for geographic variation of risk, beyond that expected from random variation alone. This result agrees with that obtained by the California Department of Health Services [REYN96]. One cannot say that there is no variation in risk; merely that neither analysis was sufficiently sensitive to detect any such variation.
One might question the wisdom of analyzing the 12 demographic subsamples. They are so small that their analysis could have detected only an unusually large variation of risk. Nevertheless, even a large nonuniformity might have passed undetected in the DHS analysis, and it seemed prudent to make sure that none existed. The greater statistical power of the full data set (N=401) is offset by the difficulty of interpreting results in a demographically diverse population.
'The DEMP analysis described in this paper improves on the DHS analysis in several respects:
1) The 1990 Census data, not available to the DHS analysis, permitted improved estimates of the geographical distribution of the population at risk;
2) 259 census tracts were analyzed in the DEMP analysis, vs. 101 communities in the DHS analysis. Statistical power is lost through the choice of the 101 communities, because about one-third of the population occurs in the cities of Fresno and Bakersfield. In the density equalized map of Figure 4, those two cities occupy one-third of the map. Geographic variation of risk within those cities, if it exists, cannot be detected in the DHS analysis. Furthermore, many of the 101 communities are so small that robust confidence limits cannot be computed.
(3) in (any) DEMP analysis, geographic subareas are weighted according to their statistical significance;
(4) the DEMP methodology uses information on the locations of geographic subareas;
(5) the DEMP analysis includes a parameter that allows the scale of geographic aggregation to be varied.
In addition, the DEMP analysis has the significant benefit that individual case locations can be graphically displayed without compromising the confidentiality of individuals, because the plotted cases cannot be associated with particular census tracts by readers who do not have access to their boundaries plotted in the transformed space.
The DEMP technique illustrated here has value for the analysis or re-analysis of other data sets, and for routine screening of routinely collected case data where the potential exists for geographic variation of risk factors. In the present analysis, since no significant variation of risk was observed, the analysis was confined to testing the null hypothesis of equal risk. Because of the difficulties of interpretation, the DEMP methodology is most useful as (a) a hypothesis-free method for detecting departures from the equal-risk hypothesis, and (b) an aid in visually locating any clusters that may exist. The use of the DEMP technique for analyzing and interpreting statistically significant departures from uniformity has not been fully explored.
FIGURES
Figure 1. (Fig.1 of REYN96). Childhood cancer cases for the four county area, 1980-1988 (N=401). Figure 2. (Fig.3 of REYN96). Observed vs. expected distribution of cases per community, excluding Fresno and Bakersfield. Figure 3. (Fig.8 of MERR98). Boundaries of 259 modified census tracts, with geographic detail removed. Figure 4. (Fig. 9 of MERR98). Density-equalized map for the complete data set of 401 cases, 3.3 million person-years (Mpy). The square in the lower left corner shows the area corresponding to 0.1 Mpy. Figure 5. (Fig. 20 of MERR98). Distribution of 401 cases, plotted on the density equalized map of Figure 4. In the two lower insets of Fig. 5, the distribution of cases is random by definition; any apparent clusters are due to statistical variation. In the two upper maps, any apparent clusters are insignificant unless more extreme than the random fluctuations in the two lower maps. In all four maps, the circles indicate the size of an area within which 20 cases are expected. Figure 6. (Fig. 37 of MERR98). Contours of relative risk (RR), corresponding to Figure 5. A Gaussian kernel (GK) density estimator with k=10 was used. If geographic variation is due entirely to statistical fluctuations, approximately one-sixth of the area is expected to lie within the solid contours, and one-sixth within the dotted contours. Those proportions would increase in the presence of significant geographic variation. Figure 7. (Fig. 35 of MERR98). Distribution of measurements of log RR, corresponding to Figure 6. If geographic variation is due entirely to statistical fluctuations, log RR is expected to have a normal distribution. Approximately one-third of the measurements are expected to lie in the tails outside the vertical lines; that fraction would increase in the presence of significant geographic variation.REFERENCES
BOOT87. Boots, B. N. `Voronoi (Thiessen) Polygons'. In series: Concepts and Techniques in Modern Geography (CATMOG), Geo Books, Norwich UK, 1987.
CLOS94. Close ER, Merrill DW, and Holmes HH. Implementation of a new algorithm for Density Equalizing Map Projections (DEMP), Report LBL-35738, December 1994. The text of the paper (without figures) is at http://merrill.olm.net/pdocs/tr940401/all.html.
For a printed copy of the complete paper, please contact the author, merrill@crocker.com.
GILL27. Gill, 'Population maps', AJPH, 17, 316-319 (1927).
GUSE93. Gusein-Zade SM and Tikunov VS. ‘A New Technique for Constructing Continuous Cartograms’, Cartography and Geographic Information Systems, 20:3, 167-173 (1993).
HAAC03. Haack H. and Wiechel H. Kartogramm zur Reichstagswahl. 2 Wahlkarten d. Deutschen Reiches in alter u. neuer Darstellung mit polit.-statist. Begleitworte u. kartog., Gotha, 1903. Cited in [GUSE94]. ).
KARS23. Karsten Karl G. Charts and Graphs, Ch. 52, 1923. Cited in [GILL27].
MERR95. Merrill D, Selvin S, Close ER and Holmes HH. Use of density equalizing map projections (DEMP) in the analysis of childhood cancer in four California counties, Report LBL-36630, January 1995. The text of the paper (without figures) is at http://merrill.olm.net/pdocs/cdc9501/lbl36630.html. For a printed copy of the complete paper, please contact the author, merrill@crocker.com.
MERR96. Merrill DW, Selvin S, Close ER and Holmes HH. 'Use of density equalizing map projections (DEMP) in the analysis of childhood cancer in four California counties', Statistics in Medicine, 15, 1837-1848 (1996).
MERR98. Merrill DW. Density Equalizing Map Projections (Cartograms) in Public Health Applications, Dr.P.H. Dissertation, University of California, Berkeley School of Public Health, May 1998. Lawrence Berkeley National Laboratory Report LBNL-41624. URL: http://library.lbl.gov/docs/LBNL/416/24/HTML/. For a printed copy please contact the author, merrill@crocker.com.
REYN91. Reynolds P, Satariano E, Smith D. The Four County Study of Childhood Cancer Incidence: Interim report II. Environmental Epidemiology and Toxicology Program, California Department of Health Services, October 1991.
REYN96. Reynolds P, Smith DF, Satariano E, Nelson DO, Goldman LR, Neutra RR. 'The Four County Study of Childhood Cancer: Clusters in Context', Statistics in Medicine, 15, 683-697 (1996).
SEAM1798. Seaman V. 'An Inquiry into the Cause of the Prevalence of the Yellow Fever in New York', The Medical Repository, 1, 315-332. Cited in Stevenson L.G. 'Putting Disease on the Map: the Early Use of Spot Maps in the Study of Yellow Fever', J.Hist. Med., 20, 226-261 (1965).
SNOW1849. Snow J. On the mode of communication of cholera. First edition, John Churchill, London, 1849. The first edition had no map. The famous map appeared in the second edition, London, 1855. (The map faces p.45).
WALL26. Wallace, J.W. 'Population map for health officers', AJPH, 16, 1023 (1926).
3/14/01 1250 EST
simdwm2.doc (Word 97 document)
simdwm2.pdf (Adobe Acrobat PDF format)
simdwm2.html (HTML)
Hyperlinks to Figures 1-7 are included with figure captions.
dennie:\\d:\docs\cdc9911\
dennie2:\\c:\dwmerrill\docs\cdc9911\
http://merrill.olm.net/pdocs/cdc9911/