LBL-30954




                 ANALYZING GEOGRAPHIC CLUSTERED RESPONSE


                       Deane W. Merrill, Ph.D.1,2

                         Steve Selvin, Ph.D.1,2

                        Michael S. Mohr, M.S.1,2







                               August 1991








1Information and Computing Sciences Division, Lawrence Berkeley
Laboratory, 1 Cyclotron Road, Berkeley CA 94720.

2Department of Biomedical and Environmental Health Sciences,
University of California, Berkeley CA 94720.

Presented at the 1991 Joint Statistical Meetings of the American
Statistical Association, Atlanta GA, August 18-22, 1991.

This work was supported by the Director, Office of Epidemiology and
Health Surveillance; Office of Health; Office of Environment,
Safety and Health; U.S. Department of Energy under Contract No. DE-
AC03-76SF00098.
                 Analyzing Geographic Clustered Response


ABSTRACT

In the study of geographic disease clusters, an alternative to
traditional methods based on rates is to analyze case locations on
a transformed map in which population density is everywhere equal. 
Although the analyst's task is thereby simplified, the
specification of the density equalizing map projection (DEMP)
itself is not simple and continues to be the subject of
considerable research.  Here a new DEMP algorithm is described,
which avoids some of the difficulties of earlier approaches.  The
new algorithm (a) avoids illegal overlapping of transformed
polygons; (b) finds the unique solution that minimizes map
distortion; (c) provides constant magnification over each map
polygon; (d) defines a continuous transformation over the entire
map domain; (e) defines an inverse transformation; (f) can accept
optional constraints such as fixed boundaries; and (g) can use
commercially supported minimization software.  Work is continuing
to improve computing efficiency and improve the algorithm.

running head: Analyzing Geographic Clustered Response

key words: cartogram; density equalizing map projection; disease
clustersI.   Introduction

     Studies of geographic disease clusters are frequently
     surrounded by controversy.  Often such studies, being
     politically rather than scientifically motivated, are
     conducted with insufficient data and, predictably, yield
     inconclusive results.  There is pressure to publish positive
     results and ignore negative results; as a result, causally
     unrelated results emerge which raise more questions than they
     answer.  Rothman has gone so far as to state:

           1) with very few exceptions, there is little scientific or
           public health purpose to investigate individual disease
           clusters at all; 2) there is likewise very little reason to
           study overall patterns of disease clustering in space-time;
           and 3) as a consequent of points 1 and 2, no new statistical
           methodologies are needed to refine our study of disease
           clusters or clustering in general.

     Rothman's statement, though perhaps true in a limited context,
     is so categorically stated that it demands discussion.  To
     support the opposite view, we present Figures 1 and 2, which
     are examples of maps in which geographic variation provided
     important clues regarding disease etiology.

     In Figure 1 is the well-known map of Dr. John Snow, in which
     he plotted cholera deaths in central London in 1854.  The
     observed clustering of deaths in the vicinity of the Broad
     Street pump (indicated by the circled X near the center of the
     map) led him to the correct conclusion that water from the
     pump was contaminated and responsible for the cholera
     epidemic.

     In Figure 2 is a map of white male stomach cancer mortality
     rates, age-adjusted by county, 1950-1969.2  The elevated rates
     observed in northern Minnesota and Michigan coincide
     geographically with high proportions of Northern European
     immigrants living in those areas.  The observed correlation is
     compatible with high rates of stomach cancer prevailing in
     their countries of origin. 

     The role of statistical analysis is to provide objective
     criteria for the evaluation of alternate hypotheses.  A
     negative result can be as important as a positive one; either
     must be quantitatively stated if it is to contribute usefully
     to epidemiologic knowledge.  For example, quantitative
     statistical techniques are required in order to avoid drawing
     possibly incorrect conclusions from Figures 1 and 2:

     In Figure 1, cases rather than rates were plotted.  Without
     knowledge of the population distribution, one cannot know
     whether the observed cluster of cholera cases demonstrates a
     real health risk, or is simply a reflection of higher
     population density in the vicinity of the Broad Street pump.

     In Figure 2, without knowing the population of individual
     counties, one cannot know whether the high rate in a black-
     shaded county is statistically significant.  Furthermore,
     large sparsely populated counties such as those in the
     Southwest are statistically less important than their size
     suggests.


     I.A.  Density equalizing map projections (DEMPs)

     An early attempt to deal with such problems is illustrated in
     the maps of Figure 3, which were published in 1927.  In (a)
     the number of 1915-1924 smallpox deaths in each California
     county is represented by a number of dots; in (b) each county
     has been given an area proportional to its population, so that
     population density is everywhere equal.  The author actually
     constructed his maps from lumps of plastocine; the weight of
     plastocine put into each county was proportional to the
     population; the density-equalized map in (b) was produced by
     rolling the counties flat with a rolling pin.  The smallpox
     clusters observed in (a) are primarily an artifact of the
     large populations in one or two counties.  In this example,
     the author did not discuss how the transformed map might be
     statistically analyzed.

     Density-equalized maps, frequently called cartograms, have
     been widely used for graphic display, in various applications
     unrelated to epidemiology or public health.  More often than
     not, cartograms have been drawn freehand by a graphic
     artist.,  Others have been constructed with the aid of
     computers, but no specification has been given for linking the
     transformed map to a variable such as population.

     In recent years, a number of computerized algorithms for
     density equalizing map projections (DEMPs) have been described
     and used to display geographic data.,,,,  In each
     case, the algorithm iteratively adjusts polygon boundaries by
     ad hoc procedures until each polygon has the desired area. 
     For example, maps of the United States before and after a
     DEMP, from Schulman et al.,8 are shown in Figure 4.  DEMPs in
     the references cited here, with the exception of Schulman et
     al., have not been used for quantitative statistical analysis
     of spatial data.


     I.B.  Statistical analysis of DEMPs to study disease clusters

     Previous work at Lawrence Berkeley Laboratory has focused on
     the application of DEMPs to the study of disease
     clustering.8,,,  A transformed map is particularly useful
     for epidemiologic analysis of disease because spatial patterns
     of disease cases are generally dominated by the distribution
     of populations at risk.

     Clusters of disease cases can occur on an ordinary
     geopolitical map, even if every individual is equally likely
     to contract the disease.  Examples are shown in Figure 1 and
     Figure 3(a).  The apparent clusters can be the result of (a)
     non-uniform risk of disease, or (b) non-uniform distribution
     of the population at risk, or (c) both.

     Typical procedures for eliminating the confounding effect of
     a non-uniform population distribution include calculation of
     rates (as shown in Figure 2), or Poisson regression analysis. 
     Rates cannot be reliably calculated if the number of cases in
     each geographic subarea is very small, and combining subareas
     in order to create a region with a significant number of cases
     causes geographic information to be lost.

     More importantly, combination of subareas can introduce bias,
     since the choice of subarea groups is arbitrary and can be
     affected by prior knowledge of the data.  The notorious
     practice of "Texas sharpshooting" (for example defining a
     study area which would group together the high-rate counties
     of Figure 2) is guaranteed to produce a biased and meaningless
     result.

     A DEMP in general provides an unbiased method of assessing
     geographic clustering both visually and statistically, free
     from the confounding distribution of the population at risk. 
     Exact case locations can be used, so that no geographic detail
     is lost.  The statistical analysis of a DEMP is simplified,
     since a disease whose risk is spatially uniform (the null
     hypothesis) produces a random distribution of cases over the
     transformed map.  As a result, on a DEMP, the expected moments
     of the distribution of cases (calculated under the assumption
     of uniform risk) depend only on the external boundary of the
     transformed map and can be calculated analytically.8,13,14


     I.C.  Limitations of previous DEMP algorithms

     In the present paper we discuss features of DEMP algorithms
     per se, without repeating previous work8,13,14,15 concerning the
     statistical analysis of density equalized maps.  Although the
     DEMP greatly simplifies the analyst's task, the specification
     of the DEMP itself is not simple and continues to be the
     subject of considerable research.  With minor exceptions, the
     previously published DEMP algorithms8,9,10,11,12 share the
     following drawbacks:

(a)  Sparsely populated areas are necessarily transformed into thin
     snakelike areas, and repeated iterations are frequently unable
     to reduce their area to the desired value.  Without special
     procedures, some transformed polygons overlap each other,
     corresponding to a multivalued mapping function and areas with
     negative population density.  

(b)  Even when correct polygon areas are achieved, there is an
     infinite number of possible solutions, and no objective
     criterion by which to judge their relative merit.

(c)  Only the transformed polygon area is important for most
     display purposes.  A stricter condition -- constant area
     magnification within each polygon -- is required if the
     transformed map is to be used for statistical analysis of
     point locations.  This condition is not guaranteed by most of
     the previous algorithms.

(d)  Only the transformation of boundary points is specified, and
     not the transformation of points within the polygons (for
     example, the locations of disease cases).

(e)  An inverse transformation is not specified.  This could be
     used, for example, to plot on the original map contours which
     include equal numbers of people.

(f)  Imposition of additional constraints (for example, requiring
     a fixed boundary) is usually not possible.

(g)  The previous algorithms incorporate ad hoc iteration
     procedures.  They do not take advantage of important recently
     developed optimization techniques.

     In this paper we introduce a new DEMP algorithm that
     eliminates all of the deficiencies (a)-(g).  

     Section II describes pre-DEMP map processing.  Latitude and
     longitude are converted to Cartesian coordinates, unneeded
     geographic details are removed, and each polygon in the map is
     subdivided into triangles.

     Section III describes the DEMP algorithm itself.  A linear
     transformation is defined for every triangle in the map; then
     the parameters of the transformation are adjusted with
     standard nonlinear optimization techniques, so as yield the
     correct transformed area for each triangle while holding
     overall map distortion to a minimum.

     Section IV describes post-DEMP map processing.  The linear
     transformation within each triangle is used to determine the
     transformed location of arbitrary points in the map.  The
     transformed locations of cases of disease are statistically
     analyzed by methods which have been described
     elsewhere.8,13,14,15


II.  Pre-DEMP map processing

     In this section we describe the methods used to prepare a
     geographic map, with coordinates given in latitude and
     longitude, for DEMP processing.  Because the DEMP requires
     calculation of polygon areas, we first convert latitude and
     longitude to approximate distance units, e.g. kilometers.  In
     order to reduce computation time we remove unnecessary detail
     by filtering out points which contribute little to map
     definition.  Then we subdivide the map polygons into triangles
     so that the DEMP can be described as a continuous piecewise
     linear transformation with a finite number of parameters.

     We assume that the map file to be processed is free of errors,
     and that unwanted features such as small islands and lakes
     have been removed.  One can show that an error-free map of a
     simply connected region (i.e., without islands or lakes) obeys
     the relationship

    (number of segments) - (number of points) = (number of polygons)
- 1


     II.A.      Transformation of map coordinates

     Coordinates in geographic base files (GBF) are customarily
     stored in degrees as (long,lat), where

           long = | east longitude |  or   -| west longitude |

           lat = | north latitude |  or   -| south latitude |

     In order to facilitate area calculations, the geographic
     (long,lat) coordinates are first transformed into Cartesian
     coordinates (x,y), using any desired map projection.  For
     example, for small areas far from the poles (noting that one
     degree of latitude is approximately 111.195 km) one might use
     the equirectangular projection

              x = 111.195 c0 (long - a0) cos ( b0 ã / 180 )

                        y = 111.195 c0 (lat - b0)

     where (a0,b0) are the (long,lat) coordinates of the
     approximate center of the map, and c0 is a scale factor
     approximately equal to the number of map units in one
     kilometer.  If (for example) c0 = 1000, x and y are in meters,
     and areas are in square meters.  We calculate maxkm, the
     maximum linear extent of the map in kilometers; then choose

                       c0 = 1    if 4000 ó maxkm;

                    c0 = 10    if 400 ó maxkm < 4000;

                    c0 = 100    if 40 ó maxkm < 400;

                     c0 = 1000    if 4 ó maxkm < 40;

     etc.    With this choice, x and y can be stored as 16-bit
     integers, and exact area calculations can be performed with
     32-bit integer arithmetic.  (In order to use integer
     arithmetic and avoid division by 2, it is necessary to
     calculate and compare values equal to twice the area, rather
     than area.)

     For larger areas such as the continental United States, a
     different projection such as a conic projection would be more
     suitable.  The projection parameters (for example a0, b0 and
     c0) and the specification of what projection was used should
     be stored along with the (x,y) map coordinates to permit later
     conversion back to (long,lat) coordinates.


     II.B.      Polygon area calculations

     The DEMP algorithm relies on polygon area calculations, which
     are illustrated in Figure 5.  Map files are stored in a Dual
     Independent Map Encoded (DIME) format, similar to that used by
     the U.S. Bureau of the Census for its 1990 Census TIGER
     (Topologically Integrated Geographic Encoding and Referencing)
     files.  Records corresponding to individual line segments,
     stored in arbitrary order, contain:

    code(s) of polygon on left side of segment

    code(s) of polygon on right side of segment

    x and y coordinates of start ("from") point of segment

    x and y coordinates of end ("to") point of segment

     Each segment has an arbitrary "direction" indicated by its
     "from" and "to" point.  The contribution to a given polygon's
     area is positive or negative if the polygon lies immediately
     to the left or right of the segment, respectively, and zero if
     the segment is not part of the polygon's boundary.  The
     contribution of the segment to the polygon's area is

ñ « ( xfrom yto - xto yfrom )

     Polygon areas are calculated by summing over all segments in
     the file, in any order.  "Polygon 00" in Figure 5 is the
     external area not included in polygons 01 and 02.  By the
     convention just described, the areas of polygons 01, 02 and
     00, respectively, are equal to

         A01 = « (x1y2 - x2y1 + x2y3 - x3y2 + x3y1 - x1y3) = +1

         A02 = « (x1y3 - x3y1 + x3y4 - x4y3 + x4y1 - x1y4) = +1

  A00 = « (x2y1 - x1y2 + x1y4 - x4y1 + x4y3 - x3y4 + x3y2 - x2y3) = -2


     Ordinary polygons like 01 and 02 are traced counterclockwise
     and have positive area.  Polygon "holes", like polygon 00, are
     traced clockwise and have negative area.  Including the
     "holes", the net area of the entire map file is zero.     II.C.      Point removal

     Before proceeding with the DEMP, we remove unnecessary
     geographic detail from the map file in order to reduce
     computation time.  The method illustrated in Figure 6 is
     simple, and retains the most important features of the map. 
     The preservation of nodes, i.e. points that cannot be removed
     without altering the map topology, is an important feature of
     the point removal algorithm.

     In (a), nodes A and F are connected by line segments through
     points B, C, D and E, which are candidates for removal.  We
     wish to retain geographic features larger than a certain size,
     for example 10 km2.  The areas of all triangles involving
     three adjacent points (ABC, BCD, CDE, and DEF) are calculated,
     as shown.  The smallest triangle is CDE (with an area of 3
     km2).  We therefore remove point D; the segments CD and DE are
     replaced by the single segment CE.

     In (b), the areas of the remaining affected triangles are
     recalculated and the process is repeated.  The smallest
     triangle is CEF (with an area of 1 km2) so point E is removed.

     In (c), no triangles smaller than 10 km2 remain, so points B
     and C are retained.  Processing of the chain AF is complete.

     The entire map file contains multiple chains similar to
     ABCDEF.  In practice the smallest triangle in the entire map
     is always the next one considered for point removal, until the
     limit of 10 km2 is reached.  If removal of the considered
     point would cause two line segments anywhere in the map to
     intersect, the next smaller triangle in the map is considered
     instead.  No polygon is reduced below three points.

     In Figure 7 and Table I we illustrate point removal in a 1980
     Census tract map of San Francisco.  The map, with a total area
     of 120 km2, has 152 polygons (excluding the external boundary
     polygon) and 415 chains connecting 264 nodes.  All maps in
     Table I and Figure 7 obey the relationship

    (number of segments) - (number of points) = (number of polygons)
- 1

     because each step in the point removal algorithm
     simultaneously removes one point and one segment, without
     altering the number of polygons.

     The 0 km2 map (not shown) differs from the original map in
     that collinear segments have been consolidated.  The 0.01 km2
     map in (b) appears only slightly less detailed than the
     original map in (a) even though the number of points has been
     reduced from 1920 to 656.  Even in the 1 km2 map in (c), with
     only 268 points, the map still recognizably portrays San
     Francisco census tracts.  The minimum map in (d),
     corresponding to an infinite point removal criterion, contains
     only the 264 nodes, which are necessary to preserve the
     original map topology; each of the 415 chains has been reduced
     to a single line segment.

     Because excessive detail exacts a large computation penalty,
     we recommend a 1 km2 point removal criterion for tract level
     DEMPs in dense urban areas like San Francisco.  A point
     removal criterion between 100 km2 and 1000 km2 is appropriate
     for most county level maps.


     II.D.      Triangulation of polygons

     We wish to represent the DEMP transformation by a finite
     number of parameters which are closely related to the
     available data; namely, the populations and polygon boundaries
     of the original map.  Triangulation of each polygon allows a
     piecewise linear description of the DEMP transformation, while
     introducing no arbitrary new parameters.

     It has been argued that, for purposes of interpolation,

           the Delaunay triangulation is most appropriate because,
           besides being unique for a given set of data points, it
           simultaneously maximizes the number of triangles and produces
           triangles that are as equiangular as possible.  This is
           important in the interpolation process since it ensures that
           triangle edge lengths and thus the distances between
           interpolation points are minimized.

     A simple process is used which yields the uniquely defined
     Delaunay triangulation of every polygon.  (In triangulating a
     quadrilateral all of whose points lie on a circle, there are
     two possible choices, both equally valid.)  The algorithm is
     based on the properties of Voronoi (or Thiessen) polygons17
     and a theorem which is proved in Appendix A:

           Given n points in a plane, if the circle drawn
           through three of those points does not contain or
           touch any other of the n points, the triangle
           connecting those three points belongs to the
           Delaunay triangulation.

     Application of the theorem is illustrated in Figure 8 for a
     sample polygon ABCDE.  We want the triangulation of the plane
     to include all the pre-existing polygon boundaries, so we need
     only triangulate each polygon, one polygon at a time.  Three
     adjacent points in the polygon boundary which form an interior
     angle less than 180 degrees, for example ABC, are arbitrarily
     selected.  The (unique) circle is drawn which passes through
     all three of the points ABC.  If the circle contains no other
     points of the ABCDE polygon boundary (which is the case), then
     triangle ABC is part of the Delaunay triangulation.  Triangle
     ABC is separated out, and the same process is applied to the
     remaining polygon ACDE.

     In the remaining polygon ACDE, the circle through triangle ADE
     contains no other points, so triangle ADE (and hence also ACD)
     is valid.  On the other hand, the circle through triangle CDE
     contains points A and B, so triangle CDE is not valid (nor is
     ACE).

     The triangulation process is repeated for every polygon in the
     entire map.  Because the interior angle EAB is greater than
     180 degrees, triangle EAB may be included in the triangulation
     of another polygon, but not of ABCDE.

     In Figure 9 we illustrate the triangulated map of San
     Francisco, obtained by applying the triangulation algorithm to
     the 1 km2 map shown in Figure 7(c).  The triangulated map in
     Figure 9 contains 498 triangles, 268 points and 765 segments. 
     Like all the maps in Figure 7 and Table I, it obeys the
     relationship

    (number of segments) - (number of points) = (number of polygons)
- 1

     since each step in the triangulation algorithm simultaneously
     adds one segment and one polygon, without altering the number
     of points.


III. DEMP algorithm

     The DEMP algorithm itself will now be described.  The DEMP
     transformation is assumed to be continuous and piecewise
     linear; every triangle is mapped into another triangle, and
     adjacency of the triangles is preserved.  The parameters of
     the DEMP are the transformed coordinates of the triangle
     vertices.  The transformed area of each triangle is uniquely
     determined by requiring that (a) total area be unchanged (b)
     transformed areas be proportional to population, and (c)
     magnification be constant within each polygon.  The DEMP
     parameters are optimized to determine the transformed map
     configuration which satisfies the area constraints and is
     least distorted relative to the original map.


III.A.     Piecewise linear transformations

     Once the Delaunay triangulation has been performed on the
     entire map file, our density equalizing map projection (DEMP)
     is performed.  For simplicity, we require that the
     transformation within each Delaunay triangle be constant and
     linear; under such a transformation, each triangle maps into
     another triangle.

     In the example of Figure 10, all points (x,y) within triangle
     01 are mapped into points (u,v) by the linear transformation

                         u = a01 x + b01 y + e01

                         v = c01 x + d01 y + f01

     In matrix notation,     The area of triangle 01 is A01 before the transformation and
     B01 after the transformation.

     Similarly, all points (x,y) within triangle 02 are mapped into
     points (u,v) by the linear transformation

                         u = a02 x + b02 y + e02

                         v = c02 x + d02 y + f02

     In matrix notation,
     The area of triangle 02 is A02 before the transformation and
     B02 after the transformation.

     The two linear transformations in Figure 10 can be defined by
     specifying the twelve coefficients a01 through f01 and a02
     through f02.  Four constraints on the coefficients are
     required in order to preserve the adjacency of triangles 01
     and 02 in the transformation.  Necessary (but not sufficient)
     conditions are: e01 = e02 and f01 = f02.  More economically,
     one can define the transformation by simply specifying the
     eight transformed triangle coordinates (u1,v1), (u2,v2),
     (u3,v3), (u4,v4).  Expressions for a...f in terms of u and v
     (and the original coordinates x and y) are as follows.  For
     example, for triangles 01 and 02:
     By considering the transformation of an infinitesimally small
     map region, one can show that at any point the map
     magnification, i.e., the ratio of transformed to original map
     area, is equal to the Jacobian
     In order that spurious disease clusters not be produced by the
     DEMP, we require not only that transformed polygons have the
     correct area, but also that area magnification be constant
     within each polygon associated with a given population.  This
     is automatically true within each triangle, since a, b, c, and
     d are constant for a linear transformation.  In the next
     section we describe the additional constraints required to
     guarantee constant magnification within each entire polygon.

     If A01 þ 0, the DEMP transformation is defined.  The Jacobian
     is equal to B01/A01 = a01d01-b01c01, the area magnification
     factor of triangle 01.

     If B01 þ 0, the inverse transformation is defined, with an
     area magnification factor of A01/B01 = (a01d01-b01c01)-1.


     III.B.     Area constraints

     In Figure 11 we illustrate the area constraints that are
     applied to polygons in a DEMP.  Two polygons, a triangle 01
     and a quadrilateral 02, are defined by coordinates (x1,y1)
     through (x5,y5), which are transformed into coordinates (u1,v1)
     through (u5,v5).  The Delaunay triangulation divides the
     quadrilateral 02 into triangles 02.1 and 02.2 (and the
     triangle 01 into a single triangle 01.1).  Areas before and
     after the DEMP are denoted by A and B respectively.  We
     require that:

    total map area be unchanged by the transformation; i.e., 

Btot = Atot

    after the transformation, transformed polygon areas be
     proportional to population; i.e., 

                       B01 / pop01  =  B02 / pop02

    areas of triangles within each polygon are in the same
     proportion before and after the transformation; i.e.,

                     B02.1 / A02.1  =  B02.2 / A02.2

     Hence we require the target areas B01.1, B02.1 and B02.2 to be
     equal to

             B01.1 = pop01 þ [Atot / poptot] þ [A01.1 / A01]

             B02.1 = pop02 þ [Atot / poptot] þ [A02.1 / A02]

             B02.2 = pop02 þ [Atot / poptot] þ [A02.2 / A02]

     where
                         poptot = pop01 + pop02

              Atot  =  A01 + A02  =  A01.1 + A02.1 + A02.2

              Btot  =  B01 + B02  =  B01.1 + B02.1 + B02.2

           A01.1 = « (x1y2 - x2y1 + x2y3 - x3y2 + x3y1 - x1y3)

           A02.1 = « (x1y3 - x3y1 + x3y5 - x5y3 + x5y1 - x1y5)

           A02.2 = « (x3y4 - x4y3 + x4y5 - x5y4 + x5y3 - x3y5)

     In the (u,v) space, then, we require the transformed triangle
     areas to be equal to their target values, which sets up the
     following three quadratic constraints on (u1,v1) through
     (u5,v5):

     H01.1 ð « (u1v2 - u2v1 + u2v3 - u3v2 + u3v1 - u1v3) - B01.1 = 0

     H02.1 ð « (u1v3 - u3v1 + u3v5 - u5v3 + u5v1 - u1v5) - B02.1 = 0

     H02.2 ð « (u3v4 - u4v3 + u4v5 - u5v4 + u5v3 - u3v5) - B02.2 = 0

     The imposition of separate area constraints on triangles B02.1
     and B02.2, and not just on the quadrilateral B02, is a critical
     feature of the DEMP algorithm.  This feature prevents the
     occurrence of negative-area triangles, which would produce
     double-valued regions of the mapping function and self-
     intersecting polygon boundaries.


     III.C.     Map distortion

     The area constraints described in the previous section are
     insufficient to completely define the DEMP.  In the example of
     Figure 11, the ten parameters (u1,v1) through (u5,v5) are
     restricted by the three constraints H01.1 = 0, H02.1 = 0, and
     H02.2 = 0.  Three further constraints are imposed by requiring
     that the transformed map not be rotated or translated (in x or
     y) relative to the original map.  Four degrees of freedom
     remain, which means that the number of possible solutions is
     infinite.

     Tobler recommends that a unique optimum solution be defined by
     making the DEMP be as nearly conformal as possible. 
     Conformal transformations are those that locally preserve the
     shape (but not necessarily the size, location, or orientation)
     of each infinitesimal portion of the map.  As illustrated in
     Figure 12, conformal linear transformations include all
     combinations of translations, rotations and magnifications,
     corresponding to the following transformation matrices:
     but not reflections, compressions or shear transformations,
     corresponding to the following transformation matrices:

     Tobler's suggested measure of non-conformality (which we write
     in more customary notation) is the integral over the
     original map of
     This quantity is identically equal to zero for conformal
     transformations, which by definition obey the Cauchy-Riemann
     conditions19
     However Tobler's measure does not properly reflect the non-
     conformality of of reflections, compressions or shear
     transformations; i.e., linear transformations of the form
     where a þ d or b þ -c, since his measure is equal to zero in
     this situation also.

     Instead of using Tobler's measure, we define instead the non-
     conformal distortion of a triangle 01.1 as

             G01.1 = ( a01.1 - d01.1 )2 + ( b01.1 + c01.1 )2

     where a01.1, b01.1, c01.1 and d01.1 are functions of (x,y) and
     (u,v) defined as in section III.A.  The resulting expression
     is quadratic in u1, v1, u2, v2, u3 and v3, the transformed
     coordinates of triangle 01.1:

   G01.1 = { [ u1(y2-y3) +u2(y3-y1) +u3(y1-y2) -v1(x3-x2) -v2(x1-x3) -
v3(x2-x1) ]2

            + [ u1(x3-x2) +u2(x1-x3) +u3(x2-x1) +v1(y2-y3) +v2(y3-y1)
+v3(y1-y2) ]2 } / det2
     
     where

         det = 2 A01.1 = x1y2 - x2y1 + x2y3 - x3y2 + x3y1 - x1y3

     The distortion of the entire map is calculated as a sum of G
     over all triangles, weighted by the area of each triangle.  We
     choose the weighting coefficients to be A+B, the sum of the
     original area A and the target area B.  (Using A instead of
     A+B yields unstable solutions for triangles which are
     initially small and which are magnified by large factors;
     using B alone yields unstable solutions for triangles which
     are magnified by small factors.  In addition, the symmetric
     form A+B has the aesthetic property that the DEMP
     transformation, and the inverse transformation to return to
     the original map, have equal distortion.)

     The assumed transformation functions u(x,y) and v(x,y) are
     continuous and piecewise linear; however, they have
     discontinuous first and second derivatives at triangle
     boundaries. At this time it is not clear how to quantify that
     component of map distortion.


     III.D.     Summary of DEMP algorithm

     In summary, the DEMP algorithm consists of finding the set of
     transformed coordinates u1, v1, u2, v2... which minimize
     overall map distortion subject to the triangle area
     constraints.  That is, for the example of Figure 11, finding
     the values u1, v1, ... u5, v5 which minimize

         G(u,v) ð g0 [ g01.1 G01.1 + g02.1 G02.1 + g02.2 G02.2 ]

     where

       g01.1=A01.1+B01.1;   g02.1=A02.1+B02.1;   g02.2=A02.2+B02.2

     subject to the three quadratic constraints

                    H01.1 = 0;  H02.1 = 0;  H02.2 = 0

     or, equivalently, to the single constraint

    H(u,v) ð h0 [ h01.1(H01.1)2 + h02.1(H02.1)2 + h02.2(H02.2)2] = 0

     which is fourth order in u1, v1, ... u5, v5;  g0 and h0 are
     arbitrary constants.  The constants h01.1, h02.1 and h02.2 must
     be positive and non-zero; good convergence is obtained for the
     choice
     where Bmin = Atot/poptot; namely, the area on the transformed
     map equivalent to a population of one person.  The Bmin term
     is required to avoid infinitely large coefficents h, for
     triangles having population and target area B equal to zero.

     During the DEMP process, three of the ten parameters u1, v1,
     ... u5, v5 should be fixed in order to prevent arbitrary x
     translation and/or y translation and/or rotation of the entire
     transformed map.  As illustrated in Table II,

           (number of parameters) = 2 * (number of points) - 3

     In order to fix three parameters, a convenient choice for a
     map whose x range exceeds its y range is:

              uwest = xwest;  vwest = ywest;  veast = yeast

     where (xwest,ywest) and (uwest,vwest) are respectively the
     original and transformed (x,y) coordinates of the westernmost
     point of the original map; yeast and veast are respectively the
     original and transformed y coordinate of the easternmost point
     of the original map.  A convenient choice (with similar
     definitions) for a map whose y range exceeds its x range is:

           usouth = xsouth;  vsouth = ysouth;  unorth = xnorth

     After the DEMP, the resulting map can then be translated
     and/or rotated as desired.

     Optionally, additional constraints can be imposed during the
     DEMP optimization, provided a feasible solution of the area
     constraints exists.  For example, one could require ui=xi and
     vi=yi for all points i on the external map boundary.


IV.  Post-DEMP map processing

     In analyzing the geographic distribution of disease, the
     primary purpose of the DEMP is to plot cases on a transformed
     map where population density has been equalized.  In addition
     one may wish to transform the locations of various geographic
     features.  For example, one would normally wish to plot on the
     transformed map the location of a supposed environmental
     hazard such as a microwave tower or a toxic waste dump. 
     Contours of equal distance from a point source, although no
     longer circular, could still serve to identify areas of
     supposedly equal risk.  A transformed latitude-longitude grid
     could help the analyst locate additional features on the
     transformed plotted map.

     Case locations or geographic features are transformed as
     follows:

    With the transformation described in Section II.A, the
     (long,lat) location of a disease case or geographic feature k
     is first projected into (xk,yk), in the same coordinate system
     as the pre-DEMP polygon map.

    The DEMP solution, determined as in Section III.D, provides
     values of (u,v), the vertices of all triangles in the DEMP-
     transformed map.  From these, as explained in Section III.A,
     one can calculate ai,bi,ci,di for each triangle i, and a global
     e and f.

    A point-in-triangle routine, along with the pre-DEMP triangle
     coordinates (x,y), is used to determine in which polygon i
     each point k is located.  A point k lies inside or on the
     boundary of a triangle with vertices (x1,y1), (x2,y2), (x3,y3)
     (labeled in counterclockwise order) if and only if all three
     of the following quantities are non-negative:

                     (x2-x1)(yk-y1) - (xk-x1)(y2-y1)

                     (x3-x2)(yk-y2) - (xk-x2)(y3-y2)

                     (x1-x3)(yk-y3) - (xk-x3)(y1-y3)

    Then the appropriate ai,bi,ci,di (and e and f ) are used to
     transform the point (xk,yk) into density-equalized coordinates
     (uk,vk).

     In a density equalized map, analysis of disease distributions
     is relatively straightforward.  As described in Section I.B,
     a number of quantitative techniques have been developed, which
     have certain advantages over traditional rate comparisons or
     relative risk calculations.8,13,14,15


V.   Preliminary results

     V.A.  NAG optimization

     In our preliminary tests of the DEMP algorithm we have used
     the NAG (Numerical Algorithms Group) Fortran library routine
     E04VDF, which minimizes the quadratic objective function
     G(u,v) subject to the fourth order constraint condition H(u,v)
     = 0.  (An attempt to separately apply the quadratic
     constraints H01.1 = 0, H02.1 = 0, H02.2 = 0 led to excessive
     memory usage when the size of the problem was increased.)

     E04VDF uses a sequential quadratic programming method
     described by Gill et al.  Explicitly calculated first
     derivatives of the objective function G(u,v) and constraint
     function H(u,v), with respect to the variables u1, v1, u2, v2,
     etc. are required.  We note that G(u,v) and the individual
     Hi(u,v) are sums of separate terms, each of which is a
     quadratic function of only six parameters; namely, the
     transformed u and v coordinates of the vertices of a single
     triangle. 

     In addition, E04VDF requires an initial estimate of the
     solution, for which we use the original map (u1 = x1, v1 = y1,
     etc.)

     Without affecting the final solution, the objective function
     G(u,v) and its derivatives can be multiplied by an arbitrary
     constant g0.  Similarly, the constraint function H(u,v) and
     its derivatives can be multiplied by an arbitrary constant h0. 
     Larger values of g0 and h0 produce more precise solutions,
     with more iterations.  The ratio h0/g0 determines the relative
     weights of the two functions.  Large values of h0/g0 can
     produce solutions in which distortion may not be minimized;
     small values can produce solutions in which the final triangle
     areas are not exactly correct.  Some trial and error is
     required to obtain satisfactory solutions without an excessive
     number of iterations.


     V.B.  Examples

     To illustrate the results of the preliminary DEMP
     implementation, we have chosen the state of Vermont, which has
     14 counties.  A point removal criterion of 500 km2 was applied
     as described in Section II.C; then the filtered map was
     triangulated as described in Section II.D.  The filtered and
     triangulated Vermont map, which is shown in Figure 13(a), has
     30 points and 43 triangles; i.e., 57 parameters and 43 area
     constraints.

     In Figure 13(b) the same map has been transformed by a DEMP as
     described in Section III.D; in this example the variable
     equalized by the DEMP is the 1980 total population.  Note that
     in the transformed map (b) the areas of triangles within a
     given county are proportional to their areas in the original
     map (a).

     Figure 14 illustrates the optimization path corresponding to
     the total 1980 population DEMP map of Figure 13(b), for three
     different values of h0/g0.  Optimization proceeds from lower
     right to upper left.  The numbers on the curves represent the
     number of major iterations taken by the NAG routine E04VDF.

     For any DEMP, the original map by definition has distortion
     G(u,v) equal to zero; in this example (1980 total population)
     the initial violation of the area criterion was such that
     log10 H(u,v)= -1.62.  During minimization, the area violation
     H(u,v) was forced smaller and smaller, so that log10 H(u,v)
     became more and more negative; at the same time, distortion
     G(u,v) necessarily increased from zero to a value around 0.52.

     Consider first the middle curve, the solid line labeled
     "optimum area constraint."  After about 90 major iterations
     (2.8 minutes of central processor time on a Sun SPARCstation
     1), log10 H(u,v) reached a value around -6, at which point
     further changes in the map were invisible.  Repeated trials
     from different starting points (not shown) consistently
     converged to the same solution.

     In the lower dashed curve, labeled "weak area constraint," 
     major iterations are indicated in parentheses ( ).  Here the
     area constraint was weakened relative to the distortion
     function by decreasing the value of h0/g0.  Distortion was low
     throughout the optimization, but more iterations were
     required.  The final solution was the same as for the middle
     curve.

     In the upper dashed curve, labeled "strong area constraint,"
     major iterations are indicated in square brackets [ ].  Here
     the area constraint was strengthened relative to the
     distortion function by increasing the value of h0/g0.  The
     area constraint was satisfied more quickly but distortion was
     increased.  At the final solution, distortion remained around
     0.53 and could not be reduced by further iterations.  The
     final map configuration from the upper curve (not shown) was
     visibly different from those from the two lower curves.

     In general, the optimum value of h0/g0 (here, the middle
     curve) is that which yields the least distorted solution in
     the fewest iterations.  The user must experiment to determine
     optimum values of g0 and h0, and to determine at what value of
     log10 H(u,v) a stable solution has been reached.  

     Occasionally two or more distinct local minima of the
     distortion function G(u,v) were found, both of which satisfy
     the area constraint H(u,v)=0.  Figure 15 illustrates the
     results of a DEMP (again Vermont with a 500 km2 point removal
     criterion) which equalized the 1980 Native American population.

     The optimization path is illustrated in Figure 15(a). In this
     example (1980 Native American population) the initial
     violation of the area criterion was such that log10 H(u,v)= -
     0.97.  Two distinct solutions were found, with distortion
     G(u,v)=1.14 and 1.01 in (b) and (c) respectively.  The second
     solution, which is less distorted, is preferred.

     The initial value of log10 H(u,v)= -0.97 for the Native
     American population, compared with log10 H(u,v)= -1.62 for the
     total population, indicates that for the Native American
     population the area constraint (in the original map) is more
     severely violated than for the total population.  This is
     consistent with the result that distortion (in the final
     transformed map) for the Native American population
     (G(u,v)=1.14 or 1.01) is larger than for the total population
     (G(u,v)=0.52).

     For the Native American population the less distorted solution
     in Figure 15(c) was consistently obtained for weaker area
     constraints (smaller values of h0/g0).  However, in other
     cases (e.g., Black population, not shown) the reverse
     situation occurred.

     The numbers given in this section for h0/g0, G(u,v), and log10
     H(u,v) are for comparison purposes only; they have no meaning
     in an absolute sense.  A necessary condition for a DEMP
     "solution" is that repeated trials from various starting
     points converge to that solution, and that no solutions with
     lower distortion be found.  However, one cannot prove that
     better solutions do not exist.  When performing repeated
     trials, note that G(u,v) is always the distortion of (u,v)
     relative to the true starting point (x,y), not relative to the
     perturbed starting point.


     V.C.  Computing Requirements

     Approximate computing requirements for the NAG routine E04VDF
     are summarized in Table II.  In order to estimate computing
     time for larger problems, we created a triangulated Vermont
     map (not shown) with a point removal criterion of 50 km2,
     which had more points and more triangles than the 500 km2
     triangulated map in Figure 13(a).  We performed DEMPs on both
     Vermont maps, equalizing both on 1980 total population density
     and using the same g0 and h0 and termination criterion (log10
     H(u,v)= -6).

     Other factors being equal, computation time increased from 2.8
     to 30 CP minutes (on a Sun SPARC station 1) when the number of
     points increased from 30 to 66, and the number of triangles
     increased from 43 to 103.  Consistent results (an increase
     from 1.3 to 14 CP minutes) were obtained on a VAX 6510
     computer.

     For the example studied, computation time increased
     approximately as the third power of map complexity.  The San
     Francisco 1 km2 triangulated map in Figure 9 has about four
     times as many points and five times as many triangles as the
     Vermont 50 km2 map (not shown).  Therefore, if other factors
     remain comparable, we estimate that a San Francisco tract
     level DEMP may take 2 to 3 days on a SPARC station 1.  It is
     not known how the number of required iterations will increase
     with map complexity.

     Runs of more than a few hours are impractical and expensive,
     especially since multiple runs are required to check the
     validity of a solution.  Problems of interest may have
     thousands or even tens of thousands of points.  The NAG
     routine E04VDF is not intended for large sparse problems, of
     which our DEMP algorithm is an example.  Work is continuing to
     improve computation efficiency, and to find minimization tools
     which are better suited to the particular characteristics of
     our DEMP algorithm.


VI.  Conclusions

     As explained in Section I.B, density equalizing map
     projections (DEMPs), correct for the confounding effect of
     varying population density, thereby providing a useful tool
     for analyzing the geographic distribution of disease.  The
     statistical analysis of a density equalized map is
     straightforward, since under the null hypothesis of equal
     risk, the distribution of disease cases on such a map is
     expected to be uniform.

     Although a DEMP simplifies the task of the analyst, the
     specification of the DEMP transformation itself is not simple
     and continues to be the subject of considerable research.  The
     algorithm described here represents a radical new approach to
     DEMP calculations, and has important advantages over previous
     techniques.  In particular (cf. Section I.C.):

(a)  application of the area constraint separately to each triangle
     avoids overlapping of transformed polygons, corresponding to
     a multivalued mapping function and areas of negative
     population density;

(b)  competing solutions can be quantitatively compared, permitting
     a comparative evaluation of various optimization techniques;

(c)  constant magnification within each polygon ensures that the
     DEMP cannot enhance spurious disease clusters;

(d)  the solution determines a continuous transformation over the
     entire map surface;

(e)  unless zero-population areas are present, the inverse
     transformation is uniquely defined;

(f)  optional boundary constraints can be easily imposed, provided
     a solution exists that is compatible with the area
     constraints;

(g)  commercially supported optimization software can be used. 

     Work is continuing to improve computation efficiency. 
     Provided that the difficulties of numeric optimization can be
     resolved, the new DEMP algorithm can be put to practical use
     in routinely analyzing specific geographic disease
     distributions.Appendix A.  Voronoi polygons and Delaunay triangles.

     In this appendix we prove the theorem used in Section II.D:

           Given n points in a plane, if the circle drawn
           through three of those points does not contain or
           touch any other of the n points, the triangle
           connecting those three points belongs to the
           Delaunay triangulation.

     For illustration, a set of Voronoi polygons is shown in Figure
     A-1.  Given n points in the plane (for example A,B,C,D,E), the
     Voronoi polygon associated with one point i is defined as that
     region which is as close or closer to point i than to any
     other of the n points.  For example, the quadrilateral
     indicated by dashed lines in Figure A-1 encloses the region
     which is closer to point A than to point B, C, D or E.

     The Delaunay triangulation is defined by the set of pairwise
     connections between points i whose Voronoi polygons are
     contiguous.  In Figure A-1, the Delaunay triangulation is the
     set of eight line segments AB, AC, AD, AE, BC, CD, DE and EB.

     Consider the circle centered at O, drawn through the three
     points A, D and E.  If the circle does not contain or touch
     any other of the n points in the plane, then points A, D, and
     E are equally close to point O, and no other of the n points
     (e.g. B and C) is as close.  Therefore, O belongs to the
     Voronoi polygons associated with A, D and E, but to no other
     Voronoi polygons.  Exactly three Voronoi polygons (A, D and E)
     coincide at point O, so the three polygons must be mutually
     contiguous.  By definition, AD and DE and EA belong to the
     Delaunay triangulation, so ADE is a Delaunay triangle.REFERENCES

               Table I.  Point removal and triangulation,
                    San Francisco, 1980 census tracts

area
criterionnumber
of
polygon
s*number
of
pointsnumber
of
segmentsfigureoriginal
map152192020717(a)0 km215215471698not shown0.001 km215212681419not shown0.01 km21526568077(b)0.1 km2152321472not shown1 km21522684197(c)minimum
map1522644157(d)1 km2,
triangulat
ed4982687659
* excluding external boundary polygon for "rest of world"Table II.  DEMP Computing requirements

two
tri-
angl
esthre
e
tri-
angl
esVermon
t 500
km2Vermon
t 50
km2San
Fran
-
cisc
o 1
km2figure5,
101113not
shown7(c)
, 9points453066268polygons*221414152triangles2343103498parameter
s5757129533area
constrain
ts2343103498degrees
of
freedom34142635iteration
s90128200-
300CP time
on SPARC
12.8
min30 min2-3
daysCP time
on VAX
65101.3
min14 min1-2
days
* excluding external boundary polygon for "rest of world"