|
|
|
The following information can be used to estimate the size of an on-line genealogy database. Or, if the size of the database is known, one can estimate the number of occurrences of a surname of interest. 1. To estimate the size of a database, choose a surname whose relative frequency is known. For the examples below we use the surname Merrill (with no spelling variations), whose relative frequency in the U.S. since 1850 is about 0.012±0.002 percent. See Merrills in the U.S., 1850-1996. For a 1990 Census estimate for any other surname, see Frequency of Names in America's 1990 Census. 2. Count the occurrences of that surname in the database. If the number is large and a total is not provided automatically, use the following table, which gives the approximate distribution of given names (first names) in the United States:
For example, in FTM Family Finder Index, there are approximately 790 entries for Merrill with a given name beginning with B. From other sources, we know that the frequency of the surname Merrill is 0.012±0.002 percent, and the frequency of given names beginning with B is 4.0±0.5 percent. We estimate the size of the database to be 790/(0.040±0.005)/(.00012±0.00002) = 125 to 225 million. As a check on our result, this database is known to contain 153 million records. To get a better estimate, the process can be repeated for other first letters; also, the surname relative frequency can be estimated for different surnames and from a number of different databases. In a second example, we estimate the size of a database whose size is not known. In Switchboard (residential phone listings), there are approximately 660 entries for Merrill with a given name beginning with A. We estimate the size of the database to be 660/(0.080±0.005)/(.00012±0.00002) = 55 to 88 million records. For comparison, ProCD's Home Phone (1997 edition 1) has 85 million listings. Switchboard can be used to estimate the relative frequency of other surnames in the United States. In a third example, we estimate the number of Merrills expected in a genealogy database whose size is known. As of April 1997, GENSERV contained 12.2 million records. In a database this size one would expect to find 0.00012*12.2 million = 1500 Merrills if records had been sampled at random from the US population. By actual count there are over 4600, due to the fact that Merrills have been intensively studied by genealogists, and are disproportionately represented in the GENSERV database. For other online genealogy databases, see Comparison of Linked Pedigree Databases.
back to Deane Merrill's Genealogy Page
updated 4/5/98 | |||||||||||||||||||||||||||||||||||||||||||||
|
Deane Merrill, merrill@crocker.com |