Global Mapping International and the JESUS Film Project of Campus Crusade for Christ are developing a database of populated places (cities, towns, and villages) combining the best features and minimizing the deficiencies of the best freely-available populated place data sets. The resulting database includes records of approximately 2.3 million distinct places suitable for use in mapping systems and in database systems that record the locations of persons, institutions, and events. The methods used allow the database to be updated with new or updated sources. Because the data preserves all alternate names of the places from the original sources, along with phonetic renderings of all names, the data is particularly suitable for use in systems where it is necessary to look up places that may have several names or variant spellings of the name.
Intended Uses
Currently anticipated uses for the data include:
Site and project location lookups for WorldMap.org's partners site, which allows ministries to track the locations of their sites and activities online
Improved map labeling and population-based map symbols for users of the Global Ministry Mapping System and other Geographic Information Systems (GIS)
Built-up area polygons frequently include other places than the one named
Poor name rendering
No population data
Instituto Nacional De Estadistica Geografia E Informatica (INEGI)(download from CIESIN)
Large number of populations
Many alternate names
Mexico only
No diacritics in names
Habitats Project (1994)
Provides populations and alternate names for some places missed by other datasets
Good coverage of places and their relationships in urban areas
Poor coordinate accuracy in some areas
Older populations (mostly early 1990s)
Method
The output data table is initially empty. For each distinct place in each source, all alternate names of nearby places already in the output table are compared phonetically to all alternate names of the place under consideration. If no match is found, the place is added to the output table. If the place already exists in the output table, the entry in the output table is updated to reflect any "better" information (name, coordinate, or population). More detail.
Status of Work
A first run of all of the data was completed in January, 2007, requiring 8 weeks of continuous processing time. The results were close to what we desired, but had an excessive number of places merged togther in situations where a person looking at the data would judge the names to be too dissimilar or the distances separating the places too great. As of April 2007 we are:
revising the matching algorithm to improve phonetic discrimination
decreasing the average radius searched for matches, making it dependent on the population of the place and the accuracy of the source
We currently expect that results will be available in the second half of 2007.
Limitations
While the process described here draws each component of the result (name, coordinates, and population) from the source considered most reliable, many places have data from only a single source.
Contacts
Contact
at Global Mapping International or
at JESUS Film Project for additional information on this project.
Donations
This project has been funded to date by contributions to Global Mapping International and the JESUS Film Project. Your contribution will help the continuing work of updating and improving this resource.
Detailed Method
The general method used to merge the various sources is as follows:
Decompose each data soure into entries in three tables:
places containing for each (supposedly) unique place in each source:
A unique ID composed of a source identifier and a place ID within the original source
The preferred name of the place for GIS labeling purposes (from among the names provided in the source
Place and urban agglomeration populations
names containing for each alternate name of each entry in places
Place identifier
Name in Unicode
Name in diacritic-stripped plain ASCII
A "normalized" name with punctuation removed and common words (e.g. various language versions of "a" "the" "saint", "city", etc. either removed or converted to standard abbreviation
Primary and secondary phonetic equivalents of the name using a modified version of Philips' Double Metaphone algorithm.
Populate a third table gis with merged, composite records (similar in structure to the places table, but with source identifiers for each component), according to the following general rules for each entry in places:
Search for gis table entries within a stated distance and having at least one double metaphone match among all of the corresponding entries in the names table.
If zero matching entries are found, copy the places record to the gis table
If one entry is found, update the components of the GIS record (code, name, coordinates, population) for which the current places record has better data
If more than one entry is found, merge existing gis records according to the above rules until a single entry is present, then update the record with new data as above.