Geographic Information Retrieval and Spatial Browsing

Ray R. Larson

Associate Professor

School of Library and Information Studies

University of California, Berkeley

Table of Contents

  1. Abstract
  2. Introduction
  3. Geographic Information Retrieval and Spatial Querying
    1. Geographic and Spatial Queries
    2. Types of spatial queries
  4. Spatial Browsing
  5. Geographic and Spatial Indexing
    1. GIPSY: Automatic Geo-referencing of Text
  6. Examples of GIR on the InterNet
  7. Geographic Browsing Toolkit
  8. Conclusions
  9. Acknowledgements


Digital Library projects are beginning to create very large-scale repositories of digital information on a wide range of topics. As with traditional print libraries, this information can be indexed and retrieved in a variety of ways, ranging from purely descriptive cataloging of items in the database and topical analysis of content, to more specialized methods of classification and description that exploit the characteristics of digital information. This paper will examine the problems and prospects of a class of retrieval and indexing methods that are particularly suited to Digital Library materials with geographic content or associations. The characteristics of spatial queries, and some of the problems of uncertainty and approximation in spatial and geographic information retrieval are discussed. Requirements and a methodology for automatic indexing and geo-referencing of text documents are then examined. A "state-of-the-art" examination of network access to geo-referenced information is provided, and a specification language and tool for development of graphical interfaces to support geographic information retrieval and spatial browsing is also described. In conclusion general issues and characteristics of geo-referenced multimedia information systems are discussed.


Many users of digital and print-based libraries have needs for information that is best approached from a geographical perspective. These users include scientists with research interests that range from the global changes brought about by the Greenhouse Effect and ozone depletion, global climate modeling and ocean dynamics, to the ecological characteristics of a region. They also include historians who require information on specific areas (at particular times), grammar school students working on class projects about particular cities and countries, and developers and city planners who must develop an environmental impact reports for a given site.

Digital libraries have multiplied the types of geo-referenced information sources (i.e., information with associated geographic coordinates) available beyond the traditional print and paper forms (maps, geographically indexed books, etc.) to include remote sensing data and images from satellites and aircraft, databases of measurements (e.g. temperature, windspeed, salinity, snow depth, etc.) from specific geographic locations, complex vector information such as topographic maps. They also store large amounts of digitized text and photographs from a variety of sources, in a variety of formats, and include other "multimedia" information such digitized sounds and video in the DL database. Storage and access to specific items in a DL database is generally fast and efficient. However, access to the contents, and particularly to information relevant to the particular topical or subject needs of the user of such a database is another matter altogether. Users of Digital Libraries need to be able to search for specific known items in the database, and to retrieve relevant unknown items based on various criteria. This searching and retrieval has to be done efficiently and effectively, even when the scale of the database reaches the multi-terabyte range, (as is expected in the not-too-distant future). This implies that Digital Library objects must be indexed, so that users can retrieve them by content. Effective user interfaces also must be designed so that the users can both search for items based on particular characteristics and browse the Digital Library for desired information.

This paper will examine the notion of Geographic Information Retrieval in the context of Digital Libraries, in particular, it will focus on a particular class of indexing and retrieval methods appropriate to Digital Library materials with geographic content or associations. Geographic indexing and access has long been recognized as problematic in libraries(Holmes, 1990). It has largely relied on verbal designations of places, commonly depending on the Library of Congress Subject Headings and Name Authorities as a source(Brinker, 1962; Larsgaard, 1987). Graphical methods of access (such as the use of index map sheets in cartographic collections) have been much more rare.

This paper will examine some of the characteristics of geo-referenced information and how such information can be incorporated into Digital Libraries. The intent is to raise a number of issues in the design and use of Digital Libraries with regard to search and retrieval of their geographically oriented contents. Not all of the issues raised will have obvious solutions, nor shall I attempt to solve all of the myriad problems with such systems. The next section defines and describes the characteristics of geographic information retrieval and spatial querying, the following section examines the characteristics and advantages of spatial browsing as a method of presenting a variety of geo-referenced information in a coherent framework. Following sections examine indexing and access creation for geo-referenced sources, this includes a discussion of some automatic geo-referencing methods developed at Berkeley for analysis and geo-referencing of text materials. The subsequent section will examine graphical interfaces to support geographic information retrieval and spatial browsing, and provides a description of a prototype toolkit designed to aid in developing user interfaces for geographical applications such as spatial search and browsing. Finally, the conclusion will examine some general issues and characteristics of geo-referenced multimedia information systems.

Geographic Information Retrieval and Spatial Querying

Geographic information retrieval (GIR) is concerned with providing access to geo-referenced information sources. This phrase is intended to convey a specialization of the term Information Retrieval (IR). It includes all of the areas that have traditionally formed the core of IR research with an emphasis, or addition, of spatially and geographically oriented indexing and retrieval.

There has often been a distinction in the literature between IR and "Data Retrieval" of the type associated with database management systems (DBMS). In practice the distinction is one of degree rather than kind. Figure 1 shows a spectrum of various attributes of information retrieval and data retrieval. We will examine each of the attributes depicted in the figure and attempt to see where GIR falls on the continuum.

In Information Retrieval the underlying model of providing access to documents (I will use the term document to represent any item of potential interest in a collection or database, regardless of the content -- text, images, maps, video, etc. -- or the form -- paper or digital --) is probabilistic. It is concerned with such subjective and indeterminate issues as whether (and to what degree) a document satisfies a user's need for information, i.e., whether it is relevant for that user and request. Data retrieval, on the other hand, is deterministic with regard to retrieval operations. If a document fulfills the conditions specified in the user's query, then it is by definition "relevant". In Geographic Information Retrieval we are concerned with both deterministic retrieval (such as finding all data sets that contain information on a particular coordinate) and probabilistic retrieval (such as finding all towns near a major river).

Indexing is required for both efficient access to large databases, and to organize and limit the set of elements of a database that are accessible. Most information retrieval systems derive their index elements from the contents of the items to be indexed. The derivation may be simple extraction (such as extracting keywords from a text), inferential extraction (such as mapping from text word to thesaurus terms) or it may be intellectual analysis and assignment index items (such as assigning subject headings to a document). In data retrieval the element itself, in its entirety, is the indexing unit. Obviously this spectrum is not really smooth or continuous, since both types of indexing may be present within the same system. In GIR both of these extremes are blended. There may be intellectual indexing (such as assignment of bounding box coordinates to an aerial photograph), and inferential indexing (assignment of coordinates for places mentioned in a text).

In the actual retrieval of items from the database, the algorithms used for matching between the query and the index elements (or database contents) are based on the particular retrieval model. Information retrieval models lead to a class of retrieval algorithms that are probabilistic in nature, and may involve the actual calculation of probabilities and use of statistical inference methods, or may take another approach based on another model of the document space (such as Salton's (1989) vector space model). They attempt to find all of the potential (partial) matches between query and document and to rank them based on some measure of "goodness", so that the "best" matches receive the highest ranks. Data retrieval algorithms are deterministic, and therefore demand an exact match between the query specification and the contents of the database. The Boolean logic used in processing the query languages of virtually all commercial database management systems, as well as in online catalogs and commercial information retrieval systems, is a deterministic algorithm. In GIR both approximate, partial matching and strict deterministic matching are of value in processing geographic and spatial queries (discussed further below).

The queries in information retrieval systems are commonly expressed as a natural language statement of the searcher's needs for information. These queries are inherently imprecise and may be ambiguous. In data retrieval the query is usually expressed in some sort of structured query language with precise syntactic and semantic characteristics. When the goal is to retrieve all items from a database that exactly match the specifications of the query, then there must be no ambiguity in the query statement as to what is wanted. The query types thus reflect the underlying models of the retrieval systems. In information retrieval queries are taken as a "clue" as to what the searcher might consider to be a relevant item from the database and retrieval is based on how well an item matches the clue. Typically, the results of a search are presented in a ranked order based on this degree of match between the query and the database item. In data retrieval the query is taken as a precise specification of the desired items from the database and retrieval is based on exact correspondence between the item and the query. Unless explicitly specified by the system or by the user as part of the query, there is no ranking or order imposed on the results of a data retrieval query.

Geographic Information Retrieval, as we define it here, is an applied research area that combines aspects of DBMS research, User Interface Research, GIS research, and Information Retrieval research, and is concerned with indexing, searching, retrieving and browsing of geo-referenced information sources, and the design of systems to accomplish these tasks effectively and efficiently. In the next section we will further examine the characteristics of geographic and spatial queries and where these fit on the continua of Figure 1.

Geographic and Spatial Queries

The terms geographic queries and spatial queries imply querying a spatially indexed database based on relationships between particular items in that database within a particular coordinate system (or compatible coordinate systems). Spatial querying is the more general term. It can be defined as queries about the spatial relationships (intersection, containment, boundary, adjacency, proximity) of entities geometrically defined and located in space (De Floriani, Marzano & Puppo, 1993) without regard to the nature of the coordinate system. It could be argued that the Vector Space model of information retrieval (Salton, 1989) is spatial querying system where the space and coordinates are defined by occurrance and frequency of term usage in a document collection. Geographic querying assumes that the space is delineated by the well-defined coordinate systems of the "real world." In the following discussion, the emphasis will be on geographic querying, although the underlying implementation might be a general purpose spatial database system rather than a geographic information system. As Frank (1991) has pointed out, there are many characteristics of geographic data that require special access methods and data structures. We will not examine access methods here, but will concentrate on a basic classification of types of spatial queries.

In general, geographical relationships in the coordinate systems imposed on the real world are geometric relationships. Within a geometric framework, where distance and direction can be measured on a continuous scale, many types of relationships between objects defined within that space can be determined using geometry. For example, given the geographic coordinates in latitude and longitude of Chicago (4152'N 8737'W) and New York (4040'N 7358'W), a fairly simple calculation can give the distance between the two cities (Using the great circle method, the distance is (1/2 7915.6) (0.86838) (D /180), where 7915.6 is the diameter of the Earth in miles, 0.86838 is the ratio of miles to nautical miles, and Cos D = sin Latitude1 sin Latitude2 + cos Latitude1 cos Latitude2 cos (Longitude1 - Longitude2), or about 651 nautical miles). Using the coordinates alone it is simple to determine other relationships between the cities: e.g., Chicago is West and North of New York.

Spatial relationships may be both geometric and topological (spatially related but without measureable distance or absolute direction). Examples of topological relations include such properties as adjacency, connectivity, and containment. For example, whether some building is inside or outside of the city limits of Chicago has to do with the building's relationship to an arbitrary boundary, but the distance or direction between the two is not an issue. Topological directions may have no particular relationship to any coordinate system that they might be imbedded in. "Left" and "Right" are valid directions only in relation to the observer' frame of reference and have no absolute relationship with "North" or "West".

Spatial and geographic queries combine both geometric and topological elements. Frew, et al. (1995) suggest that there are two primary classes of requests from users, the "What's here" query and the "Where's this" query. The first type of query stems from a desire to discover what information is available about a particular location, while the second stems from a desire to find out where certain phenomena occur. Within this simple classification of spatial and geographic queries, there are a number of different types of queries, distinguished by how the locations of interest are defined. The following discussion is based on the types of spatial queries defined by Laurini and Thompson (1992) and De Floriani, et al. (1993).

Types of spatial queries

The types of spatial queries submitted by users to an information systems such as a Digital Library may be arbitrarily complex in the types of information desired, the limitations on the areas, time periods, etc. covered, and many other conditions (spatial or not) that might be specified in such a query. If we concentrate on only the spatial or geographic aspects of the query, there are a number of query types that can be distinguished based on the type of information provided by the user in the query. We will consider 5 types of spatial queries here:

1. Point-in-polygon queries.

2. Region queries.

3. Distance and Buffer Zone queries.

4. Path queries.

5. Multimedia queries

The last is actually a combination of multiple geo-referenced sources in a single query. In the following discussion we will examine each of these query types and their characteristics.

The first type of query is probably the most straightforward to process and describe. This is Point-in-polygon query, illustrated in Figure 2, which essentially asks the question "What do we have at this X,Y point in the current coordinate system?." The point-in-polygon query, in a digital library context, might ask which satellite images are available that show a particular spot, or which documents describe the place indicated by the point. The query essentially asks for any georeferenced object or geographic dataset that contains, surrounds or refers to a particular spot on the surface of the earth. This is one of the more precise of all the spatial query types discussed here.

Figure 2. Point-in-polygon query

Figure 3. Region query

Figure 4. Distance and Buffer Zone query

The next type of query is a Region query, illustrated in Figure 3. A region query asks the question "What do we have in this region?" Instead of referring to a particular point in the coordinate space, a region query defines a polygon in that space and asks for information regarding anything that is contained in, adjacent to, or overlaps the polygonal area so defined. There are a number of potential variants or restrictions that might be applied, for example, a user might be asking "Which point encoded items lie within the region," "What lines (borders, rivers, etc.) lie within or the cross the region," "Which areas (or regional datasets) overlap this region," "Which areas (or regional datasets) lie entirely within this region", or "Which areas share a border with this region." Any combination of elements or containment criteria might be specified, given the needs of the particular searcher. In addition, the specified query region can be any polygon, ranging from regular shapes such as rectangles or even circles (which would be the same as a Buffer Zone query on a point as discussed below), to irregular shapes like the boundary of a city, or any arbitrary set of points defining a closed polygonal shape. The containment criteria need not be precise, but may use "fuzzy" or probabilistic interpretations of such things as the maximum or minimum areas of overlap for an object to be considered included in the specified area, or the coverage areas for particular datasets that are candidates for retrieval (Brimicombe,1993).

The next type of query is the Distance and Buffer Zone query, illustrated in Figure 4. The distance and buffer zone query asks the question "what do we have within some fixed distance of this object (point, line or polygon)." Obviously there are quite different processing steps involved if the object used as the basis for a buffer zone query is a point, a line, or a polygon. Examples include queries such as "What cities lie within 40 miles of the border of Northern and Southern Ireland?" as shown in Figure 4. Other buffer zone queries include: "What industrial plants lie within 2 miles of this river?", "which streams are within 100 yards of this highway?", "what mines are within 5 miles of this city?", etc. The buffer zone specified need not be exact, e.g., "what datasets describe the area near this point?" and inclusion can be considered a fuzzy or probabilistic function based on the location of the database objects. For such queries, a ranked list of database objects ordered by "nearness" to the point, might be a better response than an arbitrary definition of a distance.

Path queries are a somewhat more specialized form of spatial query that require the presence of a network structure in the spatial or geographic data. Networks are simply sets of interconnected line segments, representing such things as roads, oil or water pipelines, etc. A typical sort of path query involves finding the shortest route from one point in the network to another. For example, a path query might ask "What is the shortest route from San Francisco to Los Angeles?" (as shown in Figure 5). Note that path queries can become more complex (and uncertain) multimedia queries when criteria other than distance or direction are involved in the query. For example, to answer the question "What is the fastest route from San Francisco to Los Angeles?" more information, such as speed limits and traffic conditions on different routes of the network are required to provide even an approximate answer to the question.

Figure 5. Path Query

Multimedia queries combine multiple geo-referenced information sources in resolving a query. This may include multiple maps (or map layers depending on the sort of system used to resolve the query), it may also include non-map geo-referenced information, such as ownership records for particular parcels of land. An example, illustrated in Figure 6, might be the query "What are the names of farmers affected by flooding in Monterey and Santa Cruz Counties?" Answering this query involves not only map information, such as county boundaries and river locations, but also cadastral information to show who owns particular parcels of land along the rivers in the areas affected by flooding. In this particular query, complex operations are likely to be required, such as combining aerial or satellite photographs or remote sensing data (showing the extents of the flooding) and map and cadastre information, often from different databases, with different measurements, scales and levels of detail.

Figure 6. Multimedia query

The types of queries discussed above can be combined in a complex search. For example, "What streams and rivers flow through the county in which the town of Richmond (California) exists" would require a point-in-polygon search of county information to locate the county containing the city, and a region search to identify the streams and rivers that intersect the county area. Obviously, any GIR system should combine the text or concept-based retrieval associated with conventional information and database systems with the sorts of spatial queries discussed above. Any multimedia information system may include a wide variety of spatial and non-spatial information that may have a geographic association, if not a precise location (see, for example, Griffiths (1989)). Walker, Newman, Medyckj-Scott and Ruggles (1992) provide an interesting design for a system combining spatial, text, and concept-based retrieval.

Spatial Browsing

Searching a geographically indexed database or Digital Library is an activity that assumes the searcher has a good notion of what he or she wants, and is able to specify that need in some form. Most of the queries used as examples in the above discussion reflect this. Another type of "searching" is much less directed, and while it assumes that the users have some notion of the type of information desired, they may not be able to specify that information in a query language. What is needed in such cases is (in effect) the ability to navigate the database geographically, without requiring explicit query formulation. This "spatial browsing" combines ad hoc spatial querying with interactive displays of digital maps to permit the user to explore the geographical dimension of information in a database or Digital Library.

Laurini and Thompson (1992) describe spatial browsing using the "hypermap" concept. In hypertext databases (the current best example being the World Wide Web), each document (or node) may contain many links to other documents in a variety of media (text, images, video, sound clips, etc.) and the user may view any referenced document simply by selecting the representation of the link in the current document. In a hypermap, the links are represented by an icon or footprint (a polygon that outlines the area described by the object linked to the

footprint), and selection brings up the document referenced by the link. For example, Figure 7 shows a sequence of maps that might be presented to someone browsing a Digital Library, going from a global view to Europe, then to the UK, then to Ireland, and finally to a particular icon on the Ireland map representing a book about the region.

Figure 7. Spatial Browsing

There are a number of advantages to spatial browsing systems and the hypermap concept as a user interface "metaphor". These systems are often very intuitive and comprehensible (assuming that the user has some notion of geography) and can provide for both searching and browsing by direct interaction, as opposed to specification of names or coordinates. In most cases, for the purposes of browsing and search specification, the digital map displayed to the user need not be highly detailed, not does it require the accuracy of a full GIS.

There are also a number of potential problems, or requirements, for such systems. One problem is that of clutter in the display. If all of the icons or footprints representing all of the documents in a large database associated with any geographic area visible on the digital map are shown simultaneously, the map may disappear entirely beneath a heap of icons. This sort of clutter can be addressed in several ways, some of which we will discuss further below in describing the geographic browser toolkit. Another obvious requirement for spatial browsing and is that there must be coordinate-based geographical indexing of the database. In the following section methods of automatic indexing and automatic geo-referencing of text documents will be examined, we will return to the notion of spatial browsing and look at some examples in subsequent sections.

Geographic and Spatial Indexing

Not surprisingly, one of the major sources of information in digital libraries is text in a variety of forms and from a variety of sources. These text items might include full-text documents such as journal or encyclopaedia articles, books, technical report, and more specialized documents such as Environmental Impact Reports (EIRs), laws and legislation. Many of these text documents describe, discuss, or refer to particular places or regions. Geographic location is often an important, or even the primary criteria when searching for information from the digital library.

In traditional library cataloging practice, geographic references have been a common form of access point assigned to documents (primarily books and maps), but assignment was based on the cataloger's notion of whether geographic identification was deemed important for access to the document. Although it might be possible in principle to have catalogers evaluate each item that is entered into a digital library for geographic references in its content, such detailed cataloging would be prohibitively expensive. One goal of many digital library projects is to automate as much of the indexing and cataloging of documents as possible. An important component of such automatic indexing is to develop methods that can perform automatic geo-referencing of text documents. By automatic geo-referencing, I mean to automatically index and retrieve a document according to the geographic locations discussed, displayed, or otherwise associated with its content.

In most existing full-text and bibliographic information retrieval systems, searches with a geographical component, such as the point-in-polygon, region or multimedia query "locate any documents whose contents are about location XY", are not supported directly by indexing, query, or user interface functions. Instead, these searches rely on indexing and query specification of place names, either supplied by catalogers or extracted from the text itself, essentially as a side-effect of keyword indexing. Even in cases where a document is meticulously manually indexed, geographic index terms consisting of text strings (such as LCSH and LC name authorities) have several well-documented problems with ambiguity, synonymy and with name changes over time (Griffiths, 1989; Holmes, 1990). Specifically, the major problems are:

1. Names are not unique: San Jose is a common city name throughout Central and South America, as well as in California. Without additional qualifications, many place names are ambiguous.

2. The places referred to change size, shape and names over time: Political changes in the world move much faster than geological changes, and borders, country and region names, even the existence of political entities may change at any time.

3. Spelling variations: Local names for a region may differ from common English forms, and there may be variations in the spelling of a name over time (Peking, Beijing).

4. Some place names in texts are simply temporary conventions: In some scientific studies, as well as in some historical contexts, particular names may be created by scholars to describe an area or region (study areas, battlefields, etc.) that are not part of the conventional political names of a region, but which may be very precisely defined for the purposes of the study.

Instead of, or in addition to, using place names to describe locations referred to in documents, digital libraries are using the geographic coordinates of places to provide better access to those documents dealing with those locations. Geographic coordinates have several advantages over names:

1. They are persistent regardless of name, political boundary or other changes. A geographic location specified by coordinates is not dependent on the vagaries of politics, warfare, or synonymy.

2. They can be simply connected to spatial browsing interfaces and GIS data. As discussed in subsequent sections, coordinate based locations, or representation of documents associated with those locations, can be displayed and overlaid on digital maps.

3. They provide a consistent framework for GIR applications and spatial queries. Having geographic coordinates for an object (whether specified as a point or as a polygonal region) as index entries permits precise or approximate spatial querying of the database using all the types of spatial searching discussed above.

The challenge is to provide reliable automatic indexing for geographic locations, based on the names that occur in a text. Towards this end the GIPSY system was developed at U.C. Berkeley by Ph.D. students Allison Woodruff and Christian Plaunt (1994a, 1994b).

GIPSY: Automatic Geo-referencing of Text

GIPSY, The Geo-referenced Information Processing System, was developed as a new model of automatic geographic indexing for text documents. In the GIPSY model, words and phrases containing geographic place names or geographic characteristics are extracted from documents and used to provide evidence for probabilistic functions using elementary spatial reasoning and statistical methods to approximate the coordinates of the location being referenced in the text. The actual "index terms" assigned to a document are a set of coordinate polygons that describe an area on the Earth's surface in a standard geographical projection system. The GIPSY method for automatic geo-referencing is described in detail by Woodruff and Plaunt (1994a) and will only be summarized here.

GIPSY uses a three step algorithm which relies on a thesaurus or gazetteer containing place names and the names of other geographically significant objects (rivers, lakes, bioregions, animal and plant habitats, land use types, etc.)

Step 1: Identifying geographic place names and phrases.

This step attempts to locate all relevant content-bearing geographic words and phrases in a text. This involves parsing the text using a parser that "understands" how to identify geographic terminology of two types:

1. Terms which exactly or closely match objects or attributes in the geographic thesaurus. This step requires a large gazetteer of geographic names and terms along with their geographic coordinates. Terms added to this thesaurus include generic terms for geological features, climate, land use, animal and plant species , and size

2. Lexical constructs which contain spatial or topological information, such as "adjacent to the lake", "south of the river", "between the river and the highway", etc.

To implement this, a list of the most commonly occurring constructs must be created and integrated into a thesaurus/gazetteer.

Step 2: Locating pertinent data.

The output of the first step is a set of extracted terms and phases. In the second step these terms and phrase are processed by a function which retrieves geographic coordinate data related to them. This step uses spatial data sets that provide information such as the names, sizes, and location of cities, states, etc.; names and locations of endangered species; names, locations, and bioregional characteristics of different climatic regions; etc. The system attempts to identify the spatial locations (a set of one or more geographic coordinates) which most closely match the geographic terms extracted in the first stage. In some cases, where geographic modifiers are used, the area of coverage is modified to take into account the usage in the text. For example, the phrase "south of Lake Tahoe" might be mapped to the area south of Lake Tahoe and cover approximately the same volume. Since there is also geopositional data for land use (cities, schools, industrial areas, etc.) and habitats (wetlands, rivers, forests, indigenous species, etc.) available, extracted keywords and phrases for these types of data are also recognized in step one, and locational information extracted in this step. The thesaurus entries for this data should incorporate several other types of information, such as synonymy (e.g. Latin and common names of species) and membership (e.g. wetlands contain cattails, but geopositional data on cattails may not exist, so we must use their mention as weak evidence of a discussion of wetlands, and use that data instead).

In the prototype implementation of GIPSY two primary data sets were adopted to construct the thesaurus and provide geographic locations. The first was a subset of the US Geological Survey's Geographic Names Information System (GNIS) (USGS, 1985). The information extracted from the GNIS database contains latitude/longitude point coordinates associated with over 60,000 geographic place names in California. Data for land use and habitat data was derived in the US Geological Survey's Geographic Information Retrieval and Analysis System (GIRAS) (Anderson, Hardy, Roach & Witmer, 1976).

The names and terms derived from the text may be associated with more than one location, so every identified name, phrase, or region description is associated with all of the coordinate points or polygons that might potentially be the place mentioned in the text. A probabilistic weight is assigned to each of these coordinate sets based on statistical information such as the frequency of use of its associated term or phrase in the text being indexed and in the thesaurus. Many relevant terms do not exactly match place names, geographic features, or land use types in the database. Therefore, to accommodate these inexact associations between the text and the coordinate databases, the thesaurus was extended to include both manually inserted terms, and by extraction of generic term relationships from the WordNet thesaurus (Miller, Beckwith, Fellbaum, Gross & Miller, 1990), including synonyms, hyponyms, hypernyms, meronyms, holonyms, and evidonyms.

Step 3: Overlaying polygons to estimate approximate locations.

Having identified many places associated with the terms extracted from the text and their variants, the next step is to attempt the infer the most likely geographic location(s) for the areas discussed in the text. Each geographic phrase, the probabilistic weight, and the coordinates derived in the preceding step can be represented as a three dimensional "extruded" polygon with its base in the plane of the x,z axes and which extends upward on the y axis a distance proportional to its weight (Figure 8a). As new polygons are added, three cases may arise.

1. If the base of a polygon being added does not intersect with the base of any other polygons, it is simply laid on the base map beginning at y = 0 (Figure 8b).

2. If the polygon being added is completely contained within a polygon which already exists on the skyline, it is laid on top of that extruded polygon, i.e., its base is positioned in a higher y plane. (Figure 8c).

3. If the polygon being added intersects but is not wholly contained by one or more polygons, the polygon being added is split, and the intersecting portion is laid on top of the existing polygon and the non-intersecting portion is laid at a lower level. To minimize fragmentation in this case, polygons are sorted by size prior to being positioned in the "skyline" created by overlaying the polygons (Figure 8d).

Figure 8a: The "weight" of a polygon, indicated by the vertical arrow, is interpreted a thickness or "elevation".

Figure 8b: Two adjacent polygons do not affect each other; each is merely assigned its appropriate elevation.

Figure 8c: When one polygon subsumes another, their elevations in the area of overlap are summed.

Figure 8d: When two polygons intersect, their elevations are summed in the area of overlap.

In effect, the polygons, are "summed" by weight to form a geopositional "skyline" whose peaks approximate the geographical locations being referenced in the text. The geographic coordinates to assign to the text segment being indexed are determined by choosing a threshold of "elevation" z in the skyline, taking the x,z plane at z, and using the polygons at that "elevation". Raising the threshold "elevation" tends to increase the accuracy of the retrieval while lowering it tends to include other similar regions (or regions described in the same way and a region discussed in a given text).

To show the results of this process in the GIPSY prototype, consider the following text from a publication of the California Department of Water Resources:

The proposed project is the construction of a new State Water Project (SWP) facility, the Coastal Branch, Phase II, by the Department of Water Resources (DWR) and a local distribution facility, the Mission Hills Extension, by water purveyors of northern Santa Barbara County. This proposed buried pipeline would deliver 25,000 acre-feet per year (AF/YR) of SWP water to San Luis Obispo County Flood Control and Water Conservation District (SLOCFCWCD) and 27,723 AF/YR to Santa Barbara County Flood Control and Water Conservation District (SBCFCWCD). ... This extension would serve the South Coast and Upper Santa Ynez Valley. DWR and the Santa Barbara Water Purveyors Agency are jointly producing an EIR for the Santa Ynez Extension. The Santa Ynez Extension Draft EIR is scheduled for release in spring 1991.

Figure 9 contains a gridded representation of the state of California, which is elevated to distinguish it from the base of the grid. The northern part of the state is on the left-hand side of the image. The towers rising over the state's shape represent polygons in the skyline generated

by GIPSY's interpretation of the text. The largest towers occur in the area referred to by the text, primarily centered on Santa Barbara County, San Luis Obispo, and the Santa Ynez Valley area.

The surface plots generated in this fashion can also be used for browsing and retrieval. For example, the two-dimensional base of a polygon with a thickness above a certain threshold can be assigned as a coordinate index to a document. These two-dimensional polygons might then be displayed as icons or "footprints" on a map browser as those discussed below. In addition, a natural language query describing an area of interest could be processed by the GIPSY system and candidate coordinate sets could be generated and ranked according to their weights, and then used to retrieve geo-referenced information located in those areas.

Ongoing research and development of the GIPSY system is being conducted at Berkeley in conjunction with the NSF/NASA/ARPA Digital Library Initiative project (Wilensky et al., 1994). We plan to use GIPSY as part of the automatic indexing mechanism for all texts stored in the Digital Library database.

To next section