Next: Retrieval Up: Indexing and Retrieval Previous: Indexing and Retrieval

Indexing

The current indexing method runs as a daemon that is invoked whenever a new bibliographic record or full-text document is appended to the database. There are a number of POSTGRES database relations (or classes in POSTGRES terminology that are used to support the indexing and retrieval process. These classes and their logical linkages are shown in Figure 2. The wn_index class contains the complete WordNet dictionary. It provides the normalizing basis for terms used in indexing text elements of the database, that is, all terms extracted from data elements in the database are converted to the word form used in this class. All other references to terms in the indexing process are actually references to the unique IDs assigned to words in this class. The wn_index dictionary contains both individual words and common phrases, although in the current implementation only single words are used for indexing purposes.

The kw_term_doc_rel class provides a linkage between a particular item or data element (both considered documents) and a particular term from the wn_index class. The raw frequency of occurence of the term in the document is included in the kw_term_doc_rel tuple. The kw_doc_index class stores information on individual documents in the database, including a unique document ID, where the document is located (what class, attribute and tuple contain it), and whether it is simple attribute or a large object (with effectively unlimited size). Additional statistical information (such as the number of unique terms found in the document) is maintained in the kw_doc_index class. The kw_sources class contains information on the classes and attributes indexed at the class level, as well as statistics such as the number of items indexed from any given class. The other classes shown in Figure 2 are involved in the mechanics of indexing and retrieval.

The POSTGRES rules system is used to both ensure that the elements of the bibliographic records are stored in their appropriate normalized form, and to trigger the indexing daemon. Whenever an attribute in the database is defined as indexable for IR purposes (by appending a new tuple to kw_sources, a rule is created that appends the class name and attribute name to the kw_index_flags class whenever a new tuple is appended to the class. Another rule then starts the indexing process for the newly appended data. This trigger process is shown in Figure 3.

The indexing process extracts each unique keyword from the indexed attributes of the database, and stores it along with pointers to the record it came from along with the frequency of occurrence for that particular term in that field or document in kw_term_doc_rel. This process is shown in Figures 4 and 5. Other global frequency information is also maintained by the indexing daemon and the rules system, so that, for example, the overall frequency of occurrence of terms in the database, total number of indexed items, etc., are available for retrieval processing. The indexing daemon functions attempts to perform any outstanding indexing tasks before it shuts down. It also updates the kw_doc_index tuple for a given indexable class and attribute with a timestamp for the last item indexed. This permits ongoing incremental indexing without having to re-index older tuples.



Next: Retrieval Up: Indexing and Retrieval Previous: Indexing and Retrieval



Wed Mar 2 13:42:59 PST 1994