Next: Combining Boolean ... Up: The Cheshire II System Previous: Cheshire II and SGML

The Cheshire II Search Engine

A prototype version of the Cheshire system was developed as an experimental laboratory for examining advanced retrieval methods for online catalogs and was tested against Boolean and alternative advanced retrieval methods. The results of information retrieval experiments indicated that the classification clustering method developed for Cheshire system overcame one of the major problems of using MARC records with advanced retrieval methods, that is, the limited topical information available in the record (generally only a title and a small number of subject headings), by automatically grouping terms derived from the same classification area[15][14].

The Cheshire II Project team re-designed and completely re-wrote the Cheshire prototype software to incorporate the features needed to provide a fully functional ``next-generation'' online catalog and information retrieval service. Towards this end, the search engine supports a variety of search and browsing capabilities, including support for authority-controlled name searching and other conventional online catalog search features, such as a complete Boolean search capability and ``result set'' storage and retrieval. Of particular interest is that the search engine also allows users to enter queries as free text (that is, normal English prose) statements of their interest or need, and can perform probabilistic matching of the query with any indexed elements in the database.

The Cheshire II search engine supports several methods for translating the user's query terms into the vocabulary used in the database. These include support for field-specific stopword lists, field-specific query-to-key conversion functions, stemming algorithms that reduce significant words to their roots by converting suffix variations, such as plural forms of a word, to a single form, and support for mapping database and query text words to a standardized form based on the WordNet dictionary and thesaurus.

In a two-stage search method developed in the Cheshire prototype, the system uses probabilistic ``best match'' techniques to match a user's initial topical query with a set of classification clusters for the database, so that the clusters are retrieved in decreasing order of probable relevance to the user's search statement. This aids the user in subject focusing and topic/treatment discrimination[14]. The search engine also supports direct probabilistic searching of any indexed field in the SGML records. The probabilistic ranking method used in the Cheshire II search engine is based on the staged logistical regression algorithms developed by Berkeley researchers and shown to provide excellent full-text retrieval performance in the TREC evaluation of full-text IR systems[7][6].

Although no formal query language or Boolean logic is imposed on the user for this sort of ``best match'' probabilistic searching, Boolean logic is available for those who desire it, and is supported in the graphical user interface for Z39.50 access to other search engines. The system also introduces a probabilistic method of combining Boolean and ranked results within the same query, which is discussed in the next section.

Next: Combining Boolean and Up: The Cheshire II Previous: Cheshire II and

Contact: Ray R. Larson