Next: Browsing and Relevance Up: The Cheshire II Previous: Combining Boolean and

Merging Ranked Results

Each search engine produces an ordered set of documents. When a user chooses only one type of search strategy then the result set of one search engine is presented to the user as a ranked output. When the user queries the database using the parallel search strategies the two result sets are merged and presented to the user as a single set. (Presenting two separate sets of documents would add unwanted complexity to the user interface and to the user's search process.) This merging process raises some theoretical and practical issues.

Many researchers have discussed two distinct interpretations of the ordering of relevance-ranked retrieved sets[8][3]. One, often associated with fuzzy-logic and other extended Boolean search algorithms, treats the document rankings as representing degrees of relevance. The weight or rank assigned to any document in the retrieved set reflects a relative measure of the document's relevance. In the other interpretation, associated with probabilistic systems, the rank of each document represents a probability that a document is or is not relevant in absolute terms. This distinction opens up the question of the complexity of relevance judgments, which we need not explore here. Suffice to say that from the user's perspective this distinction is not critical. Bookstein among others have pointed out that users of IR systems interpret probability rankings as relative relevance rankings, or measure of potential usefulness in satisfying the need expressed in the query[2]. We will merge the result sets of our extended Boolean and probabilistic search engines with that justification.

The practical issues are more complex. Each branch of the parallel search path produces an internally ranked set. Even though we treat the rankings in each set as representing the same measure of usefulness to the user, each document is ranked relative to the other documents in one set. None of the rankings represent an absolute measure of usefulness or satisfaction of the user's query. The ranking in one set is effectively independent from that in the other retrieved set. The two sets may have widely different scales or relative ranks, and that difference may or may not represent a significant difference in the assumed measure of usefulness to the user. In order to merge the two sets into one ranked set we need to anticipate some relationship between the sets, and by extension between the Boolean and probabilistic portions of the query. This is done by assigning relative values to sets, in the form of coefficients in the merging algorithm, based on a simple analysis of the user's query.

The practical applications of combined Boolean and probabilistic searching - and the merging coefficients for different types of searches - will need to be tested and refined in the evaluation process. We have some empirical starting points for assigning relative values to certain types of Boolean queries and to the result sets produced by those queries. These are based on the relative advantages and disadvantages of a Boolean search engine compared to a probabilistic search engine for particular types of queries.

Boolean known item searches, such as author or exact title searches, should be given the highest weighting in the merging process, thus using the Boolean queries where they work best. In this case the Boolean retrieved set would be given a greater weight than the probabilistic retrieved set. (It is unknown whether users will want to combine this type of search with a probabilistic subject search.) Title or abstract keyword searches, which are often used as a type of subject search, should be assigned less of a value relative to known item searches. These terms are effectively weighted heavily in the document record, because of the presumed importance of title words and abstract words. As a starting point for evaluation, results from these queries will be given close to equal value with the results of probabilistic queries. Finally, result sets from keyword searches in the full text of a document-the weakest type of search in a Boolean system-should be assigned weights of less value than the result set of a probabilistic query. In this case, the merged result set is heavily weighted in favor of the more useful probabilistic search engine,and augmented somewhat by the occurrence of a keyword in the text of the document. We expect this method of merging the results of Boolean and probabilistic queries to be especially useful in improving the results of keyword Boolean retrieval strategies.

Next: Browsing and Relevance Up: The Cheshire II Previous: Combining Boolean and

Contact: Ray R. Larson