The current version of the CSTR system uses a probabilistic retrieval algorithm that ranks retrieved documents in order of their estimated probability of relevance for a user's natural language query. The ranking algorithm that we are currently using is based on the staged logistic regression method described in [\protect\citeauthoryearCooper et al.1992]. From the user's point of view (or the interface programmer's point of view), the retrieval method is simply invoked as a function call (kwsearch)in a query language statement. The classes built by the indexing daemon are exploited to calculate the estimated probability of relevance for each indexed item in the database. Figures 6 and 7 shows the process involved in the probabilistic retrieval from the CSTR database.
The actual query to the Lassen retrieval process consists simply of a natural language statement of the searcher's interests. The query is treated much the same as the documents were treated in the indexing process. The individual words are extracted and matched to the wn_index dictionary (after removing stopwords). The unique termids for matching words from wn_index are then used to retrieve all the tuples in kw_term_doc_rel that contain the term. For each unique document ID in this list of tuples, the matching kw_doc_index tuple is retrieved. With the frequency information contained in kw_term_doc_rel and kw_doc_index, the estimated probability of relevance is calculated for each document containing at least one term in common with the query. The formulae used in the calculation and the sequence of operations performed to calculate the probability of relevance are shown in Figure 7.
Once the probability of relevance is calculated for each document, it is stored along with a unique query ID, the document ID, location information, in the kw_retrieval class. The query itself, along with its unique id are stored in the kw_query class. To see the results of the retrieval operation, the query ID is used to retrieve the appropriate kw_retrieval tuples, ranked in order according to the estimated probability of relevance.
In the current design for indexing and retrieval operations, all of the information used is visible in user-accessible classes in the database. There is a fairly high overhead, in terms of storage and in processing time for maintaining the indexing and retrieval information in this way. We are investigating a class of new access methods to support indexing and retrieval in a more efficient fashion. This will involve declaring some POSTGRES functions that can extract sub-elements of a given type of attribute (such as words in a text string) and generate indexes for each of the sub-elements so extracted. Other types of data might also benefit from this class of access methods, for example functions that extract sub-elements like geometric shapes from images might be used to generate sub-element indexes of image collections.