Beyond the ongoing development of a working electronic library, the Berkeley group is focussing on a number of research issues that will influence the future development of our own, and others, implementation of electronic libraries. The following briefly touches on the research agenda for the CSTR electronic library project at Berkeley. Most of these research efforts are currently underway, and haven't yet seen publication of results. Our primary research areas for the electronic library project are:
The Berkeley CS department has a long history of work in natural language processing under Prof. Robert Wilensky, he is turning this expertise towards the problems of information retrieval. The Berkeley group is constructing a prototype Automated Librarian that incorporates uses natural language processing techniques to improve access to information in the repository. As part of this, work is underway on automatic classification of documents into categories using a combination of statistical and NLP methods.
As already seen, we are supporting a variety of user interfaces and retrieval methods for the electronic library project. We plan to evaluate these using methods ranging from transaction log analysis to user interviews and traditional experimental IR evaluation databases and methods.
By supporting standards in both the DBMS and networking world we are attempting to provide a more comprehensive model for electronic libraries. We believe that many current network-based views for browsing and retrieval (such as WWW) will not scale well as the libraries grow in size and complexity. Database management systems were designed to scale to very large databases, but seldom possess the ease of use found in network-based systems. We are attempting to define an integrated view of electronic libraries that will scale, and be easy to use.
Researchers in the Berkeley group have developed an indexing tool that can segment a document into its significant subtopic sections[\protect\citeauthoryearHearst &Plaunt1993]. We plan to implement a new information access paradigm that uses this section information to provide enhanced information retrieval capabilities.
We have also been developing a indexing method that can ``read'' the text of a document and automatically georeference any elements of the text that refer to places, providing the geographic coordinates of the point or polygon for the area discussed in the text[\protect\citeauthoryearWoodruff &Plaunt1994].
In addition, we have been developing some new methods for automatically categorizing or identifying the major topics of a full-length document. This work [\protect\citeauthoryearHearst1994] is still preliminary, but very promising results have been achieved for a large sample of the CSTR database. We hope to integrate the work on these indexing methods with the work on sub-element access methods for the DBMS described above. This will provide a very powerful environment for experimentation in information retrieval, as well as support for the production CSTR electronic library.
The Tioga interface is being developed to support a ``joystick'' interface that is intended to allow the user to pan and zoom through any conceivable combination of data elements as axes in a complex space.
As electronic libraries proliferate on the global network, the problem of distributed search becomes very important. How to decide which servers to search, whether to search in parallel or sequentially, and how to merge the results of distributed search are all research issues with no clear resolution. We are investigating the problems of distributed search, resource discovery, and how to merge result sets.
We are developing several methods that address the problem of searching large numbers of distributed repositories. These methods include adaptive search algorithms, database-oriented architectural algorithms, and inductive learning algorithms. We will compare the methods and integrate the promising ones into a comprehensive distributed search strategy.