Next: The Berkeley CSTR Up: Design and Development of Previous: Introduction

Defining the Electronic Library

The design and development of the Berkeley CSTR electronic library has been based on a particular vision of electronic libraries and how they will develop and proliferate in the future. This vision assumes that there will be many sites that will want to contribute material to and access material in ``The Library''. This library will be a global virtual library, the traditional collection model of gathering all information possible into one place is no longer tenable, or desirable. We envision a vast population of users scattered around the globe who are able to access, easily and conveniently, the complete contents of thousands of large and small repositories containing texts, images, sound recordings, videos, maps, scientific and business data, as well as hypermedia combinations of these elements. The library must, therefore, be a network-based distributed system with local servers responsible for maintaining individual collections of digital documents ranging from sets of electronic texts to video-on-demand services. In effect, this virtual library will actually consist of a set of publishers of electronic information and a set of consumers distributed across the network.

The ``glue'' that holds together this distributed library will be conformance to a set of standards for document description and representation, and a set of communication protocols. We believe that the use of multiple standards for both document description and representation and for communication are inevitable in the near term. For eventual standardization of document description and representation, we believe that SGML will provide the basis for text and compound document architectures (supported by additional standards for image, video, and compound documents). For communications we believe that the standard protocol, for low-level query/response and document delivery, will be some extended version of the ANSI Z39.50 information retrieval protocol. Currently we are supporting multiple communications protocols for access to the contents of the CSTR database (including the POSTGRES libpq interface, the World-Wide-Web's HTTP, Gopher, and FTP). Eventually a higher-level protocol, such as CNRI's KIS (Knowbot information server) may be used as well.

This standards-based model allows individual information providers (or publishers) to experiment with different implementation schemes while allowing the system to scale and preserve interoperability. That is, the use of open standards and the distributed client/server model will permit production servers (even commercial servers) to operate in parallel with experimental research-oriented servers, and will permit both to be accessed from any client. Materials in the Electronic Library may accessed via any protocol-compatible client, and provided by any protocol-compatible server.

In general, we agree with the 9 ``principles'' for electronic libraries discussed by Fox [\protect\citeauthoryearFox et al.1993]:

  1. Declarative representations of documents should be used. Although most of our collection is now in page image and Postscript form, OCRed text is available, and we are hoping to develop SGML markup version though automatic parsing of image and text. However, at least for the present and near future, multiple representations of documents should be supported and available.

  2. Document Components should be represented using natural forms, namely objects that can be manipulated by users familiar with those objects. Currently familiar objects (lists of authors and titles, page images, etc.) are maintained in the database and presented to the user.

  3. Links should be recorded, preserved, organized and generalized. The database schema developed for the POSTGRES database supporting the CSTR library provides for any object or item in the database to be linked to any other. Each object also has a unique object ID that can be referenced to provide such linkages. For linkages to external objects and databases, we plan support in the CSTR server for URLs and URNs (Uniform Resource Locators and Uniform Resource Names, the former have become more well know as the hypertext addresses used in the World-Wide-Web protocol.

  4. There should be separation between the digital library and the user interfaces to it. We strongly support this principle, the entire model for the electronic library is oriented towards client/server operation, with interfaces left entirely to the client-side implementation. At the present time we have a number of separate interfaces that provide access to the same underlying database.

  5. Searching should make use of advanced retrieval methods. We have been extending the POSTGRES post-relational DBMS to incorporate advanced indexing and retrieval techniques into the system. The current version uses a probabilistic retrieval algorithm that ranks retrieved documents in order of their estimated probability of relevance w.r.t. the user's query. We are planning to include additional database support for access methods that will provide both effective and efficient indexing and retrieval in support of advanced IR methods.

  6. Open systems that include the use, and where (some of) the functions of librarians are carried out by the computer, must be developed. Work is underway to develop an Automated Librarian using natural language processing (NLP) techniques that will locate information relevant to a user's request by understanding a potentially relevant text and the user's request at as deep a level as is necessary and possible, and by applying its knowledge about where information of various sorts is likely to be located.

  7. Task-oriented access to electronic archives must be supported. Since the electronic library is being developed in conjunction with the Sequoia 2000 project, the database includes much more information than just the contents of the CS technical reports. The ``Sequoia side'' of the project seeks to integrate the functions of the electronic library with the task-oriented data storage and retrieval needs of the scientists using the database.

  8. A user-centered development approach should be adopted. The project has been largely driven by the needs of the population that will be using the electronic library. We plan to incorporate full-scale user studies and analysis as the project continues.

  9. Users should work with objects at the right level of generality. User interactions and needs with regards to the electronic library and the objects in it and in the database are not yet well understood, but the library should be flexible enough to permit each user to deal with the information in the forms most appropriate to his or her needs.

Next: The Berkeley CSTR Up: Design and Development of Previous: Introduction

Wed Mar 2 13:42:59 PST 1994