Next: The Berkeley CSTR
Up: Design and Development of
The design and development of the Berkeley CSTR electronic library has
been based on a particular vision of electronic libraries and how they
will develop and proliferate in the future. This vision assumes that
there will be many sites that will want to contribute material to and
access material in ``The Library''. This library will be a global virtual library, the traditional collection model of gathering all
information possible into one place is no longer tenable, or desirable.
We envision a vast population of users scattered around the globe who
are able to access, easily and conveniently, the complete contents of
thousands of large and small repositories containing texts, images, sound
recordings, videos, maps, scientific and business data, as well as
hypermedia combinations of these elements. The library must, therefore,
be a network-based distributed system with local servers responsible for
maintaining individual collections of digital documents ranging
from sets of electronic texts to video-on-demand services. In effect,
this virtual library will actually consist of a set of publishers
of electronic information and a set of consumers distributed across
The ``glue'' that holds together this distributed library will be
conformance to a set of standards for document description and
representation, and a set of communication protocols. We believe that
the use of multiple standards for both document description and
representation and for communication are inevitable in the near term.
For eventual standardization of document description and
representation, we believe that SGML will provide the basis for text
and compound document architectures (supported by additional standards
for image, video, and compound documents). For communications we
believe that the standard protocol, for low-level query/response and
document delivery, will be some extended version of the ANSI Z39.50
information retrieval protocol. Currently we are supporting multiple
communications protocols for access to the contents of the CSTR
database (including the POSTGRES libpq interface, the World-Wide-Web's
HTTP, Gopher, and FTP). Eventually a higher-level protocol, such as
CNRI's KIS (Knowbot information server) may be used as well.
This standards-based model allows individual information providers (or
publishers) to experiment with different implementation schemes
while allowing the system to scale and preserve interoperability. That
is, the use of open standards and the distributed client/server model
will permit production servers (even commercial servers) to operate in
parallel with experimental research-oriented servers, and will permit
both to be accessed from any client. Materials in the Electronic Library
may accessed via any protocol-compatible client, and provided by any
In general, we agree with the 9 ``principles'' for electronic libraries
discussed by Fox [\protect\citeauthoryearFox et al.1993]:
- Declarative representations of documents should be used.
Although most of our collection is now in page image and Postscript form,
OCRed text is available, and we are hoping to develop SGML markup version
though automatic parsing of image and text. However, at
least for the present and near future, multiple representations of
documents should be supported and available.
- Document Components should be represented using natural forms,
namely objects that can be manipulated by users familiar with those
objects. Currently familiar objects (lists of authors and titles, page
images, etc.) are maintained in the database and presented to the user.
- Links should be recorded, preserved, organized and
generalized. The database schema developed for the POSTGRES database
supporting the CSTR library provides for any object or item in the
database to be linked to any other. Each object also has a unique object
ID that can be referenced to provide such linkages. For linkages to
external objects and databases, we plan support in the CSTR server for
URLs and URNs (Uniform Resource Locators and Uniform Resource
Names, the former have become more well know as the hypertext addresses
used in the World-Wide-Web protocol.
- There should be separation between the digital library and the
user interfaces to it. We strongly support this principle, the entire
model for the electronic library is oriented towards client/server operation,
with interfaces left entirely to the client-side implementation. At the present
time we have a number of separate interfaces that provide access to
the same underlying database.
- Searching should make use of advanced retrieval methods. We
have been extending the POSTGRES post-relational DBMS to incorporate
advanced indexing and retrieval techniques into the system. The current
version uses a probabilistic retrieval algorithm that ranks retrieved
documents in order of their estimated probability of relevance w.r.t.
the user's query. We are planning to include additional database
support for access methods that will provide both effective and
efficient indexing and retrieval in support of advanced IR methods.
- Open systems that include the use, and where (some of) the
functions of librarians are carried out by the computer, must be
developed. Work is underway to develop an Automated Librarian
using natural language processing (NLP) techniques that will locate
information relevant to a user's request by understanding a potentially
relevant text and the user's request at as deep a level as is necessary
and possible, and by applying its knowledge about where information of
various sorts is likely to be located.
- Task-oriented access to electronic archives must be supported. Since the
electronic library is being developed in conjunction with the Sequoia 2000 project,
the database includes much more information than just the contents of the CS
technical reports. The ``Sequoia side'' of the project seeks to integrate the functions
of the electronic library with the task-oriented data storage and retrieval needs of
the scientists using the database.
- A user-centered development approach should be adopted. The project has
been largely driven by the needs of the population that will be using the electronic
library. We plan to incorporate full-scale user studies and analysis as the project
- Users should work with objects at the right level of generality.
User interactions and needs with regards to the electronic library and the objects in it
and in the database are not yet well understood, but the library should be flexible
enough to permit each user to deal with the information in the forms most appropriate
to his or her needs.
Next: The Berkeley CSTR
Up: Design and Development of