next up previous
Next: Data Analysis Up: Methods Previous: Methods

Data Collection

In this study the selection of the core set of WWW sites was derived from a set of initial searches (using the DEC Alta Vista WWW search engine[Digital Equipment Corp.1995]). The focus of the search was geographic information systems, earth sciences, and satellite remote sensing. This area was chosen because of my familiarity with the topic, and also the interesting observation from the Inktomi analysis that the most frequently referenced WWW location was the Xerox PARC map browser (http://pubweb.parc.xerox.com/map/). To limit the initial set of items, the search submitted to Alta Vista (using the advanced search mode) was to find ``link:pubweb.parc.xerox.com/map AND
link:xtreme.gsfc.nasa.gov'', that is, to find a set of WWW documents containing links to both the Xerox Map browser, and the home page for NASA's AVHRR (Advanced Very High Resolution Radiometer) remote sensing projects.

This initial search resulted in a set of 115 WWW pages containing all or most of the elements. These were scanned and the apparently relevant pages were retrieved and stored for further analysis, yielding 43 pages in the areas of geography, GIS, Earth Sciences, and remote sensing. These included many ``bibliography'' pages from services like Yahoo, or those maintained by individuals interested in one of more of these topics. All of the links to other pages were extracted from these 43 pages and combined in a single file, this resulted in 7209 individual URLs. The URLs were sort into alphabetical order and edited to eliminate links that occurred in less than 3 of the citing documents. Citations that appeared to be outside of the topical boundaries set for the study were also eliminated. The editing resulted in a set of 332 potential candidates for the ``core'' set. These were then retrieved and examined using the Netscape WWW browser and appropriate sites were collected in a ``hotlist'' reducing the size of the core set to 125 WWW documents. This set was considered still too large, so the ``Best'' sites of the set (based on my own judgement, with frequent corroboration from various ``best of the Web'' awards given to some pages), reducing the final core set to the 34 sites listed in Table 1.

Having obtained a core set of WWW site in the area of Earth Sciences, Remote Sensing and Geographic Information Systemsgif, the next step was to produce a raw cocitation matrix. This stage requires the ability to search for ``citing documents'', that is, those with links to the items in the core set and also the ability to conduct the many searches required (for any core set of size N, there are tex2html_wrap_inline348 searches required - one for each pair of items in the core set).

In author and journal cocitation analysis researchers must use the online versions of Science Citation Index, Social Science Citation Index, or Arts and Humanities Citation Index for this stage, because those databases are the only place where citation information can be found, and were cocitation searching is possible (see White WHITE86A). For this study the DEC Alta Vista search engine, with its ability to search for documents containing particular URL ``links'' to a given document was used for the same purpose.

To carry out the many searches needed for the raw cocitation matrix, a ``web robot'' was programmed to automatically submit the searches based on an input set of URLs (from the core set) and to capture the resulting frequency information for further analysis. The robot was designed to be ``polite'' and to pause after each search, to avoid monopolizing the search service (although Alta Vista handles several million requests per day, a persistent robot might be a nuisance). It was also designed to be persistent and to retry searches that failed to complete (after another pause). The search was carried out for all 544 searches representing each unique URL pair from the list of 34 core set items. The searching required about 5 hours to run.



next up previous
Next: Data Analysis Up: Methods Previous: Methods



Ray Larson
Mon Jul 29 09:00:12 PDT 1996