|
About the NWA ToolsetThe NWA Toolset is a freely available solution for searching and navigating archived web document collections. A web archive may consist of a large number of web documents, but also several versions of the same web document (i.e. the documents where downloaded from the same URL). Potential users of the NWA Toolset might be anyone that has a web archive. Examples of such users may be:
Note that in the following text an object or an archive object is one file in the web archive. What the user experience as one web document may consist of several archive objects (e.g. this web page which comprises the html file and the inline images). OverviewThe NWA Toolset currently of three main parts. These are the Document Retriever, the Exporter and the Access Module.
The Document Retriever included in the NWA Toolset may need some adaptation in order to work with your web archive. ExporterThe Exporter fetches archive objects and associated metadata from the web archive and prepares them for indexing. The input to the Exporter is a list of archive id's defining which archive objects the Exporter should process. The list will have to be generated at the archive side as a preparation for export. Automated tasks for creating the list will have to be tailored to suit the specific web archive architecture. For each id in the list the following happens:
If the object is not html (e.g. a gif-image) the only data exported for the object will be the available metadata like its archive id, the original URL of the object, its mime type and its timestamp (e.g. time of harvesting). The Exporter may interface a converter that transforms non-html text objects like pdf, msword etc. into html thus enabling extraction of data from these objects as well. It may also interface language detection software enabling the user to narrow a search to text-content objects written in a specific language. Tools for conversion and language detection are not part of the NWA Toolset and they must therefore be obtained otherwise. In the NWA project, third party products licensed by FAST Search & Transfer ASA were used for both the html conversion and the language detection.
Figure 2 - Exporter The NWA document format: XML Schema, Example. If your browser does not support viewing xml formatted documents use these links instead: XML Schema, Example Access ModuleThe Access Module provides the user with interfaces for searching, browsing and navigating the archived web pages. When the user submits a query the Access Module uses the search engine to find the objects containing the text(s) satisfying the query. When the user asks for a specific web document the Access Module will return the top-level object of the document from the archive (e.g. the archived object with the original url http://www.venstre.no/). Before the object is delivered to the user's browser the object is parsed and all the inline links and references are altered to point into the archive rather than out to the www. When the browser encounters references to inline objects, the browser will ask the Access Module to return these objects in order to present them as part of the document. The resulting web page will contain a timeline at the top and the archive object(s) below it. The timeline queries the index for all archived versions of the web page and displays the timestamps graphically along the line.
Searching a web archive through the Access Module resembles using Internet search engines like Google or Alltheweb. An example of NWA's search interface is shown below.
The 3rd result tells us that 4 of the 19 versions of the web page with URL http://www.venstre.no/ satisfy the query string. There is a link to each of the four versions, with the highest ranked version in bold face. Clicking the link with the text 2001-09-12 will result in the web page shown below.
The resulting web page displays a timeline at the top and the archived web page at the bottom. The timeline enables easy navigation between different versions of the web page. The links in the html file have been altered to point within the archive. Clicking the link "det er dette vi gjør" (in the lower part of the picture) will produce the result shown below.
Of course, links in the web page may point to an object that is not archived (or indexed). When clicking a link that points to such an object this will be explained in the interface. Updated 14.02.2004 |