

README

The README file describes the files found in this distribution, and the
files in this distribution provide a means to download/mirror content
from the Internet Archive and incorporate them into your library
"catalog". The specific example used VuFind, but other "discovery"
systems could be used as well.

  * README - This file
  
  * LICENSE - This distribution is provided under the GNU Public License
  
  * getkeys.sh - Tweak the definition of the URL and output a set of
    Internet Archive keys
  
  * keys2urls.pl - Convert the set of keys into specific URLs to download
  
  * mirror.sh - Download/mirror Internet Archive content locally
  
  * updatemarc.pl - Enhance the mirrored MARC records with 856$u values
    pointing to your local copy of the content as well as the remote
    (cononical) version of the content

Sample usage:

  $ getkeys.sh > catholic.keys
  $ keys2urls.pl catholic.keys > catholic.urls
  $ mirror.sh catholic.urls
  $ updatemarc.pl
  $ find /usr/var/html/etexts -name '*.marc' /
    -exec cat {} >> /usr/local/vufind/marc/archive.marc \;
  $ cd /usr/local/vufind
  $ ./import.sh marc/archive.marc
  $ sudo ./vufind.sh restart
  
Cool next steps would be use text mining techniques against the
downloaded plain text versions of the documents to create summaries,
extract named entities, and identify possible subjects. These items
could then be inserted into the MARC records to enhance retrieval.
Ideally the full text would be indexed, but alas, MARC does not
accomodate that. "MARC must die."
  
-- 
Eric Lease Morgan <eric_morgan@infomotions.com>
June 2, 2009
