Interface to common crawl dataset on Amazon S3
2 Stars
Updated Last
5 Years Ago
Started In
September 2013


Build Status

Interface to the common crawl dataset on Amazon S3


An instance of the corpus is obtained as:

cc = CrawlCorpus(cache_location::String, debug::Bool=false)

Since the crawl corpus files are large, they are cached locally by default at cache_location. The first time a file is accessed, it is downloaded in full into the cache location. Subsequent calls to read are served locally.

All cached files, or a particular cached archive file can be deleted:

clear_cache(cc::CrawlCorpus, archive::URI)

Segments and archive files in a segment can be listed as:

segment_names = segments(cc::CrawlCorpus)
archive_uris = archives(cc::CrawlCorpus, segment::String)

Archive files across all segments can be accessed easily as:

archive_uris = archives(cc::CrawlCorpus, count::Int=0)

Passing count as 0 lists all available archive files (which can be large).

A particular archive file can be opened as:

open(cc::CrawlCorpus, archive::URI)

And crawl entries can be read from an opened archive as:

entry = read_entry(cc::CrawlCorpus, f::IO, mime_part::String="", metadata_only::Bool=false)
entries = read_entries(cc::CrawlCorpus, f::IO, mime_part::String="", num_entries::Int=0, metadata_only::Bool=false)

Method read_entry returns an ArchiveEntry instance corresponding to the next entry in the file with mime type beginning with mime_part. Method read_entries returns an array of ArchiveEntry objects. If num_entries is 0, all matching entries in the archive file are returned. If metadata_only is true, only the file metadata (url and mime type) is populated in the entries.