Data sets tagged with "crawl"
Offsite
Offsite
Crunchbase database crawl
From: http://github.com/petewarden/crunchcrawl This module lets you index and download the company information held in Crunchbase. Included is also the full scrape of the data.Before using, double-check http://www.crunchbase.com/robots.txt and the API conditions to ensure you’re obeying the terms-of-service It contains various scripts to index and pull down the latest ...
Free
The ClueWeb09 Dataset
The ClueWeb09 dataset was created to support research on information retrieval and related human language technologies. It consists of about 1 billion web pages in ten languages that were collected in January and February 2009. The dataset is used by several tracks of the TREC conference. Dataset Specifications Web Pages: 1,040,809,705 web pages, in 10 languages 5 TB, ...
Offsite
