Data sets tagged with "massive"

Enron Email Dataset

From the CALO Project at Carnegie-Mellon University a massive dataset of emails recovered from discovery documents in the Enron trials About From distribution page: > This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into ...
Offsite

Freebase Wikipedia Extraction (WEX)

The Freebase Wikipedia Extraction (WEX) is a processed dump of the English language Wikipedia. The wiki markup for each article is transformed into machine-readable XML, and common relational features such as templates, infoboxes, categories, article sections, and redirects are extracted intabular form. Freebase WEX is provided as a set of database tables in TSV format ...
Offsite

Westbury Lab Usenet Corpus: 28M postings from 47000+ newsgroups 2005-2009

A USENET corpus (2005-2009) This corpus is a collection of public USENET postings. This corpus was collected between Oct 2005 and Jan 2010, and covers 47860 English language, non-binary-file news groups. Despite our best effots, this corpus includes a very small number of non-English words, non-words, and spelling errors. The corpus is untagged, raw text. It may be ...
Offsite

Freebase Data Dump

Freebase data dumps provide all of the current facts and assertions within the Freebase system. The data dumps are complete, general-purpose extracts of the Freebase data in a variety of formats. Freebase releases a fresh data dump every three months. Freebase is an open database of the world’s information, covering millions of topics across hundreds of categories. ...
Offsite

All Tags