Data sets tagged with "documents"
Corpus of Erotica Stories
Excellent resource for working with natural language processing and machine learning. This corpus consists of 4771 raw text erotica stories collected from www.textfiles.com/sex/EROTICA. A logical flow from the encouragement of writing on BBSes, people have been writing some form of erotica or sexual narrative for others for quite some time. With the advent of Fidonet ...
Free
EU - Susta Info
About Overview from [front page](http://magenta.collexis.net/susta-info/en/index.aspx): > Susta-Info is a global database of case studies and publications, validated by research institutes, ‘associations of cities’ and expert groups. Susta-Info is an EU DG Research supported project in the context of the Sixth Framework Program, with priority 1.1.6.3: Global ...
Offsite
EUROPA - Register of Commission documents
About Overview > The register contains references both of documents which have already been published and of internal (unpublished) Commission documents, from the 1st January 2001. Information in register includes: the identifier or reference number, the title of the document in the languages in which it is available, the date of the document, the languages in ...
Offsite
Wikisource
“Wikisource is an online library of free content publications collected and maintained by the community (see our inclusion policy).”
Offsite
eu-council-register-of-documents
Register of documents by the Council of Ministers of the European Union.
Various search functions.
Offsite
20 Newsgroups Dataset (De-Duped Version)
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. It is speculated that it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 Newsgroups collection has become a ...
Free
Stack Overflow Data Dump - Posts, Comments, Users, Votes & Badges
Stack Overflow Creative Commons Data Dump We decided early on that all user-generated content on Stack Overflow would be under a Creative Commons license. All those great Stack Overflow questions, answers, and comments, so generously contributed by all of you, are licensed under cc-wiki: You are free to Share — to copy, distribute, and transmit the work to Remix — ...
Offsite
Digging into Data - Various Repositories
A list of digital libraries, data archives, and data repositories that are inviting Digging into Data researchers to use their collections. For each repository, you’ll find a description of their contents, contact information, and other details.
Offsite
Kabul War Diary - Over 90,000 Military documents from the War in Afghanistan (from Wikileaks.org)
The Afghan War Diary an extraordinary secret compendium of over 91,000 reports covering the war in Afghanistan from 2004 to 2010. The reports describe the majority of lethal military actions involving the United States military. They include the ...
Offsite
