Category: Science » Linguistics
Not finding the data sets you're looking for? Not all of our data sets are categorized yet. Try checking out tags instead.
Showing 1 - 20 out of 40 datasets
From the CALO Project at Carnegie-Mellon University a massive dataset of emails recovered from discovery documents in the Enron trials About From distribution page: > This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into ...
Offsite
A word list with over 100,000 entries that are officially permitted in crossword games like Scrabble™. This word list is available in a simple, alphabetically-ordered Excel format, making it convenient for reference, spell-checking, or in more sophisticated application, for developers looking to build a custom spelling dictionary. The entries include variants of ...
Free
74,550 common dictionary words — A list of words in common with two or more published dictionaries. This gives the developer of a custom spelling checker a good beginning pool of relatively common words.
$4.00
Excellent resource for working with natural language processing and machine learning. This corpus consists of 4771 raw text erotica stories collected from www.textfiles.com/sex/EROTICA. A logical flow from the encouragement of writing on BBSes, people have been writing some form of erotica or sexual narrative for others for quite some time. With the advent of Fidonet ...
Free
A list of 113,809 words officially permitted in crossword games like Scrabble™ with their definitions. The words are compatible with the first edition of the Official Scrabble Players Dictionary™. Since this list has variants of words: -ing, -ed, -s, and so on, it makes a good addition when building a custom spelling dictionary. It is an reference to have handy for ...
$4.00
113,809 official crosswords A list of words permitted in crossword games such as Scrabble™. Compatible with the first edition of the Official Scrabble Players Dictionary™. Since this list has all forms: -ing, -ed, -s, and so on of words, it makes a good addition when building a custom spelling dictionary.
Free
WordNet® is a large lexical database of English, developed under the direction of George A. Miller. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts ...
Offsite
List of summonable objects from the Nintendo DS game Scribblenauts, from AARDVARK, ABOMINABLE SNOWMAN and ABSCONDER to ZOMBIE, ZUNICERATOPS and ZYGOTE. via the Scribblenauts Wikipedia entry: Scribblenauts is an emergent puzzle action video game with the tagline “Write Anything, Solve Everything”. Its objective is to complete puzzles by summonning any object (from a ...
Free
Over 354,000 single words, excluding proper names, acronyms, or compound words and phrases. This list does not exclude archaic words or significant variant spellings.
Free
This data is derived from the MySpace real-time stream API. The word count is from the free-form text fields MySpace moods, forum topic titles, replies to forum topics, text from sharing a link or item, and status mood updates. For the last three months the words from these fields have been extracted and this dataset contains their totals binned by day.
$25.00
This data is derived from the MySpace real-time stream API. The word count is from the free-form text fields MySpace moods, forum topic titles, replies to forum topics, text from sharing a link or item, and status mood updates. For the last three months the words from these fields have been extracted and this dataset contains their totals binned by hour.
$50.00
The mission of the site is to make available to the public the highest quality and most reliable historical data on important economic aggregates, with particular emphasis on nominal measures. The data have been created using the highest standards of the fields of economics and history and are rigorously refereed by the most distinguished researchers in the fields. ...
Offsite
A banned word list representing a collection of many lists from around the web of words considered socially unacceptable for one reason or another. What to do with a banned word list? Use this dirty word list to screen for spammers and griefers, to censor dissidents; to better understand the semiotic role of taboo signifiers in an online modality; to monitor user ...
Free
Topic Detection and Tracking research was pursued under the DARPA Translingual Information Detection, Extraction, and Summarization (TIDES) program. Topic Detection and Tracking is an integral part of the DARPA Translingual Information Detection, Extraction, and Summarization (TIDES) program. The goal of the TIDES program is to enable English-speaking users to access, ...
Offsite
A common word list with over 250,000 entries of hyphenated, capitalized and compound English words. The download consists of entries containing more than one word, as well as capitalized words and acronyms. Phrases are considered “common” if they or variations of them occur in a standard dictionary or thesaurus. This word list is available in a simple, ...
Free
Over 354,000 single words, excluding proper names, acronyms, or compound words and phrases. This list does not exclude archaic words or significant variant spellings.
$4.00