Data sets tagged with "classification"
Taxobox - Wikipedia Infoboxes with Taxonomic information on Animal Species
This dataset consists of a collection of Infoboxes from Wikipedia on the topic of Taxobox. Snippet: Antilles_pinktoe: name: Antilles Pink Toed Tarantula regnum: "[[Animal]]" classis: "[[Arachnid]]" phylum: "[[Arthropod]]" ordo: "[[Spider]]" imageWidth: 250px imageCaption: Female Avicularia versicolor binomial: Avicularia versicolor familia: ...
Free
Corpus of Erotica Stories
Excellent resource for working with natural language processing and machine learning. This corpus consists of 4771 raw text erotica stories collected from www.textfiles.com/sex/EROTICA. A logical flow from the encouragement of writing on BBSes, people have been writing some form of erotica or sexual narrative for others for quite some time. With the advent of Fidonet ...
Free
Document Metadata Based on a Sample of Web Documents from the Open Directory
DMOZ100k06 is a large research data set about document metadata based on a random sample of 100,000 web documents from the Open Directory combined with data retrieved from the social bookmarking service delicious.com, the content rating system ICRA, and the search engine Google. The data set is freely available for other research.
Michael G. Noll
Offsite
ICONCLASS - Multilingual Thematic Classification
About From the website: > This is an experimental service that makes the ICONCLASS Iconographic Classification system available as linked-data using the SKOS vocabulary. This service is inspired by the excellent Library of Congress Subject Headings linked data service. It is intentionally copied in spirit and conventions used. The idea is to enable others to make ...
Offsite
Mushroom Data Set
This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no ...
Offsite
20 Newsgroups Dataset (De-Duped Version)
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. It is speculated that it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 Newsgroups collection has become a ...
Free
TechTC - Technion Repository of Text Categorization Data Sets
The Technion Repository of Text Categorization Datasets provides a large number of diverse test collections for use in text categorization research.
Offsite
Standard Occupational Classification
The Standard Occupational Classification (SOC) system is used by Federal statistical agencies to classify workers into occupational categories for the purpose of collecting, calculating, or disseminating data. All workers are classified into one of over 820 occupations according to their occupational definition. Additional facts from data.gov Dataset Summary Date ...
Offsite
National Accounts Sector Classification
Provides information on the classification of organisations and institutions in the National Accounts. Source agency: Office for National Statistics Designation: National Statistics Language: English Alternative title: Sector Classification Statistical classification of Northern Rock plc Financial support for the banking industry: classification issues Classification of ...
Offsite
Mobile User Short Message Data of One Mobile Operator in China
Mobile User Short Message Data, comes from One Mobile Operator in China. Data mainly includes formal short message and spam message. There are 170229 records of spams and 33588 records of formal messages.
Free
Movie Reviews Naive Bayes Sentiment Classifier for Python NLTK
A 98.7% accurate Naive Bayes sentiment analysis classifier trained on movie reviews. Given a feature dict of words and bigrams, it will classify the text as “pos” or “neg”. It requires Python & NLTK 2.0 and is licensed for commercial usage.
$25.00
Movie Reviews Naive Bayes Subjectivity Classifier for Python NLTK
A 93.57% accurate Naive Bayes subjectivity classifier trained on IMDb plots and RottenTomatoes quotes. This classifier can be used for hierarchical sentiment analysis to determine whether text is objective or subjective before using a sentiment classifier to determine polarity. Given a feature dict of words and bigrams, it will classify the text as “quote” or ...
$25.00
Victorian Dryland Salinity Assessment 2000 – Best Case Trends (VIC_MIN_TREND) from data.gov.au
A 685.00KB dataset from data.gov.au. This information shows those areas with rising, flat or falling watertable trends based on the minimum (or best case) trend derived from the bore hydrograph analysis. No information has been compiled in defined ...
Free
Victorian Dryland Salinity Assessment 2000 – Worst Case Trends (VIC_MAX_TREND) from data.gov.au
A 1.05MB dataset from data.gov.au. This information shows those areas with rising, flat or falling watertable trends based on the maximum (or worst case) trend derived from the bore hydrograph analysis. No information has been compiled in defined ...
Free
