Data sets tagged with "classification"

Taxobox - Wikipedia Infoboxes with Taxonomic information on Animal Species

This dataset consists of a collection of Infoboxes from Wikipedia on the topic of Taxobox. Snippet: Antilles_pinktoe: name: Antilles Pink Toed Tarantula regnum: "[[Animal]]" classis: "[[Arachnid]]" phylum: "[[Arthropod]]" ordo: "[[Spider]]" imageWidth: 250px imageCaption: Female Avicularia versicolor binomial: Avicularia versicolor familia: ...
Free

Corpus of Erotica Stories

Excellent resource for working with natural language processing and machine learning. This corpus consists of 4771 raw text erotica stories collected from www.textfiles.com/sex/EROTICA. A logical flow from the encouragement of writing on BBSes, people have been writing some form of erotica or sexual narrative for others for quite some time. With the advent of Fidonet ...
Free

Document Metadata Based on a Sample of Web Documents from the Open Directory

DMOZ100k06 is a large research data set about document metadata based on a random sample of 100,000 web documents from the Open Directory combined with data retrieved from the social bookmarking service delicious.com, the content rating system ICRA, and the search engine Google. The data set is freely available for other research. Michael G. Noll
Offsite

ICONCLASS - Multilingual Thematic Classification

About From the website: > This is an experimental service that makes the ICONCLASS Iconographic Classification system available as linked-data using the SKOS vocabulary. This service is inspired by the excellent Library of Congress Subject Headings linked data service. It is intentionally copied in spirit and conventions used. The idea is to enable others to make ...
Offsite

Mushroom Data Set

This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no ...
Offsite

20 Newsgroups Dataset (De-Duped Version)

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. It is speculated that it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 Newsgroups collection has become a ...
Free

Standard Occupational Classification

The Standard Occupational Classification (SOC) system is used by Federal statistical agencies to classify workers into occupational categories for the purpose of collecting, calculating, or disseminating data. All workers are classified into one of over 820 occupations according to their occupational definition. Additional facts from data.gov Dataset Summary Date ...
Offsite

National Accounts Sector Classification

Provides information on the classification of organisations and institutions in the National Accounts. Source agency: Office for National Statistics Designation: National Statistics Language: English Alternative title: Sector Classification Statistical classification of Northern Rock plc Financial support for the banking industry: classification issues Classification of ...
Offsite

All Tags