Data sets tagged with "record_linkage"
Email Data Sets
Due to privacy issues, it is very hard to get a hold of large and realistic email corpora. Here you can find a few email data sets, as well as a dataset of news groups text – annotated with personal names spans. The email corpora given here were extracted from the Enron corpus, made public by the Federal agency Regulatory commission. As a second type of informal text, ...
Offsite
Offsite
Offsite
ZoomInfo - Welcome to the ZoomInfo Developer API
The ZoomInfo Public API provides free access to ZoomInfo’s people database and company database that contain over 40 million people and nearly 4 million companies, respectively. The ZoomInfo people search API gives you the ability to search for any person in the database by name. The ZoomInfo company search API gives you the ability to search for any company in the ...
Offsite
Given Name Frequency Project: Analysis of Given Name Popularity
This Given Name Frequency Project provides analysis, tools, and data to spur further work on given names. Data provided includes popular given names in the US from 1801 to 1999, samples of names from England before 1800 from a diverse set of sources, the popularity of the name Mary over the past 800 years, and a sample of cotton workers in Manchester, England from ...
Offsite
1990 Census Name Files
Three separate datasets obtained from the 1990 cense. One set includes last names, one has first male names, and one has first female names. They contain the following data: the name, frequency in percent, cumulative frequency in percent, and rank.
Offsite
New SwetoDblp RDF dataset released with 11M triples
The LSDIS (Large Scale Distributed Information Systems) lab at the University of Georgia has released a new version of the SwetoDblp dataset. SwetoDblp is a large-size ontology (spin-off of SWETO ontology) focused on bibliography data of Computer Science publications where the main data source is DBLP (Digital Bibliography & Library Project). The dataset has about 11M ...
Offsite
LSDIS : SwetoDblp
SwetoDblp is a large-size ontology (spin-off of SWETO ontology) focused on bibliography data of Computer Science publications where the main data source is DBLP (Digital Bibliography & Library Project). SwetoDblp was created from a large XML document available at DBLP’s website and other datasets that are used to add relationships to other entities such as Publishers, ...
Offsite
Vaccines: IIS/Tech/Deduplication Test Cases
NIP (now called the National Center for Immunization and Respiratory Diseases) developed a toolkit to assist immunization information systems (IIS) in the evaluation of their deduplication algorithms. This toolkit helps registries assess their system’s ability to prevent/remove duplicate records. The data and procedures in this toolkit can help identify strengths and ...
Offsite
Ted Pedersen - Name Discrimination Data / Name Disambiguation Data / Name Ambiguity Data / Named Ent
Contains data where ambiguous entity names in text have been disambiguated. The data has either been manually disambiguated, or created by conflating multiple names into a single ambiguous pseudo-name.
Offsite
Duplicate Detection, Record Linkage, and Identity Uncertainty: Datasets
The following datasets have been provided for evaluating duplicate detection, record linkage, and identity uncertainty systems. Several of these are not yet available for downloading; please contact the authors. The datasets include a segmented citation dataset based on the Cora research paper search engine, a collection of 864 restaurant records from the Fodor’s and ...
Offsite
