Wolfram Research

World Atlas of Language Structures

Dataset of structural properties of languages

Swadesh Lists

Word lists for common concepts in nearly 1200 languages

Japanese-English Subtitle Corpus

A parallel corpus for machine translation systems, information extraction and other language processing techniques

1911 Encyclopedia Britannica

Plaintext of the complete Encyclopedia Britannica Eleventh Edition (1910-11)

Japanese-English Legal Parallel Corpus

A parallel corpus for machine translation systems, information extraction and other language processing techniques

SQuAD v1.1 Tokens Generated with WL

A list of isolated words and symbols from the SQuAD dataset, which consists of a set of Wikipedia articles labeled for question answering and reading comprehension

SQuAD v1.1

A dataset for question answering and reading comprehension from a set of Wikipedia articles

Europarl English-Spanish Machine Translation Dataset V7

A parallel corpus for machine translation from the proceedings of the European Parliament

Europarl English-German Machine Translation Dataset V7

A parallel corpus for machine translation from the proceedings of the European Parliament

Europarl English-Italian Machine Translation Dataset V7

A parallel corpus for machine translation from the proceedings of the European Parliament

Europarl English-French Machine Translation Dataset V7

A parallel corpus for machine translation from the proceedings of the European Parliament

FDIC Institution EntityStore

A Wolfram Language EntityStore with selected data on FDIC insured institutions

Scripps National Spelling Bee Champions

Spelling Bee winners, final words, and sponsoring organizations

Atlantic Hurricane Data 1851-2017

A modification of the NOAA "Hurdat2" Dataset on Atlantic Hurricanes to facilitate use with the Wolfram Language

Minecraft Block Types

Wolfram Language EntityStore with IDs and sample images for 150+ types of Minecraft blocks

Geotagged Public Tweets (Europe, April 6-8 2016)

Public Twitter statuses

Spoken Digit Commands

A dataset consisting of recordings of spoken digits

Kyoto Free Translation Task Data

A parallel corpus for the evaluation and development of Japanese-English machine translation systems

Irish-Viking Networks in 'Cogadh Gaedhel re Gallaibh'

Graph datasets for Irish and Viking character relationships in the medieval Irish text 'Cogadh Gaedhel re Gallaibh' ('The War of the Gaedhil with the Gaill')

SQuAD v2.0 Tokens Generated with WL

A list of isolated words and symbols from the SQuAD dataset, which consists of a set of Wikipedia articles labeled for question answering and reading comprehension

SQuAD v2.0

A dataset for question answering and reading comprehension from a set of Wikipedia articles

Clinical Concepts from Massive Sources of Medical Data

A dataset of medical concepts

United States Supreme Court Decisions 1946-present

Datasets relating to Supreme Court cases from 1946 to present