Europarl English-Spanish Machine Translation Dataset V7

A parallel corpus for machine translation from the proceedings of the European Parliament

The Europarl dataset contains text corpora from 21 languages from the proceedings of the European Parliament between 1996 and 2011. The English-Spanish corpus contains 1.9 million training and 44,000 test sentences. Training examples are taken from the originally distributed parallel corpus, while test examples are extracted from the Q4/2000 portion of the data, following the publisher's suggestion. The "ContentElements" field contains four options: "TrainingData", "TestData", "TrainingDataset" and "TestDataset". "TrainingData" and "TestData" are structured as associations. "TrainingDataset" and "TestDataset" are structured as datasets. Duplicate examples have been removed from the test set.

Examples

Basic Examples

Retrieve the resource:

In[1]:=

$ResourceObject["Europarl English-Spanish Machine Translation Dataset \ V7"]$

Out[1]=

Obtain the first three training examples:

In[2]:=

Out[2]=

Obtain the first three test examples:

In[3]:=

Out[3]=

Dataset Form

Obtain ten random pairs from the training set in Dataset form:

In[4]:=

Out[4]=

Obtain ten random pairs from the test set in Dataset form:

In[5]:=

Out[5]=

Analysis

Obtain a character-level histogram of test example lengths:

In[6]:=

Histogram[
Map[StringLength, ResourceData[
"Europarl English-Spanish Machine Translation Dataset V7", "TestData"], {2}], ChartLegends -> Automatic, LegendAppearance -> "Column"]

Out[6]=

Obtain a word-level histogram of test example lengths:

In[7]:=

Histogram[
ParallelMap[WordCount, ResourceData[
"Europarl English-Spanish Machine Translation Dataset V7", "TestData"], {2}], ChartLegends -> Automatic, LegendAppearance -> "Column"]

Out[7]=

Bibliographic Citation

Wolfram Research, "Europarl English-Spanish Machine Translation Dataset V7" from the Wolfram Data Repository (2018)

License Information

Reproduction is authorized, provided that the source is acknowledged

Data Resource History

Date Created: 16 April 2018

Source Metadata

Title: Europarl: A Parallel Corpus for Statistical Machine Translation
Creator: Philipp Koehn
Publisher: Conference Proceedings: The Tenth Machine Translation Summit, pages 79-86. Phuket, Thailand, AAMT
Date: 2005
Language: English, Spanish
Source: http://www.statmt.org/europarl

Publisher Information

Publisher of Record: Wolfram Research