Europarl English-Spanish Machine Translation Dataset V7

A parallel corpus for machine translation from the proceedings of the European Parliament

The Europarl dataset contains text corpora from 21 languages from the proceedings of the European Parliament between 1996 and 2011. The English-Spanish corpus contains 1.9 million training and 44,000 test sentences. Training examples are taken from the originally distributed parallel corpus, while test examples are extracted from the Q4/2000 portion of the data, following the publisher's suggestion. The "ContentElements" field contains four options: "TrainingData", "TestData", "TrainingDataset" and "TestDataset". "TrainingData" and "TestData" are structured as associations. "TrainingDataset" and "TestDataset" are structured as datasets. Duplicate examples have been removed from the test set.

Examples

Basic Examples

Retrieve the resource:

In[1]:=
ResourceObject["Europarl English-Spanish Machine Translation Dataset \
V7"]
Out[1]=

Obtain the first three training examples:

In[2]:=
ResourceData[
  "Europarl English-Spanish Machine Translation Dataset V7"][[All, ;; 3]]
Out[2]=

Obtain the first three test examples:

In[3]:=
ResourceData[
  "Europarl English-Spanish Machine Translation Dataset V7", "TestData"][[All, ;; 3]]
Out[3]=

Dataset Form

Obtain ten random pairs from the training set in Dataset form:

In[4]:=
RandomSample[
 ResourceData[
  "Europarl English-Spanish Machine Translation Dataset V7", "TrainingDataset"], 10]
Out[4]=

Obtain ten random pairs from the test set in Dataset form:

In[5]:=
RandomSample[
 ResourceData[
  "Europarl English-Spanish Machine Translation Dataset V7", "TestDataset"], 10]
Out[5]=

Analysis

Obtain a character-level histogram of test example lengths:

In[6]:=
Histogram[
 Map[StringLength, ResourceData[
   "Europarl English-Spanish Machine Translation Dataset V7", "TestData"], {2}], ChartLegends -> Automatic, LegendAppearance -> "Column"]
Out[6]=

Obtain a word-level histogram of test example lengths:

In[7]:=
Histogram[
 ParallelMap[WordCount, ResourceData[
   "Europarl English-Spanish Machine Translation Dataset V7", "TestData"], {2}], ChartLegends -> Automatic, LegendAppearance -> "Column"]
Out[7]=

Wolfram Research, "Europarl English-Spanish Machine Translation Dataset V7" from the Wolfram Data Repository (2018)  

License Information

Reproduction is authorized, provided that the source is acknowledged

Data Resource History

Source Metadata

Publisher Information