Wolfram Data Repository
Immediate Computable Access to Curated Contributed Data
A parallel corpus for machine translation from the proceedings of the European Parliament
The Europarl dataset contains text corpora from 21 languages from the proceedings of the European Parliament between 1996 and 2011. The English-French corpus contains 2 million training and 45,000 test sentences. Training examples are taken from the originally distributed parallel corpus, while test examples are extracted from the Q4/2000 portion of the data, following the publisher's suggestion. The "ContentElements" field contains four options: "TrainingData", "TestData", "TrainingDataset" and "TestDataset". "TrainingData" and "TestData" are structured as associations. "TrainingDataset" and "TestDataset" are structured as datasets. Duplicate examples have been removed from the test set.
Retrieve the resource:
In[1]:= | ![]() |
Out[1]= | ![]() |
Obtain the first three training examples:
In[2]:= | ![]() |
Out[2]= | ![]() |
Obtain the first three test examples:
In[3]:= | ![]() |
Out[3]= | ![]() |
Obtain a character-level histogram of test example lengths:
In[6]:= | ![]() |
Out[6]= | ![]() |
Obtain a word-level histogram of test example lengths:
In[7]:= | ![]() |
Out[7]= | ![]() |
Wolfram Research, "Europarl English-French Machine Translation Dataset V7" from the Wolfram Data Repository (2018)
Reproduction is authorized, provided that the source is acknowledged