Kyoto Free Translation Task Data

A parallel corpus for the evaluation and development of Japanese-English machine translation systems

The data was originally prepared by the National Institute for Information and Communications Technology (NICT) and released as the Japanese-English Bilingual Corpus of Wikipedia's Kyoto Articles. The data was processed to form the Kyoto Free Translation Task dataset. Data was cleaned to remove sentences with fewer than 1 or more than 40 words, and separated into training, tuning, development and test sets. The training data should be used for training statistical models, tuning data used for tuning weights, development data used for testing the system in development and testing data used for reporting final results. The validation sets presented here correspond to the development set.

The "ContentElements" field contains eight options: "TrainingData", "TestData", "ValidationData", "TuningData", "TrainingDataset", "TestDataset", "ValidationDataset" and "TuningDataset". "TrainingData", "TestData", "ValidationData" and "TuningData" are structured as associations. "TrainingDataset", "TestDataset", "ValidationDataset" and "TuningDataset" are structured as datasets.

Examples

Basic Examples

Retrieve the resource:

In[1]:=
ResourceObject["Kyoto Free Translation Task Data"]
Out[1]=

Obtain the first three training examples:

In[2]:=
ResourceData["Kyoto Free Translation Task Data"][[All, ;; 3]]
Out[2]=

Obtain the last three test examples:

In[3]:=
ResourceData["Kyoto Free Translation Task Data", "TestData"][[All, -3 ;;]]
Out[3]=

Obtain the one random validation example:

In[4]:=
ResourceData["Kyoto Free Translation Task Data", "ValidationData"][[All, RandomInteger[{1, Length@ResourceData["Kyoto Free Translation Task Data", "ValidationData"]}]]]
Out[4]=

Obtain the one random tuning example:

In[5]:=
ResourceData["Kyoto Free Translation Task Data", "TuningData"][[All, RandomInteger[{1, Length@ResourceData["Kyoto Free Translation Task Data", "TuningData"]}]]]
Out[5]=

Dataset Form

Obtain five random pairs from the training set in Dataset form:

In[6]:=
RandomSample[
 ResourceData["Kyoto Free Translation Task Data", "TrainingDataset"],
  5]
Out[6]=

Obtain five random pairs from the test set in Dataset form:

In[7]:=
RandomSample[
 ResourceData["Kyoto Free Translation Task Data", "TestDataset"], 5]
Out[7]=

Obtain five random pairs from the validation set in Dataset form:

In[8]:=
RandomSample[
 ResourceData["Kyoto Free Translation Task Data", "ValidationDataset"], 5]
Out[8]=

Obtain five random pairs from the tuning set in Dataset form:

In[9]:=
RandomSample[
 ResourceData["Kyoto Free Translation Task Data", "TuningDataset"], 5]
Out[9]=

Analysis

Obtain a character-level histogram of test example lengths:

In[10]:=
Histogram[
 Map[StringLength, ResourceData["Kyoto Free Translation Task Data", "TestData"], {2}], ChartLegends -> Automatic, LegendAppearance -> "Column"]
Out[10]=

Wolfram Research, "Kyoto Free Translation Task Data" from the Wolfram Data Repository (2018)  

License Information

Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)

Data Resource History

Source Metadata

Publisher Information