Kyoto Free Translation Task Data

A parallel corpus for the evaluation and development of Japanese-English machine translation systems

The data was originally prepared by the National Institute for Information and Communications Technology (NICT) and released as the Japanese-English Bilingual Corpus of Wikipedia's Kyoto Articles. The data was processed to form the Kyoto Free Translation Task dataset. Data was cleaned to remove sentences with fewer than 1 or more than 40 words, and separated into training, tuning, development and test sets. The training data should be used for training statistical models, tuning data used for tuning weights, development data used for testing the system in development and testing data used for reporting final results. The validation sets presented here correspond to the development set.

The "ContentElements" field contains eight options: "TrainingData", "TestData", "ValidationData", "TuningData", "TrainingDataset", "TestDataset", "ValidationDataset" and "TuningDataset". "TrainingData", "TestData", "ValidationData" and "TuningData" are structured as associations. "TrainingDataset", "TestDataset", "ValidationDataset" and "TuningDataset" are structured as datasets.