A parallel corpus for machine translation systems, information extraction and other language processing techniques
The Japanese-English Subtitle Corpus (JESC) is the product of a collaboration among Stanford University, Google Brain and Rakuten Institute of Technology. It was created by crawling the internet for movie and TV subtitles and aligning their captions. It is one of the largest freely available English-Japanese corpora (3.2M parallel sentences), and covers the poorly represented domain of colloquial language.
The "ContentElements" field contains six options: "TrainingData", "TestData", "ValidationData", "TrainingDataset", "TestDataset" and "ValidationDataset". "TrainingData", "TestData" and "ValidationData" are structured as associations. "TrainingDataset", "TestDataset" and "ValidationDataset" are structured as datasets.
Examples
Basic Examples
Retrieve the resource:
Out[1]= |  |
Obtain the first three training examples:
Out[2]= |  |
Obtain the last three test examples:
Out[3]= |  |
Obtain the one random validation example:
Out[4]= |  |
Dataset Form
Obtain five random pairs from the training set in Dataset form:
Out[5]= |  |
Obtain five random pairs from the test set in Dataset form:
Out[6]= |  |
Obtain five random pairs from the validation set in Dataset form:
Out[7]= |  |
Analysis
Obtain a character-level histogram of test example lengths:
Out[8]= |  |
Bibliographic Citation
Wolfram Research,
"Japanese-English Subtitle Corpus"
from the Wolfram Data Repository
(2018)
Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Data Resource History
Publisher Information