Japanese-English Subtitle Corpus

A parallel corpus for machine translation systems, information extraction and other language processing techniques

The Japanese-English Subtitle Corpus (JESC) is the product of a collaboration among Stanford University, Google Brain and Rakuten Institute of Technology. It was created by crawling the internet for movie and TV subtitles and aligning their captions. It is one of the largest freely available English-Japanese corpora (3.2M parallel sentences), and covers the poorly represented domain of colloquial language.

The "ContentElements" field contains six options: "TrainingData", "TestData", "ValidationData", "TrainingDataset", "TestDataset" and "ValidationDataset". "TrainingData", "TestData" and "ValidationData" are structured as associations. "TrainingDataset", "TestDataset" and "ValidationDataset" are structured as datasets.

Examples

Basic Examples

Retrieve the resource:

In[1]:=

Out[1]=

Obtain the first three training examples:

In[2]:=

Out[2]=

Obtain the last three test examples:

In[3]:=

Out[3]=

Obtain the one random validation example:

In[4]:=

ResourceData["Japanese-English Subtitle Corpus", "ValidationData"][[All, RandomInteger[{1, Length@ResourceData["Japanese-English Subtitle Corpus", "ValidationData"]}]]]

Out[4]=

Dataset Form

Obtain five random pairs from the training set in Dataset form:

In[5]:=

Out[6]=

Obtain five random pairs from the test set in Dataset form:

In[7]:=

Out[8]=

Obtain five random pairs from the validation set in Dataset form:

In[9]:=

Out[10]=

Analysis

Obtain a character-level histogram of test example lengths:

In[11]:=

Out[11]=

Bibliographic Citation

Wolfram Research, "Japanese-English Subtitle Corpus" from the Wolfram Data Repository (2018)

License Information

Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Data Resource History

Date Created: 1 June 2018

Source Metadata

Title: JESC: Japanese-English Subtitle Corpus
Creator: Reid Pryzant, Yongjoo Chung, Dan Jurafsky, Denny Britz
Publisher: arXiv eprints 1710.10639
Date: 2017
Source: https://nlp.stanford.edu/projects/jesc

Publisher Information

Publisher of Record: Wolfram Research