Wolfram Computation Meets Knowledge

Japanese-English Subtitle Corpus

A parallel corpus for machine translation systems, information extraction and other language processing techniques

The Japanese-English Subtitle Corpus (JESC) is the product of a collaboration among Stanford University, Google Brain and Rakuten Institute of Technology. It was created by crawling the internet for movie and TV subtitles and aligning their captions. It is one of the largest freely available English-Japanese corpora (3.2M parallel sentences), and covers the poorly represented domain of colloquial language.

The "ContentElements" field contains six options: "TrainingData", "TestData", "ValidationData", "TrainingDataset", "TestDataset" and "ValidationDataset". "TrainingData", "TestData" and "ValidationData" are structured as associations. "TrainingDataset", "TestDataset" and "ValidationDataset" are structured as datasets.

Examples

Basic Examples

Retrieve the resource:

In[1]:=
ResourceObject["Japanese-English Subtitle Corpus"]
Out[1]=

Obtain the first three training examples:

In[2]:=
ResourceData["Japanese-English Subtitle Corpus"][[All, ;; 3]]
Out[2]=

Obtain the last three test examples:

In[3]:=
ResourceData["Japanese-English Subtitle Corpus", 
  "TestData"][[All, -3 ;;]]
Out[3]=

Obtain the one random validation example:

In[4]:=
ResourceData["Japanese-English Subtitle Corpus", 
  "ValidationData"][[All, 
 RandomInteger[{1, 
   Length@ResourceData["Japanese-English Subtitle Corpus", 
     "ValidationData"]}]]]
Out[4]=

Dataset Form

Obtain five random pairs from the training set in Dataset form:

In[5]:=
RandomSample[
 ResourceData["Japanese-English Subtitle Corpus", "TrainingDataset"],
  5]
Out[5]=

Obtain five random pairs from the test set in Dataset form:

In[6]:=
RandomSample[
 ResourceData["Japanese-English Subtitle Corpus", "TestDataset"], 5]
Out[6]=

Obtain five random pairs from the validation set in Dataset form:

In[7]:=
RandomSample[
 ResourceData["Japanese-English Subtitle Corpus", 
  "ValidationDataset"], 5]
Out[7]=

Analysis

Obtain a character-level histogram of test example lengths:

In[8]:=
Histogram[
 Map[StringLength, 
  ResourceData["Japanese-English Subtitle Corpus", "TestData"], {2}], 
 ChartLegends -> Automatic, LegendAppearance -> "Column"]
Out[8]=

Wolfram Research, "Japanese-English Subtitle Corpus" from the Wolfram Data Repository (2018) 

License Information

Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Data Resource History

Source Metadata

Publisher Information