SQuAD v2.0 Tokens Generated with WL

A list of isolated words and symbols from the SQuAD dataset, which consists of a set of Wikipedia articles labeled for question answering and reading comprehension

The Stanford Question Answering Dataset (SQuAD) consists of questions posed by crowd workers on a set of Wikipedia articles where the answer to every question is a segment of text, or span, from the corresponding reading passage. Unanswerable questions were added to the dataset for v2.0.

The "ContentElements" field contains eight options: "Dataset", "TrainingData", "ValidationData", "TrainingMetadata", "ValidationMetadata", "Data", "ColumnNames" and "ColumnDescriptions". "Dataset" contains the full dataset. Please note that data marked "Validation" in the ValidationRole field can have multiple possible answers for each question. "TrainingData" and "ValidationData" are formatted for standard question answering usage; for every question, only the first answer of the full dataset is selected. "TrainingMetadata" and "ValidationMetadata" contain the title of the Wikipedia article to which each question ID corresponds. "Data" contains the full dataset structured as an association. "ColumnNames" and "ColumnDescriptions" provide more information about the columns of the dataset.

Modifications from the original dataset: Data marked "Training" in the ValidationRole field corresponds to the Training Set v2.0 subset of the original dataset. Data marked "Validation" in the ValidationRole field corresponds to the Dev Set v2.0 subset of the original dataset. The original dataset is 0-indexed; in order be accurate in the Wolfram Language, 1 was added to the value of "AnswerPosition", as the Wolfram Language is 1-indexed. This is a pre-processed form of the original dataset, where the pieces of text (Context, Questions and Answers) have been converted to a list of tokens with the following simple tokenization done in the Wolfram Language: SimpleTokenize[text_String] := DeleteCases[StringTrim @ StringSplit[text, {WordBoundary, x:PunctuationCharacter :> x}], ""];. The answers are then labeled with the field AnswerSpans, which corresponds to the first and the last indices of the tokens in Context. Most of the labeling errors of the original dataset could be detected and corrected during the process of computing these spans, given the character positions and the expected answer content.

Examples

Basic Examples

Retrieve the resource:

In[1]:=
ResourceObject["SQuAD v2.0 Tokens Generated with WL"]
Out[1]=

Retrieve a sample of the dataset:

In[2]:=
RandomSample[ResourceData["SQuAD v2.0 Tokens Generated with WL"], 4]
Out[2]=

Working with Training and Validation Data

Obtain the first example from the training dataset:

In[3]:=
sample = ResourceData["SQuAD v2.0 Tokens Generated with WL", 
   "TrainingData"][[ All, 1]]
Out[3]=

View the question and answer associated with the passage:

In[4]:=
sample["Question"]
Out[4]=
In[5]:=
sample["Answer"]
Out[5]=

"AnswerSpan" shows where the first word and the last word of the answer are located in the passage:

In[6]:=
sample["AnswerSpan"]
Out[6]=

Test that the span between the starting position and the ending position is equivalent to the answer:

In[7]:=
answer = sample["Context"][[
  sample["AnswerSpan"][[1]] ;; sample["AnswerSpan"][[2]]]]
Out[7]=
In[8]:=
answer == sample["Answer"]
Out[8]=

Add the training metadata to the training data:

In[9]:=
trainingSet = 
  Join[ResourceData["SQuAD v2.0 Tokens Generated with WL", 
    "TrainingData"], 
   ResourceData["SQuAD v2.0 Tokens Generated with WL", 
    "TrainingMetadata"]];

There is a one-to-one correspondence between the data and the metadata; therefore, all lists will have the same length:

In[10]:=
Length /@ trainingSet
Out[10]=

View the joined data:

In[11]:=
trainingSet[[All, 1]]
Out[11]=

Dataset Size

Display the number of Wikipedia pages in the training and validation datasets:

In[12]:=
Length[DeleteDuplicates[
  ResourceData["SQuAD v2.0 Tokens Generated with WL", 
    "TrainingMetadata"]["Title"]]]
Out[12]=
In[13]:=
Length[DeleteDuplicates[
  ResourceData["SQuAD v2.0 Tokens Generated with WL", 
    "ValidationMetadata"]["Title"]]]
Out[13]=

Display the number of Wikipedia paragraphs in the training and validation datasets:

In[14]:=
Length[DeleteDuplicates[
  ResourceData["SQuAD v2.0 Tokens Generated with WL", "TrainingData"][
   "Context"]]]
Out[14]=
In[15]:=
Length[DeleteDuplicates[
  ResourceData["SQuAD v2.0 Tokens Generated with WL", 
    "ValidationData"]["Context"]]]
Out[15]=

Display the number of question/answer pairs in the training and validation datasets:

In[16]:=
Length[DeleteDuplicates[
  ResourceData["SQuAD v2.0 Tokens Generated with WL", "TrainingData"][
   "Question"]]]
Out[16]=
In[17]:=
Length[DeleteDuplicates[
  ResourceData["SQuAD v2.0 Tokens Generated with WL", 
    "ValidationData"]["Question"]]]
Out[17]=

Wolfram Research, "SQuAD v2.0 Tokens Generated with WL" from the Wolfram Data Repository (2019)   https://doi.org/10.24097/wolfram.27825.data

License Information

Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Data Resource History

Source Metadata

Data Downloads

Publisher Information