SQuAD v1.1

A dataset for question answering and reading comprehension from a set of Wikipedia articles

The Stanford Question Answering Dataset (SQuAD) consists of questions posed by crowd workers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage.

The "ContentElements" field contains eight options; "Dataset", "TrainingData", "ValidationData", "TrainingMetadata", "ValidationMetadata", "Data", "ColumnNames" and "ColumnDescriptions". "Dataset" contains the full dataset. Please note that data marked "Validation" in the ValidationRole field can have multiple possible answers for each question. "TrainingData" and "ValidationData" are formatted for standard question answering usage; for every question, only the first answer of the full dataset is selected. "TrainingMetadata" and "ValidationMetadata" contain the title of the Wikipedia article to which each question ID corresponds. "Data" contains the full dataset structured as an association. "ColumnNames" and "ColumnDescriptions" provide more information about the columns of the dataset.

Modifications from the original dataset: Data marked "Training" in the ValidationRole field corresponds to the Training Set v1.1 subset of the original dataset. Data marked "Validation" in the ValidationRole field corresponds to the Dev Set v1.1 subset of the original dataset. The original dataset is 0-indexed, in order be accurate in the Wolfram Language, 1 was added to the value of "AnswerPosition", as the Wolfram Language is 1-indexed.

Examples

Basic Examples

Retrieve the resource:

In[1]:=
ResourceObject["SQuAD v1.1"]
Out[1]=

Retrieve a sample of the dataset:

In[2]:=
RandomSample[ResourceData["SQuAD v1.1"], 4]
Out[2]=

Working with Training and Validation data:

Obtain the first example from the training dataset:

In[3]:=
sample = ResourceData["SQuAD v1.1", "TrainingData"][[ All, 1]]
Out[3]=

View the question and answer associated with the passage:

In[4]:=
sample["Question"]
Out[4]=
In[5]:=
sample["Answer"]
Out[5]=

"AnswerPosition" shows where the first character of the answer is located in the passage. The position of the last character can be computed:

In[6]:=
answerStart = sample["AnswerPosition"]
Out[6]=
In[7]:=
answerEnd = sample["AnswerPosition"] + StringLength[sample["Answer"]] - 1
Out[7]=

Test that the span between the starting position and the ending position is equivalent to the answer:

In[8]:=
answer = StringTake[sample["Context"], answerStart ;; answerEnd]
Out[8]=
In[9]:=
answer == sample["Answer"]
Out[9]=

Add the training metadata to the training data:

In[10]:=
trainingSet = Join[ResourceData["SQuAD v1.1", "TrainingData"], ResourceData["SQuAD v1.1", "TrainingMetadata"]];

There is a one-to-one correspondence between the data and the metadata, therefore all lists will have the same length:

In[11]:=
Length /@ trainingSet
Out[11]=

View an example of the joined data:

In[12]:=
trainingSet[[All, 1]]
Out[12]=

Aggregation

Set the types of questions:

In[13]:=
questionTypes = {"What", "How many", "How", "Whom", "Whose", "Who", "When", "Which", "Where", "Why", "Be/Do/etc."};

Define the patterns corresponding to these questions:

In[14]:=
patternToQuestionType = Append[StartOfString | (___ ~~ " ") ~~ ToLowerCase[#] ~~ ((" " | "," | "s " | "'" | "\"" | ":") ~~ __) | (PunctuationCharacter ~~ ("" | " " ...)) ~~ EndOfString -> # & /@ Most[questionTypes], StartOfString ~~ __ ~~ EndOfString -> Last[questionTypes]];

Classify the questions from the SQuAD dataset:

In[15]:=
classifiedQuestions = Map[# -> StringReplace[ToLowerCase[#], patternToQuestionType] &[#] &, Flatten[Normal[
     ResourceData["SQuAD v1.1"][[All, "QuestionAnswerSets"]][[All, All, "Question"]]]]];

Print the statistics about the question types

In[16]:=
counts = Dataset[CountsBy[classifiedQuestions, Last]][SortBy[-# &]]
Out[16]=
In[17]:=
PieChart[counts, ChartLegends -> Keys[Normal[counts]], ChartStyle -> "Rainbow"]
Out[17]=

Display some examples for each type of question:

In[18]:=
questionsOfType[type_] := First /@ Normal[
   Select[Association[classifiedQuestions], # == type &]]
In[19]:=
RandomSample[questionsOfType["What"], 5] // Column
Out[19]=
In[20]:=
RandomSample[questionsOfType["Who"], 5] // Column
Out[20]=
In[21]:=
RandomSample[questionsOfType["When"], 5] // Column
Out[21]=
In[22]:=
RandomSample[questionsOfType["Be/Do/etc."], 5] // Column
Out[22]=

Dataset size

Display the number of Wikipedia pages:

In[23]:=
Length[DeleteDuplicates[
  ResourceData["SQuAD v1.1", "Dataset"][All, "Title"]]]
Out[23]=

Display the number of Wikipedia paragraphs:

In[24]:=
Length[ResourceData["SQuAD v1.1", "Dataset"]]
Out[24]=

Display the number of question/answer pairs in the training and validation datasets:

In[25]:=
Length[ResourceData["SQuAD v1.1", "TrainingData"]["Context"]]
Out[25]=
In[26]:=
Length[ResourceData["SQuAD v1.1", "ValidationData"]["Context"]]
Out[26]=

Wolfram Research, "SQuAD v1.1" from the Wolfram Data Repository (2018)  

License Information

Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Data Resource History

Source Metadata

Data Downloads

Publisher Information