You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The SNLI corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE). We aim for it to serve both as a benchmark for evaluating representational systems for text, especially including those induced by representation learning methods, as well as a resource for developing NLP models of any kind.
Example Datapoints:
Text
Judgments
Hypothesis
A man inspects the uniform of a figure in some East Asian country.
contradictionC C C C C
The man is sleeping
An older and younger man smiling.
neutralN N E N N
Two men are smiling and laughing at the cats playing on the floor.
A black race car starts up in front of a crowd of people.
contradictionC C C C C
A man is driving down a lonely road.
A soccer game with multiple males playing.
entailmentE E E E E
Some men are playing a sport.
A smiling costumed woman is holding an umbrella.
neutralN N E C N
A happy woman in a fairy costume holds an umbrella.
My Script for reading:
importjsonimportosimportrefromkeras.utils.np_utilsimportto_categoricalclassSnliReader:
"""Reader for the SNLI dataset More details can be found here : https://nlp.stanford.edu/projects/snli/ Each data point contains 2 sentences and their label('contradiction', 'entailment', 'neutral') Additionally, it also provides annotator labels which has a range of labels given by the annotators. We will mostly ignore this. Example datapoint: gold_label sentence1_binary_parse sentence2_binary_parse sentence1_parse sentence2_parse sentence1 sentence2 captionID pairID label1 label2 label3 label4 label5 neutral ( ( Two women ) ( ( are ( embracing ( while ( holding ( to ( go packages ) ) ) ) ) ) . ) ) ( ( The sisters ) ( ( are ( ( hugging goodbye ) ( while ( holding ( to ( ( go packages ) ( after ( just ( eating lunch ) ) ) ) ) ) ) ) ) . ) ) (ROOT (S (NP (CD Two) (NNS women)) (VP (VBP are) (VP (VBG embracing) (SBAR (IN while) (S (NP (VBG holding)) (VP (TO to) (VP (VB go) (NP (NNS packages)))))))) (. .))) (ROOT (S (NP (DT The) (NNS sisters)) (VP (VBP are) (VP (VBG hugging) (NP (UH goodbye)) (PP (IN while) (S (VP (VBG holding) (S (VP (TO to) (VP (VB go) (NP (NNS packages)) (PP (IN after) (S (ADVP (RB just)) (VP (VBG eating) (NP (NN lunch))))))))))))) (. .))) Two women are embracing while holding to go packages. The sisters are hugging goodbye while holding to go packages after just eating lunch. 4705552913.jpg#2 4705552913.jpg#2r1n neutral entailment neutral neutral neutral Parameters ---------- filepath : str path to the folder with the snli data """def__init__(self, filepath):
self.filepath=filepathself.filename= {}
self.filename['train'] ='snli_1.0_train.jsonl'self.filename['dev'] ='snli_1.0_dev.jsonl'self.filename['test'] ='snli_1.0_test.jsonl'self.label2index= {'contradiction': 0, 'entailment': 1, 'neutral': 2}
defget_data(self, split):
"""Returnd the data for the given split Parameters ---------- split : {'train', 'test', 'dev'} The split of the data Returns ------- sentA_datalist, sentB_datalist, lablels, annotator_labels """x1, x2, labels, annotator_labels= [], [], [], []
withopen(os.path.join(self.filepath, self.filename[split]), 'r') asf:
forlineinf:
line=json.loads(line)
ifline['gold_label'] =='-':
# In the case of this unknown label, we will skip the whole datapointcontinuex1.append(self._preprocess(line['sentence1']))
x2.append(self._preprocess(line['sentence2']))
labels.append(self.label2index[line['gold_label']])
annotator_labels.append(line['annotator_labels'])
returnx1, x2, labels, annotator_labelsdef_preprocess(self, sent):
"""lower, strip and split the string and remove unnecessaey characters Parameters ---------- sent : str The sentence to be preprocessed """returnre.sub("[^a-zA-Z0-9]", " ", sent.strip().lower()).split()
defget_label2index(self):
"""Returns the label2index dict"""returnself.label2index
The text was updated successfully, but these errors were encountered:
aneesh-joshi
changed the title
SNLI corpus
SNLI(Stanford Natural Language Inference) corpus
Aug 13, 2018
Dataset download link : https://nlp.stanford.edu/projects/snli/snli_1.0.zip
Dataset website : https://nlp.stanford.edu/projects/snli/
Paper : https://nlp.stanford.edu/pubs/snli_paper.pdf
Brief Description (from website):
Example Datapoints:
My Script for reading:
The text was updated successfully, but these errors were encountered: