SNLI(Stanford Natural Language Inference) corpus #32

aneesh-joshi · 2018-08-13T12:13:55Z

Dataset download link : https://nlp.stanford.edu/projects/snli/snli_1.0.zip

Dataset website : https://nlp.stanford.edu/projects/snli/

Paper : https://nlp.stanford.edu/pubs/snli_paper.pdf

Brief Description (from website):

The SNLI corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE). We aim for it to serve both as a benchmark for evaluating representational systems for text, especially including those induced by representation learning methods, as well as a resource for developing NLP models of any kind.

Example Datapoints:

Text	Judgments	Hypothesis
A man inspects the uniform of a figure in some East Asian country.	contradictionC C C C C	The man is sleeping
An older and younger man smiling.	neutralN N E N N	Two men are smiling and laughing at the cats playing on the floor.
A black race car starts up in front of a crowd of people.	contradictionC C C C C	A man is driving down a lonely road.
A soccer game with multiple males playing.	entailmentE E E E E	Some men are playing a sport.
A smiling costumed woman is holding an umbrella.	neutralN N E C N	A happy woman in a fairy costume holds an umbrella.

My Script for reading:

import json
import os
import re
from keras.utils.np_utils import to_categorical

class SnliReader:
	"""Reader for the SNLI dataset
	More details can be found here : https://nlp.stanford.edu/projects/snli/

	Each data point contains 2 sentences and their label('contradiction', 'entailment', 'neutral')
	Additionally, it also provides annotator labels which has a range of labels given by the annotators. We will mostly ignore this.

	Example datapoint:
	gold_label	sentence1_binary_parse	sentence2_binary_parse	sentence1_parse	sentence2_parse	sentence1	sentence2	captionID	pairID	label1	label2	label3	label4	label5
	neutral	( ( Two women ) ( ( are ( embracing ( while ( holding ( to ( go packages ) ) ) ) ) ) . ) )	( ( The sisters ) ( ( are ( ( hugging goodbye ) ( while ( holding ( to ( ( go packages ) ( after ( just ( eating lunch ) ) ) ) ) ) ) ) ) . ) )	(ROOT (S (NP (CD Two) (NNS women)) (VP (VBP are) (VP (VBG embracing) (SBAR (IN while) (S (NP (VBG holding)) (VP (TO to) (VP (VB go) (NP (NNS packages)))))))) (. .)))	(ROOT (S (NP (DT The) (NNS sisters)) (VP (VBP are) (VP (VBG hugging) (NP (UH goodbye)) (PP (IN while) (S (VP (VBG holding) (S (VP (TO to) (VP (VB go) (NP (NNS packages)) (PP (IN after) (S (ADVP (RB just)) (VP (VBG eating) (NP (NN lunch))))))))))))) (. .)))	Two women are embracing while holding to go packages.	The sisters are hugging goodbye while holding to go packages after just eating lunch.	4705552913.jpg#2	4705552913.jpg#2r1n	neutral	entailment	neutral	neutral	neutral

	Parameters
	----------
	filepath : str
		path to the folder with the snli data

	"""
	
	def __init__(self, filepath):
		self.filepath = filepath
		self.filename = {}
		self.filename['train'] = 'snli_1.0_train.jsonl'
		self.filename['dev'] = 'snli_1.0_dev.jsonl'
		self.filename['test'] = 'snli_1.0_test.jsonl'
		self.label2index = {'contradiction': 0, 'entailment': 1, 'neutral': 2}

	def get_data(self, split):
		"""Returnd the data for the given split

		Parameters
		----------
		split : {'train', 'test', 'dev'}
			The split of the data

		Returns
		-------
		sentA_datalist, sentB_datalist, lablels, annotator_labels
		"""
		x1, x2, labels, annotator_labels = [], [], [], []
		with open(os.path.join(self.filepath, self.filename[split]), 'r') as f:
			for line in f:
				line = json.loads(line)
				if line['gold_label'] == '-':
					# In the case of this unknown label, we will skip the whole datapoint
					continue
				x1.append(self._preprocess(line['sentence1']))
				x2.append(self._preprocess(line['sentence2']))
				labels.append(self.label2index[line['gold_label']])
				
				annotator_labels.append(line['annotator_labels'])
		return x1, x2, labels, annotator_labels

	def _preprocess(self, sent):
		"""lower, strip and split the string and remove unnecessaey characters

		Parameters
		----------
		sent : str
			The sentence to be preprocessed
		"""
		return re.sub("[^a-zA-Z0-9]", " ", sent.strip().lower()).split()

	def get_label2index(self):
		"""Returns the label2index dict"""
		return self.label2index

adikolsur · 2019-03-07T15:48:54Z

@aneesh-joshi Can you please explain a bit about how to use this script and its functioning?

aneesh-joshi · 2019-03-17T20:13:10Z

Hi @adikolsur ,
could you mention which parts are unclear?
Have you taken a look at comments and links (to the paper and website)?

FarhatAbdullah · 2020-06-30T10:55:33Z

Hi, Can someone guide me how can I find Urdu corpus on this database?

aneesh-joshi changed the title ~~SNLI corpus~~ SNLI(Stanford Natural Language Inference) corpus Aug 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNLI(Stanford Natural Language Inference) corpus #32

SNLI(Stanford Natural Language Inference) corpus #32

aneesh-joshi commented Aug 13, 2018

adikolsur commented Mar 7, 2019

aneesh-joshi commented Mar 17, 2019

FarhatAbdullah commented Jun 30, 2020

SNLI(Stanford Natural Language Inference) corpus #32

SNLI(Stanford Natural Language Inference) corpus #32

Comments

aneesh-joshi commented Aug 13, 2018

adikolsur commented Mar 7, 2019

aneesh-joshi commented Mar 17, 2019

FarhatAbdullah commented Jun 30, 2020