Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNLI(Stanford Natural Language Inference) corpus #32

Open
aneesh-joshi opened this issue Aug 13, 2018 · 3 comments
Open

SNLI(Stanford Natural Language Inference) corpus #32

aneesh-joshi opened this issue Aug 13, 2018 · 3 comments

Comments

@aneesh-joshi
Copy link

Dataset download link : https://nlp.stanford.edu/projects/snli/snli_1.0.zip

Dataset website : https://nlp.stanford.edu/projects/snli/

Paper : https://nlp.stanford.edu/pubs/snli_paper.pdf

Brief Description (from website):

The SNLI corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE). We aim for it to serve both as a benchmark for evaluating representational systems for text, especially including those induced by representation learning methods, as well as a resource for developing NLP models of any kind.

Example Datapoints:

Text Judgments Hypothesis
A man inspects the uniform of a figure in some East Asian country. contradictionC C C C C The man is sleeping
An older and younger man smiling. neutralN N E N N Two men are smiling and laughing at the cats playing on the floor.
A black race car starts up in front of a crowd of people. contradictionC C C C C A man is driving down a lonely road.
A soccer game with multiple males playing. entailmentE E E E E Some men are playing a sport.
A smiling costumed woman is holding an umbrella. neutralN N E C N A happy woman in a fairy costume holds an umbrella.

My Script for reading:

import json
import os
import re
from keras.utils.np_utils import to_categorical

class SnliReader:
	"""Reader for the SNLI dataset
	More details can be found here : https://nlp.stanford.edu/projects/snli/

	Each data point contains 2 sentences and their label('contradiction', 'entailment', 'neutral')
	Additionally, it also provides annotator labels which has a range of labels given by the annotators. We will mostly ignore this.

	Example datapoint:
	gold_label	sentence1_binary_parse	sentence2_binary_parse	sentence1_parse	sentence2_parse	sentence1	sentence2	captionID	pairID	label1	label2	label3	label4	label5
	neutral	( ( Two women ) ( ( are ( embracing ( while ( holding ( to ( go packages ) ) ) ) ) ) . ) )	( ( The sisters ) ( ( are ( ( hugging goodbye ) ( while ( holding ( to ( ( go packages ) ( after ( just ( eating lunch ) ) ) ) ) ) ) ) ) . ) )	(ROOT (S (NP (CD Two) (NNS women)) (VP (VBP are) (VP (VBG embracing) (SBAR (IN while) (S (NP (VBG holding)) (VP (TO to) (VP (VB go) (NP (NNS packages)))))))) (. .)))	(ROOT (S (NP (DT The) (NNS sisters)) (VP (VBP are) (VP (VBG hugging) (NP (UH goodbye)) (PP (IN while) (S (VP (VBG holding) (S (VP (TO to) (VP (VB go) (NP (NNS packages)) (PP (IN after) (S (ADVP (RB just)) (VP (VBG eating) (NP (NN lunch))))))))))))) (. .)))	Two women are embracing while holding to go packages.	The sisters are hugging goodbye while holding to go packages after just eating lunch.	4705552913.jpg#2	4705552913.jpg#2r1n	neutral	entailment	neutral	neutral	neutral

	Parameters
	----------
	filepath : str
		path to the folder with the snli data

	"""
	
	def __init__(self, filepath):
		self.filepath = filepath
		self.filename = {}
		self.filename['train'] = 'snli_1.0_train.jsonl'
		self.filename['dev'] = 'snli_1.0_dev.jsonl'
		self.filename['test'] = 'snli_1.0_test.jsonl'
		self.label2index = {'contradiction': 0, 'entailment': 1, 'neutral': 2}

	def get_data(self, split):
		"""Returnd the data for the given split

		Parameters
		----------
		split : {'train', 'test', 'dev'}
			The split of the data

		Returns
		-------
		sentA_datalist, sentB_datalist, lablels, annotator_labels
		"""
		x1, x2, labels, annotator_labels = [], [], [], []
		with open(os.path.join(self.filepath, self.filename[split]), 'r') as f:
			for line in f:
				line = json.loads(line)
				if line['gold_label'] == '-':
					# In the case of this unknown label, we will skip the whole datapoint
					continue
				x1.append(self._preprocess(line['sentence1']))
				x2.append(self._preprocess(line['sentence2']))
				labels.append(self.label2index[line['gold_label']])
				
				annotator_labels.append(line['annotator_labels'])
		return x1, x2, labels, annotator_labels

	def _preprocess(self, sent):
		"""lower, strip and split the string and remove unnecessaey characters

		Parameters
		----------
		sent : str
			The sentence to be preprocessed
		"""
		return re.sub("[^a-zA-Z0-9]", " ", sent.strip().lower()).split()

	def get_label2index(self):
		"""Returns the label2index dict"""
		return self.label2index
@aneesh-joshi aneesh-joshi changed the title SNLI corpus SNLI(Stanford Natural Language Inference) corpus Aug 13, 2018
@adikolsur
Copy link

@aneesh-joshi Can you please explain a bit about how to use this script and its functioning?

@aneesh-joshi
Copy link
Author

Hi @adikolsur ,
could you mention which parts are unclear?
Have you taken a look at comments and links (to the paper and website)?

@FarhatAbdullah
Copy link

Hi, Can someone guide me how can I find Urdu corpus on this database?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants