-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DistilBERTurk training for question answering failed #25
Comments
Hi @ekandemir , thanks for your interest and using the distilled version 🤗 Could you specify the exact Transformers version, that you use for fine-tuning 🤔 I'm currently using: python run_qa.py \
--model_name_or_path dbmdz/distilbert-base-turkish-cased \
--dataset_name squad \
--do_train \
--do_eval \
--per_device_train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/debug_squad/ with Transformers 4.3.0.dev0 (latest (Yes, it is not a Turkish QA dataset, but fine-tuning is running). Could you also paste the exact training command that you use? |
Oh, I just saw that you're using the legacy script. Is there any chance that you use the new I would be very interested in the Turkish QA dataset that you're using. If it's not available in the awesome Hugging Face datasets library, then we maybe could integrate it 🤗 |
Thanks for the quick answer. I've been trying to run new script but due to windows machine and network restrictions, I couldn't get "datasets" run well. Also it didn't work with Turkish Squad and my customized dataset local files.
But same error. Turkish QA dataset is available on TQuad . And there are some example BERT models exist in huggingface finetuned with this dataset. Thanks again. |
Hi @ekandemir , after some debugging I can confirm that there's something strange with the configuration of my distilbert model. Root cause can be found in the model configuration, I will remove this option from the config and then evaluation should be fine (I checked it locally) and you should be able to use the old QA script then. I've also written a new Hugging Face datasets recipe for TSQuAD, which I will integrate into datasets library soon. Will report back here, whenever I changed the model configuration, @ekandemir ! (Thanks also to @sgugger for providing more information about that issue 🤗 ) |
On our side, we'll work on fixing the scripts so the error does not appear if the option |
Changing the config file solved the problem training from the main model, thanks. Thanks for your help. |
Hi @ekandemir , great to hear that it would work with the old script. Here's a first draft of a Hugging Face Datasets recipe. Just create a folder named like from __future__ import absolute_import, division, print_function
import json
import datasets
# BibTeX citation
_CITATION = """
"""
_DESCRIPTION = """\
TSQuAD
"""
_URL = "https://raw.githubusercontent.com/TQuad/turkish-nlp-qa-dataset/master/"
_URLS = {
"train": _URL + "train-v0.1.json",
"dev": _URL + "dev-v0.1.json",
}
class SquadTrConfig(datasets.BuilderConfig):
"""BuilderConfig for TSQuAD."""
def __init__(self, **kwargs):
"""BuilderConfig for TSQuAD.
Args:
**kwargs: keyword arguments forwarded to super.
"""
super(SquadTrConfig, self).__init__(**kwargs)
class SquadTr(datasets.GeneratorBasedBuilder):
"""TSQuAD dataset."""
VERSION = datasets.Version("0.1.0")
BUILDER_CONFIGS = [
SquadTrConfig(
name="v1.1.0",
version=datasets.Version("1.0.0", ""),
description="Plain text Turkish squad version 1",
),
]
def _info(self):
# Specifies the datasets.DatasetInfo object
return datasets.DatasetInfo(
# This is the description that will appear on the datasets page.
description=_DESCRIPTION,
# datasets.features.FeatureConnectors
features=datasets.Features(
{
# These are the features of your dataset like images, labels ...
"id": datasets.Value("string"),
"title": datasets.Value("string"),
"context": datasets.Value("string"),
"question": datasets.Value("string"),
"answers": datasets.features.Sequence(
{
"text": datasets.Value("string"),
"answer_start": datasets.Value("int32"),
}
),
}
),
# If there's a common (input, target) tuple from the features,
# specify them here. They'll be used if as_supervised=True in
# builder.as_dataset.
supervised_keys=None,
# Homepage of the dataset for documentation
homepage="https://github.com/TQuad/turkish-nlp-qa-dataset",
citation=_CITATION,
)
def _split_generators(self, dl_manager):
"""Returns SplitGenerators."""
# Downloads the data and defines the splits
# dl_manager is a datasets.download.DownloadManager that can be used to
# download and extract URLs
dl_dir = dl_manager.download_and_extract(_URLS)
return [
datasets.SplitGenerator(
name=datasets.Split.TRAIN,
# These kwargs will be passed to _generate_examples
gen_kwargs={"filepath": dl_dir["train"]},
),
datasets.SplitGenerator(
name=datasets.Split.VALIDATION,
# These kwargs will be passed to _generate_examples
gen_kwargs={"filepath": dl_dir["dev"]},
),
]
def _generate_examples(self, filepath):
"""Yields examples."""
# Yields (key, example) tuples from the dataset
with open(filepath, encoding="utf-8") as f:
data = json.load(f)
for example in data["data"]:
title = example.get("title", "").strip()
for paragraph in example["paragraphs"]:
context = paragraph["context"].strip()
for qa in paragraph["qas"]:
question = qa["question"].strip()
id_ = str(qa["id"])
answer_starts = [answer["answer_start"] for answer in qa["answers"]]
answers = [answer["text"].strip() for answer in qa["answers"]]
yield id_, {
"title": title,
"context": context,
"question": question,
"id": id_,
"answers": {
"answer_start": answer_starts,
"text": answers,
},
} Then you can use the shiny new $ python3 run_qa.py \
--model_name_or_path dbmdz/distilbert-base-turkish-cased \
--dataset_name ./squad_tr \
--do_train \
--do_eval \
--per_device_train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir ./output-squad-tr You may ask for good baseline comparisons. I recently found a great paper from @xplip and @JoPfeiff : "How Good is Your Tokenizer?" that also uses this QA dataset with the "normal" BERTurk model. |
cheaters gonna cheat :D typical :D |
Hey, I tried to train DistilBERTurk model for question answering by using run_squad.py script. After training, I got the error during evaluation stage;
When I tried to discard the last value as "start_logits, end_logits, _ = output" the error became
I checked the model with samples from the dataset and the confidence levels were really low, mostly below 0.001. I assume training couldn't done right either.
I tried to train DistilBERT original with the same script and the same dataset and it trained without error and confidence levels were high.
I compared the layers but both model looked same. Also tried to load the model as qa model, saved it but the error occurred again.
Thank you so much.
The text was updated successfully, but these errors were encountered: