Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 37: invalid start byte #1

Open
hackhaye opened this issue Jun 7, 2024 · 1 comment

Comments

@hackhaye
Copy link

hackhaye commented Jun 7, 2024

请问你是怎么解决“def build_vocabulary(spacy_de, spacy_en):
def tokenize_de(text):
return tokenize(text, spacy_de)

def tokenize_en(text):
    return tokenize(text, spacy_en)

print("Building German Vocabulary ...")

train, val, test = datasets.Multi30k(language_pair=("de", "en"))

train = datasets.Multi30k(root='.data', split='train', language_pair=('de', 'en'))
val = datasets.Multi30k(root='.data', split='valid', language_pair=('de', 'en'))
test = datasets.Multi30k(root='.data', split='test', language_pair=('de', 'en'))
vocab_src = build_vocab_from_iterator(
    yield_tokens(train + val + test, tokenize_de, index=0),
    min_freq=2,
    specials=["<s>", "</s>", "<blank>", "<unk>"],
)

print("Building English Vocabulary ...")

train, val, test = datasets.Multi30k(language_pair=("de", "en"))

train = datasets.Multi30k(root='.data', split='train', language_pair=('de', 'en'))
val = datasets.Multi30k(root='.data', split='valid', language_pair=('de', 'en'))
test = datasets.Multi30k(root='.data', split='test', language_pair=('de', 'en'))
vocab_tgt = build_vocab_from_iterator(
    yield_tokens(train + val + test, tokenize_en, index=1),
    min_freq=2,
    specials=["<s>", "</s>", "<blank>", "<unk>"],
)

vocab_src.set_default_index(vocab_src["<unk>"])
vocab_tgt.set_default_index(vocab_tgt["<unk>"])

return vocab_src, vocab_tgt

def load_vocab(spacy_de, spacy_en):
if not exists("vocab.pt"):
vocab_src, vocab_tgt = build_vocabulary(spacy_de, spacy_en)
torch.save((vocab_src, vocab_tgt), "vocab.pt")
else:
vocab_src, vocab_tgt = torch.load("vocab.pt")
print("Finished.\nVocabulary sizes:")
print(len(vocab_src))
print(len(vocab_tgt))
return vocab_src, vocab_tgt

if is_interactive_notebook():
# global variables used later in the script
spacy_de, spacy_en = show_example(load_tokenizers)
vocab_src, vocab_tgt = show_example(load_vocab, args=[spacy_de, spacy_en])” 引入的UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 37: invalid start byte问题的?

@mcxiaoxiao
Copy link
Owner

哎非常抱歉 我也卡在这一步😨。如果dalao能解决的话希望可以提PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants