Crash in Bootleg during request #174

gcampax · 2021-07-14T17:59:07Z

The following command: "find me a movie with chris pratt" seems to reliably trigger a crash in Booleg with the currently deployed model on staging.

[E 210714 17:56:44 web:1789] Uncaught exception POST /v1/models/x40org-thingpedia-models-defaultx2fen:predict (127.0.0.1)
    HTTPServerRequest(protocol='http', host='x40org-thingpedia-models-defaultx2fen-predictor-default.staging.svc.cluster.local', method='POST', uri='/v1/models/x40org-thingpedia-models-defaultx2fen:predict', version='HTTP/1.1', remote_ip='127.0.0.1')
    Traceback (most recent call last):
      File "/usr/local/lib64/python3.8/site-packages/tornado/web.py", line 1704, in _execute
        result = await result
      File "/usr/local/lib/python3.8/site-packages/kfserving/handlers/http.py", line 79, in post
        response = (await model.predict(request)) if inspect.iscoroutinefunction(model.predict) else model.predict(request)
      File "/opt/genienlp/genienlp/kfserver.py", line 55, in predict
        results = self.server.handle_request(request)
      File "/opt/genienlp/genienlp/server.py", line 142, in handle_request
        output = generate_with_model(
      File "/opt/genienlp/genienlp/validate.py", line 60, in generate_with_model
        return generate_with_seq2seq_model(
      File "/opt/genienlp/genienlp/validate.py", line 124, in generate_with_seq2seq_model
        generated = model.generate(
      File "/opt/genienlp/genienlp/models/transformer_seq2seq.py", line 175, in generate
        generated = self.model.generate(
      File "/usr/local/lib64/python3.8/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
        return func(*args, **kwargs)
      File "/usr/local/lib/python3.8/site-packages/transformers/generation_utils.py", line 970, in generate
        return self.greedy_search(
      File "/usr/local/lib/python3.8/site-packages/transformers/generation_utils.py", line 1327, in greedy_search
        if unfinished_sequences.max() == 0 or stopping_criteria(input_ids, scores):
    RuntimeError: CUDA error: device-side assert triggered
[E 210714 17:56:44 web:2239] 500 POST /v1/models/x40org-thingpedia-models-defaultx2fen:predict (127.0.0.1) 395.41ms
[E 210714 17:57:02 web:1789] Uncaught exception POST /v1/models/x40org-thingpedia-models-defaultx2fen:predict (127.0.0.1)
    HTTPServerRequest(protocol='http', host='x40org-thingpedia-models-defaultx2fen-predictor-default.staging.svc.cluster.local', method='POST', uri='/v1/models/x40org-thingpedia-models-defaultx2fen:predict', version='HTTP/1.1', remote_ip='127.0.0.1')
    Traceback (most recent call last):
      File "/usr/local/lib64/python3.8/site-packages/tornado/web.py", line 1704, in _execute
        result = await result
      File "/usr/local/lib/python3.8/site-packages/kfserving/handlers/http.py", line 79, in post
        response = (await model.predict(request)) if inspect.iscoroutinefunction(model.predict) else model.predict(request)
      File "/opt/genienlp/genienlp/kfserver.py", line 55, in predict
        results = self.server.handle_request(request)
      File "/opt/genienlp/genienlp/server.py", line 109, in handle_request
        extract_features_with_annotator(examples, self.bootleg_annotator, self.args, task)
      File "/opt/genienlp/genienlp/data_utils/bootleg.py", line 96, in extract_features_with_annotator
        bootleg_labels = bootleg_annotator.label_mentions(bootleg_inputs)
      File "/usr/local/lib/python3.8/site-packages/bootleg/end2end/bootleg_annotator.py", line 551, in label_mentions
        batch_example_aliases_locs_start = torch.tensor(
    RuntimeError: CUDA error: device-side assert triggered

The text was updated successfully, but these errors were encountered:

gcampax · 2021-07-14T18:00:04Z

Actually, after the first crash now any command causes a crash. I assume it is because of a CUDA error that was not recovered correctly.

gcampax · 2021-07-14T20:14:31Z

Yeah the error doesn't seem to be Bootleg related. There is this warning though:

Token indices sequence length is longer than the specified maximum sequence length for this model (1461 > 1024). Running this sequence through the model will result in indexing errors

What's going on here?

gcampax · 2021-07-14T21:50:15Z

This is quite interesting because we pass truncation=True when we call Tokenizer.batch_encode_plus, so that should truncate the sequence to the max length of the model (1024). Why does it not happen?

Mehrad0711 · 2021-07-14T22:01:34Z

truncation is used only for token classification task (where input words and labels need to be aligned) but not for the general encoding which happens in encode_batch method.
I think we should raise an error if any input length surpasses the model maximum length instead of truncating. This forces user to inspect their input and make sure it's not a dataset bug (missing end of line, etc.) If their task truly needs handling long sequences, e.g. document classification, QA with long history, they can add a new task with specific preprocessing (similar to what I did for ambigqa task)

Mehrad0711 · 2021-07-14T22:07:57Z

Alternatively, we can make truncation optional and add a flag for it so the user can decide what to do. Although, I prefer the first approach to avoid silent bugs.

gcampax · 2021-07-14T22:44:03Z

I agree that if truncation is necessary it is a bug, but the current failure mode takes down the whole server until it is manually restarted. We can raise an error if we catch it in the server code and report it correctly to the API caller (not a 500 error). Otherwise, logging a warning and truncating is better than nothing.

gcampax added the bug Something isn't working label Jul 14, 2021

gcampax assigned Mehrad0711 Jul 14, 2021

This was referenced Jul 14, 2021

Thingpeida integration test using k8s stanford-oval/genie-cloud#1041

Merged

Two changes to reduce the sequence length passed to genienlp stanford-oval/genie-toolkit#692

Merged

gcampax added the server Issues with serving and dynamic inference-time label Jul 15, 2021

nrser added the P1 We're working on it right now label Aug 4, 2021

gcampax mentioned this issue Aug 10, 2021

Catch cuda runtime errors #190

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash in Bootleg during request #174

Crash in Bootleg during request #174

gcampax commented Jul 14, 2021

gcampax commented Jul 14, 2021

gcampax commented Jul 14, 2021

gcampax commented Jul 14, 2021

Mehrad0711 commented Jul 14, 2021 •

edited

Loading

Mehrad0711 commented Jul 14, 2021 •

edited

Loading

gcampax commented Jul 14, 2021

Crash in Bootleg during request #174

Crash in Bootleg during request #174

Comments

gcampax commented Jul 14, 2021

gcampax commented Jul 14, 2021

gcampax commented Jul 14, 2021

gcampax commented Jul 14, 2021

Mehrad0711 commented Jul 14, 2021 • edited Loading

Mehrad0711 commented Jul 14, 2021 • edited Loading

gcampax commented Jul 14, 2021

Mehrad0711 commented Jul 14, 2021 •

edited

Loading

Mehrad0711 commented Jul 14, 2021 •

edited

Loading