Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory when computing bleurt on GEM submission #87

Open
lewtun opened this issue Mar 16, 2022 · 0 comments
Open

CUDA out of memory when computing bleurt on GEM submission #87

lewtun opened this issue Mar 16, 2022 · 0 comments

Comments

@lewtun
Copy link
Contributor

lewtun commented Mar 16, 2022

Hello, I'm trying to compute the bleurt metric on a sample submission for the GEM benchmark (attached). However, running the following command throws a Blas GEMM launch failed error:

gem_metrics sample-submission.json --metric-list bleurt -o metrics.heavy.json
Stack trace
[W 220316 15:40:50 texts:191] Model parameter count not present in the submission file.
[I 220316 15:40:50 texts:32] Loading predictions for SeqPlan/mlsum_de_validation
[I 220316 15:40:50 texts:32] Loading predictions for SeqPlan/mlsum_de_test
[I 220316 15:40:50 texts:32] Loading predictions for SeqPlan/mlsum_de_challenge_test_covid
[W 220316 15:40:50 data:54] /home/lewis/miniconda3/envs/gem-metrics/lib/python3.8/site-packages/data/references/mlsum_de_validation.json not found -- downloading https://huggingface.co/datasets/GEM/references/resolve/main/mlsum_de_validation.json. This may take a few minutes.
[W 220316 15:40:50 __init__:258] Could not format references for mlsum_de_validation: HTTP Error 404: Not Found
  File "/home/lewis/miniconda3/envs/gem-metrics/lib/python3.8/site-packages/gem_metrics/__init__.py", line 251, in load_references
    dataset_file = ensure_download(
  File "/home/lewis/miniconda3/envs/gem-metrics/lib/python3.8/site-packages/gem_metrics/data.py", line 76, in ensure_download
    urllib.request.urlretrieve(
  File "/home/lewis/miniconda3/envs/gem-metrics/lib/python3.8/urllib/request.py", line 247, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/home/lewis/miniconda3/envs/gem-metrics/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/home/lewis/miniconda3/envs/gem-metrics/lib/python3.8/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/home/lewis/miniconda3/envs/gem-metrics/lib/python3.8/urllib/request.py", line 640, in http_response
    response = self.parent.error(
  File "/home/lewis/miniconda3/envs/gem-metrics/lib/python3.8/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/home/lewis/miniconda3/envs/gem-metrics/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/home/lewis/miniconda3/envs/gem-metrics/lib/python3.8/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
[W 220316 15:40:50 data:54] /home/lewis/miniconda3/envs/gem-metrics/lib/python3.8/site-packages/data/references/mlsum_de_validation.json not found -- downloading https://huggingface.co/datasets/GEM/references/resolve/main/mlsum_de_validation.json. This may take a few minutes.
[I 220316 15:40:50 __init__:275] mlsum_de_validation does not have source associated.
[I 220316 15:40:50 texts:32] Loading references for /home/lewis/miniconda3/envs/gem-metrics/lib/python3.8/site-packages/data/references/mlsum_de_test.json
[I 220316 15:40:50 texts:32] Loading sources for /home/lewis/miniconda3/envs/gem-metrics/lib/python3.8/site-packages/data/references/mlsum_de_test.json
[I 220316 15:40:50 __init__:275] mlsum_de_test does not have source associated.
[I 220316 15:40:50 texts:32] Loading references for /home/lewis/miniconda3/envs/gem-metrics/lib/python3.8/site-packages/data/references/mlsum_de_challenge_test_covid.json
[I 220316 15:40:51 texts:32] Loading sources for /home/lewis/miniconda3/envs/gem-metrics/lib/python3.8/site-packages/data/references/mlsum_de_challenge_test_covid.json
[I 220316 15:40:51 __init__:275] mlsum_de_challenge_test_covid does not have source associated.
[I 220316 15:40:51 __init__:385] Found parent ID in mlsum_de_challenge_test_covid but no corresponding parent dataset
[I 220316 15:40:51 __init__:219] Computing metrics for mlsum_de_validation...
[I 220316 15:40:51 __init__:219] Computing metrics for mlsum_de_test...
[I 220316 15:40:51 __init__:219] Computing metrics for mlsum_de_challenge_test_covid...
[I 220316 15:40:51 __init__:152] Computing BLEURT for SeqPlan/mlsum_de_test...
[I 220316 15:40:51 __init__:152] Computing BLEURT for SeqPlan/mlsum_de_challenge_test_covid...
INFO:tensorflow:Reading checkpoint ../bleurt-base-128.
I0316 15:40:58.413195 140619271960384 score.py:161] Reading checkpoint ../bleurt-base-128.
INFO:tensorflow:Config file found, reading.
I0316 15:40:58.413323 140619271960384 checkpoint.py:92] Config file found, reading.
INFO:tensorflow:Will load checkpoint bert_custom
I0316 15:40:58.413443 140619271960384 checkpoint.py:96] Will load checkpoint bert_custom
INFO:tensorflow:Loads full paths and checks that files exists.
I0316 15:40:58.413485 140619271960384 checkpoint.py:98] Loads full paths and checks that files exists.
INFO:tensorflow:... name:bert_custom
I0316 15:40:58.413520 140619271960384 checkpoint.py:102] ... name:bert_custom
INFO:tensorflow:... vocab_file:vocab.txt
I0316 15:40:58.413564 140619271960384 checkpoint.py:102] ... vocab_file:vocab.txt
INFO:tensorflow:... bert_config_file:bert_config.json
I0316 15:40:58.413612 140619271960384 checkpoint.py:102] ... bert_config_file:bert_config.json
INFO:tensorflow:... do_lower_case:True
I0316 15:40:58.413659 140619271960384 checkpoint.py:102] ... do_lower_case:True
INFO:tensorflow:... max_seq_length:128
I0316 15:40:58.413696 140619271960384 checkpoint.py:102] ... max_seq_length:128
INFO:tensorflow:Creating BLEURT scorer.
I0316 15:40:58.413734 140619271960384 score.py:168] Creating BLEURT scorer.
INFO:tensorflow:Creating WordPiece tokenizer.
I0316 15:40:58.413768 140619271960384 tokenizers.py:40] Creating WordPiece tokenizer.
INFO:tensorflow:WordPiece tokenizer instantiated.
I0316 15:40:58.478093 140619271960384 tokenizers.py:45] WordPiece tokenizer instantiated.
INFO:tensorflow:Creating Eager Mode predictor.
I0316 15:40:58.478170 140619271960384 score.py:57] Creating Eager Mode predictor.
INFO:tensorflow:Loading model.
I0316 15:40:58.478209 140619271960384 score.py:62] Loading model.
2022-03-16 15:40:58.843356: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2022-03-16 15:40:58.882447: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-16 15:40:58.882741: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:08:00.0 name: NVIDIA TITAN RTX computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 23.65GiB deviceMemoryBandwidth: 625.94GiB/s
2022-03-16 15:40:58.882956: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2022-03-16 15:40:58.884625: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2022-03-16 15:40:58.886231: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2022-03-16 15:40:58.886503: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2022-03-16 15:40:58.887874: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2022-03-16 15:40:58.888352: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2022-03-16 15:40:58.890600: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2022-03-16 15:40:58.890693: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-16 15:40:58.890939: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-16 15:40:58.891113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2022-03-16 15:40:58.891322: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2022-03-16 15:40:58.896174: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 3792765000 Hz
2022-03-16 15:40:58.897041: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fe2d0000b20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-03-16 15:40:58.897061: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2022-03-16 15:40:59.000449: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-16 15:40:59.000935: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x6f9fd30 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2022-03-16 15:40:59.000954: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA TITAN RTX, Compute Capability 7.5
2022-03-16 15:40:59.001150: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-16 15:40:59.001417: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:08:00.0 name: NVIDIA TITAN RTX computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 23.65GiB deviceMemoryBandwidth: 625.94GiB/s
2022-03-16 15:40:59.001450: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2022-03-16 15:40:59.001463: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2022-03-16 15:40:59.001478: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2022-03-16 15:40:59.001490: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2022-03-16 15:40:59.001500: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2022-03-16 15:40:59.001513: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2022-03-16 15:40:59.001524: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2022-03-16 15:40:59.001609: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-16 15:40:59.001901: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-16 15:40:59.002133: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2022-03-16 15:40:59.002165: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2022-03-16 15:40:59.002925: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-03-16 15:40:59.002938: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0 
2022-03-16 15:40:59.002944: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N 
2022-03-16 15:40:59.003061: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-16 15:40:59.003356: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-16 15:40:59.003612: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22611 MB memory) -> physical GPU (device: 0, name: NVIDIA TITAN RTX, pci bus id: 0000:08:00.0, compute capability: 7.5)
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/resource_variable_ops.py:1817: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
W0316 15:40:59.450889 140619271960384 deprecation.py:506] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/resource_variable_ops.py:1817: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
INFO:tensorflow:BLEURT initialized.
I0316 15:41:00.563959 140619271960384 score.py:174] BLEURT initialized.
INFO:tensorflow:Computing BLEURT scores...
I0316 15:41:00.564104 140619271960384 score_files.py:133] Computing BLEURT scores...
INFO:tensorflow:Reading checkpoint ../bleurt-base-128.
I0316 15:41:00.649482 139858446710592 score.py:161] Reading checkpoint ../bleurt-base-128.
INFO:tensorflow:Config file found, reading.
I0316 15:41:00.649625 139858446710592 checkpoint.py:92] Config file found, reading.
INFO:tensorflow:Will load checkpoint bert_custom
I0316 15:41:00.649743 139858446710592 checkpoint.py:96] Will load checkpoint bert_custom
INFO:tensorflow:Loads full paths and checks that files exists.
I0316 15:41:00.649785 139858446710592 checkpoint.py:98] Loads full paths and checks that files exists.
INFO:tensorflow:... name:bert_custom
I0316 15:41:00.649821 139858446710592 checkpoint.py:102] ... name:bert_custom
INFO:tensorflow:... vocab_file:vocab.txt
I0316 15:41:00.649855 139858446710592 checkpoint.py:102] ... vocab_file:vocab.txt
INFO:tensorflow:... bert_config_file:bert_config.json
I0316 15:41:00.649900 139858446710592 checkpoint.py:102] ... bert_config_file:bert_config.json
INFO:tensorflow:... do_lower_case:True
I0316 15:41:00.649946 139858446710592 checkpoint.py:102] ... do_lower_case:True
INFO:tensorflow:... max_seq_length:128
I0316 15:41:00.649982 139858446710592 checkpoint.py:102] ... max_seq_length:128
INFO:tensorflow:Creating BLEURT scorer.
I0316 15:41:00.650019 139858446710592 score.py:168] Creating BLEURT scorer.
INFO:tensorflow:Creating WordPiece tokenizer.
I0316 15:41:00.650053 139858446710592 tokenizers.py:40] Creating WordPiece tokenizer.
INFO:tensorflow:WordPiece tokenizer instantiated.
I0316 15:41:00.714641 139858446710592 tokenizers.py:45] WordPiece tokenizer instantiated.
INFO:tensorflow:Creating Eager Mode predictor.
I0316 15:41:00.714712 139858446710592 score.py:57] Creating Eager Mode predictor.
INFO:tensorflow:Loading model.
I0316 15:41:00.714751 139858446710592 score.py:62] Loading model.
2022-03-16 15:41:01.075217: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2022-03-16 15:41:01.099614: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-16 15:41:01.099885: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:08:00.0 name: NVIDIA TITAN RTX computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 23.65GiB deviceMemoryBandwidth: 625.94GiB/s
2022-03-16 15:41:01.100065: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2022-03-16 15:41:01.101612: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2022-03-16 15:41:01.103159: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2022-03-16 15:41:01.103421: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2022-03-16 15:41:01.104980: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2022-03-16 15:41:01.105824: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2022-03-16 15:41:01.107973: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2022-03-16 15:41:01.108065: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-16 15:41:01.108303: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-16 15:41:01.108475: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2022-03-16 15:41:01.108682: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2022-03-16 15:41:01.113914: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 3792765000 Hz
2022-03-16 15:41:01.114642: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f31a4000b20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-03-16 15:41:01.114660: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2022-03-16 15:41:01.170013: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-16 15:41:01.170263: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x73c46a0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2022-03-16 15:41:01.170284: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA TITAN RTX, Compute Capability 7.5
2022-03-16 15:41:01.170470: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-16 15:41:01.170703: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:08:00.0 name: NVIDIA TITAN RTX computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 23.65GiB deviceMemoryBandwidth: 625.94GiB/s
2022-03-16 15:41:01.170738: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2022-03-16 15:41:01.170756: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2022-03-16 15:41:01.170774: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2022-03-16 15:41:01.170790: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2022-03-16 15:41:01.170805: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2022-03-16 15:41:01.170816: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2022-03-16 15:41:01.170828: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2022-03-16 15:41:01.170897: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-16 15:41:01.171147: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-16 15:41:01.171339: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2022-03-16 15:41:01.171368: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2022-03-16 15:41:01.172099: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-03-16 15:41:01.172110: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0 
2022-03-16 15:41:01.172116: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N 
2022-03-16 15:41:01.172219: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-16 15:41:01.172492: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-16 15:41:01.172725: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 923 MB memory) -> physical GPU (device: 0, name: NVIDIA TITAN RTX, pci bus id: 0000:08:00.0, compute capability: 7.5)
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/resource_variable_ops.py:1817: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
W0316 15:41:01.458086 139858446710592 deprecation.py:506] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/resource_variable_ops.py:1817: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
INFO:tensorflow:BLEURT initialized.
I0316 15:41:02.578015 139858446710592 score.py:174] BLEURT initialized.
INFO:tensorflow:Computing BLEURT scores...
I0316 15:41:02.578154 139858446710592 score_files.py:133] Computing BLEURT scores...
2022-03-16 15:41:10.040301: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2022-03-16 15:41:13.784464: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2022-03-16 15:41:13.999250: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2022-03-16 15:41:14.004738: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2022-03-16 15:41:14.006644: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2022-03-16 15:41:14.011551: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2022-03-16 15:41:14.011578: W tensorflow/stream_executor/stream.cc:2041] attempting to perform BLAS operation using StreamExecutor without BLAS support
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/app/bleurt/bleurt/score_files.py", line 168, in <module>
    tf.compat.v1.app.run()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/app/bleurt/bleurt/score_files.py", line 164, in main
    score_files(sentence_pairs_generator, FLAGS.bleurt_checkpoint)
  File "/app/bleurt/bleurt/score_files.py", line 138, in score_files
    _consume_buffer()
  File "/app/bleurt/bleurt/score_files.py", line 128, in _consume_buffer
    batch_size=FLAGS.bleurt_batch_size)
  File "/app/bleurt/bleurt/score.py", line 215, in score
    predict_out = self._predictor.predict(tf_input)
  File "/app/bleurt/bleurt/score.py", line 71, in predict
    input_dict["segment_ids"]))["predictions"].numpy()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1605, in __call__
    return self._call_impl(args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1645, in _call_impl
    return self._call_flat(args, self.captured_inputs, cancellation_manager)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1746, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 598, in call
    ctx=ctx)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InternalError:  Blas GEMM launch failed : a.shape=(8192, 2), b.shape=(2, 768), m=8192, n=768, k=2
         [[node bert/embeddings/MatMul (defined at app/bleurt/bleurt/score.py:63) ]] [Op:__inference_pruned_6660]

Function call stack:
pruned

As far as I can tell, this error stems from a CUDA OOM error. I'm running on an NVIDIA TITAN RTX with 23.65GiB of memory, so this is quite surprising. One possibility is that the submission file has very long inputs, but these are from one of the baseline models and would presumably be similar for other GEM participants.

For context, I installed the library following the README instructions for "heavy" metrics, plus some additional Docker configuration (login & installing NVIDIA Container toolkit).

cc @sebastianGehrmann @danieldeutsch

sample-submission.json.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant