-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] Dealing with executors that are stuck in the barrier #453
Comments
I am sorry your application isn't making progress. I'll chime in from the spark-rapids side, specifically around the shuffle (for now). I see that you have configured UCX shuffle. Do you have RDMA-capable networking? Or, do you have several GPUs in one box? If you don't have these things, then we recommend the MULTITHREADED shuffle mode (default). Note also that these configs are only for the MULTITHREADED shuffle:
I would start by removing the Also, if you do get another hang, getting a jstack of the executors is useful. You can access a stack trace under the Spark UI in the Executors tab (click on "Thread Dump"). |
Also, looks like you may have pasted the successful executor log twice as the logs look identical. Would be great to see the bad one when you get a chance. |
Thanks for your reply. I have fixed the executor logs. I don't remember setting up RDMA, but I have the RDMA packages are installed and I can run ucx_perftest with the example on the docs:
I tried running the LinReg and KMeans example again with an updated configuration where I commented most of the shuffle-related configuration. Here is the new configuration without the configs for the history server, Python/Java, and timeout values. New Config
The KMeans example does not have any executors that were killed, but 4 executors, which are on the same node, hang. Also, there are two pending stages: stage 7 with 192 tasks and stage 8 with 1 task with the same description (javaToPython at NativeMethodAccessorImpl.java:0).
The status for Stage 9, the barrier on the CLI is (4 + 4) / 8, but it's "3/8 (5 running)" on the Spark UI. It's the same thing for linear regression with stage 7 and stage 8 as pending stages and stage 9 as an active stage with "(6 + 2) / 8". The executors were not killed, so there is no thread dump. Interestingly enough, the executors that hang always have the same index on Spark UI (indices 2 and 3). Executor with SUCCESS state
Executor with RUNNING state
I thought the executors were not killed because it was not using UCX, but when using UCX again, the 2 executors were not killed. I am not sure if it's because I updated the RAPIDS packages on both servers before attempting to run the servers again. The two executors failed on the other node for linear regression, which did not happen before, and this node has log4j level set to TRACE, so there is more information. Executor with SUCCESS state
Executor with RUNNING state
For some reason, cuML context is only initialized on the running/killed executors. |
Hmm, so the examples worked after increasing the size of the dataset, even for UCX. I'm not sure why my application, which uses a large dataset did not work then. It might be a different issue. |
Thanks for the additional updates and glad there are some signs of it working. In looking back at your previous executor logs, it is actually the executors without the When you get back to this, please also share the worker(s) and master startup configs. Looks like spark standalone mode. |
Sorry for the late reply. Yes, I am using Spark Standalone mode. I have the following "spark-env.sh" on each node: spark-env.sh
The initial configuration for each node is almost identical to my app's spark.conf: spark-defaults.sh
I am not sure if it is intended behaviour to hang when the dataset is too small to be in every executor. I tried updating Spark RAPIDS ML to the newest version containing the #464 commit, but it still does not work. As mentioned before, if I used the LinReg example as is on the Python README.md with df as For my applications, I noticed that I can get it to work now when using "multi-threaded mode" for the shuffle manager instead of UCX. Also, I am not sure if this is related, but I am getting Anyway, I noticed that when I run my PCA application with a small dataset (8.1KB on |
Currently the expected/intended behavior with empty partitions during training is for the tasks receiving no data to raise exceptions which should fail the whole barrier stage doing the training. It is strange, and would be a bug, that this doesn't seem to be happening in your case. I'm not able to reproduce this for some reason. I'm also not able to reproduce barrier tasks not logging anything spark-rapids-ml related before exiting. This would mean that spark-rapids-ml udf is not being invoked at all for the partition for that task. I've attempted to test if Spark might have an optimization that avoids invoking mapInPandas on partitions with no data, but so far I'm not able to trigger this on some toy examples, if it is even the case.
The spark-rapids plugin recommends JDK8. See https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html#apache-spark-setup-for-gpu. @abellina is that still the case? In any case, looks like you might be running into different JDK versions being invoked in different places within the application run.
This is actually the normal behavior. The notation means that 8 of the barrier tasks are running and it is during this time that the GPUs are carrying out the distributed computation and communicating directly with each other. If it is stuck with say "(6 + 2) / 8" then 6 barrier tasks exited for some reason, without syncing and two are hanging. That is problematic and would be a bug, even with empty data partitions, as mentioned above. |
The spark-rapids plugin is being tested with JDK 8, 11, and 17. I think an issue here is likely that the java that the executor is seeing is different than the one that the driver is seeing. Make sure that the spark-rapids jar is the same exactly from all places. |
Thanks for your replies. I downgraded my JDK version to 17 since I compiled Spark RAPIDS with JDK 17, as it uses Security Manager, and I was also facing an issue where I am getting "Found an unexpected context classloader" when using Scala Spark. After downgrading to JDK 17 and updating RAPIDS to 23.12.00a, I am not facing "java.io.InvalidClassException" anymore on PySpark. For the original issue, my guess is that the problem is from building Spark RAPIDS incorrectly? I tried using the JARs from Full Error Log from launching spark-shell
|
@an-ys thanks for the report, we are looking into it (NVIDIA/spark-rapids#9498). It seems to be an issue with our Spark 3.5.0 shim, specific to spark-shell (pyspark shell, spark-submit don't exhibit this behavior) Please note we don't recommend using 23.12 unless you are testing some cutting edge feature, as it's not released yet. 23.10 isn't entirely released yet either. An option if you want to try to build on your own is to set |
@an-ys The original hang issue is due to a combination of empty partitions and an optimization in how spark-rapids etl plugin handles mapInPandas vs baseline Spark (which is what we had used to test empty partitions). See NVIDIA/spark-rapids#9480 . The spark-rapids version of mapInPandas does not execute the udf on empty partitions and hence the spark-rapids-ml barrier is never entered for those tasks, leaving the other tasks (with data) hanging. Note that even after that issue is resolved to have spark-rapids mapInPandas match baseline spark mapInPandas behavior on empty partitions, and thereby avoid hanging, an exception would still be raised currently by spark-rapids-ml in the case of empty partitions. |
I am trying to run the Linear Regression, KMeans, and PCA examples on a cluster of 2 nodes, each with 4 GPUs, but some of the executors in the examples always get stuck in the barrier when the cuML function is called (i.e., I get 6+2/8, 4+4/8, and 5+3/8, where 2, 4, and 3 executors are stuck in LinReg, KMeans, and PCA respectively). I also tried runing a KMeans application that deals with a large amount of data, so I do not think the problem is related to the small dataset.
I checked the logs for the executor that successfully ran the task and the executor that got stuck. The executor that got stuck initialized cuML These logs are from running the LinReg example in the Python directory of this repo. The executors that are stuck have
RUNNING | NODE_LOCAL
as the status while the successful executors haveSUCCESS PROCESS_LOCAL
.I am using Spark RAPIDS ML branch-23.10 (daedfe56edae33c565af5e06179e992cf8fec93e and f651978), Spark 3.5.0 on standalone mode, and Hadoop 3.3.6 on a cluster of 2 nodes, each with 4 Titan-V GPUs.
Successful Executor
Killed Executor
Here is the
spark.conf
containing the related options. I tried to disable the options related to UDFs (Scala UDF, UDF compiler, etc.), but it did not do much.`spark.conf`
The text was updated successfully, but these errors were encountered: