Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] HashShuffleWriter OOM when the schema contains a column of complex type in large scale job #8088

Open
kecookier opened this issue Nov 28, 2024 · 1 comment · May be fixed by #8089
Open
Labels
bug Something isn't working triage

Comments

@kecookier
Copy link
Contributor

Backend

VL (Velox)

Bug description

The ShuffleWriter.default_leaf(velox::memory::MemoryPool) allocated too much memory in VeloxHashShuffleWriter, causing an off-heap OOM.

24/11/26 21:31:42 ERROR Executor task launch worker for task 1559 ManagedReservationListener: Error reserving memory from target
org.apache.gluten.memory.memtarget.ThrowOnOomMemoryTarget$OutOfMemoryException: Not enough spark off-heap execution memory. Acquired: 8.0 MiB, granted: 0.0 B. Try tweaking config option spark.memory.offHeap.size to get larger space to run this application (if spark.gluten.memory.dynamic.offHeap.sizing.enabled is not enabled). 

Current config settings: 
	spark.gluten.memory.offHeap.size.in.bytes=13690208256
	spark.gluten.memory.task.offHeap.size.in.bytes=6845104128
	spark.gluten.memory.conservative.task.offHeap.size.in.bytes=3422552064
	spark.gluten.memory.dynamic.offHeap.sizing.enabled=false
Memory consumer stats: 
	Task.1559:                                             Current used bytes:   8.4 GiB, peak bytes:        N/A
	\- Gluten.Tree.0:                                      Current used bytes:   8.4 GiB, peak bytes:   11.9 GiB
	   \- root.0:                                          Current used bytes:   8.4 GiB, peak bytes:   11.9 GiB
	      +- ShuffleWriter.0:                              Current used bytes:   8.3 GiB, peak bytes:    8.8 GiB
	      |  \- single:                                    Current used bytes:   8.3 GiB, peak bytes:    8.8 GiB
	      |     +- root:                                   Current used bytes:   8.2 GiB, peak bytes:    8.2 GiB
	      |     |  \- default_leaf:                        Current used bytes:   8.2 GiB, peak bytes:    8.2 GiB
	      |     \- gluten::MemoryAllocator:                Current used bytes:  62.9 MiB, peak bytes: 1436.4 MiB
	      +- VeloxBatchAppender.0:                         Current used bytes: 104.0 MiB, peak bytes:  224.0 MiB
	      |  \- single:                                    Current used bytes: 104.0 MiB, peak bytes:  224.0 MiB
	      |     +- root:                                   Current used bytes: 100.2 MiB, peak bytes:  224.0 MiB
	      |     |  \- default_leaf:                        Current used bytes: 100.2 MiB, peak bytes:  216.8 MiB
	      |     \- gluten::MemoryAllocator:                Current used bytes:     0.0 B, peak bytes:      0.0 B
	      +- NativePlanEvaluator-1.0:                      Current used bytes:  25.0 MiB, peak bytes:  176.0 MiB
	      |  \- single:                                    Current used bytes:  25.0 MiB, peak bytes:  176.0 MiB
	      |     +- root:                                   Current used bytes:  22.6 MiB, peak bytes:  169.0 MiB
	      |     |  +- task.Gluten_Stage_2_TID_1559_VTID_0: Current used bytes:  22.6 MiB, peak bytes:  169.0 MiB
	      |     |  |  +- node.0:                           Current used bytes:  22.1 MiB, peak bytes:  168.0 MiB
	      |     |  |  |  +- op.0.0.0.TableScan:            Current used bytes:  22.1 MiB, peak bytes:  162.8 MiB
	      |     |  |  |  \- op.0.0.0.TableScan.test-hive:  Current used bytes:     0.0 B, peak bytes:      0.0 B
	      |     |  |  \- node.1:                           Current used bytes: 528.2 KiB, peak bytes: 1024.0 KiB
	      |     |  |     \- op.1.0.0.FilterProject:        Current used bytes: 528.2 KiB, peak bytes:  849.5 KiB
	      |     |  \- default_leaf:                        Current used bytes:     0.0 B, peak bytes:      0.0 B
	      |     \- gluten::MemoryAllocator:                Current used bytes:     0.0 B, peak bytes:      0.0 B
	      +- ArrowContextInstance.0:                       Current used bytes:     0.0 B, peak bytes:      0.0 B
	      +- VeloxBatchAppender.0.OverAcquire.0:           Current used bytes:     0.0 B, peak bytes:   67.2 MiB
	      +- IndicatorVectorBase#init.0.OverAcquire.0:     Current used bytes:     0.0 B, peak bytes:    2.4 MiB
	      +- NativePlanEvaluator-1.0.OverAcquire.0:        Current used bytes:     0.0 B, peak bytes:   52.8 MiB
	      +- ShuffleWriter.0.OverAcquire.0:                Current used bytes:     0.0 B, peak bytes:    2.6 GiB
	      \- IndicatorVectorBase#init.0:                   Current used bytes:     0.0 B, peak bytes:    8.0 MiB
	         \- single:                                    Current used bytes:     0.0 B, peak bytes:    8.0 MiB
	            +- root:                                   Current used bytes:     0.0 B, peak bytes:      0.0 B
	            |  \- default_leaf:                        Current used bytes:     0.0 B, peak bytes:      0.0 B
	            \- gluten::MemoryAllocator:                Current used bytes:     0.0 B, peak bytes:      0.0 B


	at org.apache.gluten.memory.memtarget.ThrowOnOomMemoryTarget.borrow(ThrowOnOomMemoryTarget.java:66)
	at org.apache.gluten.memory.listener.ManagedReservationListener.reserve(ManagedReservationListener.java:49)
	at org.apache.gluten.vectorized.ShuffleWriterJniWrapper.write(Native Method)
	at org.apache.spark.shuffle.ColumnarShuffleWriter.internalWrite(ColumnarShuffleWriter.scala:177)
	at org.apache.spark.shuffle.ColumnarShuffleWriter.write(ColumnarShuffleWriter.scala:231)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
	at org.apache.spark.scheduler.Task.run(Task.scala:134)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:479)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1448)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:482)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

Where is VeloxMemoryPool used in VeloxHashShuffleWriter?

When splitComplexType() is called, the vector will first be serialized by PrestoVectorSerde, and then flushed to cache by the function evictPartitionBuffers(). The memory held by arenas_ will be freed only after flushing.

Why is so much memory used?

When doSplit is called, we estimate how many rows can fit within the current task's available memory, and then adapt the last partition buffers. We estimate without considering complex type columns, only simple columns. Thus, the memory of the complex type is missed. As we iterate batch by batch, we check if the current estimated rows are much larger than the already existing partition buffers. If so, we cache these buffers (evict partition buffer to payloadCache), and the cached payload will spill in the future, and then the memory is freed. f our complex type vector is large, the eviction is typically not triggered until the process has already run out of memory (OOM).

Possible Solutions

  1. The default partition buffer size is 4096. In our case, the schema is {int, string, map<string, string>, map<string, string>}. Almost after iterating 200+ batches, the process will run out of memory. We can change this option to 200, and the job can succeed, but it's not a general solution.
  2. When estimating how many rows can fit within the current task's available memory, also consider complex type columns. We can use arenas_ to do this.

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

No response

@kecookier kecookier added bug Something isn't working triage oom velox backend works for Velox backend and removed oom labels Nov 28, 2024
@kecookier kecookier removed the velox backend works for Velox backend label Nov 29, 2024
@kecookier
Copy link
Contributor Author

@FelixYBW @marin-ma #8089 may solve this problem, any suggestions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
1 participant