[VL] HashShuffleWriter OOM when the schema contains a column of complex type in large scale job #8088

kecookier · 2024-11-28T23:50:14Z

Backend

VL (Velox)

Bug description

The ShuffleWriter.default_leaf(velox::memory::MemoryPool) allocated too much memory in VeloxHashShuffleWriter, causing an off-heap OOM.

24/11/26 21:31:42 ERROR Executor task launch worker for task 1559 ManagedReservationListener: Error reserving memory from target
org.apache.gluten.memory.memtarget.ThrowOnOomMemoryTarget$OutOfMemoryException: Not enough spark off-heap execution memory. Acquired: 8.0 MiB, granted: 0.0 B. Try tweaking config option spark.memory.offHeap.size to get larger space to run this application (if spark.gluten.memory.dynamic.offHeap.sizing.enabled is not enabled). 

Current config settings: 
	spark.gluten.memory.offHeap.size.in.bytes=13690208256
	spark.gluten.memory.task.offHeap.size.in.bytes=6845104128
	spark.gluten.memory.conservative.task.offHeap.size.in.bytes=3422552064
	spark.gluten.memory.dynamic.offHeap.sizing.enabled=false
Memory consumer stats: 
	Task.1559:                                             Current used bytes:   8.4 GiB, peak bytes:        N/A
	\- Gluten.Tree.0:                                      Current used bytes:   8.4 GiB, peak bytes:   11.9 GiB
	   \- root.0:                                          Current used bytes:   8.4 GiB, peak bytes:   11.9 GiB
	      +- ShuffleWriter.0:                              Current used bytes:   8.3 GiB, peak bytes:    8.8 GiB
	      |  \- single:                                    Current used bytes:   8.3 GiB, peak bytes:    8.8 GiB
	      |     +- root:                                   Current used bytes:   8.2 GiB, peak bytes:    8.2 GiB
	      |     |  \- default_leaf:                        Current used bytes:   8.2 GiB, peak bytes:    8.2 GiB
	      |     \- gluten::MemoryAllocator:                Current used bytes:  62.9 MiB, peak bytes: 1436.4 MiB
	      +- VeloxBatchAppender.0:                         Current used bytes: 104.0 MiB, peak bytes:  224.0 MiB
	      |  \- single:                                    Current used bytes: 104.0 MiB, peak bytes:  224.0 MiB
	      |     +- root:                                   Current used bytes: 100.2 MiB, peak bytes:  224.0 MiB
	      |     |  \- default_leaf:                        Current used bytes: 100.2 MiB, peak bytes:  216.8 MiB
	      |     \- gluten::MemoryAllocator:                Current used bytes:     0.0 B, peak bytes:      0.0 B
	      +- NativePlanEvaluator-1.0:                      Current used bytes:  25.0 MiB, peak bytes:  176.0 MiB
	      |  \- single:                                    Current used bytes:  25.0 MiB, peak bytes:  176.0 MiB
	      |     +- root:                                   Current used bytes:  22.6 MiB, peak bytes:  169.0 MiB
	      |     |  +- task.Gluten_Stage_2_TID_1559_VTID_0: Current used bytes:  22.6 MiB, peak bytes:  169.0 MiB
	      |     |  |  +- node.0:                           Current used bytes:  22.1 MiB, peak bytes:  168.0 MiB
	      |     |  |  |  +- op.0.0.0.TableScan:            Current used bytes:  22.1 MiB, peak bytes:  162.8 MiB
	      |     |  |  |  \- op.0.0.0.TableScan.test-hive:  Current used bytes:     0.0 B, peak bytes:      0.0 B
	      |     |  |  \- node.1:                           Current used bytes: 528.2 KiB, peak bytes: 1024.0 KiB
	      |     |  |     \- op.1.0.0.FilterProject:        Current used bytes: 528.2 KiB, peak bytes:  849.5 KiB
	      |     |  \- default_leaf:                        Current used bytes:     0.0 B, peak bytes:      0.0 B
	      |     \- gluten::MemoryAllocator:                Current used bytes:     0.0 B, peak bytes:      0.0 B
	      +- ArrowContextInstance.0:                       Current used bytes:     0.0 B, peak bytes:      0.0 B
	      +- VeloxBatchAppender.0.OverAcquire.0:           Current used bytes:     0.0 B, peak bytes:   67.2 MiB
	      +- IndicatorVectorBase#init.0.OverAcquire.0:     Current used bytes:     0.0 B, peak bytes:    2.4 MiB
	      +- NativePlanEvaluator-1.0.OverAcquire.0:        Current used bytes:     0.0 B, peak bytes:   52.8 MiB
	      +- ShuffleWriter.0.OverAcquire.0:                Current used bytes:     0.0 B, peak bytes:    2.6 GiB
	      \- IndicatorVectorBase#init.0:                   Current used bytes:     0.0 B, peak bytes:    8.0 MiB
	         \- single:                                    Current used bytes:     0.0 B, peak bytes:    8.0 MiB
	            +- root:                                   Current used bytes:     0.0 B, peak bytes:      0.0 B
	            |  \- default_leaf:                        Current used bytes:     0.0 B, peak bytes:      0.0 B
	            \- gluten::MemoryAllocator:                Current used bytes:     0.0 B, peak bytes:      0.0 B


	at org.apache.gluten.memory.memtarget.ThrowOnOomMemoryTarget.borrow(ThrowOnOomMemoryTarget.java:66)
	at org.apache.gluten.memory.listener.ManagedReservationListener.reserve(ManagedReservationListener.java:49)
	at org.apache.gluten.vectorized.ShuffleWriterJniWrapper.write(Native Method)
	at org.apache.spark.shuffle.ColumnarShuffleWriter.internalWrite(ColumnarShuffleWriter.scala:177)
	at org.apache.spark.shuffle.ColumnarShuffleWriter.write(ColumnarShuffleWriter.scala:231)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
	at org.apache.spark.scheduler.Task.run(Task.scala:134)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:479)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1448)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:482)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

Where is VeloxMemoryPool used in VeloxHashShuffleWriter?

When splitComplexType() is called, the vector will first be serialized by PrestoVectorSerde, and then flushed to cache by the function evictPartitionBuffers(). The memory held by arenas_ will be freed only after flushing.

Why is so much memory used?

When doSplit is called, we estimate how many rows can fit within the current task's available memory, and then adapt the last partition buffers. We estimate without considering complex type columns, only simple columns. Thus, the memory of the complex type is missed. As we iterate batch by batch, we check if the current estimated rows are much larger than the already existing partition buffers. If so, we cache these buffers (evict partition buffer to payloadCache), and the cached payload will spill in the future, and then the memory is freed. f our complex type vector is large, the eviction is typically not triggered until the process has already run out of memory (OOM).

Possible Solutions

The default partition buffer size is 4096. In our case, the schema is {int, string, map<string, string>, map<string, string>}. Almost after iterating 200+ batches, the process will run out of memory. We can change this option to 200, and the job can succeed, but it's not a general solution.
When estimating how many rows can fit within the current task's available memory, also consider complex type columns. We can use arenas_ to do this.

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

No response

The text was updated successfully, but these errors were encountered:

kecookier · 2024-12-02T02:35:50Z

@FelixYBW @marin-ma #8089 may solve this problem, any suggestions?

kecookier added bug Something isn't working triage oom velox backend works for Velox backend and removed oom labels Nov 28, 2024

kecookier linked a pull request Nov 29, 2024 that will close this issue

[VL] Enhance VeloxHashShuffleWriter partition buffer size estimation by incorporating complex type columns #8089

Open

kecookier removed the velox backend works for Velox backend label Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VL] HashShuffleWriter OOM when the schema contains a column of complex type in large scale job #8088

[VL] HashShuffleWriter OOM when the schema contains a column of complex type in large scale job #8088

kecookier commented Nov 28, 2024

kecookier commented Dec 2, 2024

[VL] HashShuffleWriter OOM when the schema contains a column of complex type in large scale job #8088

[VL] HashShuffleWriter OOM when the schema contains a column of complex type in large scale job #8088

Comments

kecookier commented Nov 28, 2024

Backend

Bug description

Where is VeloxMemoryPool used in VeloxHashShuffleWriter?

Why is so much memory used?

Possible Solutions

Spark version

Spark configurations

System information

Relevant logs

kecookier commented Dec 2, 2024