Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pika莫名其妙的OOM #2947

Open
banlilin opened this issue Nov 11, 2024 · 6 comments
Open

pika莫名其妙的OOM #2947

banlilin opened this issue Nov 11, 2024 · 6 comments

Comments

@banlilin
Copy link

版本:3.3.6
服务器配置: 8C16G1TSSD
监控中内存使用并不是很高,实例数据也才100G左右。
image
客户端也100个左右
image
服务器内存使用率接近100%,把pika oom kill了
image
image
监控看内存使用率并不高,但是进程的内存一直在增加,最后导致了OOM。
没找到什么头绪

@Issues-translate-bot
Copy link

Bot detected the issue body's language is not English, translate it automatically.


Title: pika’s inexplicable OOM

Version: 3.3.6
Server configuration: 8C16G1TSSD
The memory usage during monitoring is not very high, and the instance data is only about 100G.
image
There are about 100 clients
image
The server memory usage is close to 100%, so pika oom is killed.
image
image
Monitoring shows that the memory usage is not high, but the memory of the process keeps increasing, which eventually leads to OOM.
I can’t find any clue

@Mixficsol
Copy link
Collaborator

确认是否使用tcmalloc,需要做定时清理工作, 确认一下是否有 Table Cache 这个配置项

@Issues-translate-bot
Copy link

Bot detected the issue body's language is not English, translate it automatically.


Confirm whether to use tcmalloc, you need to do regular cleanup work, and confirm whether there is a Table Cache configuration item

@chenbt-hz
Copy link
Collaborator

确认是否使用tcmalloc,需要做定时清理工作, 确认一下是否有 Table Cache 这个配置项

可以看下是不是同一个问题:#2537 (comment)

@banlilin
Copy link
Author

banlilin commented Nov 18, 2024

OOM的时候 tablereader内存使用并不高,才8G左右,而且当时QPS还不到100,连接数也100左右。从节点比主节点早20分钟OOM,从节点没有连接使用。
max-cache-files这个使用的是默认配置的5000,系统所有的sst文件加起来在4000个左右
配置文件中大部分都是使用的默认配置:
image

我更想搞明白:除了tablereader这8个多G 内存外,还会有什么占用内存会导致进程内存到15G多而OOM

确认是否使用tcmalloc,需要做定时清理工作, 确认一下是否有 Table Cache 这个配置项

可以看下是不是同一个问题:#2537 (comment)

@cheniujh
Copy link
Collaborator

cheniujh commented Nov 22, 2024

你好!
辛苦做一下操作来进行排除:

  1. 考虑是否是table cache导致的oom,我们这边原来有过oom就是table cache导致:
    1. 看一下业务的key是不是大key(key size本身是不是很大,不是说value),这些key之间是否有很长的公共前缀,如果有, index block一般会比较大,table cache(table reader总开销也会比较大)
    2. 尝试调整:开启cache-index-and-filter-blocks为yes,将table cache塞入block cache, 这样table cache的总开销就会共享block cache的额度,更加可控,当然你这里的block cache需要调大
  2. 336版本默认的memtable 总大小上限是10G,请尝试调小。

简述:主从都oom的话,怀疑是不是后台的compaction不断打开了更多文件(往table cache里面塞入了更多index和bloom filter),建议先尝试将table cache 设置硬上限(按照上述的2,将其塞入block cache即可)。
实际上对于你16个G的内存,8G的table cache开销已经相当不小了,而且默认情况下table cache没有硬上限(只有max open file limit),比较容易成为oom的原因。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants