Depending on the workloads, some customers are hitting problem with OOM killer activated and the system locking up for a rather long period of time. For example, 1) BZ #1853862 - [mm] System locked up for over an hour during memory reclaim 2) BZ #1857446 - ARO/Azure: excessive pod memory allocation causes node lockup There are two different problem here: 1) The workload is consuming memory faster than the system can recover through memory reclaim. 2) The systems seems to lockup when OOM killer was invoked to kill some tasks. The first problem can be addressed if cgroup v2 is used by the customer as the following upstream commit and its derivatives can help to prevent the processes from consuming memory too fast. commit 0e4b01df865935007bd712cbc8e7299005b28894 Author: Chris Down <chris> Date: Mon, 23 Sep 2019 15:34:55 -0700 mm, memcg: throttle allocators when failing reclaim over memory.high However, cgroup v1 is still the default. Even though we are going to switch to cgroup v2 as the default soon, it takes time for customers to do the migration. For existing cgroup v1 customers, we need to improve the memory reclaim and OOM handling code to a more updated code base of v5.6. Hopefully that can help to alleviate the problem. For the second lockup problem, it is not clear what causes it. Dong Hai is investigating the core dump report in BZ #1853862 to find some clue. If a reproducer can be found, it will help us to locate the problem area more quickly. More work still need to be done. -Longman
Patch(es) available on kernel-4.18.0-240.6.el8.dt2
Patch(es) available on kernel-4.18.0-248.el8
*** Bug 1929122 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: kernel security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:1578
Can this bug please be made public, so that we can reference comment 0?
(In reply to Scott Dodson from comment #31) > Can this bug please be made public, so that we can reference comment 0? As there isn't anything confidential in this BZ, I don't think it is a problem to make it public. However, I don't know the proper steps to make it public. -Longman
Longman, Thanks, mechanically it should just be removing redhat from the "Groups:" field which is just above comment 0 and to the far right. All public comments will become viewable when that changes. I don't see any thing sensitive in any of the public comments.
I see. Thanks for the tip.