Bug 1873759

Summary: RHEL8.4: Update memory reclaim and OOM code base to v5.6
Product: Red Hat Enterprise Linux 8 Reporter: Waiman Long <llong>
Component: kernelAssignee: Waiman Long <llong>
kernel sub component: Memory Management QA Contact: Li Wang <liwan>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: crecklin, ddutile, jaeshin, mm-maint, travi, wking
Version: 8.4Keywords: ZStream
Target Milestone: rc   
Target Release: 8.0   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-4.18.0-248.el8 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1929738 1929739 (view as bug list) Environment:
Last Closed: 2021-05-18 14:04:21 UTC Type: Component Upgrade
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1861799, 1882177, 1894575, 1929738, 1929739    

Description Waiman Long 2020-08-30 01:37:39 UTC
Depending on the workloads, some customers are hitting problem with OOM killer activated and the system locking up for a rather long period of time. For example, 

 1) BZ #1853862 - [mm] System locked up for over an hour during memory reclaim
 2) BZ #1857446 - ARO/Azure: excessive pod memory allocation causes node lockup 

There are two different problem here:

 1) The workload is consuming memory faster than the system can recover through memory reclaim.
 2) The systems seems to lockup when OOM killer was invoked to kill some tasks.

The first problem can be addressed if cgroup v2 is used by the customer as the following upstream commit and its derivatives can help to prevent the processes
from consuming memory too fast.

commit 0e4b01df865935007bd712cbc8e7299005b28894
Author: Chris Down <chris>
Date:   Mon, 23 Sep 2019 15:34:55 -0700

    mm, memcg: throttle allocators when failing reclaim over memory.high

However, cgroup v1 is still the default. Even though we are going to switch to cgroup v2 as the default soon, it takes time for customers to do the migration.

For existing cgroup v1 customers, we need to improve the memory reclaim and OOM handling code to a more updated code base of v5.6. Hopefully that can help to alleviate the problem.

For the second lockup problem, it is not clear what causes it. Dong Hai is investigating the core dump report in BZ #1853862 to find some clue. If a reproducer can be found, it will help us to locate the problem area more quickly. More work still need to be done.

-Longman

Comment 14 Jan Stancek 2020-10-22 07:01:22 UTC
Patch(es) available on kernel-4.18.0-240.6.el8.dt2

Comment 17 Jan Stancek 2020-11-12 08:05:32 UTC
Patch(es) available on kernel-4.18.0-248.el8

Comment 23 Waiman Long 2021-02-16 14:02:44 UTC
*** Bug 1929122 has been marked as a duplicate of this bug. ***

Comment 30 errata-xmlrpc 2021-05-18 14:04:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: kernel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1578

Comment 31 Scott Dodson 2021-07-28 15:33:52 UTC
Can this bug please be made public, so that we can reference comment 0?

Comment 32 Waiman Long 2021-07-28 16:32:57 UTC
(In reply to Scott Dodson from comment #31)
> Can this bug please be made public, so that we can reference comment 0?

As there isn't anything confidential in this BZ, I don't think it is a problem to make it public. However, I don't know the proper steps to make it public.

-Longman

Comment 33 Scott Dodson 2021-07-28 17:43:44 UTC
Longman,

Thanks, mechanically it should just be removing redhat from the "Groups:" field which is just above comment 0 and to the far right. All public comments will become viewable when that changes. I don't see any thing sensitive in any of the public comments.

Comment 34 Waiman Long 2021-07-28 18:08:13 UTC
I see. Thanks for the tip.