1873759 – RHEL8.4: Update memory reclaim and OOM code base to v5.6

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1873759 - RHEL8.4: Update memory reclaim and OOM code base to v5.6

Summary: RHEL8.4: Update memory reclaim and OOM code base to v5.6

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	8.4
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	8.0
Assignee:	Waiman Long
QA Contact:	Li Wang
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1929122 (view as bug list)
Depends On:
Blocks:	1861799 1882177 1894575 1929738 1929739
TreeView+	depends on / blocked

Reported:	2020-08-30 01:37 UTC by Waiman Long
Modified:	2024-06-13 23:00 UTC (History)
CC List:	6 users (show)
Fixed In Version:	kernel-4.18.0-248.el8
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1929738 1929739 (view as bug list)
Environment:
Last Closed:	2021-05-18 14:04:21 UTC
Type:	Component Upgrade
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	5804881	0	None	None	None	2021-02-16 14:02:44 UTC

Internal Links: 1929122

Description Waiman Long 2020-08-30 01:37:39 UTC

Depending on the workloads, some customers are hitting problem with OOM killer activated and the system locking up for a rather long period of time. For example, 

 1) BZ #1853862 - [mm] System locked up for over an hour during memory reclaim
 2) BZ #1857446 - ARO/Azure: excessive pod memory allocation causes node lockup 

There are two different problem here:

 1) The workload is consuming memory faster than the system can recover through memory reclaim.
 2) The systems seems to lockup when OOM killer was invoked to kill some tasks.

The first problem can be addressed if cgroup v2 is used by the customer as the following upstream commit and its derivatives can help to prevent the processes
from consuming memory too fast.

commit 0e4b01df865935007bd712cbc8e7299005b28894
Author: Chris Down <chris>
Date:   Mon, 23 Sep 2019 15:34:55 -0700

    mm, memcg: throttle allocators when failing reclaim over memory.high

However, cgroup v1 is still the default. Even though we are going to switch to cgroup v2 as the default soon, it takes time for customers to do the migration.

For existing cgroup v1 customers, we need to improve the memory reclaim and OOM handling code to a more updated code base of v5.6. Hopefully that can help to alleviate the problem.

For the second lockup problem, it is not clear what causes it. Dong Hai is investigating the core dump report in BZ #1853862 to find some clue. If a reproducer can be found, it will help us to locate the problem area more quickly. More work still need to be done.

-Longman

Comment 14 Jan Stancek 2020-10-22 07:01:22 UTC

Patch(es) available on kernel-4.18.0-240.6.el8.dt2

Comment 17 Jan Stancek 2020-11-12 08:05:32 UTC

Patch(es) available on kernel-4.18.0-248.el8

Comment 23 Waiman Long 2021-02-16 14:02:44 UTC

*** Bug 1929122 has been marked as a duplicate of this bug. ***

Comment 30 errata-xmlrpc 2021-05-18 14:04:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: kernel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1578

Comment 31 Scott Dodson 2021-07-28 15:33:52 UTC

Can this bug please be made public, so that we can reference comment 0?

Comment 32 Waiman Long 2021-07-28 16:32:57 UTC

(In reply to Scott Dodson from comment #31)
> Can this bug please be made public, so that we can reference comment 0?

As there isn't anything confidential in this BZ, I don't think it is a problem to make it public. However, I don't know the proper steps to make it public.

-Longman

Comment 33 Scott Dodson 2021-07-28 17:43:44 UTC

Longman,

Thanks, mechanically it should just be removing redhat from the "Groups:" field which is just above comment 0 and to the far right. All public comments will become viewable when that changes. I don't see any thing sensitive in any of the public comments.

Comment 34 Waiman Long 2021-07-28 18:08:13 UTC

I see. Thanks for the tip.

Note You need to log in before you can comment on or make changes to this bug.