Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Depending on the workloads, some customers are hitting problem with OOM killer activated and the system locking up for a rather long period of time. For example,
1) BZ #1853862 - [mm] System locked up for over an hour during memory reclaim
2) BZ #1857446 - ARO/Azure: excessive pod memory allocation causes node lockup
There are two different problem here:
1) The workload is consuming memory faster than the system can recover through memory reclaim.
2) The systems seems to lockup when OOM killer was invoked to kill some tasks.
The first problem can be addressed if cgroup v2 is used by the customer as the following upstream commit and its derivatives can help to prevent the processes
from consuming memory too fast.
commit 0e4b01df865935007bd712cbc8e7299005b28894
Author: Chris Down <chris>
Date: Mon, 23 Sep 2019 15:34:55 -0700
mm, memcg: throttle allocators when failing reclaim over memory.high
However, cgroup v1 is still the default. Even though we are going to switch to cgroup v2 as the default soon, it takes time for customers to do the migration.
For existing cgroup v1 customers, we need to improve the memory reclaim and OOM handling code to a more updated code base of v5.6. Hopefully that can help to alleviate the problem.
For the second lockup problem, it is not clear what causes it. Dong Hai is investigating the core dump report in BZ #1853862 to find some clue. If a reproducer can be found, it will help us to locate the problem area more quickly. More work still need to be done.
-Longman
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Important: kernel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2021:1578
(In reply to Scott Dodson from comment #31)
> Can this bug please be made public, so that we can reference comment 0?
As there isn't anything confidential in this BZ, I don't think it is a problem to make it public. However, I don't know the proper steps to make it public.
-Longman
Longman,
Thanks, mechanically it should just be removing redhat from the "Groups:" field which is just above comment 0 and to the far right. All public comments will become viewable when that changes. I don't see any thing sensitive in any of the public comments.