Bug 2008604 - Backport: Long reboot recovery time for DU node with RT kernel with large number of pods
Summary: Backport: Long reboot recovery time for DU node with RT kernel with large num...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Telco Edge
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.8.z
Assignee: Ian Miller
QA Contact: yliu1
URL:
Whiteboard:
Depends On: 2006953
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-28 16:33 UTC by Ian Miller
Modified: 2022-01-18 06:33 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of: 2006953
Environment:
Last Closed: 2022-01-18 06:33:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2022:0113 0 None None None 2022-01-18 06:33:17 UTC

Description Ian Miller 2021-09-28 16:33:37 UTC
+++ This bug was initially created as a clone of Bug #2006953 +++

Reboot recovery time of a DU node with the RT kernel is much longer than without RT kernel. The root cause is BZ 1975356. The DU node profile needs to enable the fix by adding rcupdate.rcu_normal_after_boot=0 to the kernel commandline.

--- Additional comment from  on 2021-09-22 17:59:59 UTC ---

Soft reboot test on 4.9 load with the additional kernel arg took about 14 minutes with 43 test pods. 
If this kernel arg was not added, it could take more than half an hour to recovery.

Comment 1 Ian Miller 2021-09-28 16:36:17 UTC
This bug tracks backport of fix to release-4.8. The automatic clone failed for some reason. The PR is
https://github.com/openshift-kni/cnf-features-deploy/pull/722

Comment 2 Ian Miller 2021-09-29 15:42:17 UTC
Wrong PR in the previous message. The PR for this backport is:
https://github.com/openshift-kni/cnf-features-deploy/pull/713

Comment 3 yliu1 2021-09-29 18:22:36 UTC
Verification for this bz is partially blocked by: https://bugzilla.redhat.com/show_bug.cgi?id=2009033

Comment 4 yliu1 2021-10-15 19:01:00 UTC
Verified on 4.8.15 with 43 test pods. Cluster does become more stable with the additional kernel arg rcupdate.rcu_normal_after_boot=0.
However sometimes cluster still recovery very slowly. A separate bz was opened to track that.  https://bugzilla.redhat.com/show_bug.cgi?id=2014542

Comment 7 errata-xmlrpc 2022-01-18 06:33:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.8.27 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0113


Note You need to log in before you can comment on or make changes to this bug.