Bug 2008604

Summary: Backport: Long reboot recovery time for DU node with RT kernel with large number of pods
Product: OpenShift Container Platform Reporter: Ian Miller <imiller>
Component: Telco EdgeAssignee: Ian Miller <imiller>
Telco Edge sub component: RAN QA Contact: yliu1
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: unspecified CC: yliu1
Version: 4.8   
Target Milestone: ---   
Target Release: 4.8.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 2006953 Environment:
Last Closed: 2022-01-18 06:33:08 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2006953    
Bug Blocks:    

Description Ian Miller 2021-09-28 16:33:37 UTC
+++ This bug was initially created as a clone of Bug #2006953 +++

Reboot recovery time of a DU node with the RT kernel is much longer than without RT kernel. The root cause is BZ 1975356. The DU node profile needs to enable the fix by adding rcupdate.rcu_normal_after_boot=0 to the kernel commandline.

--- Additional comment from  on 2021-09-22 17:59:59 UTC ---

Soft reboot test on 4.9 load with the additional kernel arg took about 14 minutes with 43 test pods. 
If this kernel arg was not added, it could take more than half an hour to recovery.

Comment 1 Ian Miller 2021-09-28 16:36:17 UTC
This bug tracks backport of fix to release-4.8. The automatic clone failed for some reason. The PR is
https://github.com/openshift-kni/cnf-features-deploy/pull/722

Comment 2 Ian Miller 2021-09-29 15:42:17 UTC
Wrong PR in the previous message. The PR for this backport is:
https://github.com/openshift-kni/cnf-features-deploy/pull/713

Comment 3 yliu1 2021-09-29 18:22:36 UTC
Verification for this bz is partially blocked by: https://bugzilla.redhat.com/show_bug.cgi?id=2009033

Comment 4 yliu1 2021-10-15 19:01:00 UTC
Verified on 4.8.15 with 43 test pods. Cluster does become more stable with the additional kernel arg rcupdate.rcu_normal_after_boot=0.
However sometimes cluster still recovery very slowly. A separate bz was opened to track that.  https://bugzilla.redhat.com/show_bug.cgi?id=2014542

Comment 7 errata-xmlrpc 2022-01-18 06:33:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.8.27 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0113