Bug 2008604

Summary:	Backport: Long reboot recovery time for DU node with RT kernel with large number of pods
Product:	OpenShift Container Platform	Reporter:	Ian Miller <imiller>
Component:	Telco Edge	Assignee:	Ian Miller <imiller>
Telco Edge sub component:	RAN	QA Contact:	yliu1
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	unspecified	CC:	yliu1
Version:	4.8
Target Milestone:	---
Target Release:	4.8.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:	2006953	Environment:
Last Closed:	2022-01-18 06:33:08 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2006953
Bug Blocks:

Description Ian Miller 2021-09-28 16:33:37 UTC

+++ This bug was initially created as a clone of Bug #2006953 +++

Reboot recovery time of a DU node with the RT kernel is much longer than without RT kernel. The root cause is BZ 1975356. The DU node profile needs to enable the fix by adding rcupdate.rcu_normal_after_boot=0 to the kernel commandline.

--- Additional comment from  on 2021-09-22 17:59:59 UTC ---

Soft reboot test on 4.9 load with the additional kernel arg took about 14 minutes with 43 test pods. 
If this kernel arg was not added, it could take more than half an hour to recovery.

Comment 1 Ian Miller 2021-09-28 16:36:17 UTC

This bug tracks backport of fix to release-4.8. The automatic clone failed for some reason. The PR is
https://github.com/openshift-kni/cnf-features-deploy/pull/722

Comment 2 Ian Miller 2021-09-29 15:42:17 UTC

Wrong PR in the previous message. The PR for this backport is:
https://github.com/openshift-kni/cnf-features-deploy/pull/713

Comment 3 yliu1 2021-09-29 18:22:36 UTC

Verification for this bz is partially blocked by: https://bugzilla.redhat.com/show_bug.cgi?id=2009033

Comment 4 yliu1 2021-10-15 19:01:00 UTC

Verified on 4.8.15 with 43 test pods. Cluster does become more stable with the additional kernel arg rcupdate.rcu_normal_after_boot=0.
However sometimes cluster still recovery very slowly. A separate bz was opened to track that.  https://bugzilla.redhat.com/show_bug.cgi?id=2014542

Comment 7 errata-xmlrpc 2022-01-18 06:33:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.8.27 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0113