+++ This bug was initially created as a clone of Bug #2006953 +++ Reboot recovery time of a DU node with the RT kernel is much longer than without RT kernel. The root cause is BZ 1975356. The DU node profile needs to enable the fix by adding rcupdate.rcu_normal_after_boot=0 to the kernel commandline. --- Additional comment from on 2021-09-22 17:59:59 UTC --- Soft reboot test on 4.9 load with the additional kernel arg took about 14 minutes with 43 test pods. If this kernel arg was not added, it could take more than half an hour to recovery.
This bug tracks backport of fix to release-4.8. The automatic clone failed for some reason. The PR is https://github.com/openshift-kni/cnf-features-deploy/pull/722
Wrong PR in the previous message. The PR for this backport is: https://github.com/openshift-kni/cnf-features-deploy/pull/713
Verification for this bz is partially blocked by: https://bugzilla.redhat.com/show_bug.cgi?id=2009033
Verified on 4.8.15 with 43 test pods. Cluster does become more stable with the additional kernel arg rcupdate.rcu_normal_after_boot=0. However sometimes cluster still recovery very slowly. A separate bz was opened to track that. https://bugzilla.redhat.com/show_bug.cgi?id=2014542
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.8.27 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:0113