+++ This bug was initially created as a clone of Bug #2049055 +++ Description of the problem: Request to include fix for the issue described in bugzilla #2020764 [OCP node XFS metadata corruption after numerous reboots] in Openshift RHCOS.
In order to get this into RHCOS 4.9, we need the fix backported into RHEL 8.4.z EUS. I've requested the backport here - https://bugzilla.redhat.com/show_bug.cgi?id=2020764#c25 If the z-stream request is accepted, I'll reset the DependsOn field to point to the 8.4.z BZ.
@Rio thanks for testing the hotfix, but this tracker BZ cannot be VERIFIED until the fixed kernel build lands in a version of RHCOS that will be shipped to all customers. Moving this back to ASSIGNED
The fixed kernel (kernel-4.18.0-305.40.1.el8_4) for 8.4.z was shipped as part of https://access.redhat.com/errata/RHSA-2022:0777 It was included in RHCOS 410.84.202203081640-0 and will be included in a future OCP 4.10.z release payload
verified kernel version with 4.9.24 oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.24 True False 4m50s Cluster version is 4.9.24 oc get node -o wide 1 ↵ NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-151-81.us-east-2.compute.internal Ready master 30m v1.22.5+5c84e52 10.0.151.81 <none> Red Hat Enterprise Linux CoreOS 49.84.202203081945-0 (Ootpa) 4.18.0-305.40.1.el8_4.x86_64 cri-o://1.22.2-2.rhaos4.9.gitb030be8.el8 ip-10-0-152-76.us-east-2.compute.internal Ready worker 22m v1.22.5+5c84e52 10.0.152.76 <none> Red Hat Enterprise Linux CoreOS 49.84.202203081945-0 (Ootpa) 4.18.0-305.40.1.el8_4.x86_64 cri-o://1.22.2-2.rhaos4.9.gitb030be8.el8 ip-10-0-170-224.us-east-2.compute.internal Ready master 32m v1.22.5+5c84e52 10.0.170.224 <none> Red Hat Enterprise Linux CoreOS 49.84.202203081945-0 (Ootpa) 4.18.0-305.40.1.el8_4.x86_64 cri-o://1.22.2-2.rhaos4.9.gitb030be8.el8 ip-10-0-175-29.us-east-2.compute.internal Ready worker 22m v1.22.5+5c84e52 10.0.175.29 <none> Red Hat Enterprise Linux CoreOS 49.84.202203081945-0 (Ootpa) 4.18.0-305.40.1.el8_4.x86_64 cri-o://1.22.2-2.rhaos4.9.gitb030be8.el8 ip-10-0-203-38.us-east-2.compute.internal Ready master 31m v1.22.5+5c84e52 10.0.203.38 <none> Red Hat Enterprise Linux CoreOS 49.84.202203081945-0 (Ootpa) 4.18.0-305.40.1.el8_4.x86_64 cri-o://1.22.2-2.rhaos4.9.gitb030be8.el8 ip-10-0-220-112.us-east-2.compute.internal Ready worker 22m v1.22.5+5c84e52 10.0.220.112 <none> Red Hat Enterprise Linux CoreOS 49.84.202203081945-0 (Ootpa) 4.18.0-305.40.1.el8_4.x86_64 cri-o://1.22.2-2.rhaos4.9.gitb030be8.el8 for node in $(oc get node -o name);do echo;oc debug $node -- chroot /host uname -r;done 130 ↵ Starting pod/ip-10-0-151-81us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` 4.18.0-305.40.1.el8_4.x86_64 Removing debug pod ... Starting pod/ip-10-0-152-76us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` 4.18.0-305.40.1.el8_4.x86_64 Removing debug pod ... Starting pod/ip-10-0-170-224us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` 4.18.0-305.40.1.el8_4.x86_64 Removing debug pod ... Starting pod/ip-10-0-175-29us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` 4.18.0-305.40.1.el8_4.x86_64 Removing debug pod ... Starting pod/ip-10-0-203-38us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` 4.18.0-305.40.1.el8_4.x86_64 Removing debug pod ... Starting pod/ip-10-0-220-112us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` 4.18.0-305.40.1.el8_4.x86_64 Removing debug pod ... @miabbott is there any other info need to check?
> @miabbott is there any other info need to check? As this is a tracker BZ, we are only verifying that the fixed package was included in RHCOS/OCP. Verification of the actual problem being fixed is handled by the respective RHEL QE team. Thanks for the verification; moving to VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.9.24 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:0798
According to original https://bugzilla.redhat.com/show_bug.cgi?id=2020764#c48, which is covered by automation test, set the flag qe_test_coverage to '+'