+++ This bug was initially created as a clone of Bug #2049055 +++ Description of the problem: Request to include fix for the issue described in bugzilla #2020764 [OCP node XFS metadata corruption after numerous reboots] in Openshift RHCOS.
In order to get this into RHCOS 4.8, we need the fix backported into RHEL 8.4.z EUS. I've requested the backport here - https://bugzilla.redhat.com/show_bug.cgi?id=2020764#c25 If the z-stream request is accepted, I'll reset the DependsOn field to point to the 8.4.z BZ.
does anyone know when the backport for RHCOS 4.8 will appear? We've seen exactly this on a bare metal test system with OpenShift 4.8.x. and RHCOS 4.18.14 . After a few reboots by the MCO, the affected nodes run into an XFS corruption. XFS (sda4): Mounting V5 Filesystem XFS (sda4): Starting recovery (logdev: internal) XFS (sda4): Metadata corruption detected at xfs_dir3_leaf_check_int+0x99/0xf0, xfs_dir3_leaf1 block 0x12faa158 XFS (sda4): Unmount and run xfs_repair XFS (sda4): First 128 bytes of corrupted metadata buffer: 00000000: 00 00 00 00 00 00 00 00 3d f1 00 00 e1 9e d5 8b ........=....... 00000010: 00 00 00 00 12 fa a1 58 00 00 00 29 00 00 1b cc .......X...).... 00000020: 91 06 78 ff f7 7e 4a 7d 8d 53 86 f2 ac 47 a8 23 ..x..~J}.S...G.# 00000030: 00 00 00 00 17 e0 00 80 00 43 00 00 00 00 00 00 .........C...... 00000040: 00 00 00 2e 00 00 00 08 00 00 17 2e 00 00 00 0a ................ 00000050: 02 35 79 83 00 00 00 30 04 d3 b4 80 00 00 01 50 .5y....0.......P 00000060: 08 40 95 7f 00 00 02 98 08 41 fe b7 00 00 02 d4 .@.......A...... 00000070: 0d 62 ef a7 00 00 01 f2 14 50 21 41 00 00 00 0c .b.......P!A.... XFS (sda4): Corruption of in-memory data (0x8) detected at xfs_do_force_shutdown+0x1a/0x20 (fs/xfs/xfs_buf.c:1514). Shutting down. XFS (sda4): Please unmount the filesystem and rectify the problem(s) XFS (sda4): log mount/recovery failed: error -117 XFS (sda4): log mount failed # oc image info $(oc adm release info --image-for=machine-os-content quay.io/openshift-release-dev/ocp-release:4.8.32-x86_64) -o json|jq -r '.config.config.Labels."com.coreos.rpm.kernel"' 4.18.0-305.34.2.el8_4.x86_64 As far is i can see , the fixed kernel for 8.4.z EUS is allready released according to https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/hotfixes/art3786/kernel-4.18.0-305.39.1.el8_4.x86_64/ . According to the kernel changelog the fix ins included --> xfs: check sb_meta_uuid for dabuf buffer recovery (Bill O'Donnell) [2049291 2020764])
(In reply to daniel.hagen from comment #6) > As far is i can see , the fixed kernel for 8.4.z EUS is allready released > according to > https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/hotfixes/ > art3786/kernel-4.18.0-305.39.1.el8_4.x86_64/ . > According to the kernel changelog the fix ins included --> xfs: check > sb_meta_uuid for dabuf buffer recovery (Bill O'Donnell) [2049291 2020764]) That is a hotfix release and should not be used by customers unless they have an approved Support Exception This bug is tracking the inclusion of the fix from the RHEL BZ https://bugzilla.redhat.com/show_bug.cgi?id=2049291 That fix for that BZ has not been released to RHEL (or OCP) customers yet, but looks like it will be included as part of the next RHEL 8.4.z EUS batch release (schedule for early March). The fixed kernel will automatically be included in RHCOS as part of the usual build process and will make it way into an OCP z-stream release 1-2 weeks after the release of the kernel to RHEL customers.
The fixed kernel (kernel-4.18.0-305.40.1.el8_4) for 8.4.z was shipped as part of https://access.redhat.com/errata/RHSA-2022:0777 It was included in RHCOS 48.84.202203081757-0 and will be included in a future OCP 4.8.z release payload
Verified updated kernel is in RHCOS 48.84.202203101844-0 which is part of the OCP 4.8.0-0.nightly-2022-03-13-111317 payload. $ oc get nodes NAME STATUS ROLES AGE VERSION ci-ln-g585zjt-72292-mkhft-master-0 Ready master 24m v1.21.8+ee73ea2 ci-ln-g585zjt-72292-mkhft-master-1 Ready master 24m v1.21.8+ee73ea2 ci-ln-g585zjt-72292-mkhft-master-2 Ready master 23m v1.21.8+ee73ea2 ci-ln-g585zjt-72292-mkhft-worker-a-mmrj6 Ready worker 17m v1.21.8+ee73ea2 ci-ln-g585zjt-72292-mkhft-worker-b-7xcvp Ready worker 17m v1.21.8+ee73ea2 ci-ln-g585zjt-72292-mkhft-worker-c-pb4fq Ready worker 17m v1.21.8+ee73ea2 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2022-03-13-111317 True False 5m51s Cluster version is 4.8.0-0.nightly-2022-03-13-111317 $ oc debug node/ci-ln-g585zjt-72292-mkhft-worker-c-pb4fq Starting pod/ci-ln-g585zjt-72292-mkhft-worker-c-pb4fq-debug ... To use host binaries, run `chroot /host` If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# rpm -q kernel kernel-4.18.0-305.40.2.el8_4.x86_64 sh-4.4# rpm-ostree status State: idle Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ad5f2045705fd39b6f7e27fe056085816108eb62fd8e09ea108be4263b235e15 CustomOrigin: Managed by machine-config-operator Version: 48.84.202203101844-0 (2022-03-10T18:47:20Z) ostree://13c18da5e6fee09fade484c3903209730cbb73e9ebcab806b9e9000cf97fd719 Version: 48.84.202109241901-0 (2021-09-24T19:04:29Z)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.8.35 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:0872
According to comment 5, the original bug https://bugzilla.redhat.com/show_bug.cgi?id=2020764#c48, which is covered by automation test, set the flag qe_test_coverage to '+'