Bug 2049057 - [tracker] [OCP 4.8] Include fix for xfs metadata corruption in RHCOS
Summary: [tracker] [OCP 4.8] Include fix for xfs metadata corruption in RHCOS
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.8
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.8.z
Assignee: Micah Abbott
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On: 2049056 2049291
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-01 12:48 UTC by Mario Abajo
Modified: 2022-10-14 11:25 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2049055
Environment:
Last Closed: 2022-03-22 17:29:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2022:0872 0 None None None 2022-03-22 17:29:31 UTC

Description Mario Abajo 2022-02-01 12:48:43 UTC
+++ This bug was initially created as a clone of Bug #2049055 +++

Description of the problem:

Request to include fix for the issue described in bugzilla #2020764 [OCP node XFS metadata corruption after numerous reboots] in Openshift RHCOS.

Comment 2 Micah Abbott 2022-02-01 13:55:09 UTC
In order to get this into RHCOS 4.8, we need the fix backported into RHEL 8.4.z EUS.

I've requested the backport here - https://bugzilla.redhat.com/show_bug.cgi?id=2020764#c25

If the z-stream request is accepted, I'll reset the DependsOn field to point to the 8.4.z BZ.

Comment 6 daniel.hagen 2022-02-17 06:55:22 UTC
does anyone know when the backport for RHCOS 4.8 will appear? 
We've seen exactly this on a bare metal test system with OpenShift 4.8.x. and RHCOS 4.18.14 . After a few reboots by the MCO, the affected nodes run into an XFS corruption.

 XFS (sda4): Mounting V5 Filesystem
 XFS (sda4): Starting recovery (logdev: internal)
 XFS (sda4): Metadata corruption detected at xfs_dir3_leaf_check_int+0x99/0xf0, xfs_dir3_leaf1 block 0x12faa158
 XFS (sda4): Unmount and run xfs_repair
 XFS (sda4): First 128 bytes of corrupted metadata buffer:
 00000000: 00 00 00 00 00 00 00 00 3d f1 00 00 e1 9e d5 8b  ........=.......
 00000010: 00 00 00 00 12 fa a1 58 00 00 00 29 00 00 1b cc  .......X...)....
 00000020: 91 06 78 ff f7 7e 4a 7d 8d 53 86 f2 ac 47 a8 23  ..x..~J}.S...G.#
 00000030: 00 00 00 00 17 e0 00 80 00 43 00 00 00 00 00 00  .........C......
 00000040: 00 00 00 2e 00 00 00 08 00 00 17 2e 00 00 00 0a  ................
 00000050: 02 35 79 83 00 00 00 30 04 d3 b4 80 00 00 01 50  .5y....0.......P
 00000060: 08 40 95 7f 00 00 02 98 08 41 fe b7 00 00 02 d4  .@.......A......
 00000070: 0d 62 ef a7 00 00 01 f2 14 50 21 41 00 00 00 0c  .b.......P!A....
 XFS (sda4): Corruption of in-memory data (0x8) detected at xfs_do_force_shutdown+0x1a/0x20 (fs/xfs/xfs_buf.c:1514).  Shutting down.
 XFS (sda4): Please unmount the filesystem and rectify the problem(s)
 XFS (sda4): log mount/recovery failed: error -117
 XFS (sda4): log mount failed


# oc image info $(oc adm release info --image-for=machine-os-content quay.io/openshift-release-dev/ocp-release:4.8.32-x86_64) -o json|jq -r '.config.config.Labels."com.coreos.rpm.kernel"'
4.18.0-305.34.2.el8_4.x86_64

As far is i can see , the fixed kernel for 8.4.z EUS is allready released according to https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/hotfixes/art3786/kernel-4.18.0-305.39.1.el8_4.x86_64/ .
According to the kernel changelog the fix ins included --> xfs: check sb_meta_uuid for dabuf buffer recovery (Bill O'Donnell) [2049291 2020764])

Comment 7 Micah Abbott 2022-02-17 14:15:08 UTC
(In reply to daniel.hagen from comment #6)

> As far is i can see , the fixed kernel for 8.4.z EUS is allready released
> according to
> https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/hotfixes/
> art3786/kernel-4.18.0-305.39.1.el8_4.x86_64/ .
> According to the kernel changelog the fix ins included --> xfs: check
> sb_meta_uuid for dabuf buffer recovery (Bill O'Donnell) [2049291 2020764])

That is a hotfix release and should not be used by customers unless they have an approved Support Exception


This bug is tracking the inclusion of the fix from the RHEL BZ https://bugzilla.redhat.com/show_bug.cgi?id=2049291

That fix for that BZ has not been released to RHEL (or OCP) customers yet, but looks like it will be included as part of the next RHEL 8.4.z EUS batch release (schedule for early March).

The fixed kernel will automatically be included in RHCOS as part of the usual build process and will make it way into an OCP z-stream release 1-2 weeks after the release of the kernel to RHEL customers.

Comment 10 Micah Abbott 2022-03-09 16:13:07 UTC
The fixed kernel (kernel-4.18.0-305.40.1.el8_4) for 8.4.z was shipped as part of https://access.redhat.com/errata/RHSA-2022:0777

It was included in RHCOS 48.84.202203081757-0 and will be included in a future OCP 4.8.z release payload

Comment 12 Michael Nguyen 2022-03-14 18:39:37 UTC
Verified updated kernel is in RHCOS 48.84.202203101844-0 which is part of the OCP 4.8.0-0.nightly-2022-03-13-111317 payload.

$ oc get nodes
NAME                                       STATUS   ROLES    AGE   VERSION
ci-ln-g585zjt-72292-mkhft-master-0         Ready    master   24m   v1.21.8+ee73ea2
ci-ln-g585zjt-72292-mkhft-master-1         Ready    master   24m   v1.21.8+ee73ea2
ci-ln-g585zjt-72292-mkhft-master-2         Ready    master   23m   v1.21.8+ee73ea2
ci-ln-g585zjt-72292-mkhft-worker-a-mmrj6   Ready    worker   17m   v1.21.8+ee73ea2
ci-ln-g585zjt-72292-mkhft-worker-b-7xcvp   Ready    worker   17m   v1.21.8+ee73ea2
ci-ln-g585zjt-72292-mkhft-worker-c-pb4fq   Ready    worker   17m   v1.21.8+ee73ea2
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2022-03-13-111317   True        False         5m51s   Cluster version is 4.8.0-0.nightly-2022-03-13-111317
$ oc debug node/ci-ln-g585zjt-72292-mkhft-worker-c-pb4fq
Starting pod/ci-ln-g585zjt-72292-mkhft-worker-c-pb4fq-debug ...
To use host binaries, run `chroot /host`

If you don't see a command prompt, try pressing enter.

sh-4.2# chroot /host
sh-4.4# rpm -q kernel
kernel-4.18.0-305.40.2.el8_4.x86_64
sh-4.4# rpm-ostree status
State: idle
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ad5f2045705fd39b6f7e27fe056085816108eb62fd8e09ea108be4263b235e15
              CustomOrigin: Managed by machine-config-operator
                   Version: 48.84.202203101844-0 (2022-03-10T18:47:20Z)

  ostree://13c18da5e6fee09fade484c3903209730cbb73e9ebcab806b9e9000cf97fd719
                   Version: 48.84.202109241901-0 (2021-09-24T19:04:29Z)

Comment 15 errata-xmlrpc 2022-03-22 17:29:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.8.35 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0872

Comment 16 HuijingHei 2022-04-12 04:20:44 UTC
According to comment 5, the original bug https://bugzilla.redhat.com/show_bug.cgi?id=2020764#c48, which is covered by automation test, set the flag qe_test_coverage to '+'


Note You need to log in before you can comment on or make changes to this bug.