Bug 2207682

Summary: OSD pod fails MCP update
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: mperetz <mperetz>
Component: rookAssignee: Santosh Pillai <sapillai>
Status: CLOSED COMPLETED QA Contact: Neha Berry <nberry>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.13CC: dollierp, fdeutsch, jhopper, jpeimer, muagarwa, ocs-bugs, odf-bz-bot, ryasharz, tnielsen
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-06-23 21:26:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description mperetz 2023-05-16 13:44:47 UTC
Created attachment 1964906 [details]
ODF mustgather

Description of problem (please be detailed as possible and provide log
snippests):

We have been using in CNV the following image for ODF: quay.io/rhceph-dev/ocs-registry:latest-stable-4.13.

Lately, with Openshift build  4.13.0-rc.8-x86_64 MCP updates fail randomly with the following errors in machine-config-controller pod:

2023-05-12 12:05:13.634826 I | clusterdisruption-controller: osd is down in failure domain "rack0" and pgs are not active+clean. pg health: "cluster is not fully clean. PGs: [{StateName:unknown Count:3}]"

And eventually MCP update fails with:
I0516 13:00:19.190838       1 drain_controller.go:171] node infd-vrf-414t0-r82cz-master-0: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when evicting pods/"rook-ceph-osd-1-6c577579f5-z6sw9" -n "openshift-storage": global timeout reached: 1m30s

This issue does not occur with RC 5 (4.13.0-rc.5-x86_64) and quay.io/rhceph-dev/ocs-registry:latest-stable-4.13.


Version of all relevant components (if applicable):
Openshift:  4.13.0-rc.8-x86_64
ODF: quay.io/rhceph-dev/ocs-registry:latest-stable-4.13


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Currently it breaks CNV builds after we do machine config change, like enalbe huge pages, updating new ICSP, and so on...


Is there any workaround available to the best of your knowledge? No


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible? Yes


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
Yes, with the following builds:
Openshift:  4.13.0-rc.5-x86_64 (RC 5)
ODF: quay.io/rhceph-dev/ocs-registry:latest-stable-4.13

The issue doesn't occur, but with the same ODF image and RC 8(4.13.0-rc.8-x86_64) the issue does reproduce.


Steps to Reproduce:
1. Install openshift cluster with version  4.13.0-rc.8-x86_64
2. Install ODF from catalog source image quay.io/rhceph-dev/ocs-registry:latest-stable-4.13
3. Perform multiple MCP updates manually. You can use these scripts for that:

pause_mcp()
{
    oc patch --type=merge --patch='{"spec":{"paused": true}}' $(oc get mcp -o name)
}

# Resume master and worker MCP.
resume_mcp()
{
    oc patch --type=merge --patch='{"spec":{"paused": false}}' $(oc get mcp -o name)
}


wait_mcp_for_updated()
{
    local attempts=${1:-60} i
    local mcp_updated="false"
    local mcp_stat_file="$(mktemp "${TMDIR:-/tmp}"/mcp-stat.XXXXX)"

    sleep 30

    for ((i=1; i<=attempts; i++)); do
      echo_debug "Attempt ${i}/${attempts}"
      sleep 30
      if oc wait mcp --all --for condition=updated --timeout=1m; then
        echo "MCP is Updated"
        mcp_updated="true"
        break
      fi
    done

    rm -f "${mcp_stat_file}"

    if [[ "${mcp_updated}" == "false" ]]; then
      ech "Error: MCP didn't get Updated!!"
      exit 1
    fi
}

pause_mcp
resume_mcp
wait_mcp_for_updated



Actual results:


Expected results:


Additional info:

Comment 2 Jenia Peimer 2023-05-18 06:51:21 UTC
Does this look related?
https://bugzilla.redhat.com/show_bug.cgi?id=2182820

Comment 3 Travis Nielsen 2023-05-22 21:17:15 UTC
The must gather shows that the "activate" container for the rook-ceph-osd-2 pod is not finding the device at pod startup:

2023-05-16T12:54:27.781636226Z failed to read label for /var/lib/ceph/osd/ceph-2/block: (2) No such file or directory

This will prevent the OSD from starting, and the PDBs will prevent other OSDs from being taken down.

The disk must have been renamed during node restart, thus causing the failure in the OSD. 

osd-2 is consuming PVC: ocs-deviceset-0-data-1vjj9j, 
  which is bound to PV: local-pv-ae1524ba

That PV is using the path:
  local:
    path: /mnt/local-storage/local-block-ocs/google-

Which path must have been lost during the node update. LSO volumes should use a path by id to be more reliable: /dev/disk/by-id

Moving out of 4.13 while discussing...

Comment 4 Fabian Deutsch 2023-05-30 12:16:06 UTC
Is htis bug affecting
a) only new installations?
b) uprgades from 4.12 to 4.13?

#a is still a bug, but less severe
#b would be pretty bad, as we break existing customer deployments

Comment 5 mperetz 2023-05-30 12:21:41 UTC
@

Comment 6 mperetz 2023-05-30 12:41:31 UTC
Fabian,this bug can happen with any MCP update when ODF is installed.
So whenever we do some operation which updates the machine config, such as adding ICSP, new machine config or even during complaince tests, we may hit this issue.
Usually we hit it during every tier0 build. 
So it's not related to fresh install or not, but to any operation that requires MCP update.

I guess an upgrade for 4.12 to 4.13 also requires MCP update, so yes - in case ODF is installed, we may hit it there too.

Comment 7 Santosh Pillai 2023-05-30 13:09:18 UTC
I've a few questions based on Travis' comment #3

1. https://bugzilla.redhat.com/show_bug.cgi?id=2182820#c10 which looks similar is not producible now. Is the current issue still happening?
2. For both RC 5 and RC 8, are both using PVs on disks rather than disk-ids.
   This can be confirmed with the PV that is bounded to the OSD PVC. For example:
    local:
    path: /mnt/local-storage/local-block-ocs/google-
3. If the PV is using disk name rather than disk ID, do you see the disk names change when the nodes restarts?

Comment 8 mperetz 2023-05-30 15:41:28 UTC
1. Yes, we still see it constantly with 4.13.1, and also 4.14 releases. We moved
2. This is the path I see for 4.13.1 (where the issue also happens):

    uid: 6985e8cc-0dbe-4aff-b356-f8f6ec6a1fd0
  local:
    path: /mnt/local-storage/local-block-ocs/google-
  nodeAffinity:
    required:

And this is the path on 4.13.0-RC5:
    uid: a3768836-d7bb-41a6-bff0-a7023f295e7e
  local:
    path: /mnt/local-storage/local-block-ocs/virtio-a92d4e77-0183-40f9-9
  nodeAffinity:
    required:
      nodeSelectorTerms:

So yes, both are using PVs on disks rather than disk-ids.

3. I tried on manual restart of a node, but I haven't seen after that reboot that the disk name has changed, but as I mentioned, it take several reboots before we hit that problem.

Comment 9 Fabian Deutsch 2023-05-30 19:02:00 UTC
@mperetz does this mean that if ODF got installed in 4.12 or lower, and /dev/sdN is ued to identify a device, then after upgrading to 4.13, this bug can appear?

If this is the case, isn't this a severe regression?

Comment 10 mperetz 2023-05-31 07:17:54 UTC
@Fabain, this is what I know:
Starting from OCP 4.13.0-rcX, where X>5 (I can't tell exactly which X; I know GA (RC8) and I guess also RC7), we randomly hit this OSD pod eviction during MCP update.

So I would *carefully* say that the answer to this question depends on if there are MCP updates post-upgrade as part of the upgrade process.

Comment 11 mperetz 2023-05-31 08:21:06 UTC
We noticed that the discovered deviceIDs are a bit weird (4.13.1):

[mperetz@fedora cnv-qe-automation]$ oc get localvolumediscoveryresults -n openshift-local-storage -o yaml
apiVersion: v1
items:
- apiVersion: local.storage.openshift.io/v1alpha1
  kind: LocalVolumeDiscoveryResult
  metadata:
    creationTimestamp: "2023-05-24T05:04:25Z"
    generation: 1
    labels:
      discovery-result-node: infd-vrf-413t0-v26fv-master-2
    name: discovery-result-infd-vrf-413t0-v26fv-master-2
    namespace: openshift-local-storage
    ownerReferences:
    - apiVersion: local.storage.openshift.io/v1alpha1
      kind: LocalVolumeDiscovery
      name: auto-discover-devices
      uid: d5991d54-5265-4ec0-8342-b4e65a6642b9
    resourceVersion: "3583408"
    uid: 359cfbd6-073f-4b05-b938-35f136102b47
  spec:
    nodeName: infd-vrf-413t0-v26fv-master-2
  status:
    discoveredDevices:
    - deviceID: /dev/disk/by-id/google--part1
      fstype: ""
      model: ""
      path: /dev/vda1
      property: Rotational
      serial: ""
      size: 1048576
      status:
        state: NotAvailable
      type: part
      vendor: ""
    - deviceID: /dev/disk/by-id/google--part2
      fstype: vfat
      model: ""
      path: /dev/vda2
      property: Rotational
      serial: ""
      size: 133169152
      status:
        state: NotAvailable
      type: part
      vendor: ""
    - deviceID: /dev/disk/by-id/google--part3
      fstype: ext4
      model: ""
      path: /dev/vda3
      property: Rotational
      serial: ""
      size: 402653184
      status:
        state: NotAvailable
      type: part
      vendor: ""
    - deviceID: /dev/disk/by-id/google--part4
      fstype: xfs
      model: ""
      path: /dev/vda4
      property: Rotational
      serial: ""
      size: 139048500736
      status:
        state: NotAvailable
      type: part
      vendor: ""
    - deviceID: /dev/disk/by-id/google-
      fstype: xfs
      model: ""
      path: /dev/vdb
      property: Rotational
      serial: 81a49af9-5f82-48d4-a
      size: 53687091200
      status:
        state: NotAvailable
      type: disk
      vendor: "0x1af4"
    - deviceID: /dev/disk/by-id/virtio-346ea4d8-af43-435f-8
      fstype: ""
      model: ""
      path: /dev/vdc
      property: Rotational
      serial: 346ea4d8-af43-435f-8
      size: 53687091200
      status:
        state: NotAvailable
      type: disk
      vendor: "0x1af4"
    discoveredTimeStamp: "2023-05-26T23:57:24Z"
kind: List
metadata:
  resourceVersion: ""

Comment 12 mperetz 2023-05-31 08:33:34 UTC
another observation, looks like it just takes it how it is from the node itself:

To use host binaries, run `chroot /host`
Pod IP: 192.168.0.195
If you don't see a command prompt, try pressing enter.
sh-4.4# 
sh-4.4# chroot /host
sh-5.1# ls -l /dev/disk/by-id/
total 0
lrwxrwxrwx. 1 root root  9 May 29 11:57 google- -> ../../vdb
lrwxrwxrwx. 1 root root 10 May 24 09:42 google--part1 -> ../../vda1
lrwxrwxrwx. 1 root root 10 May 24 09:42 google--part2 -> ../../vda2
lrwxrwxrwx. 1 root root 10 May 24 09:42 google--part3 -> ../../vda3
lrwxrwxrwx. 1 root root 10 May 24 09:42 google--part4 -> ../../vda4
lrwxrwxrwx. 1 root root  9 May 29 11:57 virtio-1bac3431-0365-4387-a -> ../../vdb
lrwxrwxrwx. 1 root root  9 May 24 09:42 virtio-2d1fd2a0-5a28-4e77-b -> ../../vda
lrwxrwxrwx. 1 root root 10 May 24 09:42 virtio-2d1fd2a0-5a28-4e77-b-part1 -> ../../vda1
lrwxrwxrwx. 1 root root 10 May 24 09:42 virtio-2d1fd2a0-5a28-4e77-b-part2 -> ../../vda2
lrwxrwxrwx. 1 root root 10 May 24 09:42 virtio-2d1fd2a0-5a28-4e77-b-part3 -> ../../vda3
lrwxrwxrwx. 1 root root 10 May 24 09:42 virtio-2d1fd2a0-5a28-4e77-b-part4 -> ../../vda4
lrwxrwxrwx. 1 root root  9 May 24 09:42 virtio-9e1f496b-7a19-4c05-8 -> ../../vdc
sh-5.1# 

This is how it looks like in 4.13.0-RC5:

sh-5.1# ls -l /dev/disk/by-id/
total 0
lrwxrwxrwx. 1 root root  9 May 24 11:12 virtio-83c35374-16ae-47b8-a -> ../../vdc
lrwxrwxrwx. 1 root root  9 May 24 11:12 virtio-863862bb-ee95-4872-a -> ../../vda
lrwxrwxrwx. 1 root root 10 May 24 11:12 virtio-863862bb-ee95-4872-a-part1 -> ../../vda1
lrwxrwxrwx. 1 root root 10 May 24 11:12 virtio-863862bb-ee95-4872-a-part2 -> ../../vda2
lrwxrwxrwx. 1 root root 10 May 24 11:12 virtio-863862bb-ee95-4872-a-part3 -> ../../vda3
lrwxrwxrwx. 1 root root 10 May 24 11:12 virtio-863862bb-ee95-4872-a-part4 -> ../../vda4
lrwxrwxrwx. 1 root root  9 May 24 11:12 virtio-c3f77761-1744-4a01-b -> ../../vdb
sh-5.1# 

So I suspect something is creating those google-* links, which makes the LSO discover the wrong links...

Comment 13 mperetz 2023-05-31 10:13:21 UTC
Another observation:

Looks like what's adding those google-* links is this udev rule:

/usr/lib/udev/rules.d/65-gce-disk-naming.rules

Which is NOT present in 4.13.0-rc5:

sh-5.1# ls -l  /usr/lib/udev/rules.d/*gce*
ls: cannot access '/usr/lib/udev/rules.d/*gce*': No such file or directory
sh-5.1#

Comment 14 Denis Ollier 2023-05-31 13:41:33 UTC
IIUC, it should be fixed by https://github.com/GoogleCloudPlatform/guest-configs/pull/52.

We need to wait for an update of RHCOS which include this fix to land in OpenShift 4.13+.

Comment 16 RY 2023-06-01 09:26:52 UTC
I opened a bug with systemd

https://bugzilla.redhat.com/show_bug.cgi?id=2211632

Comment 17 Santosh Pillai 2023-06-01 10:52:49 UTC
This bugs needs to be moved out of rook. Not sure which component it should be moved to.

Comment 18 Denis Ollier 2023-06-02 11:52:04 UTC
I created this Jira issue for the RHCOS team: https://issues.redhat.com/browse/COS-2245.

Comment 19 Jenifer Abrams 2023-06-08 15:12:57 UTC
FYI the "google" udev naming bug is tracked in https://issues.redhat.com/browse/OCPBUGS-13754
it is starting to sound like it is causing other functional issues, hope to see a fix soon.

Comment 20 Denis Ollier 2023-06-08 17:31:18 UTC
(In reply to Jenifer Abrams from comment #19)
> FYI the "google" udev naming bug is tracked in
> https://issues.redhat.com/browse/OCPBUGS-13754
> it is starting to sound like it is causing other functional issues, hope to
> see a fix soon.

Thanks for the information! I closed COS-2245 as a duplicate of OCPBUGS-13754.

Comment 21 Travis Nielsen 2023-06-23 21:26:48 UTC
Closing at it appears https://issues.redhat.com/browse/OCPBUGS-13754 has been resolved and verified