Bug 2207682
| Summary: | OSD pod fails MCP update | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | mperetz <mperetz> |
| Component: | rook | Assignee: | Santosh Pillai <sapillai> |
| Status: | CLOSED COMPLETED | QA Contact: | Neha Berry <nberry> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.13 | CC: | dollierp, fdeutsch, jhopper, jpeimer, muagarwa, ocs-bugs, odf-bz-bot, ryasharz, tnielsen |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-06-23 21:26:48 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Does this look related? https://bugzilla.redhat.com/show_bug.cgi?id=2182820 The must gather shows that the "activate" container for the rook-ceph-osd-2 pod is not finding the device at pod startup:
2023-05-16T12:54:27.781636226Z failed to read label for /var/lib/ceph/osd/ceph-2/block: (2) No such file or directory
This will prevent the OSD from starting, and the PDBs will prevent other OSDs from being taken down.
The disk must have been renamed during node restart, thus causing the failure in the OSD.
osd-2 is consuming PVC: ocs-deviceset-0-data-1vjj9j,
which is bound to PV: local-pv-ae1524ba
That PV is using the path:
local:
path: /mnt/local-storage/local-block-ocs/google-
Which path must have been lost during the node update. LSO volumes should use a path by id to be more reliable: /dev/disk/by-id
Moving out of 4.13 while discussing...
Is htis bug affecting a) only new installations? b) uprgades from 4.12 to 4.13? #a is still a bug, but less severe #b would be pretty bad, as we break existing customer deployments @ Fabian,this bug can happen with any MCP update when ODF is installed. So whenever we do some operation which updates the machine config, such as adding ICSP, new machine config or even during complaince tests, we may hit this issue. Usually we hit it during every tier0 build. So it's not related to fresh install or not, but to any operation that requires MCP update. I guess an upgrade for 4.12 to 4.13 also requires MCP update, so yes - in case ODF is installed, we may hit it there too. I've a few questions based on Travis' comment #3 1. https://bugzilla.redhat.com/show_bug.cgi?id=2182820#c10 which looks similar is not producible now. Is the current issue still happening? 2. For both RC 5 and RC 8, are both using PVs on disks rather than disk-ids. This can be confirmed with the PV that is bounded to the OSD PVC. For example: local: path: /mnt/local-storage/local-block-ocs/google- 3. If the PV is using disk name rather than disk ID, do you see the disk names change when the nodes restarts? 1. Yes, we still see it constantly with 4.13.1, and also 4.14 releases. We moved
2. This is the path I see for 4.13.1 (where the issue also happens):
uid: 6985e8cc-0dbe-4aff-b356-f8f6ec6a1fd0
local:
path: /mnt/local-storage/local-block-ocs/google-
nodeAffinity:
required:
And this is the path on 4.13.0-RC5:
uid: a3768836-d7bb-41a6-bff0-a7023f295e7e
local:
path: /mnt/local-storage/local-block-ocs/virtio-a92d4e77-0183-40f9-9
nodeAffinity:
required:
nodeSelectorTerms:
So yes, both are using PVs on disks rather than disk-ids.
3. I tried on manual restart of a node, but I haven't seen after that reboot that the disk name has changed, but as I mentioned, it take several reboots before we hit that problem.
@mperetz does this mean that if ODF got installed in 4.12 or lower, and /dev/sdN is ued to identify a device, then after upgrading to 4.13, this bug can appear? If this is the case, isn't this a severe regression? @Fabain, this is what I know: Starting from OCP 4.13.0-rcX, where X>5 (I can't tell exactly which X; I know GA (RC8) and I guess also RC7), we randomly hit this OSD pod eviction during MCP update. So I would *carefully* say that the answer to this question depends on if there are MCP updates post-upgrade as part of the upgrade process. We noticed that the discovered deviceIDs are a bit weird (4.13.1):
[mperetz@fedora cnv-qe-automation]$ oc get localvolumediscoveryresults -n openshift-local-storage -o yaml
apiVersion: v1
items:
- apiVersion: local.storage.openshift.io/v1alpha1
kind: LocalVolumeDiscoveryResult
metadata:
creationTimestamp: "2023-05-24T05:04:25Z"
generation: 1
labels:
discovery-result-node: infd-vrf-413t0-v26fv-master-2
name: discovery-result-infd-vrf-413t0-v26fv-master-2
namespace: openshift-local-storage
ownerReferences:
- apiVersion: local.storage.openshift.io/v1alpha1
kind: LocalVolumeDiscovery
name: auto-discover-devices
uid: d5991d54-5265-4ec0-8342-b4e65a6642b9
resourceVersion: "3583408"
uid: 359cfbd6-073f-4b05-b938-35f136102b47
spec:
nodeName: infd-vrf-413t0-v26fv-master-2
status:
discoveredDevices:
- deviceID: /dev/disk/by-id/google--part1
fstype: ""
model: ""
path: /dev/vda1
property: Rotational
serial: ""
size: 1048576
status:
state: NotAvailable
type: part
vendor: ""
- deviceID: /dev/disk/by-id/google--part2
fstype: vfat
model: ""
path: /dev/vda2
property: Rotational
serial: ""
size: 133169152
status:
state: NotAvailable
type: part
vendor: ""
- deviceID: /dev/disk/by-id/google--part3
fstype: ext4
model: ""
path: /dev/vda3
property: Rotational
serial: ""
size: 402653184
status:
state: NotAvailable
type: part
vendor: ""
- deviceID: /dev/disk/by-id/google--part4
fstype: xfs
model: ""
path: /dev/vda4
property: Rotational
serial: ""
size: 139048500736
status:
state: NotAvailable
type: part
vendor: ""
- deviceID: /dev/disk/by-id/google-
fstype: xfs
model: ""
path: /dev/vdb
property: Rotational
serial: 81a49af9-5f82-48d4-a
size: 53687091200
status:
state: NotAvailable
type: disk
vendor: "0x1af4"
- deviceID: /dev/disk/by-id/virtio-346ea4d8-af43-435f-8
fstype: ""
model: ""
path: /dev/vdc
property: Rotational
serial: 346ea4d8-af43-435f-8
size: 53687091200
status:
state: NotAvailable
type: disk
vendor: "0x1af4"
discoveredTimeStamp: "2023-05-26T23:57:24Z"
kind: List
metadata:
resourceVersion: ""
another observation, looks like it just takes it how it is from the node itself: To use host binaries, run `chroot /host` Pod IP: 192.168.0.195 If you don't see a command prompt, try pressing enter. sh-4.4# sh-4.4# chroot /host sh-5.1# ls -l /dev/disk/by-id/ total 0 lrwxrwxrwx. 1 root root 9 May 29 11:57 google- -> ../../vdb lrwxrwxrwx. 1 root root 10 May 24 09:42 google--part1 -> ../../vda1 lrwxrwxrwx. 1 root root 10 May 24 09:42 google--part2 -> ../../vda2 lrwxrwxrwx. 1 root root 10 May 24 09:42 google--part3 -> ../../vda3 lrwxrwxrwx. 1 root root 10 May 24 09:42 google--part4 -> ../../vda4 lrwxrwxrwx. 1 root root 9 May 29 11:57 virtio-1bac3431-0365-4387-a -> ../../vdb lrwxrwxrwx. 1 root root 9 May 24 09:42 virtio-2d1fd2a0-5a28-4e77-b -> ../../vda lrwxrwxrwx. 1 root root 10 May 24 09:42 virtio-2d1fd2a0-5a28-4e77-b-part1 -> ../../vda1 lrwxrwxrwx. 1 root root 10 May 24 09:42 virtio-2d1fd2a0-5a28-4e77-b-part2 -> ../../vda2 lrwxrwxrwx. 1 root root 10 May 24 09:42 virtio-2d1fd2a0-5a28-4e77-b-part3 -> ../../vda3 lrwxrwxrwx. 1 root root 10 May 24 09:42 virtio-2d1fd2a0-5a28-4e77-b-part4 -> ../../vda4 lrwxrwxrwx. 1 root root 9 May 24 09:42 virtio-9e1f496b-7a19-4c05-8 -> ../../vdc sh-5.1# This is how it looks like in 4.13.0-RC5: sh-5.1# ls -l /dev/disk/by-id/ total 0 lrwxrwxrwx. 1 root root 9 May 24 11:12 virtio-83c35374-16ae-47b8-a -> ../../vdc lrwxrwxrwx. 1 root root 9 May 24 11:12 virtio-863862bb-ee95-4872-a -> ../../vda lrwxrwxrwx. 1 root root 10 May 24 11:12 virtio-863862bb-ee95-4872-a-part1 -> ../../vda1 lrwxrwxrwx. 1 root root 10 May 24 11:12 virtio-863862bb-ee95-4872-a-part2 -> ../../vda2 lrwxrwxrwx. 1 root root 10 May 24 11:12 virtio-863862bb-ee95-4872-a-part3 -> ../../vda3 lrwxrwxrwx. 1 root root 10 May 24 11:12 virtio-863862bb-ee95-4872-a-part4 -> ../../vda4 lrwxrwxrwx. 1 root root 9 May 24 11:12 virtio-c3f77761-1744-4a01-b -> ../../vdb sh-5.1# So I suspect something is creating those google-* links, which makes the LSO discover the wrong links... Another observation: Looks like what's adding those google-* links is this udev rule: /usr/lib/udev/rules.d/65-gce-disk-naming.rules Which is NOT present in 4.13.0-rc5: sh-5.1# ls -l /usr/lib/udev/rules.d/*gce* ls: cannot access '/usr/lib/udev/rules.d/*gce*': No such file or directory sh-5.1# IIUC, it should be fixed by https://github.com/GoogleCloudPlatform/guest-configs/pull/52. We need to wait for an update of RHCOS which include this fix to land in OpenShift 4.13+. I opened a bug with systemd https://bugzilla.redhat.com/show_bug.cgi?id=2211632 This bugs needs to be moved out of rook. Not sure which component it should be moved to. I created this Jira issue for the RHCOS team: https://issues.redhat.com/browse/COS-2245. FYI the "google" udev naming bug is tracked in https://issues.redhat.com/browse/OCPBUGS-13754 it is starting to sound like it is causing other functional issues, hope to see a fix soon. (In reply to Jenifer Abrams from comment #19) > FYI the "google" udev naming bug is tracked in > https://issues.redhat.com/browse/OCPBUGS-13754 > it is starting to sound like it is causing other functional issues, hope to > see a fix soon. Thanks for the information! I closed COS-2245 as a duplicate of OCPBUGS-13754. Closing at it appears https://issues.redhat.com/browse/OCPBUGS-13754 has been resolved and verified |
Created attachment 1964906 [details] ODF mustgather Description of problem (please be detailed as possible and provide log snippests): We have been using in CNV the following image for ODF: quay.io/rhceph-dev/ocs-registry:latest-stable-4.13. Lately, with Openshift build 4.13.0-rc.8-x86_64 MCP updates fail randomly with the following errors in machine-config-controller pod: 2023-05-12 12:05:13.634826 I | clusterdisruption-controller: osd is down in failure domain "rack0" and pgs are not active+clean. pg health: "cluster is not fully clean. PGs: [{StateName:unknown Count:3}]" And eventually MCP update fails with: I0516 13:00:19.190838 1 drain_controller.go:171] node infd-vrf-414t0-r82cz-master-0: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when evicting pods/"rook-ceph-osd-1-6c577579f5-z6sw9" -n "openshift-storage": global timeout reached: 1m30s This issue does not occur with RC 5 (4.13.0-rc.5-x86_64) and quay.io/rhceph-dev/ocs-registry:latest-stable-4.13. Version of all relevant components (if applicable): Openshift: 4.13.0-rc.8-x86_64 ODF: quay.io/rhceph-dev/ocs-registry:latest-stable-4.13 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Currently it breaks CNV builds after we do machine config change, like enalbe huge pages, updating new ICSP, and so on... Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Yes, with the following builds: Openshift: 4.13.0-rc.5-x86_64 (RC 5) ODF: quay.io/rhceph-dev/ocs-registry:latest-stable-4.13 The issue doesn't occur, but with the same ODF image and RC 8(4.13.0-rc.8-x86_64) the issue does reproduce. Steps to Reproduce: 1. Install openshift cluster with version 4.13.0-rc.8-x86_64 2. Install ODF from catalog source image quay.io/rhceph-dev/ocs-registry:latest-stable-4.13 3. Perform multiple MCP updates manually. You can use these scripts for that: pause_mcp() { oc patch --type=merge --patch='{"spec":{"paused": true}}' $(oc get mcp -o name) } # Resume master and worker MCP. resume_mcp() { oc patch --type=merge --patch='{"spec":{"paused": false}}' $(oc get mcp -o name) } wait_mcp_for_updated() { local attempts=${1:-60} i local mcp_updated="false" local mcp_stat_file="$(mktemp "${TMDIR:-/tmp}"/mcp-stat.XXXXX)" sleep 30 for ((i=1; i<=attempts; i++)); do echo_debug "Attempt ${i}/${attempts}" sleep 30 if oc wait mcp --all --for condition=updated --timeout=1m; then echo "MCP is Updated" mcp_updated="true" break fi done rm -f "${mcp_stat_file}" if [[ "${mcp_updated}" == "false" ]]; then ech "Error: MCP didn't get Updated!!" exit 1 fi } pause_mcp resume_mcp wait_mcp_for_updated Actual results: Expected results: Additional info: