Created attachment 1755017 [details] describe of osd 0 pod Description of problem (please be detailed as possible and provide log snippests): ====================================================================== On an OCS 4.7.0-250.ci + OCP 4.7(4.7.0-0.nightly-2021-01-31-031653), initiated an OCP upgrade to build 4.7.0-0.nightly-2021-02-03-113456. During MCO upgrade ,when the compute-0 was drained for maintenance and brought back in, it is seen that the OSD pod running on this stuck in Init:CrashLoopBackOff and hence OCP upgrade has not yet succeeded(more than 16+ hrs) POD status =================== rook-ceph-mon-b-674c49c7d-zwbps 2/2 Running 1 21h 10.128.2.32 compute-1 <none> <none> rook-ceph-mon-c-68698666bb-6ffft 2/2 Running 0 16h 10.130.2.9 compute-0 <none> <none> rook-ceph-mon-d-canary-5bb487445f-d4s8k 0/2 Pending 0 83s <none> <none> <none> <none> --> unable to recover as compute-2 is still cordoned rook-ceph-operator-7f8d4bfdb6-2566r 1/1 Running 0 21h 10.129.3.7 compute-3 <none> <none> rook-ceph-osd-0-7f6c4f5b4-gdp9t 0/2 Init:CrashLoopBackOff 201 16h 10.130.2.8 compute-0 <none> <none> --> OSD failed to recover after drain of compute-0 rook-ceph-osd-1-6f7688db8b-dt5mg 2/2 Running 0 21h 10.128.5.24 compute-2 <none> <none> --> this OSD was not drained due to blocking PDBs rook-ceph-osd-2-56ff8576c-52qtw 2/2 Running 0 21h 10.128.2.36 compute-1 <none> <none> $ oc get pdb NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE rook-ceph-mds-ocs-storagecluster-cephfilesystem 1 N/A 1 21h rook-ceph-mon-pdb 2 N/A 0 21h rook-ceph-osd-host-compute-1 N/A 0 0 16h --> blocking pDBs since 16H rook-ceph-osd-host-compute-2 N/A 0 0 16h >> Flow of events: 1. Initiated OCP upgrade @Wed Feb 3 16:36:46 UTC 2021, MCO(machine-config) upgrade started @Wed Feb 3 17:12:02 UTC 2021 2. During machine-config upgrade, first OCS node to be drained was compute-0. mon-c and osd-0 running on it were drained and node recovered within 2 mins. 3. rook-ceph-mon-c-68698666bb-6ffft came up fine on compute-0 , but osd-0 is still stuck in Init:CrashLoopBackOff state, hence blocking drain of all other OCS nodes 4. Next in line, OCS node compute-2 was cordoned, but pod rook-ceph-osd-1-6f7688db8b-dt5mg is unable to drain due to blcoking PDBs(as OSD-0 is still DOWN and PGs are not clean) mon-a was drained from compute-2 and since then it is stuck inpending state, as compute-2 is still cordoned and waiting for a successful drain. 5. Overall, OCP upgrade is affected because of the OSD pod and current state of the upgradE: >> $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-02-03-113456 True False 16h Error while reconciling 4.7.0-0.nightly-2021-02-03-113456: the cluster operator machine-config is degraded Final Observation ======================= 1. MON a/d is down since it was drained and node didnt recover, so MON also didnt recover (expected in such situations) 2. osd-0 pod never came back to running state and rook-ceph-osd-1-6f7688db8b-dt5mg on compute-2 was never drained due to blocking PDBs (as PGs were unclean since OSD-0 never recovered on compute-0), hence OCP upgrade failed. F Version of all relevant components (if applicable): Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: ========================= Platform : VMware N+1 scaling was enabled even for dynamic mode as the UI fix was not yet IN. Number of OCS nodes = 3 (compute-0, 1, 2) Extra Worker node = 3 (compute-3,4,5) 1. Initiate OCP upgrade, e.g. date --utc; time oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-02-03-113456 --force --allow-explicit-upgrade; date --utc 2. keep checking the progress of OCP upgrade, especially machine-config upgrade, when the compute nodes are drained one after the other 3. For some reason, first osd pod to be drained failed to come in Running state, blocking further OCS node drain Actual results: ================== OCP upgrade failed and one OSD pod is in CrashLoopBackOffState since the time it tried to come up on compute-0 after drain Expected results: ==================== OSD pod should have recovered successfully and OCP upgrade should complete without any error. Additional info: ======================= Timestamps: >> 1. OCP upgrade phase when compute-0 was drained: = Wed Feb 3 17:12:10 UTC 2021 Wed Feb 3 17:12:10 UTC 2021 ===oc get nodes== NAME STATUS ROLES AGE VERSION compute-0 Ready,SchedulingDisabled worker 2d11h v1.20.0+9b492ff >> 2. Timestamp when the node came back to Ready state = Wed Feb 3 17:14:33 UTC 2021 Wed Feb 3 17:14:33 UTC 2021 oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-01-31-031653 True True 37m Working towards 4.7.0-0.nightly-2021-02-03-113456: 522 of 668 done (78% complete) ===oc get nodes== NAME STATUS ROLES AGE VERSION compute-0 Ready worker 2d11h v1.20.0+9b492ff >>3. Next OCS node to be drained was compute-2 which is still stuck in the same "SchedulingDisabled" state. Wed Feb 3 17:22:52 UTC 2021 oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-01-31-031653 True True 46m Working towards 4.7.0-0.nightly-2021-02-03-113456: 175 of 668 done (26% complete) ===oc get nodes== NAME STATUS ROLES AGE VERSION compute-0 Ready worker 2d11h v1.20.0+9b492ff compute-1 Ready worker 2d11h v1.20.0+3b90e69 compute-2 Ready,SchedulingDisabled worker 2d11h v1.20.0+3b90e69 >> 4. current status of Nodes $ oc get nodes NAME STATUS ROLES AGE VERSION compute-0 Ready worker 3d3h v1.20.0+9b492ff compute-1 Ready worker 3d3h v1.20.0+3b90e69 compute-2 Ready,SchedulingDisabled worker 3d3h v1.20.0+3b90e69 compute-3 Ready worker 3d3h v1.20.0+3b90e69 compute-4 Ready worker 3d3h v1.20.0+9b492ff compute-5 Ready worker 3d3h v1.20.0+9b492ff control-plane-0 Ready master 3d3h v1.20.0+9b492ff control-plane-1 Ready master 3d3h v1.20.0+9b492ff control-plane-2 Ready master 3d3h v1.20.0+9b492ff >> $ oc get nodes --show-labels|grep ocs compute-0 Ready worker 3d3h v1.20.0+9b492ff beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-0,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos compute-1 Ready worker 3d3h v1.20.0+3b90e69 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-1,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos compute-2 Ready,SchedulingDisabled worker 3d3h v1.20.0+3b90e69 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-2,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos >> #oc describe pod rook-ceph-osd-0-7f6c4f5b4-gdp9t State: Waiting Reason: PodInitializing Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulled 126m (x181 over 17h) kubelet Container image "quay.io/rhceph-dev/rhceph@sha256:35e13c86bf5891b6db3386e74fc2be728906173a7aabb5d1aa11452a62d136e9" already present on machine Warning BackOff 93s (x4695 over 17h) kubelet Back-off restarting failed container [nberry@localhost ocs-250.ci]$
We are missing some logs in Index of /OCS/ocs-qe-bugs/bz-1925055-ocp-upgrade-issue/ocs-must-gather/must-gather.local.4463789386658630260/quay-io-rhceph-dev-ocs-must-gather-sha256-5645b7f307f99df13e43efe2fd2adc78b747d9b383bac517b3a63b81de314fe6/namespaces/openshift-storage/pods/rook-ceph-osd-0-7f6c4f5b4-gdp9t Why are some init containers missing? Like the "blkdevmapper", see all of them here http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-1925055-ocp-upgrade-issue/ocs-must-gather/must-gather.local.4463789386658630260/quay-io-rhceph-dev-ocs-must-gather-sha256-5645b7f307f99df13e43efe2fd2adc78b747d9b383bac517b3a63b81de314fe6/namespaces/openshift-storage/pods/rook-ceph-osd-0-7f6c4f5b4-gdp9t/rook-ceph-osd-0-7f6c4f5b4-gdp9t.yaml
*** Bug 1925062 has been marked as a duplicate of this bug. ***
Found the bug, patch in progress.
Merged downstream: https://github.com/openshift/rook/pull/167
This was built into https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/OCS%20Build%20Pipeline%204.7/138/ . Not sure why the BZ was not moved to ON_QA. Doing it manually.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041
This will be covered in tests/ecosystem/upgrade/test_upgrade_ocp.py.