Bug 1925055 - OSD pod stuck in Init:CrashLoopBackOff following Node maintenance in OCP upgrade from OCP 4.7 to 4.7 nightly
Summary: OSD pod stuck in Init:CrashLoopBackOff following Node maintenance in OCP upgr...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: rook
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: OCS 4.7.0
Assignee: Sébastien Han
QA Contact: Neha Berry
URL:
Whiteboard:
: 1925062 (view as bug list)
Depends On:
Blocks: 1788492 1878638
TreeView+ depends on / blocked
 
Reported: 2021-02-04 10:18 UTC by Neha Berry
Modified: 2023-12-01 10:18 UTC (History)
11 users (show)

Fixed In Version: 4.7.0-722.ci
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-19 09:18:58 UTC
Embargoed:


Attachments (Terms of Use)
describe of osd 0 pod (12.92 KB, text/plain)
2021-02-04 10:18 UTC, Neha Berry
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github rook rook pull 7172 0 None closed ceph: always override existing block for blkdevmapper 2021-02-15 05:45:21 UTC
Red Hat Product Errata RHSA-2021:2041 0 None None None 2021-05-19 09:19:51 UTC

Description Neha Berry 2021-02-04 10:18:55 UTC
Created attachment 1755017 [details]
describe of osd 0 pod

Description of problem (please be detailed as possible and provide log
snippests):
======================================================================
On an OCS 4.7.0-250.ci + OCP 4.7(4.7.0-0.nightly-2021-01-31-031653), initiated an  OCP upgrade to build 4.7.0-0.nightly-2021-02-03-113456. 

During MCO upgrade ,when the compute-0 was drained for maintenance and brought back in, it is seen that the OSD pod running on this stuck in Init:CrashLoopBackOff and hence OCP upgrade has not yet succeeded(more than 16+ hrs)


POD status
===================

rook-ceph-mon-b-674c49c7d-zwbps                                   2/2     Running                 1          21h     10.128.2.32    compute-1   <none>           <none>

rook-ceph-mon-c-68698666bb-6ffft                                  2/2     Running                 0          16h     10.130.2.9     compute-0   <none>           <none>

rook-ceph-mon-d-canary-5bb487445f-d4s8k                           0/2     Pending                 0          83s     <none>         <none>      <none>           <none>  --> unable to recover as compute-2 is still cordoned

rook-ceph-operator-7f8d4bfdb6-2566r                               1/1     Running                 0          21h     10.129.3.7     compute-3   <none>           <none>
rook-ceph-osd-0-7f6c4f5b4-gdp9t                                   0/2     Init:CrashLoopBackOff   201        16h     10.130.2.8     compute-0   <none>           <none>  --> OSD failed to recover after drain of compute-0

rook-ceph-osd-1-6f7688db8b-dt5mg                                  2/2     Running                 0          21h     10.128.5.24    compute-2   <none>           <none> --> this OSD was not drained due to blocking PDBs

rook-ceph-osd-2-56ff8576c-52qtw                                   2/2     Running                 0          21h     10.128.2.36    compute-1   <none>           <none>

$ oc get pdb
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     21h
rook-ceph-mon-pdb                                 2               N/A               0                     21h
rook-ceph-osd-host-compute-1                      N/A             0                 0                     16h --> blocking pDBs since 16H
rook-ceph-osd-host-compute-2                      N/A             0                 0                     16h



>> Flow of events:
1. Initiated OCP upgrade @Wed Feb  3 16:36:46 UTC 2021, MCO(machine-config) upgrade started @Wed Feb  3 17:12:02 UTC 2021

2. During machine-config upgrade, first OCS node to be drained was compute-0. mon-c and osd-0 running on it were drained and node recovered within 2 mins.

3. rook-ceph-mon-c-68698666bb-6ffft came up fine on compute-0 , but osd-0 is still stuck in Init:CrashLoopBackOff state, hence blocking drain of all other OCS nodes

4. Next in line, OCS node compute-2 was cordoned, but pod rook-ceph-osd-1-6f7688db8b-dt5mg is unable to drain due to blcoking PDBs(as OSD-0 is still DOWN and PGs are not clean)
mon-a was drained from compute-2 and since then it is stuck inpending state, as compute-2 is still cordoned and waiting for a successful drain.

5. Overall, OCP upgrade is affected because of the OSD pod and current state of the upgradE:

>> $ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-02-03-113456   True        False         16h     Error while reconciling 4.7.0-0.nightly-2021-02-03-113456: the cluster operator machine-config is degraded


Final Observation
=======================

1. MON a/d is down since it was drained and node didnt recover, so MON also didnt recover (expected in such situations)
2. osd-0 pod never came back to running state and  rook-ceph-osd-1-6f7688db8b-dt5mg on compute-2 was never drained due to blocking PDBs (as PGs were unclean since OSD-0 never recovered on compute-0), hence OCP upgrade failed.





F
Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
=========================
Platform : VMware
N+1 scaling was enabled even for dynamic mode as the UI fix was not yet IN.
Number of OCS nodes = 3 (compute-0, 1, 2)
Extra Worker node  = 3 (compute-3,4,5)


1. Initiate OCP upgrade, e.g.
 date --utc; time oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-02-03-113456 --force --allow-explicit-upgrade; date --utc


2. keep checking the progress of OCP upgrade, especially machine-config upgrade, when the compute nodes are drained one after the other

3. For some reason, first osd pod to be drained failed to come in Running state, blocking further OCS node drain


Actual results:
==================
OCP upgrade failed and one OSD pod is in CrashLoopBackOffState since the time it tried to come up on compute-0 after drain


Expected results:
====================
OSD pod should have recovered successfully and OCP upgrade should complete without any error.


Additional info:
=======================
Timestamps:

>> 1. OCP upgrade phase when compute-0 was drained: = Wed Feb  3 17:12:10 UTC 2021

Wed Feb  3 17:12:10 UTC 2021
===oc get nodes==
NAME              STATUS                     ROLES    AGE     VERSION
compute-0         Ready,SchedulingDisabled   worker   2d11h   v1.20.0+9b492ff


>> 2. Timestamp when the node came back to Ready state = Wed Feb  3 17:14:33 UTC 2021


Wed Feb  3 17:14:33 UTC 2021
oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-01-31-031653   True        True          37m     Working towards 4.7.0-0.nightly-2021-02-03-113456: 522 of 668 done (78% complete)
===oc get nodes==
NAME              STATUS                        ROLES    AGE     VERSION
compute-0         Ready                         worker   2d11h   v1.20.0+9b492ff

>>3. Next OCS node to be drained was compute-2 which is still stuck in the same "SchedulingDisabled" state.

 Wed Feb  3 17:22:52 UTC 2021
oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-01-31-031653   True        True          46m     Working towards 4.7.0-0.nightly-2021-02-03-113456: 175 of 668 done (26% complete)
===oc get nodes==
NAME              STATUS                        ROLES    AGE     VERSION
compute-0         Ready                         worker   2d11h   v1.20.0+9b492ff
compute-1         Ready                         worker   2d11h   v1.20.0+3b90e69
compute-2         Ready,SchedulingDisabled      worker   2d11h   v1.20.0+3b90e69



>> 4. current status of Nodes

$ oc get nodes 
NAME              STATUS                     ROLES    AGE    VERSION
compute-0         Ready                      worker   3d3h   v1.20.0+9b492ff
compute-1         Ready                      worker   3d3h   v1.20.0+3b90e69
compute-2         Ready,SchedulingDisabled   worker   3d3h   v1.20.0+3b90e69
compute-3         Ready                      worker   3d3h   v1.20.0+3b90e69
compute-4         Ready                      worker   3d3h   v1.20.0+9b492ff
compute-5         Ready                      worker   3d3h   v1.20.0+9b492ff
control-plane-0   Ready                      master   3d3h   v1.20.0+9b492ff
control-plane-1   Ready                      master   3d3h   v1.20.0+9b492ff
control-plane-2   Ready                      master   3d3h   v1.20.0+9b492ff


>> $ oc get nodes --show-labels|grep ocs
compute-0         Ready                      worker   3d3h   v1.20.0+9b492ff   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-0,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos

compute-1         Ready                      worker   3d3h   v1.20.0+3b90e69   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-1,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos

compute-2         Ready,SchedulingDisabled   worker   3d3h   v1.20.0+3b90e69   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-2,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos



>> #oc describe pod rook-ceph-osd-0-7f6c4f5b4-gdp9t
   State:          Waiting
      Reason:       PodInitializing

Events:
  Type     Reason   Age                   From     Message
  ----     ------   ----                  ----     -------
  Normal   Pulled   126m (x181 over 17h)  kubelet  Container image "quay.io/rhceph-dev/rhceph@sha256:35e13c86bf5891b6db3386e74fc2be728906173a7aabb5d1aa11452a62d136e9" already present on machine
  Warning  BackOff  93s (x4695 over 17h)  kubelet  Back-off restarting failed container
[nberry@localhost ocs-250.ci]$

Comment 6 Sébastien Han 2021-02-04 16:58:38 UTC
We are missing some logs in Index of /OCS/ocs-qe-bugs/bz-1925055-ocp-upgrade-issue/ocs-must-gather/must-gather.local.4463789386658630260/quay-io-rhceph-dev-ocs-must-gather-sha256-5645b7f307f99df13e43efe2fd2adc78b747d9b383bac517b3a63b81de314fe6/namespaces/openshift-storage/pods/rook-ceph-osd-0-7f6c4f5b4-gdp9t

Why are some init containers missing? Like the "blkdevmapper", see all of them here http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-1925055-ocp-upgrade-issue/ocs-must-gather/must-gather.local.4463789386658630260/quay-io-rhceph-dev-ocs-must-gather-sha256-5645b7f307f99df13e43efe2fd2adc78b747d9b383bac517b3a63b81de314fe6/namespaces/openshift-storage/pods/rook-ceph-osd-0-7f6c4f5b4-gdp9t/rook-ceph-osd-0-7f6c4f5b4-gdp9t.yaml

Comment 7 Sébastien Han 2021-02-04 17:04:02 UTC
*** Bug 1925062 has been marked as a duplicate of this bug. ***

Comment 11 Sébastien Han 2021-02-05 11:48:04 UTC
Found the bug, patch in progress.

Comment 15 Travis Nielsen 2021-02-09 18:05:42 UTC
Merged downstream: https://github.com/openshift/rook/pull/167

Comment 16 Michael Adam 2021-02-10 08:27:36 UTC
This was built into https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/OCS%20Build%20Pipeline%204.7/138/ .

Not sure why the BZ was not moved to ON_QA.
Doing it manually.

Comment 23 errata-xmlrpc 2021-05-19 09:18:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041

Comment 24 Jilju Joy 2021-09-21 06:19:22 UTC
This will be covered in tests/ecosystem/upgrade/test_upgrade_ocp.py.


Note You need to log in before you can comment on or make changes to this bug.