Bug 2112021 - After shutting down 2 worker nodes on the MS provider cluster 2 mons are down and ceph health is not recovered
Summary: After shutting down 2 worker nodes on the MS provider cluster 2 mons are down...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-managed-service
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Nobody
QA Contact: Itzhak
URL:
Whiteboard:
: 2072612 (view as bug list)
Depends On: 2133683
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-07-28 16:06 UTC by Itzhak
Modified: 2023-08-09 17:00 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2133683 (view as bug list)
Environment:
Last Closed: 2023-03-14 15:37:27 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 2133683 0 unspecified CLOSED After shutting down 2 worker nodes on the MS provider cluster 2 mons are down and ceph health is not recovered 2023-08-09 17:03:01 UTC

Description Itzhak 2022-07-28 16:06:34 UTC
Description of problem:

After shutting down 2 worker nodes on the MS provider cluster 2 new worker nodes have come up as expected, but two mon pods were stuck in a Pending state, and some other pods were stuck in a CrashLoopBackOff state. 

Version-Release number of selected component (if applicable):
ROSA cluster OCP4.10, ODF4.10.

How reproducible:

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, in case of two node failures the cluster will not be recovered.

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?1

Can this issue reproducible?
yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:
I am not sure if we have tested this scenario in the past. From what I know we didn't test it. 

Steps to Reproduce:
Shutting down two worker nodes from the AWS platform side

Actual results:
Ceph health is not OK, 2 mon pods are in a Pending state and some other pods were stuck in a CrashLoopBackOff state. 

Expected results:
Ceph health should be OK, and all the pods should be running.

Additional info:

Link to the Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/15103/

Comment 1 Itzhak 2022-07-28 16:09:53 UTC
Additional info:

$ oc get nodes
NAME                           STATUS   ROLES          AGE     VERSION
ip-10-0-135-143.ec2.internal   Ready    worker         104m    v1.23.5+012e945
ip-10-0-137-202.ec2.internal   Ready    master         6h59m   v1.23.5+012e945
ip-10-0-139-102.ec2.internal   Ready    infra,worker   6h36m   v1.23.5+012e945
ip-10-0-147-116.ec2.internal   Ready    infra,worker   6h37m   v1.23.5+012e945
ip-10-0-154-186.ec2.internal   Ready    master         7h      v1.23.5+012e945
ip-10-0-158-49.ec2.internal    Ready    worker         46m     v1.23.5+012e945
ip-10-0-163-159.ec2.internal   Ready    master         7h      v1.23.5+012e945
ip-10-0-172-207.ec2.internal   Ready    worker         46m     v1.23.5+012e945
ip-10-0-174-144.ec2.internal   Ready    infra,worker   6h37m   v1.23.5+012e945

$ oc get pods -n openshift-storage
NAME                                                              READY   STATUS             RESTARTS         AGE
addon-ocs-provider-qe-catalog-nfqm5                               1/1     Running            0                49m
alertmanager-managed-ocs-alertmanager-0                           2/2     Running            0                49m
alertmanager-managed-ocs-alertmanager-1                           2/2     Running            0                49m
alertmanager-managed-ocs-alertmanager-2                           2/2     Running            0                49m
csi-addons-controller-manager-b4495976c-l9xxz                     2/2     Running            0                53m
ocs-metrics-exporter-97cdff48f-zdsq4                              1/1     Running            0                53m
ocs-operator-5bf7c58cc9-gghmj                                     1/1     Running            0                53m
ocs-osd-controller-manager-67658f4d75-hj6p2                       2/3     Running            0                53m
ocs-provider-server-67fd6b6885-kx95k                              1/1     Running            0                53m
odf-console-5f4494795-mdpmr                                       1/1     Running            0                53m
odf-operator-controller-manager-7ff6cc9d4-8w662                   2/2     Running            0                53m
prometheus-managed-ocs-prometheus-0                               3/3     Running            0                49m
prometheus-operator-8547cc9f89-xp6wm                              1/1     Running            0                53m
rook-ceph-crashcollector-ip-10-0-135-143.ec2.internal-7b88cc69v   1/1     Running            0                104m
rook-ceph-crashcollector-ip-10-0-158-49.ec2.internal-66cb6hwkgz   1/1     Running            0                47m
rook-ceph-crashcollector-ip-10-0-172-207.ec2.internal-7577s66zw   1/1     Running            0                46m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5975758bqx7lf   1/2     CrashLoopBackOff   19 (2m14s ago)   58m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-f9fd6d9bdr6zb   1/2     Running            16 (5m38s ago)   58m
rook-ceph-mgr-a-6dc6b5bf94-lxjct                                  1/2     CrashLoopBackOff   19 (3m15s ago)   58m
rook-ceph-mon-a-8d9b6979b-d6wq2                                   0/2     Pending            0                58m
rook-ceph-mon-e-bbbc799b6-5dswt                                   0/2     Pending            0                58m
rook-ceph-mon-f-6c8c6c979-bm5pb                                   2/2     Running            0                89m
rook-ceph-operator-848fbd9dd7-wf9ph                               1/1     Running            0                53m
rook-ceph-osd-0-b47bcf64-nvcd7                                    1/2     Running            14 (5m48s ago)   58m
rook-ceph-osd-1-6cdb75979c-kj5gt                                  1/2     Running            14 (6m28s ago)   58m
rook-ceph-osd-10-84b6676bbb-tffdz                                 2/2     Running            0                77m
rook-ceph-osd-11-6ff74fc9f4-xk2rr                                 2/2     Running            0                77m
rook-ceph-osd-12-77f96b4dfd-564tm                                 2/2     Running            0                77m
rook-ceph-osd-13-bd5dbc5f-sp8bx                                   2/2     Running            0                77m
rook-ceph-osd-14-78c457f467-wf546                                 2/2     Running            0                77m
rook-ceph-osd-2-689458fc4c-ntvxj                                  1/2     Running            14 (6m18s ago)   52m
rook-ceph-osd-3-b9657b758-5f2nc                                   1/2     Running            14 (6m18s ago)   58m
rook-ceph-osd-4-8499df47d7-sbr66                                  1/2     Running            14 (6m28s ago)   58m
rook-ceph-osd-5-7bf556b477-z2t9z                                  1/2     Running            14 (6m48s ago)   58m
rook-ceph-osd-6-676dcbc4f8-tflxl                                  1/2     Running            14 (5m48s ago)   58m
rook-ceph-osd-7-7f5fdd757d-rxpvn                                  1/2     Running            14 (5m48s ago)   58m
rook-ceph-osd-8-5754cc984b-wkrbn                                  1/2     Running            14 (5m48s ago)   58m
rook-ceph-osd-9-fcbf77c67-hwskw                                   1/2     Running            14 (5m48s ago)   58m
rook-ceph-tools-74fb4f5d9c-6pfvv                                  1/1     Running            0                53m

Comment 6 Filip Balák 2022-10-31 11:48:05 UTC
*** Bug 2072612 has been marked as a duplicate of this bug. ***

Comment 19 Filip Balák 2023-02-06 10:00:51 UTC
This will be verified by rolling shutdown test as shutting down 2 nodes at the same time is not supported case.

Comment 20 Itzhak 2023-02-14 13:32:50 UTC
I ran the test "test_rolling_shutdown_and_recovery_in_controlled_fashion": https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-odf-multicluster/1338/console, and it passed successfully.
So, I am moving the bug to Verified.

Provider cluster versions:


OC version:
Client Version: 4.10.24
Server Version: 4.10.50
Kubernetes Version: v1.23.12+8a6bfe4

OCS version:
ocs-operator.v4.10.9                      OpenShift Container Storage   4.10.9            ocs-operator.v4.10.5                      Succeeded

Cluster version
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.50   True        False         4h25m   Cluster version is 4.10.50

Rook version:
rook: v4.10.9-0.b7b3a0044169fd9364683e2e4e6968361f8f3c08
go: go1.16.12

Ceph version:
ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)

CSV version:
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
mcg-operator.v4.10.9                      NooBaa Operator               4.10.9            mcg-operator.v4.10.8                      Succeeded
observability-operator.v0.0.20            Observability Operator        0.0.20            observability-operator.v0.0.19            Succeeded
ocs-operator.v4.10.9                      OpenShift Container Storage   4.10.9            ocs-operator.v4.10.5                      Succeeded
ocs-osd-deployer.v2.0.11                  OCS OSD Deployer              2.0.11-11         ocs-osd-deployer.v2.0.10                  Succeeded
odf-csi-addons-operator.v4.10.9           CSI Addons                    4.10.9            odf-csi-addons-operator.v4.10.5           Succeeded
odf-operator.v4.10.9                      OpenShift Data Foundation     4.10.9            odf-operator.v4.10.5                      Succeeded
ose-prometheus-operator.4.10.0            Prometheus Operator           4.10.0            ose-prometheus-operator.4.8.0             Succeeded
route-monitor-operator.v0.1.461-dbddf1f   Route Monitor Operator        0.1.461-dbddf1f   route-monitor-operator.v0.1.456-02ea942   Succeeded

Comment 21 Ritesh Chikatwar 2023-03-14 15:37:27 UTC
Closing this bug as fixed in v2.0.11 and tested by QE.


Note You need to log in before you can comment on or make changes to this bug.