Description of problem: After shutting down 2 worker nodes on the MS provider cluster 2 new worker nodes have come up as expected, but two mon pods were stuck in a Pending state, and some other pods were stuck in a CrashLoopBackOff state. Version-Release number of selected component (if applicable): ROSA cluster OCP4.10, ODF4.10. How reproducible: Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, in case of two node failures the cluster will not be recovered. Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)?1 Can this issue reproducible? yes Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: I am not sure if we have tested this scenario in the past. From what I know we didn't test it. Steps to Reproduce: Shutting down two worker nodes from the AWS platform side Actual results: Ceph health is not OK, 2 mon pods are in a Pending state and some other pods were stuck in a CrashLoopBackOff state. Expected results: Ceph health should be OK, and all the pods should be running. Additional info: Link to the Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/15103/
Additional info: $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-135-143.ec2.internal Ready worker 104m v1.23.5+012e945 ip-10-0-137-202.ec2.internal Ready master 6h59m v1.23.5+012e945 ip-10-0-139-102.ec2.internal Ready infra,worker 6h36m v1.23.5+012e945 ip-10-0-147-116.ec2.internal Ready infra,worker 6h37m v1.23.5+012e945 ip-10-0-154-186.ec2.internal Ready master 7h v1.23.5+012e945 ip-10-0-158-49.ec2.internal Ready worker 46m v1.23.5+012e945 ip-10-0-163-159.ec2.internal Ready master 7h v1.23.5+012e945 ip-10-0-172-207.ec2.internal Ready worker 46m v1.23.5+012e945 ip-10-0-174-144.ec2.internal Ready infra,worker 6h37m v1.23.5+012e945 $ oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE addon-ocs-provider-qe-catalog-nfqm5 1/1 Running 0 49m alertmanager-managed-ocs-alertmanager-0 2/2 Running 0 49m alertmanager-managed-ocs-alertmanager-1 2/2 Running 0 49m alertmanager-managed-ocs-alertmanager-2 2/2 Running 0 49m csi-addons-controller-manager-b4495976c-l9xxz 2/2 Running 0 53m ocs-metrics-exporter-97cdff48f-zdsq4 1/1 Running 0 53m ocs-operator-5bf7c58cc9-gghmj 1/1 Running 0 53m ocs-osd-controller-manager-67658f4d75-hj6p2 2/3 Running 0 53m ocs-provider-server-67fd6b6885-kx95k 1/1 Running 0 53m odf-console-5f4494795-mdpmr 1/1 Running 0 53m odf-operator-controller-manager-7ff6cc9d4-8w662 2/2 Running 0 53m prometheus-managed-ocs-prometheus-0 3/3 Running 0 49m prometheus-operator-8547cc9f89-xp6wm 1/1 Running 0 53m rook-ceph-crashcollector-ip-10-0-135-143.ec2.internal-7b88cc69v 1/1 Running 0 104m rook-ceph-crashcollector-ip-10-0-158-49.ec2.internal-66cb6hwkgz 1/1 Running 0 47m rook-ceph-crashcollector-ip-10-0-172-207.ec2.internal-7577s66zw 1/1 Running 0 46m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5975758bqx7lf 1/2 CrashLoopBackOff 19 (2m14s ago) 58m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-f9fd6d9bdr6zb 1/2 Running 16 (5m38s ago) 58m rook-ceph-mgr-a-6dc6b5bf94-lxjct 1/2 CrashLoopBackOff 19 (3m15s ago) 58m rook-ceph-mon-a-8d9b6979b-d6wq2 0/2 Pending 0 58m rook-ceph-mon-e-bbbc799b6-5dswt 0/2 Pending 0 58m rook-ceph-mon-f-6c8c6c979-bm5pb 2/2 Running 0 89m rook-ceph-operator-848fbd9dd7-wf9ph 1/1 Running 0 53m rook-ceph-osd-0-b47bcf64-nvcd7 1/2 Running 14 (5m48s ago) 58m rook-ceph-osd-1-6cdb75979c-kj5gt 1/2 Running 14 (6m28s ago) 58m rook-ceph-osd-10-84b6676bbb-tffdz 2/2 Running 0 77m rook-ceph-osd-11-6ff74fc9f4-xk2rr 2/2 Running 0 77m rook-ceph-osd-12-77f96b4dfd-564tm 2/2 Running 0 77m rook-ceph-osd-13-bd5dbc5f-sp8bx 2/2 Running 0 77m rook-ceph-osd-14-78c457f467-wf546 2/2 Running 0 77m rook-ceph-osd-2-689458fc4c-ntvxj 1/2 Running 14 (6m18s ago) 52m rook-ceph-osd-3-b9657b758-5f2nc 1/2 Running 14 (6m18s ago) 58m rook-ceph-osd-4-8499df47d7-sbr66 1/2 Running 14 (6m28s ago) 58m rook-ceph-osd-5-7bf556b477-z2t9z 1/2 Running 14 (6m48s ago) 58m rook-ceph-osd-6-676dcbc4f8-tflxl 1/2 Running 14 (5m48s ago) 58m rook-ceph-osd-7-7f5fdd757d-rxpvn 1/2 Running 14 (5m48s ago) 58m rook-ceph-osd-8-5754cc984b-wkrbn 1/2 Running 14 (5m48s ago) 58m rook-ceph-osd-9-fcbf77c67-hwskw 1/2 Running 14 (5m48s ago) 58m rook-ceph-tools-74fb4f5d9c-6pfvv 1/1 Running 0 53m
*** Bug 2072612 has been marked as a duplicate of this bug. ***
This will be verified by rolling shutdown test as shutting down 2 nodes at the same time is not supported case.
I ran the test "test_rolling_shutdown_and_recovery_in_controlled_fashion": https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-odf-multicluster/1338/console, and it passed successfully. So, I am moving the bug to Verified. Provider cluster versions: OC version: Client Version: 4.10.24 Server Version: 4.10.50 Kubernetes Version: v1.23.12+8a6bfe4 OCS version: ocs-operator.v4.10.9 OpenShift Container Storage 4.10.9 ocs-operator.v4.10.5 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.50 True False 4h25m Cluster version is 4.10.50 Rook version: rook: v4.10.9-0.b7b3a0044169fd9364683e2e4e6968361f8f3c08 go: go1.16.12 Ceph version: ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable) CSV version: NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.10.9 NooBaa Operator 4.10.9 mcg-operator.v4.10.8 Succeeded observability-operator.v0.0.20 Observability Operator 0.0.20 observability-operator.v0.0.19 Succeeded ocs-operator.v4.10.9 OpenShift Container Storage 4.10.9 ocs-operator.v4.10.5 Succeeded ocs-osd-deployer.v2.0.11 OCS OSD Deployer 2.0.11-11 ocs-osd-deployer.v2.0.10 Succeeded odf-csi-addons-operator.v4.10.9 CSI Addons 4.10.9 odf-csi-addons-operator.v4.10.5 Succeeded odf-operator.v4.10.9 OpenShift Data Foundation 4.10.9 odf-operator.v4.10.5 Succeeded ose-prometheus-operator.4.10.0 Prometheus Operator 4.10.0 ose-prometheus-operator.4.8.0 Succeeded route-monitor-operator.v0.1.461-dbddf1f Route Monitor Operator 0.1.461-dbddf1f route-monitor-operator.v0.1.456-02ea942 Succeeded
Closing this bug as fixed in v2.0.11 and tested by QE.