Description of problem: If all worker nodes are down except for node with rook-ceph-mgr pod for longer time then provider cluster doesn't recover after nodes are up again. Version-Release number of selected component (if applicable): ocs-operator.v4.10.0 ocs-osd-deployer.v2.0.0 odf-operator.v4.10.0 ose-prometheus-operator.4.8.0 OCP 4.10.6 How reproducible: 2/2 Steps to Reproduce: 1. Stop ec2 instances of all worker nodes except the node with mgr 2. Wait 5 minues 3. Start again ec2 instances of all worker nodes. 4. Check ceph health Actual results: Ceph health after those operations is HEALTH_WARN Slow OSD heartbeats on back (longest 33089.231ms); Slow OSD heartbeats on front (longest 33089.230ms); Reduced data availability: 124 pgs inactive, 118 pgs peering; 47 slow ops, oldest one blocked for 402 sec, daemons [osd.0,osd.1,osd.2] have slow ops. This changed after few minutes into: HEALTH_WARN 1 osds down; 1 host (1 osds) down; 1 rack (1 osds) down; Degraded data redundancy: 23/69 objects degraded (33.333%), 14 pgs degraded, 193 pgs undersized All osd pods are up. Expected results: Ceph should become healthy and cluster should survive. Additional info: $ oc rsh -n openshift-storage rook-ceph-tools-65bcddc589-fxjww ceph health HEALTH_WARN 1 osds down; 1 host (1 osds) down; 1 rack (1 osds) down; Degraded data redundancy: 23/69 objects degraded (33.333%), 14 pgs degraded, 193 pgs undersized $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-133-188.us-east-2.compute.internal Ready infra,worker 4h16m v1.23.5+b0357ed 10.0.133.188 <none> Red Hat Enterprise Linux CoreOS 410.84.202203221702-0 (Ootpa) 4.18.0-305.40.2.el8_4.x86_64 cri-o://1.23.2-2.rhaos4.10.git071ae78.el8 ip-10-0-144-59.us-east-2.compute.internal Ready infra,worker 4h17m v1.23.5+b0357ed 10.0.144.59 <none> Red Hat Enterprise Linux CoreOS 410.84.202203221702-0 (Ootpa) 4.18.0-305.40.2.el8_4.x86_64 cri-o://1.23.2-2.rhaos4.10.git071ae78.el8 ip-10-0-182-252.us-east-2.compute.internal Ready worker 4h30m v1.23.5+b0357ed 10.0.182.252 <none> Red Hat Enterprise Linux CoreOS 410.84.202203221702-0 (Ootpa) 4.18.0-305.40.2.el8_4.x86_64 cri-o://1.23.2-2.rhaos4.10.git071ae78.el8 ip-10-0-191-76.us-east-2.compute.internal Ready master 4h34m v1.23.5+b0357ed 10.0.191.76 <none> Red Hat Enterprise Linux CoreOS 410.84.202203221702-0 (Ootpa) 4.18.0-305.40.2.el8_4.x86_64 cri-o://1.23.2-2.rhaos4.10.git071ae78.el8 ip-10-0-197-230.us-east-2.compute.internal Ready master 4h34m v1.23.5+b0357ed 10.0.197.230 <none> Red Hat Enterprise Linux CoreOS 410.84.202203221702-0 (Ootpa) 4.18.0-305.40.2.el8_4.x86_64 cri-o://1.23.2-2.rhaos4.10.git071ae78.el8 ip-10-0-215-121.us-east-2.compute.internal Ready worker 4h30m v1.23.5+b0357ed 10.0.215.121 <none> Red Hat Enterprise Linux CoreOS 410.84.202203221702-0 (Ootpa) 4.18.0-305.40.2.el8_4.x86_64 cri-o://1.23.2-2.rhaos4.10.git071ae78.el8 ip-10-0-224-74.us-east-2.compute.internal Ready worker 4h30m v1.23.5+b0357ed 10.0.224.74 <none> Red Hat Enterprise Linux CoreOS 410.84.202203221702-0 (Ootpa) 4.18.0-305.40.2.el8_4.x86_64 cri-o://1.23.2-2.rhaos4.10.git071ae78.el8 ip-10-0-229-231.us-east-2.compute.internal Ready master 4h34m v1.23.5+b0357ed 10.0.229.231 <none> Red Hat Enterprise Linux CoreOS 410.84.202203221702-0 (Ootpa) 4.18.0-305.40.2.el8_4.x86_64 cri-o://1.23.2-2.rhaos4.10.git071ae78.el8
*** This bug has been marked as a duplicate of bug 2112021 ***