2072612 – Cluster doesn't recover after node shutdown on provider cluster

Bug 2072612 - Cluster doesn't recover after node shutdown on provider cluster

Summary: Cluster doesn't recover after node shutdown on provider cluster

Keywords:
Status:	CLOSED DUPLICATE of bug 2112021
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	odf-managed-service
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Ohad
QA Contact:	Neha Berry
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-04-06 15:56 UTC by Filip Balák
Modified:	2023-08-09 17:00 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-10-31 11:48:05 UTC
Embargoed:

Attachments	(Terms of Use)

Description Filip Balák 2022-04-06 15:56:11 UTC

Description of problem:
If all worker nodes are down except for node with rook-ceph-mgr pod for longer time then provider cluster doesn't recover after nodes are up again.

Version-Release number of selected component (if applicable):
ocs-operator.v4.10.0
ocs-osd-deployer.v2.0.0
odf-operator.v4.10.0
ose-prometheus-operator.4.8.0
OCP 4.10.6

How reproducible:
2/2

Steps to Reproduce:
1. Stop ec2 instances of all worker nodes except the node with mgr
2. Wait 5 minues
3. Start again ec2 instances of all worker nodes.
4. Check ceph health

Actual results:
Ceph health after those operations is
HEALTH_WARN Slow OSD heartbeats on back (longest 33089.231ms); Slow OSD heartbeats on front (longest 33089.230ms); Reduced data availability: 124 pgs inactive, 118 pgs peering; 47 slow ops, oldest one blocked for 402 sec, daemons [osd.0,osd.1,osd.2] have slow ops.

This changed after few minutes into:
HEALTH_WARN 1 osds down; 1 host (1 osds) down; 1 rack (1 osds) down; Degraded data redundancy: 23/69 objects degraded (33.333%), 14 pgs degraded, 193 pgs undersized

All osd pods are up.
Expected results:
Ceph should become healthy and cluster should survive.

Additional info:
$ oc rsh -n openshift-storage rook-ceph-tools-65bcddc589-fxjww ceph health
HEALTH_WARN 1 osds down; 1 host (1 osds) down; 1 rack (1 osds) down; Degraded data redundancy: 23/69 objects degraded (33.333%), 14 pgs degraded, 193 pgs undersized
$ oc get nodes -o wide
NAME                                         STATUS   ROLES          AGE     VERSION           INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                        KERNEL-VERSION                 CONTAINER-RUNTIME
ip-10-0-133-188.us-east-2.compute.internal   Ready    infra,worker   4h16m   v1.23.5+b0357ed   10.0.133.188   <none>        Red Hat Enterprise Linux CoreOS 410.84.202203221702-0 (Ootpa)   4.18.0-305.40.2.el8_4.x86_64   cri-o://1.23.2-2.rhaos4.10.git071ae78.el8
ip-10-0-144-59.us-east-2.compute.internal    Ready    infra,worker   4h17m   v1.23.5+b0357ed   10.0.144.59    <none>        Red Hat Enterprise Linux CoreOS 410.84.202203221702-0 (Ootpa)   4.18.0-305.40.2.el8_4.x86_64   cri-o://1.23.2-2.rhaos4.10.git071ae78.el8
ip-10-0-182-252.us-east-2.compute.internal   Ready    worker         4h30m   v1.23.5+b0357ed   10.0.182.252   <none>        Red Hat Enterprise Linux CoreOS 410.84.202203221702-0 (Ootpa)   4.18.0-305.40.2.el8_4.x86_64   cri-o://1.23.2-2.rhaos4.10.git071ae78.el8
ip-10-0-191-76.us-east-2.compute.internal    Ready    master         4h34m   v1.23.5+b0357ed   10.0.191.76    <none>        Red Hat Enterprise Linux CoreOS 410.84.202203221702-0 (Ootpa)   4.18.0-305.40.2.el8_4.x86_64   cri-o://1.23.2-2.rhaos4.10.git071ae78.el8
ip-10-0-197-230.us-east-2.compute.internal   Ready    master         4h34m   v1.23.5+b0357ed   10.0.197.230   <none>        Red Hat Enterprise Linux CoreOS 410.84.202203221702-0 (Ootpa)   4.18.0-305.40.2.el8_4.x86_64   cri-o://1.23.2-2.rhaos4.10.git071ae78.el8
ip-10-0-215-121.us-east-2.compute.internal   Ready    worker         4h30m   v1.23.5+b0357ed   10.0.215.121   <none>        Red Hat Enterprise Linux CoreOS 410.84.202203221702-0 (Ootpa)   4.18.0-305.40.2.el8_4.x86_64   cri-o://1.23.2-2.rhaos4.10.git071ae78.el8
ip-10-0-224-74.us-east-2.compute.internal    Ready    worker         4h30m   v1.23.5+b0357ed   10.0.224.74    <none>        Red Hat Enterprise Linux CoreOS 410.84.202203221702-0 (Ootpa)   4.18.0-305.40.2.el8_4.x86_64   cri-o://1.23.2-2.rhaos4.10.git071ae78.el8
ip-10-0-229-231.us-east-2.compute.internal   Ready    master         4h34m   v1.23.5+b0357ed   10.0.229.231   <none>        Red Hat Enterprise Linux CoreOS 410.84.202203221702-0 (Ootpa)   4.18.0-305.40.2.el8_4.x86_64   cri-o://1.23.2-2.rhaos4.10.git071ae78.el8

Comment 3 Filip Balák 2022-10-31 11:48:05 UTC


*** This bug has been marked as a duplicate of bug 2112021 ***

Note You need to log in before you can comment on or make changes to this bug.