Bug 2136378

Summary:	OSDs are marked as down when OSD pods are running
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Jilju Joy <jijoy>
Component:	odf-managed-service	Assignee:	Leela Venkaiah Gangavarapu <lgangava>
Status:	CLOSED NOTABUG	QA Contact:	Jilju Joy <jijoy>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.10	CC:	aeyal, ebenahar, lgangava, ocs-bugs, odf-bz-bot
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-11-21 09:10:34 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jilju Joy 2022-10-20 06:25:30 UTC

Description of problem:
2 osds are marked as down after a test case which restart all the worker nodes. This state is not recovered even after 15 hours when all the OSD pods were running. Dev addon was used for installing the cluster which contains the changes for the epic https://issues.redhat.com/browse/ODFMS-55
We have observed this condition in a different cluster without even running any disruption tests.

$ oc rsh rook-ceph-tools-787676bdbd-k4bdn ceph status
  cluster:
    id:     c4076b98-b38e-4692-9302-8dd22535a932
    health: HEALTH_WARN
            1 filesystem is degraded
            1 MDSs report slow metadata IOs
            2 osds down
            2 hosts (2 osds) down
            2 zones (2 osds) down
            Reduced data availability: 417 pgs inactive, 417 pgs peering, 417 pgs stale
 
  services:
    mon: 3 daemons, quorum a,b,c (age 14h)
    mgr: a(active, since 14h)
    mds: 1/1 daemons up, 1 standby
    osd: 3 osds: 1 up (since 12h), 3 in (since 24h)
 
  data:
    volumes: 0/1 healthy, 1 recovering
    pools:   5 pools, 417 pgs
    objects: 65 objects, 44 MiB
    usage:   398 MiB used, 12 TiB / 12 TiB avail
    pgs:     100.000% pgs not active
             417 stale+peering

All 3 OSD pods are running:

$ oc get pods -o wide -l app=rook-ceph-osd
NAME                               READY   STATUS    RESTARTS   AGE   IP             NODE                           NOMINATED NODE   READINESS GATES
rook-ceph-osd-0-85cfdf7f6f-f5pvv   2/2     Running   0          15h   10.0.148.249   ip-10-0-148-249.ec2.internal   <none>           <none>
rook-ceph-osd-1-6798855f-xzs8r     2/2     Running   0          15h   10.0.170.49    ip-10-0-170-49.ec2.internal    <none>           <none>
rook-ceph-osd-2-67dd9dd654-zvr5q   2/2     Running   0          12h   10.0.128.227   ip-10-0-128-227.ec2.internal   <none>           <none>

$ oc get nodes
NAME                           STATUS   ROLES          AGE   VERSION
ip-10-0-128-227.ec2.internal   Ready    worker         25h   v1.23.5+8471591
ip-10-0-129-248.ec2.internal   Ready    master         25h   v1.23.5+8471591
ip-10-0-135-211.ec2.internal   Ready    infra,worker   25h   v1.23.5+8471591
ip-10-0-138-240.ec2.internal   Ready    worker         25h   v1.23.5+8471591
ip-10-0-148-249.ec2.internal   Ready    worker         25h   v1.23.5+8471591
ip-10-0-153-179.ec2.internal   Ready    infra,worker   25h   v1.23.5+8471591
ip-10-0-156-88.ec2.internal    Ready    worker         25h   v1.23.5+8471591
ip-10-0-156-93.ec2.internal    Ready    master         25h   v1.23.5+8471591
ip-10-0-162-182.ec2.internal   Ready    infra,worker   25h   v1.23.5+8471591
ip-10-0-162-4.ec2.internal     Ready    master         25h   v1.23.5+8471591
ip-10-0-164-32.ec2.internal    Ready    worker         25h   v1.23.5+8471591
ip-10-0-170-49.ec2.internal    Ready    worker         25h   v1.23.5+8471591

must-gather logs : http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-o19-c3/jijoy-o19-c3_20221019T082340/logs/testcases_1666245376/
--------------------------------------------------------

Failed test case:
tests/manage/z_cluster/nodes/test_nodes_restart_ms.py::TestNodesRestartMS::test_nodes_restart[worker]


Relevant logs from the test case:

Rebooting nodes

2022-10-19 19:41:46  14:11:46 - MainThread - /home/jenkins/workspace/qe-odf-multicluster/ocs-ci/ocs_ci/utility/aws.py - INFO - C[jijoy-o19-pr] - Rebooting instances ('ip-10-0-128-227.ec2.internal', 'ip-10-0-138-240.ec2.internal', 'ip-10-0-148-249.ec2.internal', 'ip-10-0-156-88.ec2.internal', 'ip-10-0-164-32.ec2.internal', 'ip-10-0-170-49.ec2.internal')
2022-10-19 19:41:47  14:11:47 - MainThread - ocs_ci.ocs.node - INFO - C[jijoy-o19-pr] - Wait for 6 of the nodes to reach the expected status Ready


Nodes reached the state Ready after some time

2022-10-19 19:42:10  14:12:10 - MainThread - ocs_ci.ocs.node - INFO - C[jijoy-o19-pr] - The following nodes reached status Ready: ['ip-10-0-128-227.ec2.internal', 'ip-10-0-129-248.ec2.internal', 'ip-10-0-135-211.ec2.internal', 'ip-10-0-138-240.ec2.internal', 'ip-10-0-148-249.ec2.internal', 'ip-10-0-153-179.ec2.internal', 'ip-10-0-156-88.ec2.internal', 'ip-10-0-156-93.ec2.internal', 'ip-10-0-162-182.ec2.internal', 'ip-10-0-162-4.ec2.internal', 'ip-10-0-164-32.ec2.internal', 'ip-10-0-170-49.ec2.internal']

===================================================================================

Version-Release number of selected component (if applicable):
OCP 4.10.35
ODF 4.10.5-4
ocs-osd-deployer.v2.0.8

==================================================================================
How reproducible:
2/3
=================================================================================
Steps to Reproduce:
The issue reported here was seen after rebooting worker nodes in the test case tests/manage/z_cluster/nodes/test_nodes_restart_ms.py::TestNodesRestartMS::test_nodes_restart[worker]. 
This condition was also observed in a different cluster without even running any disruption tests.

===============================================================================
Actual results:
2 OSDs marked as down when checking the ceph status.

Expected results:
All OSDs should be marked as up.

Additional info:

Comment 1 Leela Venkaiah Gangavarapu 2022-10-27 13:25:53 UTC

> 2 osds are marked as down after a test case which restart all the worker nodes
- as per the lengthy discussions happened in https://chat.google.com/room/AAAASHA9vWs/w61gO12VQIc, this bug is invalid
- as a gist, the testcase should only be restarting single node at a time and need to make sure ceph health is OK before restarting the next node
- pls recheck @jijoy

Comment 2 Jilju Joy 2022-11-21 09:10:34 UTC

(In reply to Leela Venkaiah Gangavarapu from comment #1)
> > 2 osds are marked as down after a test case which restart all the worker nodes
> - as per the lengthy discussions happened in
> https://chat.google.com/room/AAAASHA9vWs/w61gO12VQIc, this bug is invalid
> - as a gist, the testcase should only be restarting single node at a time
> and need to make sure ceph health is OK before restarting the next node
> - pls recheck @jijoy

Hi Leela,

Closing the bug based on the discussions.