Bug 2136378 - OSDs are marked as down when OSD pods are running
Summary: OSDs are marked as down when OSD pods are running
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-managed-service
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Leela Venkaiah Gangavarapu
QA Contact: Jilju Joy
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-10-20 06:25 UTC by Jilju Joy
Modified: 2023-08-09 17:00 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-11-21 09:10:34 UTC
Embargoed:


Attachments (Terms of Use)

Description Jilju Joy 2022-10-20 06:25:30 UTC
Description of problem:
2 osds are marked as down after a test case which restart all the worker nodes. This state is not recovered even after 15 hours when all the OSD pods were running. Dev addon was used for installing the cluster which contains the changes for the epic https://issues.redhat.com/browse/ODFMS-55
We have observed this condition in a different cluster without even running any disruption tests.

$ oc rsh rook-ceph-tools-787676bdbd-k4bdn ceph status
  cluster:
    id:     c4076b98-b38e-4692-9302-8dd22535a932
    health: HEALTH_WARN
            1 filesystem is degraded
            1 MDSs report slow metadata IOs
            2 osds down
            2 hosts (2 osds) down
            2 zones (2 osds) down
            Reduced data availability: 417 pgs inactive, 417 pgs peering, 417 pgs stale
 
  services:
    mon: 3 daemons, quorum a,b,c (age 14h)
    mgr: a(active, since 14h)
    mds: 1/1 daemons up, 1 standby
    osd: 3 osds: 1 up (since 12h), 3 in (since 24h)
 
  data:
    volumes: 0/1 healthy, 1 recovering
    pools:   5 pools, 417 pgs
    objects: 65 objects, 44 MiB
    usage:   398 MiB used, 12 TiB / 12 TiB avail
    pgs:     100.000% pgs not active
             417 stale+peering

All 3 OSD pods are running:

$ oc get pods -o wide -l app=rook-ceph-osd
NAME                               READY   STATUS    RESTARTS   AGE   IP             NODE                           NOMINATED NODE   READINESS GATES
rook-ceph-osd-0-85cfdf7f6f-f5pvv   2/2     Running   0          15h   10.0.148.249   ip-10-0-148-249.ec2.internal   <none>           <none>
rook-ceph-osd-1-6798855f-xzs8r     2/2     Running   0          15h   10.0.170.49    ip-10-0-170-49.ec2.internal    <none>           <none>
rook-ceph-osd-2-67dd9dd654-zvr5q   2/2     Running   0          12h   10.0.128.227   ip-10-0-128-227.ec2.internal   <none>           <none>

$ oc get nodes
NAME                           STATUS   ROLES          AGE   VERSION
ip-10-0-128-227.ec2.internal   Ready    worker         25h   v1.23.5+8471591
ip-10-0-129-248.ec2.internal   Ready    master         25h   v1.23.5+8471591
ip-10-0-135-211.ec2.internal   Ready    infra,worker   25h   v1.23.5+8471591
ip-10-0-138-240.ec2.internal   Ready    worker         25h   v1.23.5+8471591
ip-10-0-148-249.ec2.internal   Ready    worker         25h   v1.23.5+8471591
ip-10-0-153-179.ec2.internal   Ready    infra,worker   25h   v1.23.5+8471591
ip-10-0-156-88.ec2.internal    Ready    worker         25h   v1.23.5+8471591
ip-10-0-156-93.ec2.internal    Ready    master         25h   v1.23.5+8471591
ip-10-0-162-182.ec2.internal   Ready    infra,worker   25h   v1.23.5+8471591
ip-10-0-162-4.ec2.internal     Ready    master         25h   v1.23.5+8471591
ip-10-0-164-32.ec2.internal    Ready    worker         25h   v1.23.5+8471591
ip-10-0-170-49.ec2.internal    Ready    worker         25h   v1.23.5+8471591

must-gather logs : http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-o19-c3/jijoy-o19-c3_20221019T082340/logs/testcases_1666245376/
--------------------------------------------------------

Failed test case:
tests/manage/z_cluster/nodes/test_nodes_restart_ms.py::TestNodesRestartMS::test_nodes_restart[worker]


Relevant logs from the test case:

Rebooting nodes

2022-10-19 19:41:46  14:11:46 - MainThread - /home/jenkins/workspace/qe-odf-multicluster/ocs-ci/ocs_ci/utility/aws.py - INFO - C[jijoy-o19-pr] - Rebooting instances ('ip-10-0-128-227.ec2.internal', 'ip-10-0-138-240.ec2.internal', 'ip-10-0-148-249.ec2.internal', 'ip-10-0-156-88.ec2.internal', 'ip-10-0-164-32.ec2.internal', 'ip-10-0-170-49.ec2.internal')
2022-10-19 19:41:47  14:11:47 - MainThread - ocs_ci.ocs.node - INFO - C[jijoy-o19-pr] - Wait for 6 of the nodes to reach the expected status Ready


Nodes reached the state Ready after some time

2022-10-19 19:42:10  14:12:10 - MainThread - ocs_ci.ocs.node - INFO - C[jijoy-o19-pr] - The following nodes reached status Ready: ['ip-10-0-128-227.ec2.internal', 'ip-10-0-129-248.ec2.internal', 'ip-10-0-135-211.ec2.internal', 'ip-10-0-138-240.ec2.internal', 'ip-10-0-148-249.ec2.internal', 'ip-10-0-153-179.ec2.internal', 'ip-10-0-156-88.ec2.internal', 'ip-10-0-156-93.ec2.internal', 'ip-10-0-162-182.ec2.internal', 'ip-10-0-162-4.ec2.internal', 'ip-10-0-164-32.ec2.internal', 'ip-10-0-170-49.ec2.internal']

===================================================================================

Version-Release number of selected component (if applicable):
OCP 4.10.35
ODF 4.10.5-4
ocs-osd-deployer.v2.0.8

==================================================================================
How reproducible:
2/3
=================================================================================
Steps to Reproduce:
The issue reported here was seen after rebooting worker nodes in the test case tests/manage/z_cluster/nodes/test_nodes_restart_ms.py::TestNodesRestartMS::test_nodes_restart[worker]. 
This condition was also observed in a different cluster without even running any disruption tests.

===============================================================================
Actual results:
2 OSDs marked as down when checking the ceph status.

Expected results:
All OSDs should be marked as up.

Additional info:

Comment 1 Leela Venkaiah Gangavarapu 2022-10-27 13:25:53 UTC
> 2 osds are marked as down after a test case which restart all the worker nodes
- as per the lengthy discussions happened in https://chat.google.com/room/AAAASHA9vWs/w61gO12VQIc, this bug is invalid
- as a gist, the testcase should only be restarting single node at a time and need to make sure ceph health is OK before restarting the next node
- pls recheck @jijoy

Comment 2 Jilju Joy 2022-11-21 09:10:34 UTC
(In reply to Leela Venkaiah Gangavarapu from comment #1)
> > 2 osds are marked as down after a test case which restart all the worker nodes
> - as per the lengthy discussions happened in
> https://chat.google.com/room/AAAASHA9vWs/w61gO12VQIc, this bug is invalid
> - as a gist, the testcase should only be restarting single node at a time
> and need to make sure ceph health is OK before restarting the next node
> - pls recheck @jijoy

Hi Leela,

Closing the bug based on the discussions.


Note You need to log in before you can comment on or make changes to this bug.