Description of problem: 2 osds are marked as down after a test case which restart all the worker nodes. This state is not recovered even after 15 hours when all the OSD pods were running. Dev addon was used for installing the cluster which contains the changes for the epic https://issues.redhat.com/browse/ODFMS-55 We have observed this condition in a different cluster without even running any disruption tests. $ oc rsh rook-ceph-tools-787676bdbd-k4bdn ceph status cluster: id: c4076b98-b38e-4692-9302-8dd22535a932 health: HEALTH_WARN 1 filesystem is degraded 1 MDSs report slow metadata IOs 2 osds down 2 hosts (2 osds) down 2 zones (2 osds) down Reduced data availability: 417 pgs inactive, 417 pgs peering, 417 pgs stale services: mon: 3 daemons, quorum a,b,c (age 14h) mgr: a(active, since 14h) mds: 1/1 daemons up, 1 standby osd: 3 osds: 1 up (since 12h), 3 in (since 24h) data: volumes: 0/1 healthy, 1 recovering pools: 5 pools, 417 pgs objects: 65 objects, 44 MiB usage: 398 MiB used, 12 TiB / 12 TiB avail pgs: 100.000% pgs not active 417 stale+peering All 3 OSD pods are running: $ oc get pods -o wide -l app=rook-ceph-osd NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES rook-ceph-osd-0-85cfdf7f6f-f5pvv 2/2 Running 0 15h 10.0.148.249 ip-10-0-148-249.ec2.internal <none> <none> rook-ceph-osd-1-6798855f-xzs8r 2/2 Running 0 15h 10.0.170.49 ip-10-0-170-49.ec2.internal <none> <none> rook-ceph-osd-2-67dd9dd654-zvr5q 2/2 Running 0 12h 10.0.128.227 ip-10-0-128-227.ec2.internal <none> <none> $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-128-227.ec2.internal Ready worker 25h v1.23.5+8471591 ip-10-0-129-248.ec2.internal Ready master 25h v1.23.5+8471591 ip-10-0-135-211.ec2.internal Ready infra,worker 25h v1.23.5+8471591 ip-10-0-138-240.ec2.internal Ready worker 25h v1.23.5+8471591 ip-10-0-148-249.ec2.internal Ready worker 25h v1.23.5+8471591 ip-10-0-153-179.ec2.internal Ready infra,worker 25h v1.23.5+8471591 ip-10-0-156-88.ec2.internal Ready worker 25h v1.23.5+8471591 ip-10-0-156-93.ec2.internal Ready master 25h v1.23.5+8471591 ip-10-0-162-182.ec2.internal Ready infra,worker 25h v1.23.5+8471591 ip-10-0-162-4.ec2.internal Ready master 25h v1.23.5+8471591 ip-10-0-164-32.ec2.internal Ready worker 25h v1.23.5+8471591 ip-10-0-170-49.ec2.internal Ready worker 25h v1.23.5+8471591 must-gather logs : http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-o19-c3/jijoy-o19-c3_20221019T082340/logs/testcases_1666245376/ -------------------------------------------------------- Failed test case: tests/manage/z_cluster/nodes/test_nodes_restart_ms.py::TestNodesRestartMS::test_nodes_restart[worker] Relevant logs from the test case: Rebooting nodes 2022-10-19 19:41:46 14:11:46 - MainThread - /home/jenkins/workspace/qe-odf-multicluster/ocs-ci/ocs_ci/utility/aws.py - INFO - C[jijoy-o19-pr] - Rebooting instances ('ip-10-0-128-227.ec2.internal', 'ip-10-0-138-240.ec2.internal', 'ip-10-0-148-249.ec2.internal', 'ip-10-0-156-88.ec2.internal', 'ip-10-0-164-32.ec2.internal', 'ip-10-0-170-49.ec2.internal') 2022-10-19 19:41:47 14:11:47 - MainThread - ocs_ci.ocs.node - INFO - C[jijoy-o19-pr] - Wait for 6 of the nodes to reach the expected status Ready Nodes reached the state Ready after some time 2022-10-19 19:42:10 14:12:10 - MainThread - ocs_ci.ocs.node - INFO - C[jijoy-o19-pr] - The following nodes reached status Ready: ['ip-10-0-128-227.ec2.internal', 'ip-10-0-129-248.ec2.internal', 'ip-10-0-135-211.ec2.internal', 'ip-10-0-138-240.ec2.internal', 'ip-10-0-148-249.ec2.internal', 'ip-10-0-153-179.ec2.internal', 'ip-10-0-156-88.ec2.internal', 'ip-10-0-156-93.ec2.internal', 'ip-10-0-162-182.ec2.internal', 'ip-10-0-162-4.ec2.internal', 'ip-10-0-164-32.ec2.internal', 'ip-10-0-170-49.ec2.internal'] =================================================================================== Version-Release number of selected component (if applicable): OCP 4.10.35 ODF 4.10.5-4 ocs-osd-deployer.v2.0.8 ================================================================================== How reproducible: 2/3 ================================================================================= Steps to Reproduce: The issue reported here was seen after rebooting worker nodes in the test case tests/manage/z_cluster/nodes/test_nodes_restart_ms.py::TestNodesRestartMS::test_nodes_restart[worker]. This condition was also observed in a different cluster without even running any disruption tests. =============================================================================== Actual results: 2 OSDs marked as down when checking the ceph status. Expected results: All OSDs should be marked as up. Additional info:
> 2 osds are marked as down after a test case which restart all the worker nodes - as per the lengthy discussions happened in https://chat.google.com/room/AAAASHA9vWs/w61gO12VQIc, this bug is invalid - as a gist, the testcase should only be restarting single node at a time and need to make sure ceph health is OK before restarting the next node - pls recheck @jijoy
(In reply to Leela Venkaiah Gangavarapu from comment #1) > > 2 osds are marked as down after a test case which restart all the worker nodes > - as per the lengthy discussions happened in > https://chat.google.com/room/AAAASHA9vWs/w61gO12VQIc, this bug is invalid > - as a gist, the testcase should only be restarting single node at a time > and need to make sure ceph health is OK before restarting the next node > - pls recheck @jijoy Hi Leela, Closing the bug based on the discussions.