Description of problem (please be detailed as possible and provide log snippets): When node's are powered off and powered on ceph reports osd as down but the osd pods are in a running state Version of all relevant components (if applicable): OCP version:- 4.12.0-0.nightly-2022-11-10-033725 ODF version:- 4.12.0-111 CEPH version:- ceph version 16.2.10-72.el8cp (3311949c2d1edf5cabcc20ba0f35b4bfccbf021e) pacific (stable) ACM version:- v2.7.0 SUBMARINER version:- 0.14.0-rc3 VOLSYNC version:- volsync-product.v0.6.0 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Deploy 4.12 ODF cluster 2. keep io running for a week 3. After a week poweroff all the nodes and after 2 hr power on the nodes Actual results: cephstatus cluster: id: 25807948-d645-4091-8d5c-df19c0d5389e health: HEALTH_WARN 1 filesystem is degraded 2 large omap objects 1 MDSs report slow metadata IOs 2 osds down 2 hosts (2 osds) down 2 racks (2 osds) down Reduced data availability: 353 pgs inactive Degraded data redundancy: 441786/662679 objects degraded (66.667%), 198 pgs degraded, 353 pgs undersized services: mon: 3 daemons, quorum a,b,c (age 48m) mgr: a(active, since 48m) mds: 1/1 daemons up, 1 standby osd: 3 osds: 1 up (since 18m), 3 in (since 8d) rbd-mirror: 1 daemon active (1 hosts) data: volumes: 0/1 healthy, 1 recovering pools: 12 pools, 353 pgs objects: 220.89k objects, 641 GiB usage: 1.9 TiB used, 4.1 TiB / 6 TiB avail pgs: 100.000% pgs not active 441786/662679 objects degraded (66.667%) 198 undersized+degraded+peered 155 undersized+peered oc get pods NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-f8897699f-txd29 2/2 Running 3 7d18h csi-cephfsplugin-bvqtn 2/2 Running 10 8d csi-cephfsplugin-hjz6f 2/2 Running 8 8d csi-cephfsplugin-provisioner-6874c89994-bft9z 5/5 Running 5 7d18h csi-cephfsplugin-provisioner-6874c89994-f69mh 5/5 Running 5 7d18h csi-cephfsplugin-xdnt8 2/2 Running 8 8d csi-rbdplugin-k95rp 3/3 Running 12 8d csi-rbdplugin-provisioner-99c7d59dc-95scf 7/7 Running 8 7d18h csi-rbdplugin-provisioner-99c7d59dc-w2gzc 7/7 Running 7 7d18h csi-rbdplugin-pxvtv 3/3 Running 12 8d csi-rbdplugin-pzl2j 3/3 Running 15 8d noobaa-core-0 1/1 Running 1 7d18h noobaa-db-pg-0 0/1 PodInitializing 1 7d18h noobaa-endpoint-57ddcfb75b-p8svt 1/1 Running 1 7d18h noobaa-operator-96fdbf6cd-qmdrg 1/1 Running 1 7d18h ocs-metrics-exporter-55b5679864-lfkqd 1/1 Running 1 7d18h ocs-operator-7fb58c9575-6vdwb 1/1 Running 1 7d18h odf-console-885bb57b5-x8kpj 1/1 Running 1 7d18h odf-operator-controller-manager-c5b894c4d-6dgh5 2/2 Running 8 7d18h rook-ceph-crashcollector-compute-0-6d677d8fd7-4xsmw 1/1 Running 1 7d18h rook-ceph-crashcollector-compute-1-5b55565ffd-9zxq4 1/1 Running 1 7d18h rook-ceph-crashcollector-compute-2-65949bfb77-gd5c5 1/1 Running 1 4d19h rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-88f4dc7dhks8r 2/2 Running 3 7d18h rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-667bc964b4q6g 2/2 Running 2 7d18h rook-ceph-mgr-a-7cf69447c6-f5cjn 2/2 Running 2 7d18h rook-ceph-mon-a-7ff8cb55d5-ccxdk 2/2 Running 2 4d19h rook-ceph-mon-b-7bcb54b689-4b4wj 2/2 Running 2 7d18h rook-ceph-mon-c-58559f8f99-ppjpr 2/2 Running 2 7d18h rook-ceph-operator-5d6f4c699d-26mvr 1/1 Running 1 7d18h rook-ceph-osd-0-5cdb4f6f74-hs25j 2/2 Running 2 4d19h rook-ceph-osd-1-68cfc85c8-mxztl 2/2 Running 2 7d18h rook-ceph-osd-2-6485c4cbf9-k2zv9 2/2 Running 2 7d18h rook-ceph-rbd-mirror-a-57fbf58549-4dh8s 2/2 Running 2 7d18h rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-b9c547f8svx7 1/2 Running 14 (7m24s ago) 7d18h rook-ceph-tools-7864db9566-zt4j7 1/1 Running 1 7d18h token-exchange-agent-955764cc7-jcjwq 1/1 Running 1 7d18h Expected results: Additional info:
RADOS team, could the OSD compaction cause the OSD to not respond to heartbeats? And why would this cause the OSDs to not come back up automatically? From Rook's perspective, the OSD pods are brought back online as expected, but ceph isn't marking them as up. Here are some observations about the running OSD pods. 1. osd.2 is up and attempts to perform health checks on osd.1. The log [1] shows the heartbeats are failing: 2022-11-23T15:56:51.603549904Z debug 2022-11-23T15:56:51.603+0000 7ffadbc82700 -1 osd.2 21750 heartbeat_check: no reply from 10.132.2.7:6804 osd.1 ever on either front or back, first ping sent 2022-11-23T15:56:30.744596+0000 (oldest deadline 2022-11-23T15:56:50.744596+0000) 2022-11-23T15:56:52.558615785Z debug 2022-11-23T15:56:52.558+0000 7ffadbc82700 -1 osd.2 21750 heartbeat_check: no reply from 10.132.2.7:6804 osd.1 ever on either front or back, first ping sent 2022-11-23T15:56:30.744596+0000 (oldest deadline 2022-11-23T15:56:50.744596+0000) 2022-11-23T15:56:53.558988446Z debug 2022-11-23T15:56:53.558+0000 7ffadbc82700 -1 osd.2 21750 heartbeat_check: no reply from 10.132.2.7:6804 osd.1 ever on either front or back, first ping sent 2022-11-23T15:56:30.744596+0000 (oldest deadline 2022-11-23T15:56:50.744596+0000) 2022-11-23T15:56:54.520875336Z debug 2022-11-23T15:56:54.520+0000 7ffadbc82700 -1 osd.2 21750 heartbeat_check: no reply from 10.132.2.7:6804 osd.1 ever on either front or back, first ping sent 2022-11-23T15:56:30.744596+0000 (oldest deadline 2022-11-23T15:56:50.744596+0000) 2022-11-23T15:56:55.547030702Z debug 2022-11-23T15:56:55.546+0000 7ffadbc82700 -1 osd.2 21750 heartbeat_check: no reply from 10.132.2.7:6804 osd.1 ever on either front or back, first ping sent 2022-11-23T15:56:30.744596+0000 (oldest deadline 2022-11-23T15:56:50.744596+0000) 2022-11-23T15:56:56.535311948Z debug 2022-11-23T15:56:56.534+0000 7ffadbc82700 -1 osd.2 21750 heartbeat_check: no reply from 10.132.2.7:6804 osd.1 ever on either front or back, first ping sent 2022-11-23T15:56:30.744596+0000 (oldest deadline 2022-11-23T15:56:50.744596+0000) 2022-11-23T15:56:57.543125115Z debug 2022-11-23T15:56:57.542+0000 7ffadbc82700 -1 osd.2 21751 heartbeat_check: no reply from 10.132.2.7:6804 osd.1 ever on either front or back, first ping sent 2022-11-23T15:56:30.744596+0000 (oldest deadline 2022-11-23T15:56:50.744596+0000) 2022-11-23T15:56:58.571094807Z debug 2022-11-23T15:56:58.570+0000 7ffadbc82700 -1 osd.2 21752 heartbeat_check: no reply from 10.132.2.7:6804 osd.1 ever on either front or back, first ping sent 2022-11-23T15:56:30.744596+0000 (oldest deadline 2022-11-23T15:56:50.744596+0000) 2022-11-23T15:56:59.603971839Z debug 2022-11-23T15:56:59.603+0000 7ffadbc82700 -1 osd.2 21752 heartbeat_check: no reply from 10.132.2.7:6804 osd.1 ever on either front or back, first ping sent 2022-11-23T15:56:30.744596+0000 (oldest deadline 2022-11-23T15:56:50.744596+0000) 2022-11-23T15:57:00.568301032Z debug 2022-11-23T15:57:00.567+0000 7ffadbc82700 -1 osd.2 21752 heartbeat_check: no reply from 10.132.2.7:6804 osd.1 ever on either front or back, first ping sent 2022-11-23T15:56:30.744596+0000 (oldest deadline 2022-11-23T15:56:50.744596+0000) 2. The osd.1 log [2] at the time of osd.2's heartbeats show that a compaction is in progress: 2022-11-23T15:56:43.942789735Z debug 2022-11-23T15:56:43.942+0000 7f02e195c700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1669219003942758, "job": 6, "event": "compaction_started", "compaction_reason": "LevelL0FilesNum", "files_L0": [1515, 1501, 1477, 1461], "files_L1": [1446, 1447], "score": 1, "input_data_size": 235293433} 2022-11-23T15:56:52.636230743Z debug 2022-11-23T15:56:52.634+0000 7f02e195c700 4 rocksdb: [compaction/compaction_job.cc:1349] [P] [JOB 6] Generated table #1527: 567403 keys, 69580265 bytes 3. The osd.0 log [3] at the same time is showing the following messages, but nothing stands out to me as a real issue for the osd being up: 2022-11-23T15:56:52.424962793Z debug 2022-11-23T15:56:52.423+0000 7fed948a0700 -1 osd.0 21750 get_health_metrics reporting 6 slow ops, oldest is osd_op(client.834126.0:3 11.7 11.3c96ede7 (undecoded) ondisk+read+known_if_redirected e21743) 2022-11-23T15:56:52.424962793Z debug 2022-11-23T15:56:52.423+0000 7fed948a0700 0 log_channel(cluster) log [WRN] : 6 slow requests (by type [ 'delayed' : 6 ] most affected pool [ 'ocs-storagecluster-cephfilesystem-metadata' : 5 ]) [1] http://rhsqe-repo.lab.eng.blr.redhat.com/ocs4qe/pratik/bz/2145268/nov23/23-11-2022_22-19-44/23-11-2022_22-19-44-api-prsurve-vm-d-qe-rh-ocs-com/ocs-mg/quay-io-rhceph-dev-ocs-must-gather-sha256-562b69969b9212a0f3cf1a588c101c5e6cf6f345fa05c462ca22d2448786aac4/namespaces/openshift-storage/pods/rook-ceph-osd-2-6485c4cbf9-k2zv9/osd/osd/logs/current.log [2] http://rhsqe-repo.lab.eng.blr.redhat.com/ocs4qe/pratik/bz/2145268/nov23/23-11-2022_22-19-44/23-11-2022_22-19-44-api-prsurve-vm-d-qe-rh-ocs-com/ocs-mg/quay-io-rhceph-dev-ocs-must-gather-sha256-562b69969b9212a0f3cf1a588c101c5e6cf6f345fa05c462ca22d2448786aac4/namespaces/openshift-storage/pods/rook-ceph-osd-1-68cfc85c8-mxztl/osd/osd/logs/current.log [3] http://rhsqe-repo.lab.eng.blr.redhat.com/ocs4qe/pratik/bz/2145268/nov23/23-11-2022_22-19-44/23-11-2022_22-19-44-api-prsurve-vm-d-qe-rh-ocs-com/ocs-mg/quay-io-rhceph-dev-ocs-must-gather-sha256-562b69969b9212a0f3cf1a588c101c5e6cf6f345fa05c462ca22d2448786aac4/namespaces/openshift-storage/pods/rook-ceph-osd-0-5cdb4f6f74-hs25j/osd/osd/logs/current.log
This appears to be the same issue as 2139103. Please reopen if there is some difference I missed. *** This bug has been marked as a duplicate of bug 2139103 ***