Bug 2145268

Summary:	When node's are powered off and powered on ceph reports osd as down but the osd pods are in a running state
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Pratik Surve <prsurve>
Component:	ceph	Assignee:	Neha Ojha <nojha>
ceph sub component:	RADOS	QA Contact:	Elad <ebenahar>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	urgent
Priority:	unspecified	CC:	bniver, madam, muagarwa, nmordech, ocs-bugs, odf-bz-bot, pdhiran, tnielsen
Version:	4.12
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-12-01 22:22:26 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Pratik Surve 2022-11-23 16:48:58 UTC

Description of problem (please be detailed as possible and provide log
snippets):

When node's are powered off and powered on ceph reports osd as down but the osd  pods are in a running state

Version of all relevant components (if applicable):

OCP version:- 4.12.0-0.nightly-2022-11-10-033725
ODF version:- 4.12.0-111
CEPH version:- ceph version 16.2.10-72.el8cp (3311949c2d1edf5cabcc20ba0f35b4bfccbf021e) pacific (stable)
ACM version:- v2.7.0
SUBMARINER version:- 0.14.0-rc3
VOLSYNC version:- volsync-product.v0.6.0


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy 4.12 ODF cluster
2. keep io running for a week
3. After a week poweroff all the nodes and after 2 hr power on the nodes



Actual results:
cephstatus
  cluster:
    id:     25807948-d645-4091-8d5c-df19c0d5389e
    health: HEALTH_WARN
            1 filesystem is degraded
            2 large omap objects
            1 MDSs report slow metadata IOs
            2 osds down
            2 hosts (2 osds) down
            2 racks (2 osds) down
            Reduced data availability: 353 pgs inactive
            Degraded data redundancy: 441786/662679 objects degraded (66.667%), 198 pgs degraded, 353 pgs undersized
 
  services:
    mon:        3 daemons, quorum a,b,c (age 48m)
    mgr:        a(active, since 48m)
    mds:        1/1 daemons up, 1 standby
    osd:        3 osds: 1 up (since 18m), 3 in (since 8d)
    rbd-mirror: 1 daemon active (1 hosts)
 
  data:
    volumes: 0/1 healthy, 1 recovering
    pools:   12 pools, 353 pgs
    objects: 220.89k objects, 641 GiB
    usage:   1.9 TiB used, 4.1 TiB / 6 TiB avail
    pgs:     100.000% pgs not active
             441786/662679 objects degraded (66.667%)
             198 undersized+degraded+peered
             155 undersized+peered


oc get pods
NAME                                                              READY   STATUS            RESTARTS         AGE
csi-addons-controller-manager-f8897699f-txd29                     2/2     Running           3                7d18h
csi-cephfsplugin-bvqtn                                            2/2     Running           10               8d
csi-cephfsplugin-hjz6f                                            2/2     Running           8                8d
csi-cephfsplugin-provisioner-6874c89994-bft9z                     5/5     Running           5                7d18h
csi-cephfsplugin-provisioner-6874c89994-f69mh                     5/5     Running           5                7d18h
csi-cephfsplugin-xdnt8                                            2/2     Running           8                8d
csi-rbdplugin-k95rp                                               3/3     Running           12               8d
csi-rbdplugin-provisioner-99c7d59dc-95scf                         7/7     Running           8                7d18h
csi-rbdplugin-provisioner-99c7d59dc-w2gzc                         7/7     Running           7                7d18h
csi-rbdplugin-pxvtv                                               3/3     Running           12               8d
csi-rbdplugin-pzl2j                                               3/3     Running           15               8d
noobaa-core-0                                                     1/1     Running           1                7d18h
noobaa-db-pg-0                                                    0/1     PodInitializing   1                7d18h
noobaa-endpoint-57ddcfb75b-p8svt                                  1/1     Running           1                7d18h
noobaa-operator-96fdbf6cd-qmdrg                                   1/1     Running           1                7d18h
ocs-metrics-exporter-55b5679864-lfkqd                             1/1     Running           1                7d18h
ocs-operator-7fb58c9575-6vdwb                                     1/1     Running           1                7d18h
odf-console-885bb57b5-x8kpj                                       1/1     Running           1                7d18h
odf-operator-controller-manager-c5b894c4d-6dgh5                   2/2     Running           8                7d18h
rook-ceph-crashcollector-compute-0-6d677d8fd7-4xsmw               1/1     Running           1                7d18h
rook-ceph-crashcollector-compute-1-5b55565ffd-9zxq4               1/1     Running           1                7d18h
rook-ceph-crashcollector-compute-2-65949bfb77-gd5c5               1/1     Running           1                4d19h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-88f4dc7dhks8r   2/2     Running           3                7d18h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-667bc964b4q6g   2/2     Running           2                7d18h
rook-ceph-mgr-a-7cf69447c6-f5cjn                                  2/2     Running           2                7d18h
rook-ceph-mon-a-7ff8cb55d5-ccxdk                                  2/2     Running           2                4d19h
rook-ceph-mon-b-7bcb54b689-4b4wj                                  2/2     Running           2                7d18h
rook-ceph-mon-c-58559f8f99-ppjpr                                  2/2     Running           2                7d18h
rook-ceph-operator-5d6f4c699d-26mvr                               1/1     Running           1                7d18h
rook-ceph-osd-0-5cdb4f6f74-hs25j                                  2/2     Running           2                4d19h
rook-ceph-osd-1-68cfc85c8-mxztl                                   2/2     Running           2                7d18h
rook-ceph-osd-2-6485c4cbf9-k2zv9                                  2/2     Running           2                7d18h
rook-ceph-rbd-mirror-a-57fbf58549-4dh8s                           2/2     Running           2                7d18h
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-b9c547f8svx7   1/2     Running           14 (7m24s ago)   7d18h
rook-ceph-tools-7864db9566-zt4j7                                  1/1     Running           1                7d18h
token-exchange-agent-955764cc7-jcjwq                              1/1     Running           1                7d18h



Expected results:



Additional info:

Comment 4 Travis Nielsen 2022-11-28 21:46:01 UTC

RADOS team, could the OSD compaction cause the OSD to not respond to heartbeats? And why would this cause the OSDs to not come back up automatically?
From Rook's perspective, the OSD pods are brought back online as expected, but ceph isn't marking them as up.

Here are some observations about the running OSD pods.


1. osd.2 is up and attempts to perform health checks on osd.1. The log [1] shows the heartbeats are failing:

2022-11-23T15:56:51.603549904Z debug 2022-11-23T15:56:51.603+0000 7ffadbc82700 -1 osd.2 21750 heartbeat_check: no reply from 10.132.2.7:6804 osd.1 ever on either front or back, first ping sent 2022-11-23T15:56:30.744596+0000 (oldest deadline 2022-11-23T15:56:50.744596+0000)
2022-11-23T15:56:52.558615785Z debug 2022-11-23T15:56:52.558+0000 7ffadbc82700 -1 osd.2 21750 heartbeat_check: no reply from 10.132.2.7:6804 osd.1 ever on either front or back, first ping sent 2022-11-23T15:56:30.744596+0000 (oldest deadline 2022-11-23T15:56:50.744596+0000)
2022-11-23T15:56:53.558988446Z debug 2022-11-23T15:56:53.558+0000 7ffadbc82700 -1 osd.2 21750 heartbeat_check: no reply from 10.132.2.7:6804 osd.1 ever on either front or back, first ping sent 2022-11-23T15:56:30.744596+0000 (oldest deadline 2022-11-23T15:56:50.744596+0000)
2022-11-23T15:56:54.520875336Z debug 2022-11-23T15:56:54.520+0000 7ffadbc82700 -1 osd.2 21750 heartbeat_check: no reply from 10.132.2.7:6804 osd.1 ever on either front or back, first ping sent 2022-11-23T15:56:30.744596+0000 (oldest deadline 2022-11-23T15:56:50.744596+0000)
2022-11-23T15:56:55.547030702Z debug 2022-11-23T15:56:55.546+0000 7ffadbc82700 -1 osd.2 21750 heartbeat_check: no reply from 10.132.2.7:6804 osd.1 ever on either front or back, first ping sent 2022-11-23T15:56:30.744596+0000 (oldest deadline 2022-11-23T15:56:50.744596+0000)
2022-11-23T15:56:56.535311948Z debug 2022-11-23T15:56:56.534+0000 7ffadbc82700 -1 osd.2 21750 heartbeat_check: no reply from 10.132.2.7:6804 osd.1 ever on either front or back, first ping sent 2022-11-23T15:56:30.744596+0000 (oldest deadline 2022-11-23T15:56:50.744596+0000)
2022-11-23T15:56:57.543125115Z debug 2022-11-23T15:56:57.542+0000 7ffadbc82700 -1 osd.2 21751 heartbeat_check: no reply from 10.132.2.7:6804 osd.1 ever on either front or back, first ping sent 2022-11-23T15:56:30.744596+0000 (oldest deadline 2022-11-23T15:56:50.744596+0000)
2022-11-23T15:56:58.571094807Z debug 2022-11-23T15:56:58.570+0000 7ffadbc82700 -1 osd.2 21752 heartbeat_check: no reply from 10.132.2.7:6804 osd.1 ever on either front or back, first ping sent 2022-11-23T15:56:30.744596+0000 (oldest deadline 2022-11-23T15:56:50.744596+0000)
2022-11-23T15:56:59.603971839Z debug 2022-11-23T15:56:59.603+0000 7ffadbc82700 -1 osd.2 21752 heartbeat_check: no reply from 10.132.2.7:6804 osd.1 ever on either front or back, first ping sent 2022-11-23T15:56:30.744596+0000 (oldest deadline 2022-11-23T15:56:50.744596+0000)
2022-11-23T15:57:00.568301032Z debug 2022-11-23T15:57:00.567+0000 7ffadbc82700 -1 osd.2 21752 heartbeat_check: no reply from 10.132.2.7:6804 osd.1 ever on either front or back, first ping sent 2022-11-23T15:56:30.744596+0000 (oldest deadline 2022-11-23T15:56:50.744596+0000)


2. The osd.1 log [2] at the time of osd.2's heartbeats show that a compaction is in progress:

2022-11-23T15:56:43.942789735Z debug 2022-11-23T15:56:43.942+0000 7f02e195c700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1669219003942758, "job": 6, "event": "compaction_started", "compaction_reason": "LevelL0FilesNum", "files_L0": [1515, 1501, 1477, 1461], "files_L1": [1446, 1447], "score": 1, "input_data_size": 235293433}
2022-11-23T15:56:52.636230743Z debug 2022-11-23T15:56:52.634+0000 7f02e195c700  4 rocksdb: [compaction/compaction_job.cc:1349] [P] [JOB 6] Generated table #1527: 567403 keys, 69580265 bytes


3. The osd.0 log [3] at the same time is showing the following messages, but nothing stands out to me as a real issue for the osd being up:

2022-11-23T15:56:52.424962793Z debug 2022-11-23T15:56:52.423+0000 7fed948a0700 -1 osd.0 21750 get_health_metrics reporting 6 slow ops, oldest is osd_op(client.834126.0:3 11.7 11.3c96ede7 (undecoded) ondisk+read+known_if_redirected e21743)
2022-11-23T15:56:52.424962793Z debug 2022-11-23T15:56:52.423+0000 7fed948a0700  0 log_channel(cluster) log [WRN] : 6 slow requests (by type [ 'delayed' : 6 ] most affected pool [ 'ocs-storagecluster-cephfilesystem-metadata' : 5 ])


[1] http://rhsqe-repo.lab.eng.blr.redhat.com/ocs4qe/pratik/bz/2145268/nov23/23-11-2022_22-19-44/23-11-2022_22-19-44-api-prsurve-vm-d-qe-rh-ocs-com/ocs-mg/quay-io-rhceph-dev-ocs-must-gather-sha256-562b69969b9212a0f3cf1a588c101c5e6cf6f345fa05c462ca22d2448786aac4/namespaces/openshift-storage/pods/rook-ceph-osd-2-6485c4cbf9-k2zv9/osd/osd/logs/current.log
[2] http://rhsqe-repo.lab.eng.blr.redhat.com/ocs4qe/pratik/bz/2145268/nov23/23-11-2022_22-19-44/23-11-2022_22-19-44-api-prsurve-vm-d-qe-rh-ocs-com/ocs-mg/quay-io-rhceph-dev-ocs-must-gather-sha256-562b69969b9212a0f3cf1a588c101c5e6cf6f345fa05c462ca22d2448786aac4/namespaces/openshift-storage/pods/rook-ceph-osd-1-68cfc85c8-mxztl/osd/osd/logs/current.log
[3] http://rhsqe-repo.lab.eng.blr.redhat.com/ocs4qe/pratik/bz/2145268/nov23/23-11-2022_22-19-44/23-11-2022_22-19-44-api-prsurve-vm-d-qe-rh-ocs-com/ocs-mg/quay-io-rhceph-dev-ocs-must-gather-sha256-562b69969b9212a0f3cf1a588c101c5e6cf6f345fa05c462ca22d2448786aac4/namespaces/openshift-storage/pods/rook-ceph-osd-0-5cdb4f6f74-hs25j/osd/osd/logs/current.log

Comment 8 Travis Nielsen 2022-12-01 22:22:26 UTC

This appears to be the same issue as 2139103. Please reopen if there is some difference I missed.

*** This bug has been marked as a duplicate of bug 2139103 ***