Bug 2240819

Summary:	post upgrade to 4.10, osd daemons flapping due to snaptrimming
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	George Law <glaw>
Component:	ceph	Assignee:	Prashant Dhange <pdhange>
ceph sub component:	RADOS	QA Contact:	Elad <ebenahar>
Status:	CLOSED NOTABUG	Docs Contact:
Severity:	high
Priority:	high	CC:	akupczyk, assingh, bhubbard, bniver, ksachdev, linuxkidd, mbreizma, muagarwa, nojha, pdhange, sbaldwin, smykhail, sostapov, vadeshpa
Version:	4.10
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-10-20 19:09:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description George Law 2023-09-26 18:39:50 UTC

Description of problem (please be detailed as possible and provide log
snippests):


Sorry, I was pulled into this case in the middle of it, our APAC folks are offline - they can add additional details if needed.


IBM customer recently upgraded from OCS 4.8 to ODF 4.10.
OCP currently on 4.12

This is IBM Cloud VPC 


Version of all relevant components (if applicable):

$ omc get csv
NAME                               DISPLAY                       VERSION   REPLACES                           PHASE
mcg-operator.v4.10.14              NooBaa Operator               4.10.14   mcg-operator.v4.9.15               Succeeded
ocs-operator.v4.10.14              OpenShift Container Storage   4.10.14   ocs-operator.v4.9.15               Succeeded
odf-csi-addons-operator.v4.10.14   CSI Addons                    4.10.14   odf-csi-addons-operator.v4.10.13   Succeeded
odf-operator.v4.10.14              OpenShift Data Foundation     4.10.14   odf-operator.v4.9.15               Succeeded

$ omc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.26   True        False         41h     Cluster version is 4.12.26


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

pre-upgrade (4.8), customer had a large number of snapshots pushing their cluster up to high usage. 

Ashish Singh had them set the following options to help speed up the space recovery. 
# ceph config set osd osd_max_trimming_pgs 5
# ceph config set osd osd_pg_max_concurrent_snap_trims 15 
# ceph config set osd osd_snap_trim_sleep_hdd 0

Customer was able to recover 690GB of the expected 1010GB used by the snapshots.

After upgrade to 4.10, customer noticed osd pods restarting and upon further investigation, the osds were flapping.
must gather was uploaded - see must-gather-post-odf4.10upgrade.tar.gz


Ashish S. reverted the above settings back to defaults, scaled down the rook-ceph-operator and ocs-operator and removed the liveness prob from the osd's to try to combat the flapping.
He also increased the HB timeouts osd_op_thread_suicide_timeout and osd_op_thread_timeout (these are currently the only non-default settings).

cluster still reporting slow requests 

captured logs with full debug enabled across all 3 osds 

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 34 Adam Kupczyk 2023-10-09 10:35:37 UTC

> Also, what could be possible reason for large omap for snap objects ?

The DB could grow very significantly if one does snapshot & overwrites on objects that have large OMAP data.
BlueStore is COPYING OMAP data when Bluestore::clone() in done on object, there is no COW here.


I think the best we can do to help SnapMapper trim is to periodically inject "compact" command.
The problem with RocksDB & deletion & slow iterators is that deletion is very compact operation,
and default triggers for auto compaction are relying on L0 sst tables size.
It is possible to accumulate significant amount of key remove operation and not trigger compaction,
but be significantly burdened with iterating over deleted keys.

It can be done by either admin command "compact"
#ceph tell osd.0 compact

If OSD suicides before finishing compaction, one can take it offline and apply:
#ceph-kvstore-tool ./bin/ceph-kvstore-tool bluestore-kv path-to-data compact
or just compact omap:
#ceph-kvstore-tool ./bin/ceph-kvstore-tool bluestore-kv path-to-data compact-prefix p

I think as SnapMapper deletion progresses, it might be required to retrigger compaction multiple times.