Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 2264798

Summary:	[6.1z backport] mon_memory_target is ignored at startup when set without mon_memory_autotune in the config database
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Manny <mcaldeir>
Component:	RADOS	Assignee:	Brad Hubbard <bhubbard>
Status:	POST ---	QA Contact:	skanta
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	5.3	CC:	bhubbard, bkunal, ceph-eng-bugs, cephqe-warriors, mcaldeir, nojha, rzarzyns, tserlin, vumrao
Target Milestone:	---
Target Release:	6.1z10
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:		Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2250568, 2380400
Bug Blocks:

Description Manny 2024-02-19 03:53:40 UTC

This bug was initially created as a copy of Bug #2250568

I am copying this bug because: 



Description of problem:  MON Slow Requests and extremely large MON store

Holistically, this customer started on a project to add ~100 OSD nodes and upgrade from 4.3.x to 6.1z-whatever. The current situation:

1.)  The OSD nodes were added while the system was still RHCS 4.3.x
2.)  The system was upgraded to RHCS 5.3z3
3.)  The system was adopted into Cephadm

When the system was RHCS 4.3.x and the nodes were being added, the customer had the first occurrence of the issue.  At that time they compacted the MON Store and set the parameter "mon_memory_target" to 64 GiB (64 x 2^30). The issue did not occur again while on RHCS and the customer noted that the RSS for the MON services was ~36 GiB. The customer also monitors vital stats on the services, if we need an artifact for this, they will provide it.

The RSS is called out because in RHCS 5.3z3, we can't get the RSS to traverse 2 GiB. More about this later.

During the upgrade 4.3.x to 5.3.5, the issue returned. The customer merely kept compacting the MON Store until the upgrade was done. The customer assumed that this would be the end of it now that the cluster was upgraded.

Then the customer attempted to adopt the cluster and the Slow Ops + large MON Store DB returned yet again. It was disruptive enough that the adoption failed at least once. That is where I jumped on a remote session with the customer.

Here is what I observed:

1.)  Given that "mon_memory_target" = 64 GiB, the RSS of the MON services was between 1.7 GB to 1.9 GB. The customer verified the parameter via the Admin Socket, but the service was acting as if the parameter was at its default value of 2 GiB. We did assert the same parameter in the ceph.conf file and restart all 3 MONs, but that made no difference. Also, the customer left the MON store in node 3 in its "large" state and the MON PID on node 3 was still just below 2 GiB for in RSS.

Again, in RHCS 4.3.x, changing the parameter did have the desired affect on the RSS and seemed to affect the MON in a positive way.

The customer also updated the case with this note:
~~~
Hi Manny,

Some more feedback on the mon_memory target. With mon memory target = 68719476736 in both ceph.conf under [mon] and with ceph config ("mon basic mon_memory_target 68719476736" from config dump), we see this at a mon startup:
Nov 16 22:03:50 lpcephmon15003 conmon[3264187]: debug 2023-11-16T22:03:50.623+0000 7f62b5c9f880  0 ceph version 16.2.10-208.el8cp (791f73fbb4bbca2ffe53a2ea0f8706dbffadcc0b) pacific (stable), process ceph-mon, pid 8

2 seconds in:
Nov 16 22:03:52 lpcephmon15003 conmon[3264187]: debug 2023-11-16T22:03:52.523+0000 7f62b5c9f880 -1 mon.lpcephmon15003@-1(???).osd e1035161 _update_mon_cache_settings not using pcm and rocksdb
Nov 16 22:03:52 lpcephmon15003 conmon[3264187]: debug 2023-11-16T22:03:52.523+0000 7f62b5c9f880 -1 mon.lpcephmon15003@-1(???).osd e1035161 handle_conf_change mon_memory_target:68719476736 rocksdb_cache_size:536870912. Unable to update cache size.

Then from that point onwards, a few samples:
Nov 16 22:12:20 lpcephmon15003 conmon[3264187]: debug 2023-11-16T22:12:20.573+0000 7f62a1ab4700  1 mon.lpcephmon15003@2(peon).osd e1035162 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 71303168 full_alloc: 79691776 kv_alloc: 864026624
...
Nov 16 23:52:28 lpcephmon15003 conmon[3264187]: debug 2023-11-16T23:52:28.100+0000 7f62a1ab4700  1 mon.lpcephmon15003@2(peon).osd e1036932 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 213909504 full_alloc: 465567744 kv_alloc: 335544320
...
Nov 17 08:54:04 lpcephmon15003 conmon[3264187]: debug 2023-11-17T08:54:04.327+0000 7f62a1ab4700  1 mon.lpcephmon15003@2(peon).osd e1038892 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 301989888 full_alloc: 411041792 kv_alloc: 297795584

Now we change the mon_memory_target on the daemon socket to about 6GB:
podman exec ceph-b631754c-6650-4342-a364-1241a7ca5e79-mon-$(hostname) ceph daemon mon.$(hostname) config set mon_memory_target 6719476736
Result:
Nov 17 09:03:14 lpcephmon15003 conmon[3264187]: debug 2023-11-17T09:03:14.668+0000 7f62a1ab4700  1 mon.lpcephmon15003@2(peon).osd e1038892 _set_new_cache_sizes cache_size:4906248856 inc_alloc: 1979711488 full_alloc: 2147483648 kv_alloc: 738197504
The value here is 6GB - 15% (from mon_memory_fragmentation which is taking it from osd_memory_expected_fragmentation), plus minus some change. 


Next we set it to 96GB:
ceph tell mon.* config set mon_memory_target 103079215104

Result:
Nov 17 10:14:13 lpcephmon15003 conmon[3264187]: debug 2023-11-17T10:14:13.082+0000 7f62a1ab4700  1 mon.lpcephmon15003@2(peon).osd e1038926 _set_new_cache_sizes cache_size:86812026469 inc_alloc: 42144366592 full_alloc: 42479910912 kv_alloc: 2080374784

Success. Can consume tons of memory again. So, something is definitely bugged in RHCS 5, we are on ceph-mon-16.2.10-208.el8cp.x86_64. Will look at the mon debug output next, see if I can spot anything.
~~~

2.)  Here are the Artifacts from Support Shell:
~~~
Grouping case attachments ...

                     ------------------  03660823  ------------------
drwxrwxrwx+ 3 yank yank        65 Nov  9 08:37 0010-sosreport-lpcephmon15002-2023-11-09-ostsqrf.tar.xz
-rw-rw-rw-+ 1 yank yank 582860180 Nov  9 09:05 0020-ceph-w-out.txt
drwxrwxrwx+ 3 yank yank        74 Nov 14 19:19 0030-sosreport-lpcephmon15001-03660823-2023-11-14-oleuvwo.tar.xz
drwxrwxrwx+ 3 yank yank        74 Nov 14 19:19 0040-sosreport-lpcephmon15002-03660823-2023-11-14-hxcsppj.tar.xz
drwxrwxrwx+ 3 yank yank        74 Nov 14 19:19 0050-sosreport-lpcephmon15003-03660823-2023-11-14-egnqyuw.tar.xz
drwxrwxrwx+ 3 yank yank        25 Nov 14 19:28 0060-varlog-lpcephmon15001.tar.gz
drwxrwxrwx+ 3 yank yank        25 Nov 14 19:28 0070-varlog-lpcephmon15002.tar.gz
drwxrwxrwx+ 3 yank yank        25 Nov 14 19:28 0080-varlog-lpcephmon15003.tar.gz
drwxrwxrwx+ 3 yank yank        74 Nov 14 19:49 0090-sosreport-lpcephmon15001-03660823-2023-11-14-zmbixjl.tar.xz
drwxrwxrwx+ 3 yank yank        74 Nov 14 19:49 0100-sosreport-lpcephmon15003-03660823-2023-11-14-oivihyf.tar.xz
drwxrwxrwx+ 3 yank yank        74 Nov 14 19:49 0110-sosreport-lpcephmon15002-03660823-2023-11-14-wgahsku.tar.xz
-rw-rw-rw-+ 1 yank yank    208213 Nov 14 19:56 0120-r15-cephmon-disk-space.png
-rw-rw-rw-+ 1 yank yank  54116171 Nov 15 00:03 0130-lpcephmon15002-mon-ops-202311081610.json
-rw-rw-rw-+ 1 yank yank  61921521 Nov 16 17:54 0140-mon-lpcephmon15003-journal
-rw-rw-rw-+ 1 yank yank  64574990 Nov 16 17:54 0150-mon-lpcephmon15002-journal
-rw-rw-rw-+ 1 yank yank 118148079 Nov 16 17:54 0160-mon-lpcephmon15001-journal
-rw-rw-rw-+ 1 yank yank     12752 Nov 16 18:09 0170-top-lpcephmon15002
-rw-rw-rw-+ 1 yank yank      7041 Nov 16 18:09 0180-top-lpcephmon15003
-rw-rw-rw-+ 1 yank yank     12729 Nov 16 18:09 0190-top-lpcephmon15001
drwxrwxrwx+ 3 yank yank        35 Nov 16 19:18 0200-ceph_commands.tar.gz
-rw-rw-rw-+ 1 yank yank  23729450 Nov 16 19:24 0210-mon-messages-lpcephmon15003
-rw-rw-rw-+ 1 yank yank  23829010 Nov 16 19:24 0220-mon-messages-lpcephmon15002
-rw-rw-rw-+ 1 yank yank  47687828 Nov 16 19:24 0230-mon-messages-lpcephmon15001
-rw-rw-rw-+ 1 yank yank 156965467 Nov 16 20:25 0240-ceph-lpcephmon15002.log
-rw-rw-rw-+ 1 yank yank 156786512 Nov 16 20:25 0250-ceph-lpcephmon15003.log
-rw-rw-rw-+ 1 yank yank 157735142 Nov 16 20:25 0260-ceph-lpcephmon15001.log
-rw-rw-rw-+ 1 yank yank  49262340 Nov 16 22:40 0270-lpcephmon15003_mon_debug_stderr
drwxrwxrwx+ 2 yank yank      4096 Nov 17 16:37 0280-cephconfigs.zip
drwxrwxrwx+ 2 yank zeek        36 Nov  9 09:41 sosreport-20231109-082355
drwxrwxrwx+ 2 yank zeek        88 Nov 14 20:49 sosreport-20231114-183326
~~~

The customer is logging to syslog, attachments 21,22,23 is the messages file just after we tried to figure out the RSS and the MON store was recently compacted on node 1 and 2, but not 3.

3.)  The customer uses htop to monitor things. The MON in node 1 was the leader at this time and all the threads were consuming some CPU. The MON in node 3 was consuming 1 CPU at ~90 percent and the thread "ms_dispatch" was the only active thread. Why is this so?

For ~120 seconds we set debug_mon and debug_monc and debug_ms all to 20 and then the customer uploaded artifact 27 to Support Shell, (lpcephmon15003_mon_debug_stderr)

Again, why was "ms_dispatch" spinning at more or less as fast as a CPU could carry it?

4.)  To consolidate the parameters, please see artifact 28 in Support Shell:
~~~
-bash 5.1 $ find 0280-cephconfigs.zip/ -type f
0280-cephconfigs.zip/config-dump.txt
0280-cephconfigs.zip/lpcephmon15001-ceph.conf
0280-cephconfigs.zip/lpcephmon15002-ceph.conf
0280-cephconfigs.zip/lpcephmon15003-ceph.conf
~~~


Summary:
Request of Eng/Dev: The customer feels very strongly that there is some sort of code bug here and they want that investigated. I know we went down some specific "rabbit trails", but the goal should be to stop the MON store growth which leads to Slow Ops and can only be worked around by a compaction. There is also the matter of the MONs not using all the RAM they have been given.

This system is critical to BMW's automotive development by work it does for 48+ hours on the weekend. The system is idle during the week. Starting Monday, the customer will be doing RHEL 8.x to 9.x upgrades in preparation to do the RHCS 5.3z5 to RHCS 6.1z-whatever a week from Monday. They fully expect the issue with the MONs to happen again. But they are available for remote sessions Monday-Friday with an eye towards investigating this issue.

In case you were wondering, once my shift was over and the customer felt that someone "heard them", they placed compaction in cron every 15 minutes and the adoption finish.

Version-Release number of selected component (if applicable):  RHCS 5.3z5


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Best regards,
Manny

Comment 1 RHEL Program Management 2024-02-19 03:53:49 UTC

Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.