.Reduce OSD memory usage for Ceph Object Gateway workloads
The OSD memory usage was tuned to reduce unnecessary usage, especially for Ceph Object Gateway workloads.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2018:2261
Description of problem: Continuous OSD memory usage growth in a HEALTH_OK cluster. health HEALTH_OK monmap e1: 3 mons at {mon032-node=192.168.1.124:6789/0,mon033-node=192.168.1.189:6789/0,mon034-node=192.168.1.252:6789/0} election epoch 74, quorum 0,1,2 mon032-node,mon033-node,mon034-node osdmap e109308: 266 osds: 266 up, 266 in flags require_jewel_osds pgmap v34020501: 11208 pgs, 19 pools, 63034 GB data, 28983 kobjects 185 TB used, 1415 TB / 1601 TB avail 11206 active+clean 2 active+clean+scrubbing The only thing I see is - - Sortbitwise and recovery_deletes flags are not set. From the configuration side: - PG count looks good - 130-150 PGs/OSD not too high it is an average recommended. - These are data OSD's. $ cat sos_commands/process/ps_auxwww | grep ceph-osd USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND ceph 577907 0.7 1.0 2788452 1390624 ? Ssl Jun20 101:57 /usr/bin/ceph-osd -f --cluster ceph --id 5 --setuser ceph --setgroup ceph ceph 628569 0.8 1.0 2793660 1383116 ? Ssl Jun20 105:38 /usr/bin/ceph-osd -f --cluster ceph --id 32 --setuser ceph --setgroup ceph ceph 1437216 11.7 4.0 7959848 5282616 ? Ssl May25 5884:42 /usr/bin/ceph-osd -f --cluster ceph --id 69 --setuser ceph --setgroup ceph ceph 1449074 13.3 2.9 6766348 3902308 ? Ssl May25 6666:02 /usr/bin/ceph-osd -f --cluster ceph --id 99 --setuser ceph --setgroup ceph ceph 2423656 12.1 4.9 9440528 6511880 ? Ssl May20 6932:06 /usr/bin/ceph-osd -f --cluster ceph --id 216 --setuser ceph --setgroup ceph ceph 2423667 10.4 1.6 4959508 2140976 ? Ssl May20 5969:04 /usr/bin/ceph-osd -f --cluster ceph --id 275 --setuser ceph --setgroup ceph ceph 2423710 14.1 2.7 6722064 3636048 ? Ssl May20 8082:32 /usr/bin/ceph-osd -f --cluster ceph --id 129 --setuser ceph --setgroup ceph ceph 2423713 12.3 2.9 6622516 3837800 ? Ssl May20 7047:32 /usr/bin/ceph-osd -f --cluster ceph --id 190 --setuser ceph --setgroup ceph ceph 2423714 17.9 7.1 12669624 9321828 ? Ssl May20 10277:10 /usr/bin/ceph-osd -f --cluster ceph --id 248 --setuser ceph --setgroup ceph ceph 2423715 16.2 2.2 6013768 2988384 ? Ssl May20 9299:55 /usr/bin/ceph-osd -f --cluster ceph --id 160 --setuser ceph --setgroup ceph /dev/sdd1 779890608 2735620 777154988 1% /var/lib/ceph/osd/ceph-5 /dev/sdk1 7811388396 1124562080 6686826316 15% /var/lib/ceph/osd/ceph-216 /dev/sdf1 7811388396 785395816 7025992580 11% /var/lib/ceph/osd/ceph-69 /dev/sdg1 7811388396 891768084 6919620312 12% /var/lib/ceph/osd/ceph-99 /dev/sdh1 7811388396 990338308 6821050088 13% /var/lib/ceph/osd/ceph-129 /dev/sdm1 7811388396 867845728 6943542668 12% /var/lib/ceph/osd/ceph-275 /dev/sdj1 7811388396 875274412 6936113984 12% /var/lib/ceph/osd/ceph-190 /dev/sde1 779890608 3192556 776698052 1% /var/lib/ceph/osd/ceph-32 /dev/sdl1 7811388396 906215476 6905172920 12% /var/lib/ceph/osd/ceph-248 /dev/sdi1 7811388396 919347392 6892041004 12% /var/lib/ceph/osd/ceph-160 OSD.32 and OSD.5 are SSD OSD's. The maximum %MEM OSD's are - 248, 216 and 69. Perf top is failing in these three OSD's with: [22620.236007] perf: interrupt took too long (5014 > 4975), lowering kernel.perf_event_max_sample_rate to 39000 OMAP sizes: =============== du -sh /var/lib/ceph/osd/ceph-*/current/omap 323M /var/lib/ceph/osd/ceph-129/current/omap 336M /var/lib/ceph/osd/ceph-160/current/omap 193M /var/lib/ceph/osd/ceph-190/current/omap 172M /var/lib/ceph/osd/ceph-216/current/omap 360M /var/lib/ceph/osd/ceph-248/current/omap 307M /var/lib/ceph/osd/ceph-69/current/omap 280M /var/lib/ceph/osd/ceph-275/current/omap 1.9G /var/lib/ceph/osd/ceph-32/current/omap 1.4G /var/lib/ceph/osd/ceph-5/current/omap 304M /var/lib/ceph/osd/ceph-99/current/omap Version-Release number of selected component (if applicable): Red Hat Ceph Storage 2.4 async ceph-osd-10.2.7-48.el7cp.x86_64 How reproducible: Always in the customer environment.