Description of problem: Continuous OSD memory usage growth in a HEALTH_OK cluster. health HEALTH_OK monmap e1: 3 mons at {mon032-node=192.168.1.124:6789/0,mon033-node=192.168.1.189:6789/0,mon034-node=192.168.1.252:6789/0} election epoch 74, quorum 0,1,2 mon032-node,mon033-node,mon034-node osdmap e109308: 266 osds: 266 up, 266 in flags require_jewel_osds pgmap v34020501: 11208 pgs, 19 pools, 63034 GB data, 28983 kobjects 185 TB used, 1415 TB / 1601 TB avail 11206 active+clean 2 active+clean+scrubbing The only thing I see is - - Sortbitwise and recovery_deletes flags are not set. From the configuration side: - PG count looks good - 130-150 PGs/OSD not too high it is an average recommended. - These are data OSD's. $ cat sos_commands/process/ps_auxwww | grep ceph-osd USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND ceph 577907 0.7 1.0 2788452 1390624 ? Ssl Jun20 101:57 /usr/bin/ceph-osd -f --cluster ceph --id 5 --setuser ceph --setgroup ceph ceph 628569 0.8 1.0 2793660 1383116 ? Ssl Jun20 105:38 /usr/bin/ceph-osd -f --cluster ceph --id 32 --setuser ceph --setgroup ceph ceph 1437216 11.7 4.0 7959848 5282616 ? Ssl May25 5884:42 /usr/bin/ceph-osd -f --cluster ceph --id 69 --setuser ceph --setgroup ceph ceph 1449074 13.3 2.9 6766348 3902308 ? Ssl May25 6666:02 /usr/bin/ceph-osd -f --cluster ceph --id 99 --setuser ceph --setgroup ceph ceph 2423656 12.1 4.9 9440528 6511880 ? Ssl May20 6932:06 /usr/bin/ceph-osd -f --cluster ceph --id 216 --setuser ceph --setgroup ceph ceph 2423667 10.4 1.6 4959508 2140976 ? Ssl May20 5969:04 /usr/bin/ceph-osd -f --cluster ceph --id 275 --setuser ceph --setgroup ceph ceph 2423710 14.1 2.7 6722064 3636048 ? Ssl May20 8082:32 /usr/bin/ceph-osd -f --cluster ceph --id 129 --setuser ceph --setgroup ceph ceph 2423713 12.3 2.9 6622516 3837800 ? Ssl May20 7047:32 /usr/bin/ceph-osd -f --cluster ceph --id 190 --setuser ceph --setgroup ceph ceph 2423714 17.9 7.1 12669624 9321828 ? Ssl May20 10277:10 /usr/bin/ceph-osd -f --cluster ceph --id 248 --setuser ceph --setgroup ceph ceph 2423715 16.2 2.2 6013768 2988384 ? Ssl May20 9299:55 /usr/bin/ceph-osd -f --cluster ceph --id 160 --setuser ceph --setgroup ceph /dev/sdd1 779890608 2735620 777154988 1% /var/lib/ceph/osd/ceph-5 /dev/sdk1 7811388396 1124562080 6686826316 15% /var/lib/ceph/osd/ceph-216 /dev/sdf1 7811388396 785395816 7025992580 11% /var/lib/ceph/osd/ceph-69 /dev/sdg1 7811388396 891768084 6919620312 12% /var/lib/ceph/osd/ceph-99 /dev/sdh1 7811388396 990338308 6821050088 13% /var/lib/ceph/osd/ceph-129 /dev/sdm1 7811388396 867845728 6943542668 12% /var/lib/ceph/osd/ceph-275 /dev/sdj1 7811388396 875274412 6936113984 12% /var/lib/ceph/osd/ceph-190 /dev/sde1 779890608 3192556 776698052 1% /var/lib/ceph/osd/ceph-32 /dev/sdl1 7811388396 906215476 6905172920 12% /var/lib/ceph/osd/ceph-248 /dev/sdi1 7811388396 919347392 6892041004 12% /var/lib/ceph/osd/ceph-160 OSD.32 and OSD.5 are SSD OSD's. The maximum %MEM OSD's are - 248, 216 and 69. Perf top is failing in these three OSD's with: [22620.236007] perf: interrupt took too long (5014 > 4975), lowering kernel.perf_event_max_sample_rate to 39000 OMAP sizes: =============== du -sh /var/lib/ceph/osd/ceph-*/current/omap 323M /var/lib/ceph/osd/ceph-129/current/omap 336M /var/lib/ceph/osd/ceph-160/current/omap 193M /var/lib/ceph/osd/ceph-190/current/omap 172M /var/lib/ceph/osd/ceph-216/current/omap 360M /var/lib/ceph/osd/ceph-248/current/omap 307M /var/lib/ceph/osd/ceph-69/current/omap 280M /var/lib/ceph/osd/ceph-275/current/omap 1.9G /var/lib/ceph/osd/ceph-32/current/omap 1.4G /var/lib/ceph/osd/ceph-5/current/omap 304M /var/lib/ceph/osd/ceph-99/current/omap Version-Release number of selected component (if applicable): Red Hat Ceph Storage 2.4 async ceph-osd-10.2.7-48.el7cp.x86_64 How reproducible: Always in the customer environment.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2261