Created attachment 1182208 [details] screenshot 1: graphite generated chart of memory consumption of all machines of the cluster Description of problem ====================== Memory consumption on monitor machine hosting calamari server *gradually grows in a linear way*. The severity of this bug and associates risks depends on the long term behavior under load, which is yet to be tested. Version-Release =============== On RHSC 2.0 server machine: rhscon-ui-0.0.48-1.el7scon.noarch rhscon-core-selinux-0.0.34-1.el7scon.noarch rhscon-ceph-0.0.33-1.el7scon.x86_64 rhscon-core-0.0.34-1.el7scon.x86_64 ceph-installer-1.0.14-1.el7scon.noarch ceph-ansible-1.0.5-28.el7scon.noarch On Ceph MON machines: rhscon-core-selinux-0.0.34-1.el7scon.noarch rhscon-agent-0.0.15-1.el7scon.noarch ceph-selinux-10.2.2-22.el7cp.x86_64 ceph-common-10.2.2-22.el7cp.x86_64 ceph-base-10.2.2-22.el7cp.x86_64 ceph-mon-10.2.2-22.el7cp.x86_64 calamari-server-1.4.6-1.el7cp.x86_64 How reproducible ================ 100 % Steps to Reproduce ================== 1. Install RHSC 2.0 following the documentation. 2. Accept few nodes for the ceph cluster. 3. Create new ceph cluster named 'alpha'. 4. Wait at least for 10 hours 5. Go to graphite web interface, and select memory consumption for every machine of the cluster. Actual results ============== Even without actual load, the memory consumption on machine hosting both ceph monitor and clamari grows in a linear fashion, as can be seen on attached screenshot #1 (graph generated by graphite web interface). Compare this with memory consumption trends from other machines, which are basically constant (the expected behavior here). Expected results ================ While memory consumption on machine hosting both ceph monitor and clamari is higher compared to monitor only machines, it doesn't grow in a linear way.
What is memory consumption of the calamari process? Does it grow linearly?
Created attachment 1184541 [details] screenshot 2: new memory consumption chart (calamari-server-1.4.7-1.el7cp.x86_64) Update ====== Testing with new build from Monday with calamari-server-1.4.7-1.el7cp.x86_64, and so far I see the same behavior as shown in the original report. As has been decided on the bug triage meeting, later I'm going to attach logs of rss and size for calamari-lite process, which is crucial here - we need to know how the consumption behaves over longer periods of time. When I have the data ready, I will answer the needinfo flag.
(In reply to Nishanth Thomas from comment #1) > What is memory consumption of the calamari process? Does it grow linearly? Based on the evidence I already have, I can state that yes, RSS of calamari-lite process grows in a linear way. Yesterday, RSS of calamari-lite process was 113756, and now I see it's 262500 already. I will provide full logs and plot it into charts when I have long term data, as we decided on the bug triage meeting.
Could calamari dev team check this? Have you seen this in your environment?
I'm not seeing it grow quite as rapidly. I will investigate and advise. I had an instance up for 7days that only made it to 700M RSS
It looks like linear growth according to QE
Created attachment 1185165 [details] figure 1: calamari-lite rss during 2 days Attaching plot of RSS (physical memory consumption) of calamari-lite process after 2 days of logging. As you can see, the trend is quite clear.
Created attachment 1185166 [details] source data for figure 1 (rss and size of calamari-lite process) Attaching source data for figure 1 (calamari-lite process RSS during 2 days). The file has the following format: ~~~ <timestamp> <rss-in-kiloBytes> <size-in-kiloBytes> ~~~ The quick summary is: ~~~ $ head calamari-watch.2days.log 2016-07-26T16:33 113756 1314692 2016-07-26T16:34 113792 1314692 2016-07-26T16:35 114036 1314884 2016-07-26T16:36 114648 1315384 2016-07-26T16:37 114724 1315528 2016-07-26T16:38 115104 1315976 2016-07-26T16:39 115124 1315976 2016-07-26T16:40 115252 1316268 2016-07-26T16:41 115840 1316432 2016-07-26T16:42 115920 1316880 $ tail calamari-watch.2days.log 2016-07-28T17:11 496996 1697944 2016-07-28T17:12 497380 1698336 2016-07-28T17:13 497384 1698336 2016-07-28T17:14 497436 1698336 2016-07-28T17:15 497512 1698664 2016-07-28T17:16 497644 1698664 2016-07-28T17:17 498188 1699120 2016-07-28T17:18 498284 1699272 2016-07-28T17:19 498600 1699440 2016-07-28T17:20 498976 1699908 ~~~
I'm in agreement this needs a fix. I have a few ready. While I don't expect to eliminate all growth in this time frame. I can reduce it's slope and set a hard limit in systemd so that if we grow too much the process will restart. I should be able to get that to be infrequent.
https://github.com/ceph/calamari/releases/tag/v1.4.8
Created attachment 1186856 [details] 1185165: figure 2: calamari-lite rss during one week Just for a record, I'm attaching chart of calamari-lite memory consumption for a whole week.
Created attachment 1186858 [details] souce data for figure 2 (rss and size of calamari-lite process) Attaching source data for figure 2 (from previous comment).
Checking with ============= On Monitor/Calamari machine: calamari-server-1.4.8-1.el7cp.x86_64 On RHSC 2.0 server machine: rhscon-ui-0.0.51-1.el7scon.noarch rhscon-core-0.0.38-1.el7scon.x86_64 rhscon-ceph-0.0.38-1.el7scon.x86_64 rhscon-core-selinux-0.0.38-1.el7scon.noarch On Ceph 2.0 machines: rhscon-core-selinux-0.0.38-1.el7scon.noarch rhscon-agent-0.0.16-1.el7scon.noarch Verification ============ I observed memory consumption of calamari-lite process for about 3 days, and noticed a significant change: * at first, the memory consumption was growing (but the rate was a bit different compared to the previous behaviour), * when the RSS reached about 190 MB, the memory consumption stopped growing and stayed there. So memory consumption of calamari-lite process no longer grows in a linear way indefinitely, but stops at about 190 MB RSS. >> VERIFIED
Created attachment 1188580 [details] figure 3: calamari-lite rss during 3 days (QE verification) Attaching evidence for verification: figure 3.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-1755.html