Created attachment 1182208 [details]
screenshot 1: graphite generated chart of memory consumption of all machines of the cluster
Description of problem
Memory consumption on monitor machine hosting calamari server *gradually
grows in a linear way*.
The severity of this bug and associates risks depends on the long term
behavior under load, which is yet to be tested.
On RHSC 2.0 server machine:
On Ceph MON machines:
Steps to Reproduce
1. Install RHSC 2.0 following the documentation.
2. Accept few nodes for the ceph cluster.
3. Create new ceph cluster named 'alpha'.
4. Wait at least for 10 hours
5. Go to graphite web interface, and select memory consumption for every
machine of the cluster.
Even without actual load, the memory consumption on machine hosting both
ceph monitor and clamari grows in a linear fashion, as can be seen on
attached screenshot #1 (graph generated by graphite web interface).
Compare this with memory consumption trends from other machines, which are
basically constant (the expected behavior here).
While memory consumption on machine hosting both ceph monitor and clamari is
higher compared to monitor only machines, it doesn't grow in a linear way.
What is memory consumption of the calamari process? Does it grow linearly?
Created attachment 1184541 [details]
screenshot 2: new memory consumption chart (calamari-server-1.4.7-1.el7cp.x86_64)
Testing with new build from Monday with calamari-server-1.4.7-1.el7cp.x86_64,
and so far I see the same behavior as shown in the original report.
As has been decided on the bug triage meeting, later I'm going to attach logs of
rss and size for calamari-lite process, which is crucial here - we need to know
how the consumption behaves over longer periods of time. When I have the data
ready, I will answer the needinfo flag.
(In reply to Nishanth Thomas from comment #1)
> What is memory consumption of the calamari process? Does it grow linearly?
Based on the evidence I already have, I can state that yes, RSS of calamari-lite
process grows in a linear way. Yesterday, RSS of calamari-lite process was
113756, and now I see it's 262500 already.
I will provide full logs and plot it into charts when I have long term data,
as we decided on the bug triage meeting.
Could calamari dev team check this? Have you seen this in your environment?
I'm not seeing it grow quite as rapidly. I will investigate and advise. I had an instance up for 7days that only made it to 700M RSS
It looks like linear growth according to QE
Created attachment 1185165 [details]
figure 1: calamari-lite rss during 2 days
Attaching plot of RSS (physical memory consumption) of calamari-lite process
after 2 days of logging.
As you can see, the trend is quite clear.
Created attachment 1185166 [details]
source data for figure 1 (rss and size of calamari-lite process)
Attaching source data for figure 1 (calamari-lite process RSS during 2 days).
The file has the following format:
<timestamp> <rss-in-kiloBytes> <size-in-kiloBytes>
The quick summary is:
$ head calamari-watch.2days.log
2016-07-26T16:33 113756 1314692
2016-07-26T16:34 113792 1314692
2016-07-26T16:35 114036 1314884
2016-07-26T16:36 114648 1315384
2016-07-26T16:37 114724 1315528
2016-07-26T16:38 115104 1315976
2016-07-26T16:39 115124 1315976
2016-07-26T16:40 115252 1316268
2016-07-26T16:41 115840 1316432
2016-07-26T16:42 115920 1316880
$ tail calamari-watch.2days.log
2016-07-28T17:11 496996 1697944
2016-07-28T17:12 497380 1698336
2016-07-28T17:13 497384 1698336
2016-07-28T17:14 497436 1698336
2016-07-28T17:15 497512 1698664
2016-07-28T17:16 497644 1698664
2016-07-28T17:17 498188 1699120
2016-07-28T17:18 498284 1699272
2016-07-28T17:19 498600 1699440
2016-07-28T17:20 498976 1699908
I'm in agreement this needs a fix. I have a few ready. While I don't expect to eliminate all growth in this time frame. I can reduce it's slope and set a hard limit in systemd so that if we grow too much the process will restart. I should be able to get that to be infrequent.
Created attachment 1186856 [details]
1185165: figure 2: calamari-lite rss during one week
Just for a record, I'm attaching chart of calamari-lite memory consumption for
a whole week.
Created attachment 1186858 [details]
souce data for figure 2 (rss and size of calamari-lite process)
Attaching source data for figure 2 (from previous comment).
On Monitor/Calamari machine:
On RHSC 2.0 server machine:
On Ceph 2.0 machines:
I observed memory consumption of calamari-lite process for about 3 days, and
noticed a significant change:
* at first, the memory consumption was growing (but the rate was a bit
different compared to the previous behaviour),
* when the RSS reached about 190 MB, the memory consumption stopped growing and
So memory consumption of calamari-lite process no longer grows in a linear way
indefinitely, but stops at about 190 MB RSS.
Created attachment 1188580 [details]
figure 3: calamari-lite rss during 3 days (QE verification)
Attaching evidence for verification: figure 3.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.