Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1367735

Summary:

qpidd consuming huge memory during scale

Product:

Red Hat Satellite

Reporter:

Pradeep Kumar Surisetty <psuriset>

Component:

Qpid

Assignee:

Mike Cressman <mcressma>

Status:

CLOSED ERRATA

QA Contact:

Pradeep Kumar Surisetty <psuriset>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

6.2.9

CC:

alexandre.chanu, andrew.schofield, aperotti, arahaman, asanders, bbuckingham, bill.scherer, bkearney, brcoca, cduryee, chrobert, codonell, cwelton, dcaplan, dkaylor, dmoessne, egolov, ehelms, gmurthy, hyu, jcallaha, jentrena, jhutar, jortel, jross, jswensso, kabbott, kim.vdriet, mcressma, mhrivnak, mlinden, mmccune, oshtaier, pdwyer, pmoravec, psuriset, sauchter, schamilt, sokeeffe, tross, wpinheir, xdmoon

Target Milestone:

Unspecified

Keywords:

FieldEngineering, Performance, PrioBumpField, PrioBumpPM, Reopened, Triaged

Target Release:

Unused

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

scale_lab

Fixed In Version:

qpid-dispatch-0.4-21, qpid-cpp-0.34-25

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1463800 (view as bug list)

Environment:

Last Closed:

2017-08-17 19:41:10 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1450948, 1468450

Bug Blocks:

1385841

Attachments:

Description	Flags
qpid mem usage	none
qpid mem usage	none
mem usage status after last hf	none
qpid-stat output after last hf	none
qpid with 3k content hosts	none
tcpdump from reproducer (all AMQP traffic unencrypted)	none

Description Pradeep Kumar Surisetty 2016-08-17 10:47:42 UTC

Created attachment 1191572 [details]
qpid mem usage

Description of problem:


Started scaling of content hosts registration to satellite 6.2/capsule.
Have registered 2k hosts per capsule.  Registered total 10k content hosts. 

qpid started consuming more memory.  Its close to 16G 



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Start scaling of content hosts registrations
2.
3.

Actual results:

more qpid memory usage

Expected results:

lesser qpid usage

Additional info:

Comment 1 Pradeep Kumar Surisetty 2016-08-18 09:39:11 UTC

Created attachment 1191831 [details]
qpid mem usage

Comment 12 Jan Hutař 2016-09-25 19:44:32 UTC

On a clients there was gofer-2.7.6-1.el7sat.noarch and katello-agent-2.5.0-5.el7sat.noarch

Comment 15 Pavel Moravec 2016-10-02 19:47:27 UTC

FYI from the numbers:

  uptime      cluster       connections  sessions  exchanges  queues
  ====================================================================
  21h 4m 40s  <standalone>  58           18,251    14         15,684

# connections include AMQP 1.0 i.e. qdrouterd. While # sessions does _not_ include AMQP 1.0 sessions - that number comes from pulp+katello+candlepin connections only.

I suspect pulp worker/beat doing something nasty/unexpected, causing # of sessions growing over the time - see #c6 how the number grows. This imho causes or at least contributes to the leak.

Definitely worth taking (on reproduced system) list of connections + list of sessions. qpid-stat cant print list of sessions until this patch applied:

# diff -rup /usr/bin/qpid-stat.orig /usr/bin/qpid-stat
--- /usr/bin/qpid-stat.orig	2016-10-02 17:57:13.434000000 +0200
+++ /usr/bin/qpid-stat	2016-10-02 21:44:19.100000000 +0200
@@ -78,6 +78,7 @@ def OptionsAndArguments(argv):
     group2 = OptionGroup(parser, "Command Options")
     group2.add_option("-g", "--general", help="Show General Broker Stats",  action="store_const", const="g",   dest="show")
     group2.add_option("-c", "--connections", help="Show Connections",       action="store_const", const="c",   dest="show")
+    group2.add_option("-s", "--sessions", help="Show Sessions",             action="store_const", const="s",   dest="show")
     group2.add_option("-e", "--exchanges", help="Show Exchanges",           action="store_const", const="e",   dest="show")
     group2.add_option("-q", "--queues", help="Show Queues",                 action="store_const", const="q",   dest="show")
     group2.add_option("-u", "--subscriptions", help="Show Subscriptions",   action="store_const", const="u",   dest="show")
@@ -94,7 +95,7 @@ def OptionsAndArguments(argv):
     opts, args = parser.parse_args(args=argv)
 
     if not opts.show:
-        parser.error("You must specify one of these options: -g, -c, -e, -q, -m, or -u. For details, try $ qpid-stat --help")
+        parser.error("You must specify one of these options: -g, -c, -s, -e, -q, -m, or -u. For details, try $ qpid-stat --help")
 
     config._types = opts.show
     config._sortcol = opts.sort_by
@@ -234,7 +235,52 @@ class BrokerManager:
         disp.formattedTable(title, heads, dispRows)
 
     def displaySession(self):
+        #print self.broker.getAllSessions()
+        #print self.getSessionMap()
         disp = Display(prefix="  ")
+        heads = []
+# {TxnCommits:0, TxnCount:0, TxnRejects:0, TxnStarts:0, attached:True, channelId:1, clientCredit:0, connectionRef:{_agent_epoch:6, _object_name:org.apache.qpid.broker:connection:qpid.127.0.0.1:5672-127.0.0.1:50911}, detachedLifespan:0, framesOutstanding:0, fullName:anonymous.bb5508c9-1986-45f4-8d60-e8ccbb82051a, name:anonymous.bb5508c9-1986-45f4-8d60-e8ccbb82051a, unackedMessages:0, vhostRef:{_object_name:org.apache.qpid.broker:vhost:org.apache.qpid.broker:broker:amqp-broker,/}}
+        heads.append(Header("session"))
+        heads.append(Header("conn"))
+        heads.append(Header("channelId"))
+        heads.append(Header("clientCredit"))
+        heads.append(Header("attached", Header.Y))
+        heads.append(Header("TxnCommits", Header.COMMAS))
+        heads.append(Header("TxnCount", Header.COMMAS))
+        heads.append(Header("TxnRejects", Header.COMMAS))
+        heads.append(Header("TxnStarts", Header.COMMAS))
+        heads.append(Header("detachedLifespan", Header.COMMAS))
+        heads.append(Header("framesOutstanding", Header.COMMAS))
+        heads.append(Header("unacked", Header.COMMAS))
+        rows = []
+        sessions = self.broker.getAllSessions()
+        connections = self.getConnectionMap()
+        for s in sessions:
+            row = []
+            try:
+                row.append(s.name)
+                connection = connections[s.connectionRef]
+                row.append(connection.address)
+                row.append(s.channelId)
+                row.append(s.clientCredit)
+                row.append(s.attached)
+                row.append(s.TxnCommits)
+                row.append(s.TxnCount)
+                row.append(s.TxnRejects)
+                row.append(s.TxnStarts)
+                row.append(s.detachedLifespan)
+                row.append(s.framesOutstanding)
+                row.append(s.unackedMessages)
+                rows.append(row)
+            except:
+                pass
+        title = "Sessions"
+        if config._sortcol:
+            sorter = Sorter(heads, rows, config._sortcol, config._limit, config._increasing)
+            dispRows = sorter.getSorted()
+        else:
+            dispRows = rows
+        disp.formattedTable(title, heads, dispRows)
 
     def displayExchange(self):
         disp = Display(prefix="  ")
#

(worth patching the system and then calling:

qpid-stat --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 -s
qpid-stat --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 -c
qpid-stat --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 -u

ps aux

)

Comment 17 Brian Bouterse 2016-10-03 20:05:09 UTC

Responding to Comment 14, there are several connections formed by each celery component. Each process connecting to Qpid uses at least 2 independent connections, one for message send/receive and the other for management commands. Furthermore each celery worker is actually 2 processes so that is 4 connections to Qpid per worker. Usually pulp systems have many workers with 1 resource manager and a lot of workers. So in other words it's not surprising to have a lot of open connections.

Comment 18 Michael Hrivnak 2016-10-03 20:31:23 UTC

Pavel, can you help us understand the assertion that the session count excludes amqp 1.0? Is that documented somewhere? Is there more context that explains why that would be the case?

Comment 19 Pavel Moravec 2016-10-04 10:45:23 UTC

(In reply to Michael Hrivnak from comment #18)
> Pavel, can you help us understand the assertion that the session count
> excludes amqp 1.0? Is that documented somewhere? Is there more context that
> explains why that would be the case?

I have to take this back - it takes some time to qdrouterd to re-establish the session and I run qpid-stat -g too early.

(I know that list of subscriptions (qpid-stat -u) does not contain AMQP 1.0 ones - see https://bugzilla.redhat.com/show_bug.cgi?id=1145141 for the Improvement BZ with some explanation why so same problem for sessions would not be a surprise for me - but qpidd counts and lists also AMQP 1.0 sessions properly, both on qpid-stat -g and in patched qpid-stat -s).

Digging more here, I found a reproducer triggering memory consumption of qpidd grows - but the sinner is qdrouterd.

Reproducer 1:
- "service goferd restart" on a content host - do it in a loop

Reproducer 2: (independent):
- install any package to a content host (even already installed) - do it in a loop
  - here I think this causes mem.growth _only_ when [1] is seen (but you wont see it in logs due to [2], sigh)

Observation:
- memory growth of qpidd grows over time
- # of sessions grows over time (qpid-stat -g)
- # of sessions from qdrouterd grows over time - use above qpid-stat patch and run:
  qpid-stat -s | grep <tcp-connection-to-qdrouterd> -c

to see this.

So, qdrouterd is not dropping AMQP sessions when goferd drops a connection&session&link (link-routed via qdrouterd to qpidd).

So, this is - at least partially - qdrouterd bug (and maybe this causes the mem.leak in qdrouterd?).

I definitely dont know if something else cant accumulate memory in qpidd, or if there is some mem.leak also in qpidd. But per a chat with jhutar, their preparation to the test & test itself basically did just above steps (but in big/scaled manner).

Ted, does the "qdrouterd not detaching session when link-routed link closes" sounds familiar? I can try to prepare Satellite-free reproducer but it takes time I dont have (at least this week).


[1] https://access.redhat.com/solutions/1614693
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1329497

Comment 20 Pavel Moravec 2016-10-04 17:33:44 UTC

(playing more on goferd restart scenario:

- (just) stopping tomcat / candlepin: no mem.growth, no session growth
- (just) stopping pulp: memory&session growth still present
- (just) stopping httpd: no mem.growth, no session growth
- (just) stopping foreman-tasks: memory&session growth still present
)

Comment 21 Pavel Moravec 2016-10-19 12:41:55 UTC

Reproducer for the qpidd memory growth and qdrouterd memory leak - same that found leak in goferd/proton in bz1272758 :)

1) add qpid broker queue pulp.bz1272758

2) Have qdrouterd link-routing pulp.* to qpidd

3) Run this script using proton reactor API:


#!/usr/bin/python

from time import sleep
from uuid import uuid4

from proton import ConnectionException

from proton.utils import BlockingConnection

import traceback
import random

while True:
  sleep(1)
  try:
    conn = BlockingConnection("proton+amqp://pmoravec-sat62-rhel7.gsslab.brq.redhat.com:5648", ssl_domain=None, heartbeat=1)
    print "conn created"
    snd = conn.create_sender("pulp.bz1272758", name=str(uuid4()))
    snd.close()
    sleep(2)
    snd = conn.create_sender("pulp.bz1272758", name=str(uuid4()))
    snd.close()
    if (conn):
      conn.close()
    snd = None
    conn = None
  except Exception, e:
    print e
    conn.close()
    pass
print "done"
sleep(100)


4) monitor RSS usage and # sessions in qpid:

ps aux | grep -v grep | grep -e qpidd -e qdrouterd

qpid-stat --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 -g | grep standalone -B2

Comment 31 Pavel Moravec 2016-11-08 08:22:00 UTC

qdrouterd 0.4-19 passed my torturing well - neither python standalone script, or goferd restarts, or package installs cause mem.leak or segfault.

Just.. qpidd shows a tiny memory growth, permanent, in either reproducer. Number of sessions is stable, though. It seems like there remains a mem.leak in qpidd.

I will file a separate BZ for that.

Comment 35 Pavel Moravec 2016-11-08 14:20:56 UTC

(In reply to Pavel Moravec from comment #31)
> qdrouterd 0.4-19 passed my torturing well - neither python standalone
> script, or goferd restarts, or package installs cause mem.leak or segfault.
> 
> Just.. qpidd shows a tiny memory growth, permanent, in either reproducer.
> Number of sessions is stable, though. It seems like there remains a mem.leak
> in qpidd.
> 
> I will file a separate BZ for that.

Based on this comment, I mark the BZ as verified.

The BZ for qpidd leak is https://bugzilla.redhat.com/show_bug.cgi?id=1392768 .

Comment 37 jcallaha 2016-11-08 14:35:10 UTC

Bringing qpid-dispatch-0.4-19 into 6.2.4.

Comment 38 Pavel Moravec 2016-11-08 22:33:56 UTC

I managed to segfault even -19 but under quite rare circumstances (that imho does not justify dropping it from 6.2.4): if qpidd broker is down for few minutes (quite improbable and itself is a bigger problem) AND if goferd is frequently restarted or frequently sends some data to qdrouterd, (only) then the segfault is seen.

See https://bugzilla.redhat.com/show_bug.cgi?id=1393128

Comment 40 errata-xmlrpc 2016-11-10 08:13:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:2699

Comment 41 Waldirio M Pinheiro 2016-11-10 14:54:59 UTC

Hi all

Issue was fixed with packages version 0.4-19.

Thank you Pavel!!!


Best Regards
-- 
Waldirio M Pinheiro | Senior Software Maintenance Engineer

Comment 42 Waldirio M Pinheiro 2016-11-10 14:55:51 UTC

Created attachment 1219418 [details]
mem usage status after last hf

Comment 43 Waldirio M Pinheiro 2016-11-10 14:56:19 UTC

Created attachment 1219419 [details]
qpid-stat output after last hf

Comment 44 Pradeep Kumar Surisetty 2016-12-04 00:29:52 UTC

Havent tested with huge scale this time due to resource limitation. 
Tried till 3k content hosts. Looks like its going in same direction. 

qpidd constantly grown till 6G as shown in attachment 

qpid stat:

[root@pman05 ~]#  qpid-stat --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt   -b amqps://localhost:5671 -g
Broker Summary:
  uptime          cluster       connections  sessions  exchanges  queues
  ========================================================================
  1d 10h 44m 23s  <standalone>  54           3,230     14         3,231

Aggregate Broker Statistics:
  Statistic                   Messages  Bytes
  ===================================================
  queue-depth                 6         5,436
  total-enqueues              220,531   319,494,432
  total-dequeues              220,525   319,488,996
  persistent-enqueues         20,659    130,652,178
  persistent-dequeues         20,653    130,646,742
  transactional-enqueues      0         0
  transactional-dequeues      0         0
  flow-to-disk-depth          0         0
  flow-to-disk-enqueues       0         0
  flow-to-disk-dequeues       0         0
  acquires                    220,525   
  releases                    0         
  discards-no-route           168,641   
  discards-ttl-expired        0         
  discards-limit-overflow     0         
  discards-ring-overflow      0         
  discards-lvq-replace        0         
  discards-subscriber-reject  0         
  discards-purged             0         
  reroutes                    0         
  abandoned                   0         
  abandoned-via-alt           0

Comment 45 Pradeep Kumar Surisetty 2016-12-04 00:30:33 UTC

Created attachment 1227804 [details]
qpid with 3k content hosts

Comment 47 Pavel Moravec 2016-12-05 11:46:11 UTC

Reproduced on 0.4-20: Standalone reproducer outside Satellite:


#!/usr/bin/python

from time import sleep
from uuid import uuid4

from proton import ConnectionException

from proton.utils import BlockingConnection

import traceback
import random

while True:
  sleep(1)
  try:
    conn = BlockingConnection("proton+amqp://0.0.0.0:5647", ssl_domain=None, heartbeat=1)
    print "conn created"
    snd = conn.create_sender("pulp.bz1367735", name=str(uuid4()))
    snd.close()
    sleep(2)
    snd = conn.create_sender("pulp.bz1367735", name=str(uuid4()))
    snd.close()
    if (conn):
      conn.close()
    snd = None
    conn = None
  except Exception, e:
    print 'Exception occurred'
    print e
    conn.close()
    pass
print "done"
sleep(100)


Run the script 10 times in parallel against qdrouterd that link routes pulp.* to qpidd - and add the queue pulp.bz1367735 to the broker.

Monitor:

while true; do qpid-stat --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt   -b amqps://localhost:5671 -g | grep standalone -B2; ps aux | grep -v grep | grep -e qpidd -e qdrouterd; sleep 10; done

Reproduced in versions: -17, -19, -20, -21.

Comment 48 Pavel Moravec 2016-12-05 11:59:18 UTC

Created attachment 1228031 [details]
tcpdump from reproducer (all AMQP traffic unencrypted)

TL;DR: due to wrong AMQP 1.0 performatives sent from qdrouterd, qpidd keeps open AMQP links forever. And number of such links grow in time, causing qpidd memory consumption grows as well.

The key problem is visible when applying filter:

(tcp.stream eq 1) && (amqp.channel == 131)

showing qpidd<->qdrouterd communication on AMQP channel nr. 131.

Frames 14416 - 14426 are correct - AMQP session attached and detached on that channel as usual.

BUT frames 14445 - 14446 show a problematic behaviour of qdrouter. It sends begin+attach performatives, immediatelly followed by detach - so far so good, but "end" performative is missing.

Instead of "end" performative, there is "begin" one - why? Isn't this the culprit? Isn't proton (or qdrouterd) sending wrong performative here?

Comment 49 Pavel Moravec 2016-12-05 12:11:12 UTC

Hi Ganesh, can you please have a look? qdrouterd is sending wrong performative in some scenario, see latest 2 updates.

Comment 50 Pavel Moravec 2016-12-05 21:19:35 UTC

Some correction from tcpdump analysis (me--): the proper problematic sequence is:

qdrouterd sending "begin+attach+detach", qpidd responding with "begin" (not qdrouterd again).

Here qpidd misbehaves. It should attach+detach the link as well. qdrouterd waits for that, thus it can't close the session by sending "end".

When I replaced qpid-cpp 0.30 -> 0.34-17 and proton 0.9 -> 0.10, neither the session or memory leak was present.

So upgrade of qpid-cpp to 0.34-17 OR proton to 0.10 fixes this bug.

(and it starts to smell like bz1392768 is a duplicate)

Comment 51 Pavel Moravec 2016-12-06 14:13:37 UTC

Testing the reproducer with qpid-cpp-server-0.34-20.el7sat.x86_64 (scratch-)built against qpid-proton-c-0.9-16.el7.x86_64, the leak is not present. Neither # sessions or RSS grows, when applying the reproducer script 40times in parallel.

So, pure upgrade of qpid-cpp-server (of version 0.34-20 built against proton 0.9 that uses also qdrouterd) is a resolution.

Anyway, doing so for Satellite would require change in (dependencies of) tfm-rubygem-qpid_messaging-0.30.0-7.el7sat.x86_64 that requires libqpidclient.so.10

Comment 52 Pavel Moravec 2016-12-06 14:43:39 UTC

*** Bug 1392768 has been marked as a duplicate of this bug. ***

Comment 53 Pavel Moravec 2016-12-11 13:58:04 UTC

Another reproducer identified (maybe of different underneath bug but also fixed in qpid-cpp 0.34-20 built against proton 0.9):

bounce many goferd clients concurrently. I.e. restart qdrouterd on a Capsule with tens or more clients registered there (and running goferd). qpidd on Satellite then boosts its memory footprint.

Standalone reproducer outside Satellite:

# cat just_consume.py 
#!/usr/bin/python

from time import sleep
from uuid import uuid4

from proton import ConnectionException, Timeout
from proton import SSLDomain, SSLException
#from proton import Message

from proton.utils import BlockingConnection

import threading
import traceback
import os, sys
import random

SSL = True
ROUTER_ADDRESS_NOSSL = "proton+amqp://pmoravec-sat62-rhel7.gsslab.brq.redhat.com:5648"
ROUTER_ADDRESS_SSL = "amqps://pmoravec-caps62-rhel7.gsslab.brq.redhat.com:5647"
ROUTER_ADDRESS = ROUTER_ADDRESS_SSL if SSL else ROUTER_ADDRESS_NOSSL
#ROUTER_ADDRESS = "proton+amqp://toledo-capsule.gsslab.brq.redhat.com:5648"
ADDRESS = "pulp.agent"
#ADDRESS = "queue"
HEARTBEAT = 5
SLEEP = float(sys.argv[1])
THREADS = int(sys.argv[2])
START_PERIOD = 10.0


class ReceiverThread(threading.Thread):
    def __init__(self, _id, address=ADDRESS, domain=None):
        super(ReceiverThread, self).__init__()
        self._id = _id
        self.address = address
        print self.address
        self.domain = domain
        self.running = True
        self.conn = None

    def connect(self):
        try:
            self.conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=self.domain, heartbeat=HEARTBEAT)
            self.conn.create_receiver(self.address, name=str(uuid4()), dynamic=False, options=None)
        except Exception:
            self.conn = None

    def run(self):
        while self.running:
            while self.conn == None:
                self.connect()
            sleep(random.uniform(SLEEP/2,3*SLEEP/2))
            try:
                print "%s: reconnecting.." % self.address
                self.conn.close()
            except Exception, e:
                print e
                pass
            self.conn = None

    def stop(self):
        self.running = False

ca_certificate='/etc/rhsm/ca/katello-server-ca.pem'
client_certificate='/etc/pki/consumer/bundle.pem'
client_key=None

domain = SSLDomain(SSLDomain.MODE_CLIENT)
domain.set_trusted_ca_db(ca_certificate)
domain.set_credentials(client_certificate, client_key or client_certificate, None)
domain.set_peer_authentication(SSLDomain.VERIFY_PEER)

random.seed()
threads = []
for i in range(THREADS):
  threads.append(ReceiverThread(i, "%s.%s" % (ADDRESS, i), domain if SSL else None))
  threads[i].start()
  sleep(START_PERIOD/THREADS)

while True:
  sleep(10)
_in = raw_input("Press Enter to exit:")

for i in range(THREADS):
    threads[i].stop()
for i in range(THREADS):
    threads[i].join()
#

# for i in $(seq 1 10); do python just_consume.py 10.0 10 & sleep 0.95; done


Here, # of sessions is stable, just memory growths. As written, the memory growth is not visible in 0.34-20.

Comment 56 Pavel Moravec 2017-01-24 14:26:04 UTC

The RHEL7 behaviour on durable queues is even more peculiar. Apart of the "RHEL7 consumes more memory than RHEL6" problem.

What I did: created 1k durable queues and send a 400bytes message to 2/3 of them. Memory grew up significantly. I let monitoring RSS further, and after a say 10-15 minutes, memory grew up from 80MB to 500MB. When the broker was idle!!!

How can an idle broker trigger consuming extra 400MB memory???

Curiously, the percentage of non-empty queues does play a role. I.e. having all queues empty and monitoring mem.usage every minute:

qpidd    29184 99.9  1.9 2113452 648452 ?      Ssl  06:57  15:51 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    29184 94.0  1.9 2113452 649092 ?      Ssl  06:57  15:53 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    29184 88.7  1.9 2113452 652052 ?      Ssl  06:57  15:54 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    29184 84.0  2.0 2113452 666112 ?      Ssl  06:57  15:56 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    29184 79.8  2.0 2113452 666544 ?      Ssl  06:57  15:58 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    29184 76.0  2.0 2113452 666492 ?      Ssl  06:57  15:59 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    29184 72.6  2.0 2113452 666740 ?      Ssl  06:57  16:01 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    29184 69.4  2.0 2113324 667332 ?      Ssl  06:57  16:02 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    29184 66.6  2.0 2113324 667624 ?      Ssl  06:57  16:04 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    29184 63.9  2.0 2113324 669708 ?      Ssl  06:57  16:06 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    29184 61.5  2.0 2113324 669940 ?      Ssl  06:57  16:07 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    29184 59.3  2.0 2113324 669992 ?      Ssl  06:57  16:09 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    29184 57.2  2.0 2113324 673596 ?      Ssl  06:57  16:10 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    29184 55.3  2.0 2113324 675732 ?      Ssl  06:57  16:12 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    29184 53.5  2.0 2113324 676004 ?      Ssl  06:57  16:13 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    29184 51.8  2.0 2113324 675924 ?      Ssl  06:57  16:15 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    29184 50.2  2.0 2113324 664052 ?      Ssl  06:57  16:17 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    29184 48.7  2.0 2113324 662164 ?      Ssl  06:57  16:18 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    29184 47.4  2.0 2113324 662300 ?      Ssl  06:57  16:20 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    29184 46.0  2.0 2113324 661068 ?      Ssl  06:57  16:21 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    29184 44.8  2.0 2113324 660968 ?      Ssl  06:57  16:23 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    29184 43.7  2.0 2113324 660856 ?      Ssl  06:57  16:25 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    29184 42.5  2.0 2113324 663880 ?      Ssl  06:57  16:26 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    29184 41.5  2.0 2113324 666308 ?      Ssl  06:57  16:28 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    29184 40.5  2.0 2113324 665856 ?      Ssl  06:57  16:29 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    29184 39.6  2.0 2113324 665008 ?      Ssl  06:57  16:31 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    29184 38.7  2.0 2113324 664752 ?      Ssl  06:57  16:32 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    29184 37.8  2.0 2113324 665008 ?      Ssl  06:57  16:34 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    29184 37.0  2.0 2113324 664956 ?      Ssl  06:57  16:36 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    29184 36.2  2.0 2113324 663724 ?      Ssl  06:57  16:37 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf


Having 25% of queues with 1 short message in each and same monitoring:
qpidd    43251  100  1.8 2113552 597736 ?      Ssl  07:44  15:59 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    43251 94.2  1.8 2113552 597736 ?      Ssl  07:44  16:01 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    43251 89.0  1.8 2113552 597736 ?      Ssl  07:44  16:03 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    43251 84.4  1.8 2113552 598832 ?      Ssl  07:44  16:05 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    43251 80.2  1.8 2113552 601068 ?      Ssl  07:44  16:07 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    43251 76.4  1.8 2113552 604152 ?      Ssl  07:44  16:09 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    43251 73.0  1.8 2113552 604112 ?      Ssl  07:44  16:12 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    43251 69.9  1.8 2113628 605044 ?      Ssl  07:44  16:14 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    43251 67.1  1.8 2113628 605684 ?      Ssl  07:44  16:16 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    43251 64.5  1.8 2113628 606464 ?      Ssl  07:44  16:18 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    43251 62.1  1.8 2113628 606680 ?      Ssl  07:44  16:20 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    43251 59.8  1.8 2113628 607328 ?      Ssl  07:44  16:22 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    43251 57.8  1.8 2113628 607424 ?      Ssl  07:44  16:24 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    43251 55.9  1.8 2113628 620684 ?      Ssl  07:44  16:26 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    43251 54.1  1.9 2113628 623412 ?      Ssl  07:44  16:28 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    43251 52.4  1.9 2113628 636212 ?      Ssl  07:44  16:30 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    43251 50.9  1.9 2113628 636356 ?      Ssl  07:44  16:32 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    43251 49.4  1.9 2113628 636356 ?      Ssl  07:44  16:34 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    43251 48.0  1.9 2113628 636228 ?      Ssl  07:44  16:36 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    43251 46.7  1.9 2113628 634224 ?      Ssl  07:44  16:38 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    43251 45.5  1.9 2113628 634224 ?      Ssl  07:44  16:40 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    43251 44.3  1.9 2113628 632916 ?      Ssl  07:44  16:43 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    43251 43.2  1.9 2113628 634772 ?      Ssl  07:44  16:45 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    43251 42.2  1.9 2113628 638088 ?      Ssl  07:44  16:47 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    43251 41.2  1.9 2113628 636744 ?      Ssl  07:44  16:49 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    43251 40.3  1.8 2113628 615056 ?      Ssl  07:44  16:51 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    43251 39.4  1.8 2113628 613008 ?      Ssl  07:44  16:53 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    43251 38.5  1.8 2113628 613008 ?      Ssl  07:44  16:55 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    43251 37.7  1.8 2113628 612996 ?      Ssl  07:44  16:57 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    43251 36.9  1.8 2113628 612868 ?      Ssl  07:44  16:59 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf

Having 50% of queues with 1 short message in each:
qpidd     8552  100  1.9 2114516 633624 ?      Ssl  08:31  15:45 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd     8552 94.0  2.2 2114516 731360 ?      Ssl  08:31  15:46 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd     8552 88.6  2.5 2114516 829484 ?      Ssl  08:31  15:48 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd     8552 83.9  2.8 2114516 928028 ?      Ssl  08:31  15:49 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd     8552 79.7  3.1 2114516 1018340 ?     Ssl  08:31  15:51 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd     8552 75.9  3.1 2114516 1019028 ?     Ssl  08:31  15:52 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd     8552 72.4  3.1 2114516 1019572 ?     Ssl  08:31  15:54 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd     8552 69.3  3.1 2114516 1019572 ?     Ssl  08:31  15:55 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd     8552 66.4  3.1 2114580 1021716 ?     Ssl  08:31  15:57 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd     8552 63.7  3.1 2114580 1025144 ?     Ssl  08:31  15:58 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd     8552 61.3  3.1 2114580 1025436 ?     Ssl  08:31  16:00 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd     8552 59.1  3.1 2114580 1025428 ?     Ssl  08:31  16:01 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd     8552 57.0  3.1 2114580 1025428 ?     Ssl  08:31  16:03 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd     8552 55.0  3.1 2114580 1025428 ?     Ssl  08:31  16:04 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd     8552 53.2  3.1 2114580 1025060 ?     Ssl  08:31  16:06 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd     8552 51.6  3.1 2114580 1025260 ?     Ssl  08:31  16:07 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd     8552 50.0  3.1 2114580 1025132 ?     Ssl  08:31  16:09 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd     8552 48.5  3.1 2114580 1025092 ?     Ssl  08:31  16:10 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd     8552 47.1  3.1 2114580 1024844 ?     Ssl  08:31  16:12 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd     8552 45.8  3.1 2114580 1025440 ?     Ssl  08:31  16:13 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd     8552 44.6  3.1 2114580 1025964 ?     Ssl  08:31  16:15 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd     8552 43.4  3.1 2114580 1027864 ?     Ssl  08:31  16:16 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd     8552 42.3  3.1 2114580 1031076 ?     Ssl  08:31  16:18 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd     8552 41.3  3.1 2114580 1031092 ?     Ssl  08:31  16:19 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd     8552 40.3  3.1 2114580 1029348 ?     Ssl  08:31  16:21 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd     8552 39.3  3.1 2114580 1029388 ?     Ssl  08:31  16:22 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd     8552 38.4  3.1 2114580 1029332 ?     Ssl  08:31  16:24 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd     8552 37.6  3.1 2114580 1027464 ?     Ssl  08:31  16:25 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd     8552 36.8  3.1 2114580 1027368 ?     Ssl  08:31  16:27 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf


Having 75% of queus with 1 message:
qpidd    56482  100  1.9 2999192 643476 ?      Ssl  12:35  18:09 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    56482 94.9  2.2 2999192 741272 ?      Ssl  12:35  18:10 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    56482 90.1  2.5 2999192 838996 ?      Ssl  12:35  18:12 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    56482 85.9  2.8 2999192 936640 ?      Ssl  12:35  18:13 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    56482 82.0  3.1 2999192 1034528 ?     Ssl  12:35  18:14 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    56482 78.4  3.2 2999192 1076152 ?     Ssl  12:35  18:16 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    56482 75.1  3.2 2999192 1063356 ?     Ssl  12:35  18:17 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    56482 72.2  3.2 2999192 1076572 ?     Ssl  12:35  18:18 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    56482 69.4  3.3 2999192 1086604 ?     Ssl  12:35  18:20 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    56482 66.9  3.3 2999192 1080300 ?     Ssl  12:35  18:21 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    56482 64.5  3.3 2999192 1081976 ?     Ssl  12:35  18:23 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    56482 62.3  3.3 2999192 1084736 ?     Ssl  12:35  18:24 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    56482 60.3  3.3 2999064 1085664 ?     Ssl  12:35  18:25 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    56482 58.4  3.3 3000804 1087488 ?     Ssl  12:35  18:27 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    56482 56.6  3.4 2999064 1118568 ?     Ssl  12:35  18:28 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    56482 54.9  3.4 2999064 1135824 ?     Ssl  12:35  18:29 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    56482 53.3  3.4 2999064 1122948 ?     Ssl  12:35  18:31 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    56482 51.8  3.4 2999064 1123052 ?     Ssl  12:35  18:32 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    56482 50.4  3.4 2999064 1123076 ?     Ssl  12:35  18:33 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    56482 49.1  3.4 2999064 1123076 ?     Ssl  12:35  18:35 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    56482 47.9  3.4 2999064 1132740 ?     Ssl  12:35  18:36 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    56482 46.7  3.4 2999064 1132776 ?     Ssl  12:35  18:37 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    56482 45.5  3.4 2999064 1132776 ?     Ssl  12:35  18:39 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    56482 44.5  3.4 2999064 1132776 ?     Ssl  12:35  18:40 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    56482 43.4  3.4 2999064 1136780 ?     Ssl  12:35  18:41 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    56482 42.5  3.4 2999064 1138500 ?     Ssl  12:35  18:43 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    56482 41.5  3.4 2999064 1140456 ?     Ssl  12:35  18:44 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    56482 40.7  3.4 2999064 1139456 ?     Ssl  12:35  18:45 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    56482 39.8  3.4 2999064 1128160 ?     Ssl  12:35  18:47 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    56482 39.0  3.4 2999064 1128312 ?     Ssl  12:35  18:48 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf


But having all queues with 1 message:
qpidd    42390  100  1.8 2998988 610692 ?      Ssl  11:46  17:56 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    42390 94.9  2.1 2998988 706604 ?      Ssl  11:46  17:58 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    42390 90.1  2.1 2998988 706812 ?      Ssl  11:46  17:59 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    42390 85.8  2.1 2998988 706808 ?      Ssl  11:46  18:01 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    42390 81.9  2.1 2998988 707136 ?      Ssl  11:46  18:02 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    42390 78.3  2.2 2998988 720428 ?      Ssl  11:46  18:03 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    42390 75.1  2.2 2998988 722240 ?      Ssl  11:46  18:05 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    42390 72.0  2.2 2998988 724784 ?      Ssl  11:46  18:06 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    42390 69.3  2.2 2998988 727308 ?      Ssl  11:46  18:08 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    42390 66.7  2.2 2998988 727592 ?      Ssl  11:46  18:09 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    42390 64.4  2.2 2998988 727728 ?      Ssl  11:46  18:11 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    42390 62.2  2.2 2998988 735480 ?      Ssl  11:46  18:12 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    42390 60.1  2.2 2998988 727904 ?      Ssl  11:46  18:13 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    42390 58.2  2.2 2998988 727904 ?      Ssl  11:46  18:15 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    42390 56.4  2.2 2998988 728028 ?      Ssl  11:46  18:16 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    42390 54.7  2.2 2998988 728220 ?      Ssl  11:46  18:18 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    42390 53.1  2.2 2998988 728216 ?      Ssl  11:46  18:19 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    42390 51.7  2.2 2998988 728360 ?      Ssl  11:46  18:20 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    42390 50.2  2.2 2998988 732656 ?      Ssl  11:46  18:22 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    42390 48.9  2.2 2998988 734432 ?      Ssl  11:46  18:23 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    42390 47.7  2.2 2998988 732424 ?      Ssl  11:46  18:25 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    42390 46.5  2.2 2998988 732424 ?      Ssl  11:46  18:26 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    42390 45.4  2.2 2998988 732452 ?      Ssl  11:46  18:28 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    42390 44.3  2.2 2998988 731040 ?      Ssl  11:46  18:29 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    42390 43.3  2.2 2998988 731044 ?      Ssl  11:46  18:30 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    42390 42.3  2.2 2998988 731040 ?      Ssl  11:46  18:32 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    42390 41.4  2.2 2998988 731044 ?      Ssl  11:46  18:33 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    42390 40.5  2.2 2998988 731040 ?      Ssl  11:46  18:35 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    42390 39.6  2.2 2998988 734500 ?      Ssl  11:46  18:36 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
qpidd    42390 38.8  2.2 2998988 734592 ?      Ssl  11:46  18:37 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf


So of *some* queues get a message AND *some* queues are empty, qpidd starts to consume yet more memory - after a random idle time.

That is so silly. Or I did a silly mistake in my tests.

Comment 57 Kim van der Riet 2017-01-26 20:22:19 UTC

The memory consumption "picture" of qpidd depends a lot on how the memory consumption is measured.

Using Pavel's method in comment #55 of using qpid-config to add queues (no messages present), and using the command "ps -aux | grep qpidd" to quickly find and isolate the RSS value, I have found the following:

(All values in kB as returned by the ps command)
      Num. Queues: |      0 |    100 |    300 |   1000 |
-------------------+--------+--------+--------+--------+
RHEL6 Transient:   |  10368 |  13736 |  15308 |  19784 |
RHEL7 Transient:   |   9460 |  13188 |  15564 |  19740 |
RHEL6 Linearstore: |  10572 |  22000 |  31576 |  66164 |
RHEL7 Linearstore: |   9720 |  94840 | 127592 | 238904 |

These results show there is a distinct difference between RHEL6 and RHEL7 for the same code, and (as I understand) is the basis for some concern by some of our customers.

I discovered that by using gdb, one can track the memory usage statistics from inside the process. This is done by running the broker under gdb, then stopping the process at key moments and running the command "call malloc_stats()" before continuing execution. This will show the memory allocated under each of the memory arenas as well as the mmap-allocated memory. I performed the same test as above using gdb, and found the following:

(All values in bytes, as returned by malloc_stats())

RHEL6:
        Num Queues: |         0 |        100 |        300 |         1000 |
--------------------+-----------+------------+--_---------+--------------+
    Arena 0     sys |   528,384 |  1,728,512 |  3,137,536 |   10,510,336 |
             in-use |   407,872 |  1,375,952 |  2,859,712 |   10,243,120 |
    Arena 1     sys |   143,360 |    323,584 |  1,032,192 |    1,064,960 |
             in-use |     5,648 |      8,768 |     13,376 |       28,176 |
    Arena 2     sys |   135,168 |    602,112 |  1,474,560 |    7,274,496 |
             in-use |     3,744 |    395,616 |  1,229,744 |    7,040,752 |
    Arena 3     sys |   135,168 |  1,007,616 |  3,874,816 |   10,252,288 |
             in-use |     2,304 |    747,680 |  3,633,536 |    9,968,672 |
    Arena 4     sys |   135,168 |    999,424 |  2,863,104 |    7,856,128 |
             in-use |     2,304 |    797,504 |  2,723,648 |    7,576,352 |
    Arena 5     sys |   135,168 |  1,847,296 |  3,354,624 |    9,592,832 |
             in-use |     2,960 |  1,489,472 |  3,108,464 |    9,323,536 |
Total Arenas*   sys | 1,212,416 |  6,508,544 | 15,736,832 |   46,551,040 |
             in-use |   424,832 |  4,814,992 | 13,568,480 |   44,180,608 |
   Max mmap regions |         0 |        100 |        300 |        1,000 |
     Max mmap bytes |         0 |105,676,800 |317,030,400 |1,056,768,000 |
Total +mmap     sys | 1,212,416 |112,185,344 |332,767,232 |1,103,319,040 |
             in-use |   424,832 |110,491,792 |330,598,880 |1,100,948,608 |


RHEL7:
        Num Queues: |         0 |        100 |        300 |         1000 |
--------------------+-----------+------------+------------+--------------+
    Arena 0     sys |   536,576 |  1,757,184 |  1,953,792 |    9,687,040 | 
             in-use |   418,944 |  1,474,176 |  1,600,720 |    9,206,944 |
    Arena 1     sys |   135,168 |  1,339,392 |  1,339,392 |    1,339,392 |
             in-use |     2,304 |      7,440 |      9,776 |       15,760 |
    Arena 2     sys |   135,168 |    135,168 |    516,096 |    5,201,920 |
             in-use |     2,384 |      2,384 |    309,120 |    4,966,512 |
    Arena 3     sys |   135,168 |    311,296 |  3,129,344 |    8,204,288 |
             in-use |     2,304 |     91,440 |  2,798,384 |    7,905,456 |
    Arena 4     sys |   135,168 |  2,043,904 |  5,210,112 |   14,131,200 |
             in-use |     2,304 |  1,836,240 |  4,947,952 |   13,895,136 |
    Arena 5     sys |   135,168 |  1,789,952 |  4,218,880 |    8,577,024 |
             in-use |     2,960 |  1,411,280 |  3,903,728 |    8,225,872 |
Total Arenas*   sys | 1,212,416 |  7,376,896 | 16,367,616 |   47,140,864 |
             in-use |   431,200 |  4,822,960 | 13,569,680 |   44,215,680 |
   Max mmap regions |         0 |        100 |        300 |        1,000 |
     Max mmap bytes |         0 |105,676,800 |317,030,400 |1,056,768,000 |
Total +mmap     sys | 1,212,416 |113,053,696 |333,398,016 |1,103,908,864 |
             in-use |   431,200 |110,499,760 |330,600,080 |1,100,983,680 |

* This value is not returned by malloc_stats(), but is calculated by adding the individual values in Arenas 0 - 5.

Looking at the "Total +mmap" values, one can see the that the total memory in play for both RHEL6 and RHEL7 are almost identical. In addition, the amounts of memory in use correlate with the expected use, given that linearstore will allocate approx. 1MB** for buffering per queue created, and that this is done through mmap (as the mmap regions match the number of queues exactly, as does the mmap bytes).

The above would seem to indicate that from an internal perspective, at least, there is no difference between what is going on in RHEL6 and RHEL7. However, the following questions arise:

1. Why is the RSS value so different between RHEL6/7 for apparently the same allocations as seen by gdb?

2. Why are the RSS values so different from those seen by gdb, and in particular, why are they so low? I can't see any correlation between the RSS values and the gdb values. What exactly is the RSS value showing?

3. Is using RSS a reliable method of checking on memory allocated within a process?

** The amount of memory being allocated by the linearstore for buffering can be adjusted, and is a separate issue from this one.

Comment 60 Carlos O'Donell 2017-01-27 03:48:32 UTC

(In reply to Kim van der Riet from comment #57)
> (All values in kB as returned by the ps command)
>       Num. Queues: |      0 |    100 |    300 |   1000 |
> -------------------+--------+--------+--------+--------+
> RHEL6 Transient:   |  10368 |  13736 |  15308 |  19784 |
> RHEL7 Transient:   |   9460 |  13188 |  15564 |  19740 |
> RHEL6 Linearstore: |  10572 |  22000 |  31576 |  66164 |
> RHEL7 Linearstore: |   9720 |  94840 | 127592 | 238904 |

You have 238MB of RSS in RHEL7...

> (All values in bytes, as returned by malloc_stats())
> RHEL6:
>         Num Queues: |         0 |        100 |        300 |         1000 |
> Total +mmap     sys | 1,212,416 |112,185,344 |332,767,232 |1,103,319,040 |
>              in-use |   424,832 |110,491,792 |330,598,880 |1,100,948,608 |
> 
> RHEL7:
>         Num Queues: |         0 |        100 |        300 |         1000 |
> --------------------+-----------+------------+------------+--------------+
> Total +mmap     sys | 1,212,416 |113,053,696 |333,398,016 |1,103,908,864 |
>              in-use |   431,200 |110,499,760 |330,600,080 |1,100,983,680 |

... and 1.1GB in use.

So the answer is: RHEL7 does a better job than RHEL6 at optimizing page placement and keeping more pages in RSS for the application to run faster.

As total system memory pressure increases the number of pages in RSS for the process may go down as the pages are moved to SWAP (if you have any).
 
> The above would seem to indicate that from an internal perspective, at
> least, there is no difference between what is going on in RHEL6 and RHEL7.
> However, the following questions arise:
> 
> 1. Why is the RSS value so different between RHEL6/7 for apparently the same
> allocations as seen by gdb?

For optimal performance RSS _should_ have been 1.1GB.

In RHEL6 the memory pressure might be different and you have less RSS pages for the process (suboptimal performance).

In RHEL7 the system is closer to optimal for this single process and more pages are in RSS, meaning the process runs faster (no waiting for pages to move from SWAP to RSS).

> 2. Why are the RSS values so different from those seen by gdb, and in
> particular, why are they so low? I can't see any correlation between the RSS
> values and the gdb values. What exactly is the RSS value showing?

RSS is resident set size.

It is the number of pages of physical memory the process is using and does not include those pages in SWAP.

Therefore while the process is using 1.1GB on both RHEL6 and RHEL7, the OS is in charge of optimizing the number of pages in RSS (physical ram) and the number of pages in SWAP (some kind of secondary storage).

In RHEL7 the virtual memory management subsystem has moved more pages into RSS for the process to help with performance.

As you increase the number of connected clients, or test threads, the kernel notices this and may move more pages from SWAP to RSS to improve performance, all at the cost to other processes on the system.

> 3. Is using RSS a reliable method of checking on memory allocated within a
> process?

Yes, but you have to understand what RSS is measuring.

RSS measures how many physical pages of ram the process is using, not the total number of pages allocated.

Between RSS, VSZ, and swap, you can get an idea of:
RSS - How many pages of physical ram the process is using.
VSZ - How many total pages of address space the process has requested (including reservations).
Swap - How many pages are currently paged out to swap by the kernel.

I hope this answers your questions.

Comment 61 Brian Bouterse 2017-01-30 12:20:04 UTC

Thanks for the input. That answers the differences in RSS numbers between EL6 and EL7, but why does qpidd use so much total memory (not RSS) in situations with a lot of empty queues? I think measuring using VSZ as you mention is a good way to be insulated from OS paging optimizations/differences.

There are several comments that contain testing data, but let's look at Comment 44. It holds 5,436 bytes worth of data total, 54 connections, and 3,230 sessions, yet qpidd is reporting to consume 6 GB. That would be I haven't done this testing myself, but in reading this BZ that is something that I'm wondering. Other comments show similar numbers (if I'm reading this right). Any idea on the cause of this memory growth rate?

Thanks for helping sat6 investigate. Since it all runs on one server, memory issues affect all sat6 services.

Comment 62 Pavel Moravec 2017-01-31 13:52:12 UTC

    (In reply to Kim van der Riet from comment #57)
    > The memory consumption "picture" of qpidd depends a lot on how the memory
    > consumption is measured.

    Indeed. I dont know what is the right number. I run qpidd under gdb that had pre-created 5000 durable empty queues, on RHEL7. Once up, I run "ps" and "qpid-stat -q" every minute and got this:

    # while true; do date; ps aux | grep -v -e grep -e gdb | grep qpidd; sleep 25; qpid-stat -q | wc; sleep 25; done
    Tue Jan 31 08:23:25 EST 2017
    root     43925 10.9  0.7 6466404 249488 pts/1  Sl+  08:22   0:05 /usr/sbin/qpidd
       5003   50015  610253
    Tue Jan 31 08:24:21 EST 2017
    root     43925  7.3  0.9 6466404 306952 pts/1  Sl+  08:22   0:07 /usr/sbin/qpidd
       5003   50015  610253
    Tue Jan 31 08:25:18 EST 2017
    root     43925  6.3  1.0 6466404 338680 pts/1  Sl+  08:22   0:10 /usr/sbin/qpidd
       5003   50015  610253
    Tue Jan 31 08:26:15 EST 2017
    root     43925  5.9  1.1 6466404 362976 pts/1  Sl+  08:22   0:13 /usr/sbin/qpidd
       5003   50015  610253
    Tue Jan 31 08:27:12 EST 2017
    root     43925  5.6  1.0 6466404 339952 pts/1  Sl+  08:22   0:15 /usr/sbin/qpidd
       5003   50015  610253
    Tue Jan 31 08:28:08 EST 2017
    root     43925  5.4  1.0 6466404 340564 pts/1  Sl+  08:22   0:18 /usr/sbin/qpidd
       5003   50015  610253
    Tue Jan 31 08:29:05 EST 2017
    root     43925  5.2  1.1 6466576 391588 pts/1  Sl+  08:22   0:20 /usr/sbin/qpidd
       5003   50015  610253
    Tue Jan 31 08:30:02 EST 2017
    root     43925  5.1  1.1 6466576 392300 pts/1  Sl+  08:22   0:22 /usr/sbin/qpidd
       5003   50015  610253
    Tue Jan 31 08:30:58 EST 2017
    root     43925  5.0  1.3 6466576 427316 pts/1  Sl+  08:22   0:25 /usr/sbin/qpidd
       5003   50015  610253
    Tue Jan 31 08:31:55 EST 2017
    root     43925  4.9  1.7 6466576 562240 pts/1  Sl+  08:22   0:27 /usr/sbin/qpidd
       5003   50015  610253
    Tue Jan 31 08:32:52 EST 2017
    root     43925  4.8  2.0 6466576 658624 pts/1  Sl+  08:22   0:30 /usr/sbin/qpidd
       5003   50015  610253
    Tue Jan 31 08:33:48 EST 2017
    root     43925  4.8  2.2 6466576 728556 pts/1  Sl+  08:22   0:32 /usr/sbin/qpidd
       5003   50015  610253
    Tue Jan 31 08:34:45 EST 2017
    root     43925  4.7  2.5 6466576 850292 pts/1  Sl+  08:22   0:34 /usr/sbin/qpidd
       5003   50015  610253
    Tue Jan 31 08:35:41 EST 2017
    root     43925  4.7  2.6 6466576 852708 pts/1  Sl+  08:22   0:37 /usr/sbin/qpidd
       5003   50015  610253
    Tue Jan 31 08:36:38 EST 2017
    root     43925  4.6  2.6 6466576 853256 pts/1  Sl+  08:22   0:39 /usr/sbin/qpidd
       5003   50015  610253
    Tue Jan 31 08:37:34 EST 2017
    root     43925  4.6  2.6 6466576 853468 pts/1  Sl+  08:22   0:41 /usr/sbin/qpidd
       5003   50015  610253
    Tue Jan 31 08:38:31 EST 2017
    root     43925  4.6  2.6 6466576 852284 pts/1  Sl+  08:22   0:44 /usr/sbin/qpidd
       5003   50015  610253
    Tue Jan 31 08:39:27 EST 2017
    root     43925  4.6  2.6 6466576 852092 pts/1  Sl+  08:22   0:46 /usr/sbin/qpidd
       5003   50015  610253
    Tue Jan 31 08:40:24 EST 2017
    root     43925  4.5  2.6 6466576 854108 pts/1  Sl+  08:22   0:49 /usr/sbin/qpidd
       5003   50015  610253
    Tue Jan 31 08:41:20 EST 2017
    root     43925  4.5  2.6 6466576 854108 pts/1  Sl+  08:22   0:51 /usr/sbin/qpidd
       5003   50015  610253
    Tue Jan 31 08:42:17 EST 2017
    root     43925  4.5  2.6 6466576 853976 pts/1  Sl+  08:22   0:53 /usr/sbin/qpidd
       5003   50015  610253
    ..

    Meantime, I called malloc_stats() from gdb several times (at the beggining, after 1st qpid-stat and close to end - compare timestamps):

    (gdb) call malloc_stats()
    Arena 0:
    system bytes     =  218726400
    in use bytes     =  218501568
    Arena 1:
    system bytes     =     135168
    in use bytes     =       4048
    Arena 2:
    system bytes     =     135168
    in use bytes     =       2304
    Arena 3:
    system bytes     =     135168
    in use bytes     =       2304
    Arena 4:
    system bytes     =     135168
    in use bytes     =       2448
    Arena 5:
    system bytes     =     135168
    in use bytes     =       2304
    Arena 6:
    system bytes     =     135168
    in use bytes     =       2304
    Arena 7:
    system bytes     =     135168
    in use bytes     =       2304
    Arena 8:
    system bytes     =     135168
    in use bytes     =       2304
    Arena 9:
    system bytes     =     135168
    in use bytes     =       2304
    Arena 10:
    system bytes     =     135168
    in use bytes     =       2304
    Arena 11:
    system bytes     =     135168
    in use bytes     =       2304
    Arena 12:
    system bytes     =     135168
    in use bytes     =       2960
    Arena 13:
    system bytes     =    3076096
    in use bytes     =    1685744
    Total (incl. mmap):
    system bytes     = 1211506688
    in use bytes     = 1208299680
    max mmap regions =       5003
    max mmap bytes   = 5283627008
    $1 = -158871104
    (gdb) continue 
    Continuing.
    2017-01-31 08:23:50 [Security] info SASL: Mechanism list: ANONYMOUS
    2017-01-31 08:23:50 [Security] info SASL: Starting authentication with mechanism: ANONYMOUS
    2017-01-31 08:23:50 [Security] info qpid.[::1]:5672-[::1]:37184 SASL: Authentication succeeded for: anonymous@QPID
    ^C
    Program received signal SIGINT, Interrupt.
    pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
    238	62:	movq	%rax, %r14
    (gdb) call malloc_stats()
    Arena 0:
    system bytes     =  218726400
    in use bytes     =  218501264
    Arena 1:
    system bytes     =     135168
    in use bytes     =       2640
    Arena 2:
    system bytes     =     135168
    in use bytes     =       2304
    Arena 3:
    system bytes     =     135168
    in use bytes     =       2304
    Arena 4:
    system bytes     =     135168
    in use bytes     =       2448
    Arena 5:
    system bytes     =     135168
    in use bytes     =       2304
    Arena 6:
    system bytes     =     135168
    in use bytes     =       2304
    Arena 7:
    system bytes     =     135168
    in use bytes     =       2304
    Arena 8:
    system bytes     =     135168
    in use bytes     =       2304
    Arena 9:
    system bytes     =     135168
    in use bytes     =       3392
    Arena 10:
    system bytes     =     135168
    in use bytes     =       2304
    Arena 11:
    system bytes     =     135168
    in use bytes     =       2304
    Arena 12:
    system bytes     =   58818560
    in use bytes     =      20368
    Arena 13:
    system bytes     =    2412544
    in use bytes     =    1684208
    Total (incl. mmap):
    system bytes     = 1269526528
    in use bytes     = 1208314928
    max mmap regions =       5003
    max mmap bytes   = 5283627008
    $2 = -158871104
    (gdb) continue 
    Continuing.
    2017-01-31 08:24:46 [Security] info SASL: Mechanism list: ANONYMOUS
    2017-01-31 08:24:46 [Security] info SASL: Starting authentication with mechanism: ANONYMOUS
    2017-01-31 08:24:46 [Security] info qpid.[::1]:5672-[::1]:37186 SASL: Authentication succeeded for: anonymous@QPID
    2017-01-31 08:25:43 [Security] info SASL: Mechanism list: ANONYMOUS
    2017-01-31 08:25:43 [Security] info SASL: Starting authentication with mechanism: ANONYMOUS
    2017-01-31 08:25:43 [Security] info qpid.[::1]:5672-[::1]:37188 SASL: Authentication succeeded for: anonymous@QPID
    ^C
    Program received signal SIGINT, Interrupt.
    pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
    238	62:	movq	%rax, %r14
    (gdb) call malloc_stats()
    Arena 0:
    system bytes     =  218726400
    in use bytes     =  218501264
    Arena 1:
    system bytes     =     135168
    in use bytes     =       2640
    Arena 2:
    system bytes     =     135168
    in use bytes     =       2304
    Arena 3:
    system bytes     =   57176064
    in use bytes     =      10624
    Arena 4:
    system bytes     =     135168
    in use bytes     =       2448
    Arena 5:
    system bytes     =     135168
    in use bytes     =       2304
    Arena 6:
    system bytes     =     135168
    in use bytes     =       2304
    Arena 7:
    system bytes     =     135168
    in use bytes     =       2640
    Arena 8:
    system bytes     =     135168
    in use bytes     =       2304
    Arena 9:
    system bytes     =     135168
    in use bytes     =       2928
    Arena 10:
    system bytes     =     135168
    in use bytes     =       2304
    Arena 11:
    system bytes     =     135168
    in use bytes     =       3088
    Arena 12:
    system bytes     =   58818560
    in use bytes     =      12336
    Arena 13:
    system bytes     =    2412544
    in use bytes     =    1684432
    Total (incl. mmap):
    system bytes     = 1326567424
    in use bytes     = 1208316096
    max mmap regions =       5003
    max mmap bytes   = 5283627008
    $3 = -158871104
    (gdb) continue 
    Continuing.
    2017-01-31 08:26:40 [Security] info SASL: Mechanism list: ANONYMOUS
    2017-01-31 08:26:40 [Security] info SASL: Starting authentication with mechanism: ANONYMOUS
    2017-01-31 08:26:40 [Security] info qpid.[::1]:5672-[::1]:37190 SASL: Authentication succeeded for: anonymous@QPID
    2017-01-31 08:27:37 [Security] info SASL: Mechanism list: ANONYMOUS
    2017-01-31 08:27:37 [Security] info SASL: Starting authentication with mechanism: ANONYMOUS
    2017-01-31 08:27:37 [Security] info qpid.[::1]:5672-[::1]:37192 SASL: Authentication succeeded for: anonymous@QPID
    2017-01-31 08:28:33 [Security] info SASL: Mechanism list: ANONYMOUS
    2017-01-31 08:28:33 [Security] info SASL: Starting authentication with mechanism: ANONYMOUS
    2017-01-31 08:28:33 [Security] info qpid.[::1]:5672-[::1]:37194 SASL: Authentication succeeded for: anonymous@QPID
    2017-01-31 08:29:30 [Security] info SASL: Mechanism list: ANONYMOUS
    2017-01-31 08:29:30 [Security] info SASL: Starting authentication with mechanism: ANONYMOUS
    2017-01-31 08:29:30 [Security] info qpid.[::1]:5672-[::1]:37196 SASL: Authentication succeeded for: anonymous@QPID
    2017-01-31 08:30:27 [Security] info SASL: Mechanism list: ANONYMOUS
    2017-01-31 08:30:27 [Security] info SASL: Starting authentication with mechanism: ANONYMOUS
    2017-01-31 08:30:27 [Security] info qpid.[::1]:5672-[::1]:37198 SASL: Authentication succeeded for: anonymous@QPID
    2017-01-31 08:31:23 [Security] info SASL: Mechanism list: ANONYMOUS
    2017-01-31 08:31:23 [Security] info SASL: Starting authentication with mechanism: ANONYMOUS
    2017-01-31 08:31:23 [Security] info qpid.[::1]:5672-[::1]:37200 SASL: Authentication succeeded for: anonymous@QPID
    2017-01-31 08:32:20 [Security] info SASL: Mechanism list: ANONYMOUS
    2017-01-31 08:32:20 [Security] info SASL: Starting authentication with mechanism: ANONYMOUS
    2017-01-31 08:32:20 [Security] info qpid.[::1]:5672-[::1]:37202 SASL: Authentication succeeded for: anonymous@QPID
    2017-01-31 08:33:17 [Security] info SASL: Mechanism list: ANONYMOUS
    2017-01-31 08:33:17 [Security] info SASL: Starting authentication with mechanism: ANONYMOUS
    2017-01-31 08:33:17 [Security] info qpid.[::1]:5672-[::1]:37204 SASL: Authentication succeeded for: anonymous@QPID
    2017-01-31 08:34:14 [Security] info SASL: Mechanism list: ANONYMOUS
    2017-01-31 08:34:14 [Security] info SASL: Starting authentication with mechanism: ANONYMOUS
    2017-01-31 08:34:14 [Security] info qpid.[::1]:5672-[::1]:37206 SASL: Authentication succeeded for: anonymous@QPID
    2017-01-31 08:35:10 [Security] info SASL: Mechanism list: ANONYMOUS
    2017-01-31 08:35:10 [Security] info SASL: Starting authentication with mechanism: ANONYMOUS
    2017-01-31 08:35:10 [Security] info qpid.[::1]:5672-[::1]:37208 SASL: Authentication succeeded for: anonymous@QPID
    2017-01-31 08:36:06 [Security] info SASL: Mechanism list: ANONYMOUS
    2017-01-31 08:36:06 [Security] info SASL: Starting authentication with mechanism: ANONYMOUS
    2017-01-31 08:36:06 [Security] info qpid.[::1]:5672-[::1]:37210 SASL: Authentication succeeded for: anonymous@QPID
    2017-01-31 08:37:03 [Security] info SASL: Mechanism list: ANONYMOUS
    2017-01-31 08:37:03 [Security] info SASL: Starting authentication with mechanism: ANONYMOUS
    2017-01-31 08:37:03 [Security] info qpid.[::1]:5672-[::1]:37212 SASL: Authentication succeeded for: anonymous@QPID
    2017-01-31 08:37:59 [Security] info SASL: Mechanism list: ANONYMOUS
    2017-01-31 08:37:59 [Security] info SASL: Starting authentication with mechanism: ANONYMOUS
    2017-01-31 08:37:59 [Security] info qpid.[::1]:5672-[::1]:37214 SASL: Authentication succeeded for: anonymous@QPID
    2017-01-31 08:38:56 [Security] info SASL: Mechanism list: ANONYMOUS
    2017-01-31 08:38:56 [Security] info SASL: Starting authentication with mechanism: ANONYMOUS
    2017-01-31 08:38:56 [Security] info qpid.[::1]:5672-[::1]:37216 SASL: Authentication succeeded for: anonymous@QPID
    ^C
    Program received signal SIGINT, Interrupt.
    pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
    238	62:	movq	%rax, %r14
    (gdb) call malloc_stats()
    Arena 0:
    system bytes     =  218902528
    in use bytes     =  218501168
    Arena 1:
    system bytes     =   58679296
    in use bytes     =       2736
    Arena 2:
    system bytes     =     151552
    in use bytes     =       2640
    Arena 3:
    system bytes     =     307200
    in use bytes     =       3056
    Arena 4:
    system bytes     =     286720
    in use bytes     =       2832
    Arena 5:
    system bytes     =   31297536
    in use bytes     =       3920
    Arena 6:
    system bytes     =   50122752
    in use bytes     =       2640
    Arena 7:
    system bytes     =     135168
    in use bytes     =       2640
    Arena 8:
    system bytes     =     520192
    in use bytes     =       2832
    Arena 9:
    system bytes     =     520192
    in use bytes     =       2928
    Arena 10:
    system bytes     =   51261440
    in use bytes     =       2640
    Arena 11:
    system bytes     =     520192
    in use bytes     =       2640
    Arena 12:
    system bytes     =   58818560
    in use bytes     =      12176
    Arena 13:
    system bytes     =    2412544
    in use bytes     =    1687760
    Total (incl. mmap):
    system bytes     = 1462018048
    in use bytes     = 1208314784
    max mmap regions =       5003
    max mmap bytes   = 5283627008
    $4 = -158871104
    (gdb) continue 


    I see in "Total (incl. mmap):" values that:
    - "system bytes" are growing (the same way like RSS grows in ps)
    - "in use bytes" are quite stable
    - "max mmap regions/bytes" are stable

    "system bytes" growth during the test:
    1211506688
    1269526528
    1326567424
    1462018048


    So what is the proper value describing how much memory the process consumes? If "system bytes" and RSS grow, will it (at later state) affect other applications? Or will it OOM killer react?

Comment 63 Kim van der Riet 2017-01-31 15:50:12 UTC

Let me comment on the qpid linearstore memory allocations. Because linarstore (and legacystore before it) uses libaio for DMA writes to disk (using O_DIRECT), it is necessary for these stores to manage their own memory to some degree. The bulk of the per-queue memory allocated by the store is for buffering disk writes. The default settings for linearstore is to allocate 1MB of buffer space per queue, a space that is broken into 32 pages of 32kB each.

Currently, the linearstore allows for the size of the buffer pages to be set (using the --wcache-page-size parameter to the broker), which then determines internally the number of pages and hence the total allocated memory according to the following table:

wcache-page-size | Num pages | Total allocation
(kiB) | | (kiB)
-----------------+-----------+------------------
4 | 64 | 256
8 | 64 | 512
16 | 32 | 512
32* | 32* | 1024*
64 | 16 | 1024
128 | 8 | 1024

* Default

To reduce the size of the allocation to 256kiB, set the wcache-page-size parameter to 4.

Notes:

1. The value of wcache-page-size must be an integer power of 2 between 4 and 128, so the values listed in the table are all the legal values allowed. See the qpidd help for more information.

2. While investigating this issue, a bug was discovered in linearstore which allows the value of wcache-page-size to include 1 and 2 kiB values also, but using these will cause a floating point exception on startup. The help also indicates that 1 and 2 are allowable values. See QPID-7651 (https://issues.apache.org/jira/browse/QPID-7651) upstream Jira for more details. Avoid the use of 1 and 2 for this parameter.

3. Using a smaller page cache size will reduce the ability of the store to manage large spikes in message volume without increasing latency during the high volume period.

Comment 64 Carlos O'Donell 2017-01-31 16:53:59 UTC

(In reply to Pavel Moravec from comment #62)
>     # while true; do date; ps aux | grep -v -e grep -e gdb | grep qpidd;
> sleep 25; qpid-stat -q | wc; sleep 25; done
>     Tue Jan 31 08:23:25 EST 2017
>     root     43925 10.9  0.7 6466404 249488 pts/1  Sl+  08:22   0:05
> /usr/sbin/qpidd
>        5003   50015  610253

The visible page-related values for 'ps aux' are VSZ and RSS.

VSZ is the total address space being used.

RSS is the total number of pages of physical ram being used at the moment.

You do not know how much swap is in use, and you do not know how much VSZ is actually just VM reservations that remain after a release via madvise MADV_DONTNEED.

Therefore at this point you don't actually know how many pages the process has requested from the OS kernel. You are missing the swap information to answer this.

Only /proc/$PID/status has enough information: VmRSS + VmSWAP.

>     Tue Jan 31 08:42:17 EST 2017
>     root     43925  4.5  2.6 6466576 853976 pts/1  Sl+  08:22   0:53
> /usr/sbin/qpidd
>        5003   50015  610253
>     ..

At this poing qpidd is using ~6GiB of virtual address space (doesn't really matter, this could be a combination of reservations MADV_DONTNEED and actual allocations), and is consuming ~854MiB of physical ram (what matters to memory load).

>     Meantime, I called malloc_stats() from gdb several times (at the
> beggining, after 1st qpid-stat and close to end - compare timestamps):
> 
>     (gdb) call malloc_stats()
...
>     Total (incl. mmap):
>     system bytes     = 1211506688
>     in use bytes     = 1208299680
...
>     Total (incl. mmap):
>     system bytes     = 1462018048
>     in use bytes     = 1208314784
>     max mmap regions =       5003
>     max mmap bytes   = 5283627008
>     $4 = -158871104
>     (gdb) continue 

So you have allocated memory through the malloc API family of functions up to a total of an internally allocated ~1.4GiB (there is rounding up, internal book keeping, and userspace caches included here to accelerate future requests).

>     I see in "Total (incl. mmap):" values that:
>     - "system bytes" are growing (the same way like RSS grows in ps)
>     - "in use bytes" are quite stable
>     - "max mmap regions/bytes" are stable
> 
>     "system bytes" growth during the test:
>     1211506688
>     1269526528
>     1326567424
>     1462018048

This looks like your test is allocating more and more pages of memory via the malloc API family of functions.
 
>     So what is the proper value describing how much memory the process
> consumes? If "system bytes" and RSS grow, will it (at later state) affect
> other applications? Or will it OOM killer react?

The proper way to determine how much memory the process consumes is to look at /proc/$PID/status and extract VmRSS + VmSWAP.

Yes, if "system bytes" (total memory being held by the glibc malloc API) and RSS grow (includes glibc malloc API allocations and any other per-application mmaps), this will increase memory load on the system, and the consequences that come with that.

The OOM killer will follow the deterministic rules it follows to determine which application to kill. Please consult your OS kernel documentation to determine what is being applied here and consult your configuration to determine if you are in heuristic overcommit mode.

Note:
Since some pages of RSS may be shared between multiple processes e.g. read-only library pages loaded from disk, there is some double counting that means that the sum of all RSS is more than the total system number of pages * page size.

Comment 65 Mike Cressman 2017-02-02 18:16:26 UTC

For now, we're going to assume the OS is not doing anything wrong, but just that RHEL-7 keeps more in memory.

qpid broker is allocating 1 MB per durable queue.  So for a large installation with, say, 50,000 queues, qpid broker would allocate 50 GB of memory to accommodate.  Not all of this is swapped in, but as we've seen, RHEL-7 does keep much of it in RSS.

Assuming you really need durable (persistent) queues, this memory allocation can be reduced via a config option.  In the qpidd.conf file, set

  wcache-page-size=4

This is currently the lowest valid setting, and will result in memory allocation of 256 kb per queue (one quarter of the 1 MB default).
Anything smaller would require a code change in qpid.

This should result in lower qpid broker memory size (possibly by a lot).  We are testing now to see how much (since not all the alocated memory ends up in RSS).

We will update here with results soon, but feel free to try some Satellite scale testing in parallel.

This will have equal impact on qpid 0.30 or 0.34, but we believe the 0.34 version fixes leak(s) as documented in comments #51 and #53.

So while changing the config on the current 6.2 qpid broker 0.30 should show positive results, we still recommend continuing with the plan to test and upgrade to the 0.34 builds  (qpid-cpp-0.34.22.el7sat, qpid-proton-0.9-17, and all other versions that are pulled in via dependency: ruby, etc).

Comment 66 Kim van der Riet 2017-02-03 20:19:29 UTC

I have measured the memory consumption for wcache-page-size=4 on a RHEL7 box:

           |       malloc_stats()      |          ps           |
Num queues | mmap bytes | in use bytes | VSS (kiB) | RSS (kiB) |
-----------+------------+--------------+-----------+-----------+
         0 |          0 |       429712 |    476284 |     10260 |
       100 |   27033600 |     34593104 |    504904 |     37836 |
       300 |   81100800 |    102877232 |    561076 |     62700 |
      1000 |  270336000 |    341862048 |    760372 |    114368 |
      3000 |  811008000 |   1024723872 |   1323204 |    261756 |

which confirms that the mmap allocations are increasing at a rate of 270,336 bytes per queue (264kiB). This is only slightly higher than the expected 256kiB.

I have also taken a look at the message latency for different values of wcache-page-size for a fixed spike in load of 1,000,000 messages of 100 bytes in a single-machine test (ie both broker and client is using localhost on the same machine). For a steady-state message load of 10 messages per second, the average round-trip latency is approx 1.5ms. Then a spike in load of 1,000,000 messages is applied. The following table shows the results:

Round-trip latency:

wcache-page-size | mmap size |      latency (ms)
      (kiB)      |   (kiB)   |  min |    max |    ave
-----------------+-----------+------+--------+-------
             128 |      1024 |  0.9 | 1034.0 |  739.3
              64 |      1024 |  1.9 |  956.1 |  762.2
              32 |      1024 |  0.8 | 1039.1 |  781.7
              16 |       512 |  1.4 | 1191.6 |  888.6
               8 |       512 |  0.6 | 1177.7 |  882.8
               4 |       256 |  1.0 | 1385.4 | 1043.0

These results indicate that for small messages, the average latency seems to depend more on the total memory allocated rather than on the underlying number of pages. The 1024k allocations showed approx 750ms latencies, the 512k allocations approx 880ms and the 256k allocation 1040ms.

This picture may change somewhat for larger messages.

Comment 67 Kim van der Riet 2017-02-03 20:22:01 UTC

(In reply to Kim van der Riet from comment #66)
>            |       malloc_stats()      |          ps           |
> Num queues | mmap bytes | in use bytes | VSS (kiB) | RSS (kiB) |
> -----------+------------+--------------+-----------+-----------+
                                           ^^^
Sorry, a typo, this *should* be VSZ.

Comment 68 Pavel Moravec 2017-02-03 21:08:02 UTC

Thanks for the performance testing.

Satellite sends to (durable) qpid queues durable messages of approximate/average sizes:

- 1.1kB for celery (very often used, latency matters to some extent)
- 6-8kB for katello_event_queue (frequently used, latency no big deal)
- usually 0.4-1kB (but arbitrary high) for pulp.agent.* (arbitrary high - when installing list of packages/errata in one request, can be e.g. 20k) - infrequently used, but 1 message can end up in thousands of queues in parallel, latency no big deal
- 1-1.5kB for pulp_task (often used, latency matters to some extent)
- 1-1.5kB for reserved_resource_worker-* (frequently used, latency matters to some extent)
- 2kB for resource_worker (often used, latency matters to some extent)

Since the queues have Java or Python or Ruby producers and consumers, while qpidd is written in C++, I think the biggest latency is on the client's side - until the comparison for say 2kB messages wont be much worse when using wcache-page-size=4.

I am also concerned a bit about 6-8kB messages for katello_event_queue where a message is bigger than the cache size - this *will* slow down the writing. Anyway, consumer will fetch messages with frequency at most 100msgs per second (in fact consume 100msgs, then sleep 1 - see https://github.com/Katello/katello/commit/87ef553cfe840299389abbb4123a640c2227e0bb). I expect the "sleep 1" makes qpidd enough time to process anything to the disk.

Comment 69 Kim van der Riet 2017-02-07 15:18:37 UTC

(In reply to Pavel Moravec from comment #68)
> Satellite sends to (durable) qpid queues durable messages of
> approximate/average sizes:

Based on the sized described above, I repeated the above test for different message sizes. Also, I am making a change to the linearstore options which allow the number of write pages and the size of the pages to be set independently of each other. The only restrictions are that the minimum number of pages is 4, and that the page sizes are a power of 2 with a minimum size of 4 (kiB).

Round-trip latency (ms):

 pg |num |alloc ||   message size (bytes)   |
size|pgs | kiB  || 100 | 1000 | 6000 |10000 |
----+----+------++-----+------+------+------+
128 |  8 | 1024 || 726 |  267 |  169 |  341 |
 64 | 16 | 1024 || 768 |  283 |  169 |  263 |
 32 | 32 | 1024 || 782 |  280 |  156 |  262 |
 16 | 32 |  512 || 833 |  303 |  172 |  263 |
  8 | 64 |  512 || 827 |  320 |  180 |  231 |
  4 | 64 |  256 || 925 |  356 |  339 |  234 |
  4 | 32 |  128 || 955 |  346 |  360 |  253 |
  4 | 16 |   64 || 932 |  342 |  387 |  252 |
  4 |  8 |   32 || 974 |  348 |  337 |  254 |
  4 |  4 |   16 || 941 |  346 |  468 |  246 |

where:
pg size = wcache-page-size value (kiB)
num pgs = wcache-num-pages
alloc = total mmap allocation (= wcache-page-size * wcache-num-pages)

This would seem to indicate that for larger messages (within the range mentioned in comment #68) the page buffers work more efficiently and latencies actually decrease relative to the small 100 byte message case.

Bear in mind the limitations of this test, however:

1. Single machine, 4 cores make this test somewhat CPU bound. More pressure can be brought to bear on the broker by using more than one client sending messages to a queue simultaneously in a multi-machine test.

2. This machine uses traditional non-raid hard disks. The results of a test will vary depending on the type, speed and configuration (RAID, etc.) of the disks.

3. Realistic performance numbers should be obtained on the intended target hardware using the intended network topology/configuration. These results do give a _relative_ feel for what may be expected, however.

Comment 70 Pavel Moravec 2017-02-16 13:27:33 UTC

Bad news: the wcache-size seems to help for low number of durable queues only.

Scenario: qpidd 0.34 in Sat6.2 where I mimicked 0, 100 and 200 goferd clients (connecting via qdrouterd to qpidd) via below python script (it's not ideal, but works). Results are:

#queues	--wcache-size	VMS	RSS
0	32	355924	26112
0	4	356744	26904
100	32	481968	96688
100	4	376892	104472
200	32	593784	173992
200	4	610756	172272

See that for 100 queues, wcache-size helps, but has no effect for 200 queues.

I will test more later on, so far preliminary results confirming what Pradeep found in scaled test (5k clients require 5GB RAM on 0.34 with wcache-size=4).


Reproducer script:

#!/usr/bin/python
from time import sleep
from uuid import uuid4
import threading

from proton import ConnectionException, Message, Timeout
from proton.utils import BlockingConnection
from proton.reactor import DynamicNodeProperties

SLEEP = 1
THREADS = 80 # was 30

class CreateQueueAndSubscribe(threading.Thread):
  def __init__(self):
    super(CreateQueueAndSubscribe, self).__init__()

  def run(self):
    qmf_conn = BlockingConnection("proton+amqp://pmoravec-sat62-rhel7.gsslab.brq.redhat.com:5648", ssl_domain=None, heartbeat=None)
    qmf_rec = qmf_conn.create_receiver(None, name=str(uuid4()), dynamic=True, options=DynamicNodeProperties({'x-opt-qd.address': unicode("qmf.default.direct")}))
    qmf_snd = qmf_conn.create_sender("qmf.default.direct", name=str(uuid4()))
    qName = 'pulp.agent.FAKE.%s' % uuid4()
    print qName

    content = {
                "_object_id": {"_object_name": "org.apache.qpid.broker:broker:amqp-broker"},
                "_method_name": "create",
                "_arguments": {"type":"queue", "name":qName,
                               "properties":{"exclusive":False, "auto-delete":False, "durable":True}}
              }
    properties = {
                   'qmf.opcode': '_method_request',
                   'x-amqp-0-10.app-id': 'qmf2',
                   'method': 'request'
                 }
    request = Message(
                    body=content,
                    reply_to=qmf_rec.remote_source.address,
                    properties=properties,
                    correlation_id=unicode(uuid4()).encode('utf-8'),
                    subject='broker')
    qmf_snd.send(request)
    qmf_rec.receive()  # ignore what we have received as QMF response
    qmf_snd.close()
    qmf_rec.close()
    qmf_conn.close()

    while True:
      connected=False
      while not(connected):
        try:
          conn = BlockingConnection("proton+amqp://pmoravec-sat62-rhel7.gsslab.brq.redhat.com:5648", ssl_domain=None, heartbeat=10)
          rec = conn.create_receiver(qName, name=str(uuid4()))
          connected=True
        except:
          pass

      while connected:
        try:
          impl = rec.receive(3 or NO_DELAY)
        except Timeout:
          pass
        except Exception:
          connected=False
          pass


threads = []
for i in range(THREADS):
  threads.append(CreateQueueAndSubscribe())
  sleep(SLEEP)
  threads[i].start()

while True:
  sleep(10)

Comment 71 Pavel Moravec 2017-02-16 20:19:48 UTC

TL;DR version: lot of qpidd memory is allocated for a consumer.

longer version: I compared VSZ+RSS sizes with:
- wcache-page-size set to 4 or 32 (default)
- either 1000 mocked goferd clients or just 1000 durable queues without a consumer

and counted average VSZ+RSS increase triggered by one extra client / queue. 

Results (in kB per 1 client/queue):

- wcache-page-size=4, goferd clients:
  - VSZ 947
  - RSS 972

- wcache-page-size=32, goferd clients:
  - VSZ 1681
  - RSS 1550

- wcache-page-size=4, just durable queues:
  - VSZ 288
  - RSS 109

- wcache-page-size=32, just durable queues:
  - VSZ 1046
  - RSS 171


Conclusions:
- wcache-page-size does help a lot, decreasing by approx. 750kB. Great
- still not sufficient - 1 goferd client still requires 1MB, impossible to have 50k clients
- huge memory "sinner": the queue subscriber / consumer that requires approx. 650kB (!!!) each

Recall: all such consumers are subscribed via qdrouterd, via one shared TCP+AMQP connection. The 650kB is really all about (AMQP 1.0) consumer..

Comment 73 Pavel Moravec 2017-02-17 08:45:52 UTC

(In reply to Pavel Moravec from comment #71)
> TL;DR version: lot of qpidd memory is allocated for a consumer.
> 
> longer version: I compared VSZ+RSS sizes with:
> - wcache-page-size set to 4 or 32 (default)
> - either 1000 mocked goferd clients or just 1000 durable queues without a
> consumer
> 
> and counted average VSZ+RSS increase triggered by one extra client / queue. 
> 
> Results (in kB per 1 client/queue):
> 
> - wcache-page-size=4, goferd clients:
>   - VSZ 947
>   - RSS 972
> 
> - wcache-page-size=32, goferd clients:
>   - VSZ 1681
>   - RSS 1550
> 
> - wcache-page-size=4, just durable queues:
>   - VSZ 288
>   - RSS 109
> 
> - wcache-page-size=32, just durable queues:
>   - VSZ 1046
>   - RSS 171
> 
> 
> Conclusions:
> - wcache-page-size does help a lot, decreasing by approx. 750kB. Great
> - still not sufficient - 1 goferd client still requires 1MB, impossible to
> have 50k clients
> - huge memory "sinner": the queue subscriber / consumer that requires
> approx. 650kB (!!!) each
> 
> Recall: all such consumers are subscribed via qdrouterd, via one shared
> TCP+AMQP connection. The 650kB is really all about (AMQP 1.0) consumer..

Continuation of the test:

creating durable queues (without QMF, via qpid-receive -a "..") and _then_ subscribing consumers via qdrouterd, very similar memory usage is seen at the end. And the activity "subscribe to a queue" really triggers approx. 650kB memory consumption.

So it does not matter if the queue is created via QMF (like goferd does). The majority of currently(*) consumed extra memory per connected goferd is due to AMQP 1.0 consumer.

(*) currently = qpid-cpp 0.34-22, wcache-page-size=4

Comment 76 Pavel Moravec 2017-02-19 19:48:59 UTC

And here is the root cause for the latest memory inefficiency (#c75) / AMQP 1.0 consumer allocating lots of memory:

https://issues.apache.org/jira/browse/QPID-7677

or:

https://github.com/apache/qpid-cpp/blob/master/src/qpid/broker/amqp/Outgoing.cpp#L68

If I decreased the 5000 to 50 on my upstream qpid broker, memory decreased dramatically \o/ .

(but that might not be the proper way of fix, though)

Comment 80 Pavel Moravec 2017-02-24 09:58:37 UTC

FYI the upstream patch for https://issues.apache.org/jira/browse/QPID-7677 adds new configurable option to manage size of outgoing buffer of unsettled/unacknowledged messages. The bigger the buffer is, the more messages can be pending to the consumer but the bigger memory requirements are there.

We should decide what will be the value for Satellite.

Satellite consumers are:
- katello-agent/goferd 
  - very few messages pending at once (usually just 1)
- pulp celery workers/manager (on 0.10 now, though unaffected until upgraded to 1.0)
  - I *think* limiting capacity to say 10 or 20 wont be problem even for resource_manager
- celery monitoring its event queue
  - here the bandwidth might matter the most / this is the most frequent AMQP subscription - so far on 0.10 so unaffected
- pulp.task queue (where http wsgi script writes requests to pulp)
  - also quite busy, less than "celery" queue, shall be ok
- various auxiliary queues, usually no deal wrt throughput

So my suggestion would be to add

session-max-unacked=10

to /etc/qpid/qpidd.conf in final solution/deployment/installer.

Comment 81 Pradeep Kumar Surisetty 2017-03-15 07:06:09 UTC

Have 6000 nodes in setup.  
Noticed good improvements with new qpid + session-max-unacked=10  in scale setup

Few observations.

1) Had around 6000 hosts.  After every qpid restart (irrespective of number of content hosts) qpid mem goes to 350MB

2) With new qpid & session-max-unacked=100  & qpid restart: Qpid memory was around 300MB
After registering another 3000 hosts, qpid went upto 1.2G.  

Its growing after new registrations. worth investigarting. 
We can ship new qpid after QE runs

Comment 91 Pavel Moravec 2017-04-27 13:04:20 UTC

Hi Bryan,
is this BZ worth candidate to 6.2.10 ?

Pros:
- many customers hit it
- big service impact

Negs:
- requires packages rebase so potential regressions
  - note that my "personal" Sat 6.2.7 upgraded to .8 runs qpid-cpp 0.34 for few moths, under various stress tests, and without a problem, so I guess a regression (if there is) would have already occurred
- might require additional testing (esp. due to above Negs), i.e. extra work
- 6.2.10 might be close to 6.3 GA where these rebased packages will appear either way?

Comment 92 Pradeep Kumar Surisetty 2017-04-27 13:12:19 UTC

Bryan, Mike

Yeh. I believe we should push it in 6.2.10 considering many customers hitting this. Potential risk would be regression. 

But i did my 2nd round scale testing with 30K hosts with new qpid packages.  I ran for about 3-4 weeks and havent noticed any major issues with new qpid. 

It needs QE validation though.

Comment 93 Pavel Moravec 2017-04-28 13:55:44 UTC

Yet another reproducer exists, without installing a single errata/package, without sending a single message to qpidd. Just have bouncing goferd clients that cause frequent session attach and detach on qpidd<->qdrouterd connection, such that qdrouterd (or the proton library beneath) wrongly recycles some sessions ("too fast": qdrouterd starts to use some session that has not been fully closed by qpidd yet). These sessions are then improperly closed in qpidd, though they are not visible in "qpid-stat -g".

Reproducer (python script mocking goferd and connecting to qdrouterd without SSL):

#!/usr/bin/python
from time import sleep
from uuid import uuid4
import threading
import sys

from proton import ConnectionException, Message, Timeout
from proton.utils import BlockingConnection
from proton.reactor import DynamicNodeProperties

SLEEP = 0.01
START_UUID = int(sys.argv[1])
END_UUID = int(sys.argv[2])
THREADS = END_UUID - START_UUID + 1

class CreateQueueAndSubscribe(threading.Thread):
  def __init__(self, uuid):
    super(CreateQueueAndSubscribe, self).__init__()
    self.uuid = uuid

  def run(self):
    self.qName = 'pulp.agent.FAKE.%s' % self.uuid
    print self.qName

    while True:
      connected=False
      while not(connected):
        try:
          conn = BlockingConnection("proton+amqp://your.satellite.fqdn:5648", ssl_domain=None, heartbeat=10)
          rec = conn.create_receiver(self.qName, name=str(uuid4()))
          connected=True
        except:
          pass

      while connected:
        try:
          impl = rec.receive(3)
        except Timeout:
          pass
        except LinkDetached:
          connected=False
          pass


threads = []
for i in range(THREADS):
  threads.append(CreateQueueAndSubscribe(START_UUID+i))
  sleep(SLEEP)
  threads[i].start()

while True:
  sleep(10)


Then run this script 10 times in parallel:

while true; do
	i=$((RANDOM%10))
	python connect_to_queues.py $((i*10+1)) $((i*10+10)) &
	pid=$!
	sleep $((RANDOM%5))
	sleep 1
	kill $pid
	sleep 1
done

Comment 96 Mike Cressman 2017-05-10 17:18:36 UTC

Pavel,
Which versions of proton, qpid-cpp, and router are you using with the latest reproducer (comment 93)?  Are you saying that there's still a leak with all the packages targeted for 6.3? (0.34-24 broker, 0.4-22 qpid-dispatch, 0.9-16 proton)

Comment 97 Pavel Moravec 2017-05-10 20:34:31 UTC

(In reply to Mike Cressman from comment #96)
> Pavel,
> Which versions of proton, qpid-cpp, and router are you using with the latest
> reproducer (comment 93)?  Are you saying that there's still a leak with all
> the packages targeted for 6.3? (0.34-24 broker, 0.4-22 qpid-dispatch, 0.9-16
> proton)

Apologize for not specifying this info before.

The reproducer is valid against 0.30 broker _only_, in particular against these packages:
qpid-cpp-client-0.30-11.el7sat.x86_64
qpid-cpp-server-0.30-11.el7sat.x86_64
qpid-cpp-server-linearstore-0.30-11.el7sat.x86_64
qpid-proton-c-0.9-16.el7.x86_64


While _no_ memory leak is present against 0.34 broker / against these packages:
python-qpid-proton-0.9-16.el7.x86_64
qpid-cpp-client-0.34-23.el7sat.x86_64
qpid-cpp-client-devel-0.34-23.el7sat.x86_64
qpid-cpp-server-0.34-23.el7sat.x86_64
qpid-cpp-server-linearstore-0.34-23.el7sat.x86_64
qpid-proton-c-0.9-16.el7.x86_64


So 0.34 broker (plus some dependant packages) upgrade will prevent the memory leak even for the reproducer in #c93.


(my primary aim for the new reproducer was to show there is a leak when not updating a single package via katello-agent, when just goferd is restarting frequently and nothing else, as this scenario has already happened at a customer)


So to have this BZ fully resolved, we need:
- upgraded qpid-cpp packages to version 0.34-24 (plus some dependant updated packages)
- session-max-unacked=10 in /etc/qpid/qpidd.conf (i.e. satellite installer change)
- upgrade regression bz1443470 to be fixed by qpid team OR a workaround to clear orphaned journal files like suggested by Kim in the BZ (I would prefer a fix)

Comment 98 Sean O'Keeffe 2017-05-10 21:45:39 UTC

Looking at https://issues.apache.org/jira/browse/QPID-7677 setting session-max-unacked=10 in /etc/qpid/qpidd.conf would be pointless without that commit which is going to be in qpid-cpp-1.37.0 pkgs when released? Version numbers look like they are wrong somewhere.. or maybe I've missed something?

Comment 99 Pavel Moravec 2017-05-11 07:03:14 UTC

(In reply to Sean O'Keeffe from comment #98)
> Looking at https://issues.apache.org/jira/browse/QPID-7677 setting
> session-max-unacked=10 in /etc/qpid/qpidd.conf would be pointless without
> that commit which is going to be in qpid-cpp-1.37.0 pkgs when released?
> Version numbers look like they are wrong somewhere.. or maybe I've missed
> something?

Good point. Gladly resolved already by backport to 0.34-24.el[6-7]sat :

# rpm -q qpid-cpp-server
qpid-cpp-server-0.34-24.el7sat.x86_64
# qpidd --help | grep session-max-unacked
  --session-max-unacked DELIVERIES (5000) 
#

See the builds and Changelog there:

https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=540919
https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=540918

Comment 101 Pavel Moravec 2017-05-11 13:07:05 UTC

(In reply to Pavel Moravec from comment #97)
> So to have this BZ fully resolved, we need:
> - upgraded qpid-cpp packages to version 0.34-24 (plus some dependant updated
> packages)
> - session-max-unacked=10 in /etc/qpid/qpidd.conf (i.e. satellite installer
> change)
> - upgrade regression bz1443470 to be fixed by qpid team OR a workaround to
> clear orphaned journal files like suggested by Kim in the BZ (I would prefer
> a fix)

Forgot one config option, so once again:

to have this BZ fully resolved, we need:
- upgraded qpid-cpp packages to version 0.34-24 (plus some dependant updated
packages)
- session-max-unacked=10 AND wcache-page-size=4 in /etc/qpid/qpidd.conf (i.e. satellite installer change)
- upgrade regression bz1443470 to be fixed by qpid team OR a workaround to
clear orphaned journal files like suggested by Kim in the BZ (I would prefer
a fix)

Comment 102 Pavel Moravec 2017-05-15 10:45:02 UTC

(In reply to Pavel Moravec from comment #101)
> Forgot one config option, so once again:
> 
> to have this BZ fully resolved, we need:
> - upgraded qpid-cpp packages to version 0.34-24 (plus some dependant updated
> packages)
> - session-max-unacked=10 AND wcache-page-size=4 in /etc/qpid/qpidd.conf
> (i.e. satellite installer change)
> - upgrade regression bz1443470 to be fixed by qpid team OR a workaround to
> clear orphaned journal files like suggested by Kim in the BZ (I would prefer
> a fix)


IMPORTANT: we realized the 0.30 -> 0.34 upgrade contains linearstore upgrade where (so far) _manual_ steps must be performed - see [1]

We need an automated script for that.. ([2] has some draft for inspiration, havent checked how precise/correct)


So once again, to have this BZ fully resolved, we need:
- upgraded qpid-cpp packages to version 0.34-24 (plus some dependant updated packages)
- session-max-unacked=10 AND wcache-page-size=4 in /etc/qpid/qpidd.conf (i.e. satellite installer change)
- upgrade regression bz1443470 to be fixed by qpid team OR a workaround to clear orphaned journal files like suggested by Kim in the BZ (I would prefer a fix)
- an automated script (working both on RHEL6 and RHEL7 with _different_ path to linear store) for the linearstore dir.structure update per [1]/[2]



[1] https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/3/html-single/Messaging_Installation_and_Configuration_Guide/index.html#Upgrade_a_MRG_Messaging_3.1_Server_to_MRG_Messaging_3.2
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1263141

Comment 105 Mike McCune 2017-06-15 19:59:32 UTC

// INTERNAL FOR NOW - WILL MAKE PUBLIC AFTER CONFIRMATION HOTFIX CLEAR

*** Satellite 6.2.9 HOTFIX INSTRUCTIONS ***

This hotfix upgrades your qpid libraries from 0.31 to 0.34 and requires a migration of your queue structure. This is done via a script provided within the archive outlined below.


1) Create a directory on your Satellite server:

/var/tmp/hotfix-1367735-1426242/

2) Copy hotfix-1367735-1426242.tar to /var/tmp/hotfix-1367735-1426242/

3) Expand archive into above directory:

cd /var/tmp/hotfix-1367735-1426242/
tar xvf hotfix-1367735-1426242.tar

4) Run update script or examine script and run commands contained within by hand if so desired:

./hotfix-1367735-1426242.sh

5) proceed with normal operations.

You are not required to upgrade all of your Capsules, they will continue to function as-is.

Comment 106 Mike McCune 2017-06-15 20:00:51 UTC

Created attachment 1288175 [details]
Hotfix repository

Comment 107 Mike McCune 2017-06-15 20:11:10 UTC

Note, the above script will stop and then restart all Satellite services. Expect downtime of ~5 minutes for a restart.

Comment 110 Mike McCune 2017-07-19 15:44:35 UTC

The above HOTFIX is also applicable to 6.2.10, feel free to apply to any 6.2.10 installations with the same instructions above.

Comment 114 Mike McCune 2017-08-17 19:41:10 UTC

This bug was incorrectly missed from the Errata advisory for 6.2.11 but we did ship the fix for this bug with that update. Manually moving this to CLOSED:ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2466