Bug 591650

Summary: qpidd appears to leak connections during scale tests
Product: Red Hat Enterprise MRG Reporter: Ken Giusti <kgiusti>
Component: qpid-cppAssignee: Gordon Sim <gsim>
Status: CLOSED CURRENTRELEASE QA Contact: Jeff Needle <jneedle>
Severity: medium Docs Contact:
Priority: high    
Version: betaCC: gsim, jross
Target Milestone: 1.3   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
log msg issued by broker on kill -9 of remote consoles.
none
output of qpid-stat -c showing connections that should've been cleaned up.
none
Verification script none

Description Ken Giusti 2010-05-12 19:23:46 UTC
Description of problem:


While running the wallaby-agent/condor_configd scale tests as described in 

https://bugzilla.redhat.com/show_bug.cgi?id=591322

the qpidd daemon uses up a great deal of memory that is not released when all wallaby-agents/condor_configd clients are released.

After running the scale test + 1000 mace clients, we stopped all console and agents.   We used netstat to verify that all connections had been closed (all consoles where to remote hosts).

The memory footprint of qpidd was large, and did not decrease over time:

[root@pman08 ~]# ps v -C qpidd 
  PID TTY      STAT   TIME  MAJFL   TRS   DRS   RSS %MEM COMMAND
16726 ?        Ssl  113:14      0    59 5415376 4796688 19.4 /usr/sbin/qpidd --daemon --pid-dir /var/run/


While running the tests, qpid-stat -q reported 3000+ queues.  We verified that the queues appear to have been cleaned up:

[root@pman08 ~]# qpid-stat -q
Queues
  queue                                      dur  autoDel  excl  msg   msgIn  msgOut  bytes  bytesIn  bytesOut  cons  bind
  ==========================================================================================================================
  reply-pman08.lab.bos.redhat.com.5713.1          Y        Y        0    74     74       0   31.4k    31.4k        1     2
  reply-pman08.lab.bos.redhat.com.16751.1         Y        Y        0    70     70       0   52.6k    52.6k        1     2
  topic-pman08.lab.bos.redhat.com.5713.1          Y        Y        0     0      0       0      0        0         1     3
  qmfc-v2-pman08.lab.bos.redhat.com.5713.1        Y        Y        1   132    131    1.19k  87.4k    86.2k        1     3
  topic-pman08.lab.bos.redhat.com.16751.1         Y        Y        0   306    306       0   46.2k    46.2k        1     3
  qmfc-v2-pman08.lab.bos.redhat.com.16751.1       Y        Y        0   247k   247k      0    251m     251m        1     3



However - and this is troubling - qpid-stat -c still reports over 100 connections active to remote condor_configd consoles:


[root@pman08 ~]# qpid-stat -c
Connections
  client-addr       cproc           cpid   auth       connected  idle  msgIn  msgOut
  ====================================================================================
  172.17.1.7:35677  condor_configd  25615  anonymous  0s         0s     401      0
  172.17.1.7:35671  condor_configd  25548  anonymous  0s         0s     429      0
  172.17.1.4:34167  condor_configd  5081   anonymous  0s         0s     185      0
  172.17.1.7:35462  condor_configd  22365  anonymous  0s         0s     418      0
  172.17.1.4:34288  condor_configd  7159   anonymous  0s         0s     185      0
  172.17.1.7:35581  condor_configd  23683  anonymous  0s         0s    1.27k     0
  172.17.1.7:41900  condor_configd  3417   anonymous  0s         0s     478      0
  172.17.1.4:34348  condor_configd  8090   anonymous  0s         0s     185      0
  172.17.1.7:36022  condor_configd  30810  anonymous  0s         0s     422      0
  172.17.1.7:35503  condor_configd  22868  anonymous  0s         0s     389      0
  172.17.1.7:35502  condor_configd  22786  anonymous  0s         0s     792      0
  172.17.1.7:36028  condor_configd  30878  anonymous  0s         0s     716      0

...


[root@pman08 ~]# qpid-stat -c | grep condor | wc
    107     856    8881


Notes:

1) we "pkill -9 condor_configd" on the remote host, so these consoles did not cleanly shut down.

2) netstat shows no connections to the remote host that had the condor_configd consoles

3) qpidd issued a number of log errors when the consoles were pkilled:

May 12 14:56:15 pman08 qpidd[16726]: 2010-05-12 14:56:15 warning CLOSING [172.17.1.7:41973] unsent data (probably due to client disconnect) 
May 12 14:56:15 pman08 qpidd[16726]: 2010-05-12 14:56:15 warning CLOSING [172.17.1.7:36036] unsent data (probably due to client disconnect) 
May 12 14:56:15 pman08 qpidd[16726]: 2010-05-12 14:56:15 warning CLOSING [172.17.1.7:41847] unsent data (probably due to client disconnect) 


See attachments for log and qpid-stat -c output.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Ken Giusti 2010-05-12 19:25:05 UTC
Created attachment 413530 [details]
log msg issued by broker on kill -9 of remote consoles.

log msg issued by broker on kill -9 of remote consoles.

Comment 2 Ken Giusti 2010-05-12 19:25:45 UTC
Created attachment 413531 [details]
output of qpid-stat -c showing connections that should've been cleaned up.

Comment 3 Gordon Sim 2010-06-01 19:26:16 UTC
Fixed on trunk (r950201) and in release repo (http://mrg1.lab.bos.redhat.com/git/?p=qpid.git;a=commit;h=c9a6a956b126ccc27e03cb32cea269cc3a0b495f).

A simple reproducer is to run e.g. qpid-perftest --size 10 --count 10000 --nsubs 500 --npubs 500 and kill it while in progress, then check that qpid-stat -c does not report any of the perftest connections still active, (Even without this fix this passed once or twice for me, but after a few iterations at most it should show up the issue).

Comment 4 Justin Ross 2011-07-15 19:21:41 UTC
Proposing this for verification.

Comment 7 Jan Sarenik 2011-09-09 12:13:45 UTC
Verified on both RHELs, both architectures, with latest packages and 590 runs
of reproducting code as mentioned in Comment #3 (see the script below)
without a single failure.

----------------------------------- %< -------------------------------------
#!/bin/sh
# qpid-cpp-server qpid-cpp-client-devel qpid-tools

RUN=1

while
  echo === Run $RUN ===
  qpid-perftest --size 10 --count 10000 --nsubs 500 --npubs 500 & sleep 10 && kill $!
  test `qpid-stat -c | wc -l` -eq 4
do
  NUM=`netstat -n | wc -l`
  while
    test `netstat -n | wc -l` -eq $NUM
  do
    echo -n .
    sleep 1
  done
  echo
  RUN=$((RUN+1))
done

Comment 8 Jan Sarenik 2011-09-09 12:15:28 UTC
Created attachment 522319 [details]
Verification script

Once again, the same script, now as the attachment.