Bug 615441

Summary: "sluggish" QMF py console can crash c++ broker
Product: Red Hat Enterprise MRG Reporter: Pete MacKinnon <pmackinn>
Component: qpid-cppAssignee: Ken Giusti <kgiusti>
Status: CLOSED NOTABUG QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: high Docs Contact:
Priority: high    
Version: betaCC: gsim, jross, kgiusti, matt
Target Milestone: 1.3   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-08-12 17:15:04 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 622865    
Bug Blocks:    
Attachments:
Description Flags
The problem as seen exhibited in the mrg31 environment
none
Output from valgrind/massif heap profiler.
none
Thorough massif heap profile. none

Description Pete MacKinnon 2010-07-16 18:12:42 UTC
For some time now, we have noticed crashes if the mrg31 broker starts continuously reporting qmf-related queue backup problems for an extended period of time (> 4 hours)

Justin has a py test driver that appears to be able to reliably trigger this:
https://svn.devel.redhat.com/repos/mrg-team/trunk/testing/slappy/sluggy.py

He runs this with -oet args.

The qpidd log will report:
2010-07-15 18:49:11 error Execution exception: resource-limit-exceeded: Policy exceeded on qmfc-v2-hi-redhat.19022.1, policy: size: max=104857600, current=104857399; count: unlimited; type=reject (qpid/broker/QueuePolicy.cpp:85)
2010-07-15 18:49:11 error Execution exception: resource-limit-exceeded: Policy exceeded on qmfc-v2-hi-redhat.19022.1, policy: size: max=104857600, current=104857399; count: unlimited; type=reject (qpid/broker/QueuePolicy.cpp:85)
2010-07-15 18:49:11 error Execution exception: resource-limit-exceeded: Policy exceeded on qmfc-v2-hi-redhat.19022.1, policy: size: max=104857600, current=104857399; count: unlimited; type=reject (qpid/broker/QueuePolicy.cpp:85)

We have been able to capture two cores so far from this condition. Core files on mrg31 in /root:

core.10270 
core.20094

It's not exactly clear where the exhaustion occurs but in the first core it would seem to be at:

#0  0x00000038af072b98 in _int_malloc () from /lib64/libc.so.6
#1  0x00000038af074bee in malloc () from /lib64/libc.so.6
#2  0x00000038c1abd17d in operator new () from /usr/lib64/libstdc++.so.6
#3  0x00000035c9b22ac4 in std::vector<boost::io::detail::format_item<char, std::char_traits<char>, std::allocator<char> >, std::allocator<boost::io::detail::format_item<char, std::char_traits<char>, std::allocator<char> > > >::_M_fill_assign (this=<value optimized out>, __n=<value optimized out>, 
    __val=<value optimized out>)
    at /usr/include/c++/4.1.2/ext/new_allocator.h:88
#4  0x00000035c9b23aec in boost::basic_format<char, std::char_traits<char>, std::allocator<char> >::make_or_reuse_data (this=<value optimized out>, 
    nbitems=<value optimized out>)
    at /usr/include/c++/4.1.2/bits/stl_vector.h:299
#5  0x00000035c9b23ca4 in boost::basic_format<char, std::char_traits<char>, std::allocator<char> >::parse (this=<value optimized out>, 
    buf=<value optimized out>) at /usr/include/boost/format/parsing.hpp:415
#6  0x00000035c9b243d3 in basic_format (this=<value optimized out>, 
    str=<value optimized out>)
    at /usr/include/boost/format/format_implementation.hpp:33
#7  0x00000035c9bb2ecf in qpid::broker::NullAuthenticator::start (
    this=<value optimized out>, mechanism=<value optimized out>, 
    response=<value optimized out>) at qpid/broker/SaslAuthenticator.cpp:186
#8  0x00000035c9b2c60b in qpid::broker::ConnectionHandler::Handler::startOk (
    this=<value optimized out>, clientProperties=<value optimized out>, 
    mechanism=<value optimized out>, response=<value optimized out>)
    at qpid/broker/ConnectionHandler.cpp:122

Comment 1 Gordon Sim 2010-07-19 06:54:39 UTC
What's the memory use at this point? Is it just that all available memory has been consumed?

Comment 2 Pete MacKinnon 2010-07-19 13:36:29 UTC
Anecdotally with top I've seen VIRT at 1.5Gb and over 3Gb. I'm assuming memory exhaustion but we don't have the raw data for that.

Perhaps Justin can tee up a sluggy.py run today?

Comment 3 Gordon Sim 2010-07-19 13:53:05 UTC
Justin, could you verify whether this is just memory exhaustion when you get a chance? If it is we won't fix it; there is no feature as yet that reliably prevents qpidd exceeding the memory limits (other than the OOM Killer). This is not a blocker unless it can't be attributed to running out of allocatable memory.

Comment 4 Justin Ross 2010-07-19 14:58:28 UTC
Pete, Gordon, gotcha, we'll run it again today.

There is a question, however, about resource limits.  Sluggy grows just one queue, and the default queue limits ought to prevent one filled-to-the-brim queue from causing so much memory usage, true?

Pete suggested that there may be some other type of resource use that sluggy is triggering, and which may also need to be limited.

At any rate, I'll do some memory math this time and see if queue sizes can account for what we see.

Comment 5 Ken Giusti 2010-08-04 13:47:40 UTC
I believe I'm able to reproduce this behavior using a RHEL5 VM running the latest broker candidate (-11).

I'm running several instances of a qmf traffic generator I've written:

https://svn.devel.redhat.com/repos/mrg-team/trunk/people/kgiusti/Tools/qmf/nozzle

against the broker, plus three instances of sluggy (from above link).

qpidd memory (VIRT + RES) slowly increases, and I get many of the following exceptions:

2010-08-03 17:35:53 error Exception caught in sendMessage: resource-limit-exceeded: resource-limit-exceeded: Policy exceeded on qmfc-v2-lo-localhost.localdomain.2591.1, policy: size: max=104857600, current=104831652; count: unlimited; type=reject (qpid/broker/QueuePolicy.cpp:85)
2010-08-03 17:35:56 warning Exception received from broker: resource-limit-exceeded: resource-limit-exceeded: Policy exceeded on qmfc-v2-lo-localhost.localdomain.2591.1, policy: size: max=104857600, current=104832912; count: unlimited; type=reject (qpid/broker/QueuePolicy.cpp:85) [caused by 10 \x00:\x00]
2010-08-03 17:35:56 error guest.qmfagent-4d73198c-de85-4b0d-a218-517884abaae4 error: resource-limit-exceeded: resource-limit-exceeded: Policy exceeded on qmfc-v2-lo-localhost.localdomain.2591.1, policy: size: max=104857600, current=104832912; count: unlimited; type=reject (qpid/broker/QueuePolicy.cpp:85)


I've run qpidd under valgrind looking for leaks.  My VM has 512MB - once qpidd memory grows to approximately that size, I've stopped all traffic generators and sluggy instances, and shut down the broker (^C in foreground).

Valgrind indicates that all memory is returned:

==19115== Memcheck, a memory error detector
==19115== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.
==19115== Using Valgrind-3.5.0 and LibVEX; rerun with -h for copyright info
==19115== Command: /home/kgiusti/qpid/cpp/src/.libs/lt-qpidd --auth no
==19115== Parent PID: 2938
==19115== 
==19115== 
==19115== HEAP SUMMARY:
==19115==     in use at exit: 0 bytes in 0 blocks
==19115==   total heap usage: 264,177 allocs, 264,177 frees, 471,915,519 bytes allocated
==19115== 
==19115== All heap blocks were freed -- no leaks are possible
==19115== 
==19115== For counts of detected and suppressed errors, rerun with: -v
==19115== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 4 from 4)

Going to use valgrind + massif to get a profile of heap usage.

Comment 6 Ken Giusti 2010-08-04 17:37:50 UTC
Massif results - it appears as if the majority of heap memory is due to std::string manipulation:

98.27% (151,810,905B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->81.74% (126,275,561B) 0x386369B7FF: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /usr/lib64/libstdc++.so.6.0.8)
| ->80.61% (124,518,967B) 0x386369D0AF: std::string::_M_mutate(unsigned long, unsigned long, unsigned long) (in /usr/lib64/libstdc++.so.6.0.8)
| | ->80.61% (124,518,967B) 0x386369D22A: std::string::_M_replace_safe(unsigned long, unsigned long, char const*, unsigned long) (in /usr/lib64/libstdc++.so.6.0.8)
| | | ->80.45% (124,282,074B) 0x525C3DD: qpid::framing::Buffer::getRawData(std::string&, unsigned int) (Buffer.cpp:289)
| | | | ->72.31% (111,702,759B) 0x52557D4: qpid::framing::AMQFrame::decode(qpid::framing::Buffer&) (AMQFrame.cpp:137)
| | | | | ->72.31% (111,702,759B) 0x4D0F169: qpid::amqp_0_10::Connection::decode(char const*, unsigned long) (Connection.cpp:56)
| | | | |   ->72.31% (111,702,759B) 0x528256F: qpid::sys::AsynchIOHandler::readbuff(qpid::sys::AsynchIO&, qpid::sys::AsynchIOBufferBase*) (AsynchIOHandler.cpp:135)
| | | | |     ->72.31% (111,702,759B) 0x51B533D: boost::function2<void, qpid::sys::AsynchIO&, qpid::sys::AsynchIOBufferBase*, std::allocator<boost::function_base> >::operator()(qpid::sys::AsynchIO&, qpid::sys::AsynchIOBufferBase*) const (function_template.hpp:576)
| | | | |       ->72.31% (111,702,759B) 0x51B3527: qpid::sys::posix::AsynchIO::readable(qpid::sys::DispatchHandle&) (AsynchIO.cpp:428)
| | | | |         ->72.31% (111,702,759B) 0x5287E2D: boost::function1<void, qpid::sys::DispatchHandle&, std::allocator<boost::function_base> >::operator()(qpid::sys::DispatchHandle&) const (function_template.hpp:576)
| | | | |           ->72.31% (111,702,759B) 0x5286CA5: qpid::sys::DispatchHandle::processEvent(qpid::sys::Poller::EventType) (DispatchHandle.cpp:285)
| | | | |           | ->72.31% (111,702,759B) 0x51BF196: qpid::sys::Poller::run() (Poller.h:131)
| | | | |           |   ->51.06% (78,871,952B) 0x51B74C8: qpid::sys::(anonymous namespace)::runRunnable(void*) (Thread.cpp:35)
| | | | |           |   | ->51.06% (78,871,952B) 0x385500673B: start_thread (in /lib64/libpthread-2.5.so)
| | | | |           |   |   ->51.06% (78,871,952B) 0x38544D3D1B: clone (in /lib64/libc-2.5.so)

The cpp/src/qpid/amqp_0_10/Connection.cpp - Connection::decode() routine is decoding the frame contents into an AMQContentBody, which is a string, using std::string::append().  This causes the heap allocation.

The AMQFrame does go out of scope locally, but it uses an intrusive_ptr to the AMQContentBody, which probably means the body is still referenced (somewhere).

Comment 7 Gordon Sim 2010-08-05 08:28:54 UTC
What does the connection count look like at the point the errors start occurring? Is the number of connections growing?

Comment 8 Justin Ross 2010-08-05 15:45:23 UTC
Created attachment 436889 [details]
The problem as seen exhibited in the mrg31 environment

The resident size of the broker is more than 2 gigs, and queue sizes (as reported by qpid-stat -q) can account for less than 400 megs.

Comment 9 Gordon Sim 2010-08-05 16:01:13 UTC
qpid-stat -q grossly under-reports memory required; it just reports the aggregate content size, not headers or any of the internal structures used to manage the queue and its messages

To get a comparison you could fill up a single queue with 40000 messages of e.g. 1k content and check memory use, then restart broker and fill up a queue with 40000 messages with no content and check the memory use in that case.

Comment 10 Ken Giusti 2010-08-05 17:23:01 UTC
Re: comment 7:

It would appear that the connection counts (as well as queues, etc) don't grow.  Granted, when I push the broker to failure, I cannot use qpid-tool in realtime.  But I have dumped the connection counts et al. *prior* to running a test, then again after I have experienced the error conditions:

Pre test (after attaching sluggy(s), but before starting traffic generators):

Summary of Objects by Type:
    Package                 Class         Active  Deleted
    =======================================================
    org.apache.qpid.broker  binding       30      0
    org.apache.qpid.broker  broker        1       0
    org.apache.qpid.broker  system        1       0
    org.apache.qpid.broker  subscription  15      0
    org.apache.qpid.broker  connection    3       0
    org.apache.qpid.broker  session       3       0
    org.apache.qpid.broker  queue         15      0
    org.apache.qpid.broker  exchange      8       0
    org.apache.qpid.broker  vhost         1       0


Post test (after stopping traffic generators and re-invoking qpid-tool):
Summary of Objects by Type:
    Package                 Class         Active  Deleted
    =======================================================
    org.apache.qpid.broker  binding       30      0
    org.apache.qpid.broker  broker        1       0
    org.apache.qpid.broker  system        1       0
    org.apache.qpid.broker  subscription  15      0
    org.apache.qpid.broker  connection    3       0
    org.apache.qpid.broker  session       3       0
    org.apache.qpid.broker  queue         15      0
    org.apache.qpid.broker  exchange      8       0
    org.apache.qpid.broker  vhost         1       0

Comment 11 Ken Giusti 2010-08-05 18:00:35 UTC
Created attachment 436928 [details]
Output from valgrind/massif heap profiler.

Ran the broker using the following command line:

 valgrind --tool=massif --trace-children=yes --stacks=no ./qpidd --auth no --queue-purge-interval 60 --default-queue-limit 8000000

Immediately started traffic generator, and experienced the queue resource limit errors less than two minutes in:

2010-08-05 13:35:54 notice SASL disabled: No Authentication Performed
2010-08-05 13:35:54 notice Listening on TCP port 5672
2010-08-05 13:35:54 notice Broker running
2010-08-05 13:37:13 error Execution exception: resource-limit-exceeded: Policy exceeded on qmfc-v2-lo-localhost.localdomain.10968.1, policy: size: max=8000000, current=7971122; count: unlimited; type=reject (qpid/broker/QueuePolicy.cpp:85)
...

Ran with traffic generation for 5 minutes.  Stopped the traffic generators (and the limit errors stopped as expected), then ran for an additional 5 minutes at idle.

Note the histogram in the massif output attached - it indicates that, even after 5 minutes of idle traffic, the heap usage did NOT decrease over time.  The broker is not freeing this memory until it is shut down.

Comment 12 Ken Giusti 2010-08-10 15:21:54 UTC
Created attachment 437920 [details]
Thorough massif heap profile.

The previous massif heap profile was incomplete and therefore inaccurate.  It turns out that the slow consumer (sluggy) used for this test was *pathologically slow* - it would take on the order of minutes to process a single large message.

I've altered the slow consumer to make it fast enough to drain a full queue in an idle system within a few hours, and re-ran the test.   Attached is the massif heap profile, which shows that the message content memory is actually returned as expected when the queues are finally drained.

Note that, during this test run, the RSS size of the broker peaked at 822MB as reported by top.  Once the queues were drained, top still reported an RSS value of 769MB.  According to massif, the heap peaked at over 500MB, and eventually dropped to < ~100MB.  We really can't used top/RSS to accurately estimate heap utilization.

In conclusion, there does not appear to be a bug with regard to heap/memory usage.  However, in leu of producer flow control, the QMF agent needs to implement a backoff mechanism when it detects a queue full condition.

I will open a new bug to track QMF agent handling of queue full conditions.