Bug 615441
Summary: | "sluggish" QMF py console can crash c++ broker | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Pete MacKinnon <pmackinn> | ||||||||
Component: | qpid-cpp | Assignee: | Ken Giusti <kgiusti> | ||||||||
Status: | CLOSED NOTABUG | QA Contact: | MRG Quality Engineering <mrgqe-bugs> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | beta | CC: | gsim, jross, kgiusti, matt | ||||||||
Target Milestone: | 1.3 | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2010-08-12 17:15:04 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | 622865 | ||||||||||
Bug Blocks: | |||||||||||
Attachments: |
|
Description
Pete MacKinnon
2010-07-16 18:12:42 UTC
What's the memory use at this point? Is it just that all available memory has been consumed? Anecdotally with top I've seen VIRT at 1.5Gb and over 3Gb. I'm assuming memory exhaustion but we don't have the raw data for that. Perhaps Justin can tee up a sluggy.py run today? Justin, could you verify whether this is just memory exhaustion when you get a chance? If it is we won't fix it; there is no feature as yet that reliably prevents qpidd exceeding the memory limits (other than the OOM Killer). This is not a blocker unless it can't be attributed to running out of allocatable memory. Pete, Gordon, gotcha, we'll run it again today. There is a question, however, about resource limits. Sluggy grows just one queue, and the default queue limits ought to prevent one filled-to-the-brim queue from causing so much memory usage, true? Pete suggested that there may be some other type of resource use that sluggy is triggering, and which may also need to be limited. At any rate, I'll do some memory math this time and see if queue sizes can account for what we see. I believe I'm able to reproduce this behavior using a RHEL5 VM running the latest broker candidate (-11). I'm running several instances of a qmf traffic generator I've written: https://svn.devel.redhat.com/repos/mrg-team/trunk/people/kgiusti/Tools/qmf/nozzle against the broker, plus three instances of sluggy (from above link). qpidd memory (VIRT + RES) slowly increases, and I get many of the following exceptions: 2010-08-03 17:35:53 error Exception caught in sendMessage: resource-limit-exceeded: resource-limit-exceeded: Policy exceeded on qmfc-v2-lo-localhost.localdomain.2591.1, policy: size: max=104857600, current=104831652; count: unlimited; type=reject (qpid/broker/QueuePolicy.cpp:85) 2010-08-03 17:35:56 warning Exception received from broker: resource-limit-exceeded: resource-limit-exceeded: Policy exceeded on qmfc-v2-lo-localhost.localdomain.2591.1, policy: size: max=104857600, current=104832912; count: unlimited; type=reject (qpid/broker/QueuePolicy.cpp:85) [caused by 10 \x00:\x00] 2010-08-03 17:35:56 error guest.qmfagent-4d73198c-de85-4b0d-a218-517884abaae4 error: resource-limit-exceeded: resource-limit-exceeded: Policy exceeded on qmfc-v2-lo-localhost.localdomain.2591.1, policy: size: max=104857600, current=104832912; count: unlimited; type=reject (qpid/broker/QueuePolicy.cpp:85) I've run qpidd under valgrind looking for leaks. My VM has 512MB - once qpidd memory grows to approximately that size, I've stopped all traffic generators and sluggy instances, and shut down the broker (^C in foreground). Valgrind indicates that all memory is returned: ==19115== Memcheck, a memory error detector ==19115== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al. ==19115== Using Valgrind-3.5.0 and LibVEX; rerun with -h for copyright info ==19115== Command: /home/kgiusti/qpid/cpp/src/.libs/lt-qpidd --auth no ==19115== Parent PID: 2938 ==19115== ==19115== ==19115== HEAP SUMMARY: ==19115== in use at exit: 0 bytes in 0 blocks ==19115== total heap usage: 264,177 allocs, 264,177 frees, 471,915,519 bytes allocated ==19115== ==19115== All heap blocks were freed -- no leaks are possible ==19115== ==19115== For counts of detected and suppressed errors, rerun with: -v ==19115== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 4 from 4) Going to use valgrind + massif to get a profile of heap usage. Massif results - it appears as if the majority of heap memory is due to std::string manipulation: 98.27% (151,810,905B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc. ->81.74% (126,275,561B) 0x386369B7FF: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /usr/lib64/libstdc++.so.6.0.8) | ->80.61% (124,518,967B) 0x386369D0AF: std::string::_M_mutate(unsigned long, unsigned long, unsigned long) (in /usr/lib64/libstdc++.so.6.0.8) | | ->80.61% (124,518,967B) 0x386369D22A: std::string::_M_replace_safe(unsigned long, unsigned long, char const*, unsigned long) (in /usr/lib64/libstdc++.so.6.0.8) | | | ->80.45% (124,282,074B) 0x525C3DD: qpid::framing::Buffer::getRawData(std::string&, unsigned int) (Buffer.cpp:289) | | | | ->72.31% (111,702,759B) 0x52557D4: qpid::framing::AMQFrame::decode(qpid::framing::Buffer&) (AMQFrame.cpp:137) | | | | | ->72.31% (111,702,759B) 0x4D0F169: qpid::amqp_0_10::Connection::decode(char const*, unsigned long) (Connection.cpp:56) | | | | | ->72.31% (111,702,759B) 0x528256F: qpid::sys::AsynchIOHandler::readbuff(qpid::sys::AsynchIO&, qpid::sys::AsynchIOBufferBase*) (AsynchIOHandler.cpp:135) | | | | | ->72.31% (111,702,759B) 0x51B533D: boost::function2<void, qpid::sys::AsynchIO&, qpid::sys::AsynchIOBufferBase*, std::allocator<boost::function_base> >::operator()(qpid::sys::AsynchIO&, qpid::sys::AsynchIOBufferBase*) const (function_template.hpp:576) | | | | | ->72.31% (111,702,759B) 0x51B3527: qpid::sys::posix::AsynchIO::readable(qpid::sys::DispatchHandle&) (AsynchIO.cpp:428) | | | | | ->72.31% (111,702,759B) 0x5287E2D: boost::function1<void, qpid::sys::DispatchHandle&, std::allocator<boost::function_base> >::operator()(qpid::sys::DispatchHandle&) const (function_template.hpp:576) | | | | | ->72.31% (111,702,759B) 0x5286CA5: qpid::sys::DispatchHandle::processEvent(qpid::sys::Poller::EventType) (DispatchHandle.cpp:285) | | | | | | ->72.31% (111,702,759B) 0x51BF196: qpid::sys::Poller::run() (Poller.h:131) | | | | | | ->51.06% (78,871,952B) 0x51B74C8: qpid::sys::(anonymous namespace)::runRunnable(void*) (Thread.cpp:35) | | | | | | | ->51.06% (78,871,952B) 0x385500673B: start_thread (in /lib64/libpthread-2.5.so) | | | | | | | ->51.06% (78,871,952B) 0x38544D3D1B: clone (in /lib64/libc-2.5.so) The cpp/src/qpid/amqp_0_10/Connection.cpp - Connection::decode() routine is decoding the frame contents into an AMQContentBody, which is a string, using std::string::append(). This causes the heap allocation. The AMQFrame does go out of scope locally, but it uses an intrusive_ptr to the AMQContentBody, which probably means the body is still referenced (somewhere). What does the connection count look like at the point the errors start occurring? Is the number of connections growing? Created attachment 436889 [details]
The problem as seen exhibited in the mrg31 environment
The resident size of the broker is more than 2 gigs, and queue sizes (as reported by qpid-stat -q) can account for less than 400 megs.
qpid-stat -q grossly under-reports memory required; it just reports the aggregate content size, not headers or any of the internal structures used to manage the queue and its messages To get a comparison you could fill up a single queue with 40000 messages of e.g. 1k content and check memory use, then restart broker and fill up a queue with 40000 messages with no content and check the memory use in that case. Re: comment 7: It would appear that the connection counts (as well as queues, etc) don't grow. Granted, when I push the broker to failure, I cannot use qpid-tool in realtime. But I have dumped the connection counts et al. *prior* to running a test, then again after I have experienced the error conditions: Pre test (after attaching sluggy(s), but before starting traffic generators): Summary of Objects by Type: Package Class Active Deleted ======================================================= org.apache.qpid.broker binding 30 0 org.apache.qpid.broker broker 1 0 org.apache.qpid.broker system 1 0 org.apache.qpid.broker subscription 15 0 org.apache.qpid.broker connection 3 0 org.apache.qpid.broker session 3 0 org.apache.qpid.broker queue 15 0 org.apache.qpid.broker exchange 8 0 org.apache.qpid.broker vhost 1 0 Post test (after stopping traffic generators and re-invoking qpid-tool): Summary of Objects by Type: Package Class Active Deleted ======================================================= org.apache.qpid.broker binding 30 0 org.apache.qpid.broker broker 1 0 org.apache.qpid.broker system 1 0 org.apache.qpid.broker subscription 15 0 org.apache.qpid.broker connection 3 0 org.apache.qpid.broker session 3 0 org.apache.qpid.broker queue 15 0 org.apache.qpid.broker exchange 8 0 org.apache.qpid.broker vhost 1 0 Created attachment 436928 [details]
Output from valgrind/massif heap profiler.
Ran the broker using the following command line:
valgrind --tool=massif --trace-children=yes --stacks=no ./qpidd --auth no --queue-purge-interval 60 --default-queue-limit 8000000
Immediately started traffic generator, and experienced the queue resource limit errors less than two minutes in:
2010-08-05 13:35:54 notice SASL disabled: No Authentication Performed
2010-08-05 13:35:54 notice Listening on TCP port 5672
2010-08-05 13:35:54 notice Broker running
2010-08-05 13:37:13 error Execution exception: resource-limit-exceeded: Policy exceeded on qmfc-v2-lo-localhost.localdomain.10968.1, policy: size: max=8000000, current=7971122; count: unlimited; type=reject (qpid/broker/QueuePolicy.cpp:85)
...
Ran with traffic generation for 5 minutes. Stopped the traffic generators (and the limit errors stopped as expected), then ran for an additional 5 minutes at idle.
Note the histogram in the massif output attached - it indicates that, even after 5 minutes of idle traffic, the heap usage did NOT decrease over time. The broker is not freeing this memory until it is shut down.
Created attachment 437920 [details]
Thorough massif heap profile.
The previous massif heap profile was incomplete and therefore inaccurate. It turns out that the slow consumer (sluggy) used for this test was *pathologically slow* - it would take on the order of minutes to process a single large message.
I've altered the slow consumer to make it fast enough to drain a full queue in an idle system within a few hours, and re-ran the test. Attached is the massif heap profile, which shows that the message content memory is actually returned as expected when the queues are finally drained.
Note that, during this test run, the RSS size of the broker peaked at 822MB as reported by top. Once the queues were drained, top still reported an RSS value of 769MB. According to massif, the heap peaked at over 500MB, and eventually dropped to < ~100MB. We really can't used top/RSS to accurately estimate heap utilization.
In conclusion, there does not appear to be a bug with regard to heap/memory usage. However, in leu of producer flow control, the QMF agent needs to implement a backoff mechanism when it detects a queue full condition.
I will open a new bug to track QMF agent handling of queue full conditions.
|