Description of problem: There is a memory leak in failing over HA link of ready broker, after active broker shutdown. Version-Release number of selected component (if applicable): qpid-cpp 0.22-12 How reproducible: 100% Steps to Reproduce: 1. (in first terminal on node1) service qpidd start qpid-ha promote qpid-send -a "testQueue; {create:always}" -m 1000 --content-size=100 2. (in second terminal on node2) valgrind --leak-check=full --show-reachable=yes $(which qpidd) > valgrind.txt 2>&1 3. (in first terminal on node1) service qpidd stop sleep 10 4. (in second terminal on node2) (stop valgrind) Actual results: ==24677== 1,193,876 (480 direct, 1,193,396 indirect) bytes in 1 blocks are definitely lost in loss record 452 of 452 ==24677== at 0x4C285BC: operator new(unsigned long) (vg_replace_malloc.c:298) ==24677== by 0x5062B58: qpid::broker::LinkRegistry::declare(std::string const&, std::string const&, unsigned short, std::string const&, bool, std::string const&, std::string const&, std::string const&, bool) (LinkRegistry.cpp:137) ==24677== by 0x7A9D819: qpid::ha::Backup::setBrokerUrl(qpid::Url const&) (Backup.cpp:75) ==24677== by 0x7ABE2E4: qpid::ha::HaBroker::setBrokerUrl(qpid::Url const&) (HaBroker.cpp:196) ==24677== by 0x7AC1C7D: qpid::ha::HaBroker::initialize() (HaBroker.cpp:122) ==24677== by 0x550F340: void qpid::(anonymous namespace)::each_plugin<boost::_bi::bind_t<void, boost::_mfi::mf1<void, qpid::Plugin, qpid::Plugin::Target&>, boost::_bi::list2<boost::arg<1>, boost::reference_wrapper<qpid::Plugin::Target> > > >(boost::_bi::bind_t<void, boost::_mfi::mf1<void, qpid::Plugin, qpid::Plugin::Target&>, boost::_bi::list2<boost::arg<1>, boost::reference_wrapper<qpid::Plugin::Target> > > const&) (mem_fn_template.hpp:162) ==24677== by 0x550F381: qpid::Plugin::initializeAll(qpid::Plugin::Target&) (Plugin.cpp:91) ==24677== by 0x4FD3459: qpid::broker::Broker::Broker(qpid::broker::Broker::Options const&) (Broker.cpp:354) ==24677== by 0x4072D5: qpid::broker::QpiddBroker::execute(qpid::broker::QpiddOptions*) (in /usr/sbin/qpidd) ==24677== by 0x40CB03: qpid::broker::run_broker(int, char**, bool) (in /usr/sbin/qpidd) ==24677== by 0x674BCDC: (below main) (in /lib64/libc-2.12.so) ==24677== ==24677== LEAK SUMMARY: ==24677== definitely lost: 480 bytes in 1 blocks ==24677== indirectly lost: 1,193,396 bytes in 10,337 blocks ==24677== possibly lost: 1,253 bytes in 16 blocks ==24677== still reachable: 78,604 bytes in 511 blocks ==24677== suppressed: 0 bytes in 0 blocks ==24677== ==24677== For counts of detected and suppressed errors, rerun with: -v ==24677== ERROR SUMMARY: 15 errors from 15 contexts (suppressed: 130 from 9) Expected results: No definitely lost bytes. Additional info: creating the queue is essential for the reproducer, as well as waiting some time ("sleep 10") to let node2 broker to fail-over "qpid.ha.link.<uuid>" link first
Created attachment 793720 [details] Full valgrind log And I forgot to add /etc/qpid/qpidd.conf: log-to-file=/tmp/qpidd.log acl-file=/etc/qpid/qpidd.acl ha-replicate=all ha-cluster=yes ha-brokers-url=train1,train2,train3 ha-username=guest ha-password=guest ha-mechanism=PLAIN (acl-file has just "acl allow all all", authentication might not be required)
Playing with the reproducer little more: - sufficient scenario is start&promote a broker, do some provisioning, start a backup broker under valgrind and once it gets "active" state, shut it down. - No primary broker shutdown is required. While provisioning updated from primary to backup is MUST. - just the 480 bytes are lost, regardless of number of repetitions of any test case like above - that seems like the very first link creation is the guilty one Anyway I feel valgrind is wrong here, as: - it complains Link object to be forgotten - that is crated from LinkRegistry.cpp (LinkRegistry::declare method) only, where it is wrapped in Link::shared_ptr. Therefore boost library should destroy Link object after removing the latest reference to it (in LinkRegistry::destroy) - moreover checking where&how Link::shared_ptr is used, I see no reason to leaving the Link object unreferenced.. - if there would be a leak in link itself, a simple federation route would trigger it as well - but I tested it is not the case; BUT Link usage outside link itself is only in Backup::setBrokerUrl where the shared_ptr reference is local variable.. Or could be the reference somehow mis-used in BrokerReplicator called there?
Fixed upstream: Commit 1565340 from Alan Conway in branch 'qpid/trunk' [ https://svn.apache.org/r1565340 ] QPID-5544: HA memory leak in backup broker after shutdown. The memory leaks were due to shared_ptr cycles between the QueueReplicator, Bridge and Link objects. This patch breaks the cycles using weak_ptrs in the appropriate places.
Done on branch: http://git.app.eng.bos.redhat.com/rh-qpid.git/log/?h=0.22-mrg-aconway-bz1004411 Note there are 2 commits on this branch: aconway@mrg32 rh-qpid (0.22-mrg-aconway-bz1004411)$ git log 0.22-mrg.. dd15a27 QPID-5544: HA memory leak in backup broker after shutdown. bbf82ec NO-JIRA: Minor refactor to improve code safety: calling shared_from_this on creation.
Tested on RHEL6.5, this issue has been fixed. Packages used for testing: python-qpid-0.22-11.el6 python-qpid-qmf-0.22-27.el6 qpid-cpp-client-0.22-35.el6 qpid-cpp-client-devel-0.22-35.el6 qpid-cpp-client-devel-docs-0.22-35.el6 qpid-cpp-client-rdma-0.22-35.el6 qpid-cpp-server-0.22-35.el6 qpid-cpp-server-devel-0.22-35.el6 qpid-cpp-server-ha-0.22-35.el6 qpid-cpp-server-linearstore-0.22-35.el6 qpid-cpp-server-rdma-0.22-35.el6 qpid-cpp-server-xml-0.22-35.el6 qpid-cpp-tar-0.22-11.el6 qpid-proton-c-0.6-1.el6 qpid-qmf-0.22-27.el6 qpid-qmf-devel-0.22-27.el6 qpid-tools-0.22-8.el6 -> VERIFIED