| Summary: | [store] Deadlock in BDB database del() function | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Kim van der Riet <kim.vdriet> | ||||||
| Component: | qpid-cpp | Assignee: | Kim van der Riet <kim.vdriet> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Petr Matousek <pematous> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | Development | CC: | gsim, iboverma, jneedle, jross, pematous | ||||||
| Target Milestone: | 2.0 | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | qpid-cpp-mrg-0.9.1079953 | Doc Type: | Bug Fix | ||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2011-06-23 15:43:45 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Attachments: |
|
||||||||
|
Description
Kim van der Riet
2011-02-24 18:25:56 UTC
Review of the code shows that the Db::del() function should be free-threaded, and not have an issue with multiple threads. However, adding a lock into the MessageStoreImpl::destroy() function has definitely solved this bug, and the attached script, which reliably failed on the 300 subscriber test, now runs all the 300, 1000 and 3000 tests without a problem. Fixed in r.4444 I got different results per RHEL versions while testing the issue with the attached reproducer on my VMs. RHEL5: perf-topic.sh stops execution in the 1 subscriber durable section. Connection was closed by broker due to Enqueue capacity threshold exceeded. 2011-04-26 13:27:57 error Unexpected exception: Enqueue capacity threshold exceeded on queue "anonymous.016d17cc-3581-4162-ad17-ccf07fe6e351". (JournalImpl.cpp:587) 2011-04-26 13:27:57 error Connection 127.0.0.1:5672-127.0.0.1:46408 closed by error: Enqueue capacity threshold exceeded on queue "anonymous.016d17cc-3581-4162-ad17-ccf07fe6e351". (JournalImpl.cpp:587)(501) RHEL6: perf-topic.sh stops execution in the 300 subscriber durable section. Connection was closed by broker due to Too many open files exception. 2011-04-26 12:40:35 warning Broker closed connection: 501, Queue anonymous.a3243978-5ccf-4ad8-ba90-73eeaea402ea: create() failed: jexception 0x0400 fcntl::clean_file() threw JERR_FCNTL_OPENWR: Unable to open file for write. (open() failed: errno=24 (Too many open files)) (MessageStoreImpl.cpp:533) SubscribeThread exception: framing-error: Queue anonymous.a3243978-5ccf-4ad8-ba90-73eeaea402ea: create() failed: jexception 0x0400 fcntl::clean_file() threw JERR_FCNTL_OPENWR: Unable to open file for write. (open() failed: errno=24 (Too many open files)) (MessageStoreImpl.cpp:533) -> handing over to freznice for further testing on real hardware Enqueue Threshold exceptions are well-understood and relate to the cumulative number of messages on the store as well as the correct ordering of message consumption - ie consuming messages in the order in which they were received. If one OS enqueues at a greater rate than another relative to the consume rate, then this exception might occur on that OS and not the other. Make the store larger to accommodate the slower OS. Too many files open is also well-known and typically occurs when there are a large number of persistent queues and/or --num-jfiles is set to a high number. Each journal file holds one file handle for the life of the queue on that broker. Each user by default may hold no more than 1024 file handles open at one time. To increase this limit, set a new higher value in /etc/security/limits.conf: userid - nofile 2048 See the man page for limits.conf for further details. If you are making this change for an installed broker, then userid would be "qpidd". Make sure that if you are running many durable tests on a single broker instance, that the queues of previous tests are deleted, thus releasing the file handles associated with that test. Your test is a topic test, which creates queue (and hence one journal) per subscription. Since each has 8 files, this would soon consume all 1024 available file handles. I have successfully tested up to 10,000 subscribers to a topic, but the file handle limit needs to be raised to 64k, as well as the AIO handle limit limitfs.aio-max-nr set in sysctl. Also make sure there is enough disk space for the total journal footprint. This issue has been fixed. Verified on RHEL5.6, RHEL6.1 architectures: i386, x86_64 Successfully tested up to 10000 transient subscribers (3000 durable subscribers due to disk space limit) with the attached reproducer on RHEL5.6, RHEL6.1 x86_64. Successfully tested up to 100 transient/durable subscribers due to insufficient resources to create another thread on RHEL 5.6, RHEL6.1 i386. With regard to this issue was found on x86_64 arch system and and no hang occurred by repeatedly testing this issue -> moving to verified. packages installed: python-qpid-0.10-1.el5.noarch python-qpid-qmf-0.10-8.el5.x86_64 qpid-cpp-client-0.10-7.el5.x86_64 qpid-cpp-client-devel-0.10-7.el5.x86_64 qpid-cpp-client-devel-docs-0.10-7.el5.x86_64 qpid-cpp-client-ssl-0.10-7.el5.x86_64 qpid-cpp-mrg-debuginfo-0.10-6.el5.x86_64 qpid-cpp-server-0.10-7.el5.x86_64 qpid-cpp-server-cluster-0.10-7.el5.x86_64 qpid-cpp-server-devel-0.10-7.el5.x86_64 qpid-cpp-server-ssl-0.10-7.el5.x86_64 qpid-cpp-server-store-0.10-7.el5.x86_64 qpid-cpp-server-xml-0.10-7.el5.x86_64 qpid-java-client-0.10-6.el5.noarch qpid-java-common-0.10-6.el5.noarch qpid-java-example-0.10-6.el5.noarch qpid-qmf-0.10-8.el5.x86_64 qpid-qmf-debuginfo-0.10-6.el5.x86_64 qpid-qmf-devel-0.10-8.el5.x86_64 qpid-tools-0.10-5.el5.noarch -> VERIFIED An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2011-0890.html |