Bug 499872
Summary: | QMF sessions can cause cluster nodes to exit | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Gordon Sim <gsim> | ||||||||||||||
Component: | qpid-cpp | Assignee: | Alan Conway <aconway> | ||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Frantisek Reznicek <freznice> | ||||||||||||||
Severity: | urgent | Docs Contact: | |||||||||||||||
Priority: | urgent | ||||||||||||||||
Version: | 1.0 | CC: | esammons, freznice | ||||||||||||||
Target Milestone: | 1.1.2 | ||||||||||||||||
Target Release: | --- | ||||||||||||||||
Hardware: | All | ||||||||||||||||
OS: | Linux | ||||||||||||||||
Whiteboard: | |||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||
Clone Of: | Environment: | ||||||||||||||||
Last Closed: | 2009-06-12 17:38:52 UTC | Type: | --- | ||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
Embargoed: | |||||||||||||||||
Bug Depends On: | |||||||||||||||||
Bug Blocks: | 501015 | ||||||||||||||||
Attachments: |
|
Description
Gordon Sim
2009-05-08 17:07:45 UTC
Created attachment 343433 [details]
ping client
Gsim provided a patch to reproduce this problem more quickly. Completion criteria: test passes repeatedly. Created attachment 343584 [details] Patch to cause the error to occur more frequently. Gordon Sim wrote: > I had to actually modify the broker code to get a faster occurrence of the bug, specifically lowering the SPONTANEOUS_REQUEST_INTERVAL in the session state drastically. > > To reproduce, apply the patch and recompile, start two broker nodes and then run the patched ping example (in its current form it needs one of the nodes to be listening on port 5672). > > I find that the second node (i.e. the one the ping client is not connected to) then exits in a fairly short time. > Created attachment 343626 [details]
Proposed fix.
Attached patch appears to fix the issue for me, I can't reproduce it anymore.
gsim can you verify this patch in your test environment before I commit?
Created attachment 343826 [details]
Fixed issue in previous fix.
Previous patch caused problems in other tests. This one looks good - passes make check on 2 boxes as well as the reproducer case.
I can confirm that patch appears to prevent the problem occuring. Committed fix for this specific issue in r774809 but we need a larger fix for cluster+management, see bug 501015 Undid commit in r775182 This fix is incorrect. The timer will go off in each member, and each one will send a response message which is replicated, resulting in a response from each member being enqueued rather than a single response. It's also heading in the wrong direction, its making management more replicated but the better solution will probably be to make management full *un*replicated, see bug 501015 Created attachment 344195 [details]
better reproducer
This is a better reproducing test case. It will issue continuous queries against a set of nodes. To use specify all the nodes to query in host:port format, e.g:
./thrasher localhost:5672 localhost:5673 localhost:5674 localhost:5675
Declaring lots of queues before doing this will also speed up the error.
Believed fixed in qpidd-0.5.752581-6.el5 Created attachment 345421 [details]
Fix for 1.1.2
The attached patch is the fix that was applied for qpidd-0.5.752581-6.el5.
The issue has been fixed, validated on RHEL 5.3 i386 / x86_64 on packages: [root@intel-greencity-01 bz499872]# rpm -qa | egrep '(qpid|openais)' | sort -u openais-0.80.3-22.el5_3.7 openais-debuginfo-0.80.3-22.el5_3.7 openais-devel-0.80.3-22.el5_3.7 python-qpid-0.5.752581-1.el5 qpidc-0.5.752581-10.el5 qpidc-debuginfo-0.5.752581-10.el5 qpidc-devel-0.5.752581-10.el5 qpidc-perftest-0.5.752581-10.el5 qpidc-rdma-0.5.752581-10.el5 qpidc-ssl-0.5.752581-10.el5 qpidd-0.5.752581-10.el5 qpidd-acl-0.5.752581-10.el5 qpidd-cluster-0.5.752581-10.el5 qpidd-devel-0.5.752581-10.el5 qpid-dotnet-0.4.738274-2.el5 qpidd-rdma-0.5.752581-10.el5 qpidd-ssl-0.5.752581-10.el5 qpidd-xml-0.5.752581-10.el5 qpid-java-client-0.5.751061-4.el5 qpid-java-common-0.5.751061-4.el5 ->VERIFIED An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-1097.html |