Created attachment 482724 [details] Script to reproduce on a qpid installation Description of problem: If cluster_tests.py, test_management is modified to enable durable messages, it fails the log comparison test shows messages like this on one broker but not the others: trace Changed V1 statistics org.apache.qpid.broker:connection:127.0.0.1:52742-127.0.0.1:44104 len=NN trace Changed V2 statistics org.apache.qpid.broker:connection:127.0.0.1:52742-127.0.0.1:44104 To date this hasn't been seen to actually cause a cluster crash but in principle it is possible it could. See https://issues.apache.org/jira/browse/QPID-3121 Version-Release number of selected component (if applicable): r1077953 How reproducible: 100% Steps to Reproduce: 1. Install qpid and store 2. Run attached runme.sh Actual results: Fails, usually on first iteration showing log files that didn't match. Use diff or other such program to verify that lines similar to these are in one log but not another: Changed V1 statistics org.apache.qpid.broker:connection:127.0.0.1:52742-127.0.0.1:44104 len=NN trace Changed V2 statistics org.apache.qpid.broker:connection:127.0.0.1:52742-127.0.0.1:44104 Expected results: Test passes repeatedly.
Fix committed upstream r1078947
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: Internal test showed inconsistencies in management data in a cluster. Consequence: Could potentially cause cluster brokers to exit with an "invalid argument" error. Fix: Corrected the inconsistencies. Result: Cluster brokers show consistent management data.
This was found & fixed during investigation of bug 681026, it was not reported by a customer. Setting skip-errata ?
We shouldn't set skip-errata for things that are in shipping code. Sure, we found this ourselves, but customers may have see this, right? If so, it's definitely worth a mention in the errata.
The issue occurrence was greatly reduces but still persists. Test Reproducer Results: RHEL5.6 i386x - FAILED on 18 iteration in verify_logs() RHEL5.6 x86_64 - FAILED on 44 iteration in verify_logs() RHEL6.1 i386x - PASSED 100 iterations Lines similar to the lines mentioned in Comment 1 can be seen in log files diff. -- CPU information: processor : 0 1 2 3 4 5 6 7 vendor_id : AuthenticAMD model name : Quad-Core AMD Opteron(tm) Processor 2376 cpu MHz : 800.000 cpu cores : 4 bogomips : 1596.10 1596.10 1596.10 1596.10 1596.10 1596.10 1596.10 1596.10 [05:31:00] Memory info: total used free shared buffers cached Mem: 8312316 1646804 6665512 0 504684 509964 -/+ buffers/cache: 632156 7680160 Swap: 5668856 1948 5666908
what package set was used to reproduce the problem in comment 6?
qpid-cpp-mrg-0.10-6
I've been testing qpid-cpp-server-0.10-7.el5 for about 11 hours or 400 iterations. So far no error. However I don't see changes between 0.10-6 and 0=10-7 that would be likely to have fixed the problem. I'll continue the test.
correction to comment 10, there are some changes between -6 and -7 that might have had an effect: 692bf60 Bug 698721 - durable messages transmission - performance issue cd703ec Bug 697913, QPID-3214 Committing the suggested fix by Gordon Sim. b8e612b BZ 693895 -- CyrusSasl ctor logic change for callback storage
Can this be reproduced on the latest packages? I haven't been able to.
I will retest the issue on the latest packages and I will make an update tomorrow.
Created attachment 503656 [details] bz682815 - logs diff Testing the issue with the reproducer over 16 hours, passed more than 100 iterations. I've reproduced the failure in verify_logs(), but the distinction in the logs is different than originally reported, please see the attached text file. Results: Test Reproducer Results: RHEL5.6 i386x - PASSED 100 iterations RHEL5.6 x86_64 - PASSED 100 iterations RHEL6.1 i386x - PASSED 100 iterations RHEL6.1 x86_64 - FAILED on 18 iteration in verify_logs() I will continue with the testing..
Correction: - RHEL6.1 x86_64 - FAILED on 18 iteration in verify_logs() + RHEL6.1 x86_64 - FAILED on 4th iteration in verify_logs()
Can you try to reproduce this with these config entries: log-enable=info+ log-enable=debug+:cluster that will give us more info on what happened.
(In reply to comment #16) > Can you try to reproduce this with these config entries: > > log-enable=info+ > log-enable=debug+:cluster > > that will give us more info on what happened. I see those log levels are already set in the reproducer. If you do reproudce the bug again can you attach the full logs from all the brokers.
I was unable to reproduce the originally reported issue on the latest packages, but I've discovered several different failures while running the reproducer. There are three types of failures: 1. Failure in verify_logs() reported in Comment 14 2. BadProcessStatus failure: cluster_tests.LongTests.test_management ................................. fail Error during test: Traceback (most recent call last): File "./qpid-python-test", line 311, in run phase() File "/root/pematous/bz682815/run_cluster_tests2/cluster_tests.py", line 834, in test_management c.stop() File "/root/pematous/bz682815/run_cluster_tests2/cluster_tests.py", line 777, in stop StoppableThread.stop(self) File "/root/pematous/bz682815/run_cluster_tests2/brokertest.py", line 565, in stop if self.error: raise self.error RethrownException: Error in ClientLoop.run Traceback (most recent call last): File "/root/pematous/bz682815/run_cluster_tests2/cluster_tests.py", line 760, in run self.process.unexpected( File "/root/pematous/bz682815/run_cluster_tests2/brokertest.py", line 194, in unexpected raise BadProcessStatus("%s %s%s" % (self.pname, msg, err)) BadProcessStatus: testagent-31028 client of cluster0-6 exit code -11 Error during teardown: Traceback (most recent call last): File "./qpid-python-test", line 311, in run phase() File "/root/pematous/bz682815/run_cluster_tests2/brokertest.py", line 506, in tearDown if err: raise Exception("Unexpected process status:\n "+"\n ".join(err)) Exception: Unexpected process status: Wait failed qpid-tool-31737: [Errno 10] No child processes Wait failed qpid-perftest-31738: [Errno 10] No child processes Wait failed qpid-queue-stats-31739: [Errno 10] No child processes Wait failed testagent-31742: [Errno 10] No child processes Wait failed qpid-txtest-31740: [Errno 10] No child processes Wait failed qpid-perftest-31774: [Errno 10] No child processes 3. Timeout exception on the Broker: cluster_tests.LongTests.test_management ................................. fail Error during test: Traceback (most recent call last): File "./qpid-python-test", line 311, in run phase() File "/root/pematous/bz682815/run_cluster_tests2/cluster_tests.py", line 830, in test_management cluster.start() File "/root/pematous/bz682815/run_cluster_tests2/brokertest.py", line 458, in start self._brokers.append(self.test.broker(self.args+args, name, expect, wait, port=port)) File "/root/pematous/bz682815/run_cluster_tests2/brokertest.py", line 525, in broker raise RethrownException("Failed to start broker %s(%s): %s" % (b.name, b.log, e)) RethrownException: Failed to start broker cluster0-7(cluster0-7.log): Timed out waiting for broker cluster0-7: 2011-06-11 13:04:39 notice Cluster store state: empty 2011-06-11 13:04:39 notice cluster(10.34.45.2:9210 PRE_INIT) configuration change: 10.34.45.2:7122 10.34.45.2:7388 10.34.45.2:9210 2011-06-11 13:04:39 notice cluster(10.34.45.2:9210 PRE_INIT) Members joined: 10.34.45.2:9210 2011-06-11 13:04:39 debug cluster(10.34.45.2:9210 PRE_INIT) received initial status from 10.34.45.2:9210 2011-06-11 13:04:39 debug cluster(10.34.45.2:9210 PRE_INIT) received initial status from 10.34.45.2:7122 Traceback (most recent call last): File "/root/pematous/bz682815/run_cluster_tests2/brokertest.py", line 523, in broker try: b.ready() File "/root/pematous/bz682815/run_cluster_tests2/brokertest.py", line 418, in ready raise Exception( Exception: Timed out waiting for broker cluster0-7: 2011-06-11 13:04:39 notice Cluster store state: empty 2011-06-11 13:04:39 notice cluster(10.34.45.2:9210 PRE_INIT) configuration change: 10.34.45.2:7122 10.34.45.2:7388 10.34.45.2:9210 2011-06-11 13:04:39 notice cluster(10.34.45.2:9210 PRE_INIT) Members joined: 10.34.45.2:9210 2011-06-11 13:04:39 debug cluster(10.34.45.2:9210 PRE_INIT) received initial status from 10.34.45.2:9210 2011-06-11 13:04:39 debug cluster(10.34.45.2:9210 PRE_INIT) received initial status from 10.34.45.2:7122 Error during teardown: Traceback (most recent call last): File "./qpid-python-test", line 311, in run phase() File "/root/pematous/bz682815/run_cluster_tests2/brokertest.py", line 506, in tearDown if err: raise Exception("Unexpected process status:\n "+"\n ".join(err)) Exception: Unexpected process status: Wait failed testagent-7293: [Errno 10] No child processes Wait failed qpid-queue-stats-7295: [Errno 10] No child processes Wait failed testagent-7895: [Errno 10] No child processes
Please find below results of my recent testing of this issue: Test Reproducer Results: RHEL5.6 i386x - FAILED on 122 iteration with the BadProcessStatus failure RHEL5.6 x86_64 - FAILED on 181 iterations with the Timeout exception on the Broker RHEL6.1 i386x - FAILED on 192 iterations with the BadProcessStatus failure RHEL6.1 x86_64 - PASSED 550 iterations The originally reported issue was not seen, but it is possible that the errors mentioned above blocks the reported issue occurrence. Alan, do you think we shall solve the failures mentioned above as a part of this BZ or should I create separate bugzillas for them?
Lets keep them together for now.
For the failures in verify_logs() - both the original one and the one in comment 14 - a plausible fix is https://reviews.apache.org/r/943/. I have not been able to reproduce the problem however so I can't verify it. For the errors like: BadProcessStatus: testagent-31028 client of cluster0-6 exit code -11 This is a core dump in the testagent client, so lets make a separate BZ for that. For the timeout on the broker, I haven't made any progress with that yet but lets also make that a separate BZ.
Fix for verify_logs failures comitted to trunk r1138874
A new Bug 716457 was created for the 'BadProcessStatus' issue mentioned in Comment 19. A new Bug 716458 was created for the 'Timeout on the broker' issue mentioned in Comment 19. Removing 'Fixed In Version - qpid-cpp-mrg-0.9.1079953-1' for this bug according to Comment 24.
Fixed well prior to the 0.14 rebase
The fix from comment 24 is in build qpid-cpp-0.14-14, so there is something else going wrong.