Bug 682815 - Cluster management inconsistency when using persistent store.
Summary: Cluster management inconsistency when using persistent store.
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 1.3
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: 2.0
: ---
Assignee: Alan Conway
QA Contact: Petr Matousek
URL:
Whiteboard:
Depends On: 716457 716458
Blocks: 818530 716249
TreeView+ depends on / blocked
 
Reported: 2011-03-07 16:35 UTC by Alan Conway
Modified: 2012-05-03 09:38 UTC (History)
7 users (show)

Fixed In Version: qpid-cpp-mrg-0.9.1079953-1
Doc Type: Bug Fix
Doc Text:
Cause: Internal test showed inconsistencies in management data in a cluster. Consequence: Could potentially cause cluster brokers to exit with an "invalid argument" error. Fix: Corrected the inconsistencies. Result: Cluster brokers show consistent management data.
Clone Of:
: 716249 818530 (view as bug list)
Environment:
Last Closed: 2012-03-29 20:05:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Script to reproduce on a qpid installation (100.00 KB, application/x-tar)
2011-03-07 16:35 UTC, Alan Conway
no flags Details
bz682815 - logs diff (6.78 KB, text/plain)
2011-06-08 09:54 UTC, Petr Matousek
no flags Details

Description Alan Conway 2011-03-07 16:35:21 UTC
Created attachment 482724 [details]
Script to reproduce on a qpid installation

Description of problem:

If cluster_tests.py, test_management is modified to enable durable messages, it fails the log comparison test shows messages like this on one broker but not the others:

trace Changed V1 statistics org.apache.qpid.broker:connection:127.0.0.1:52742-127.0.0.1:44104 len=NN
trace Changed V2 statistics org.apache.qpid.broker:connection:127.0.0.1:52742-127.0.0.1:44104

To date this hasn't been seen to actually cause a cluster crash but in principle it is possible it could.

See https://issues.apache.org/jira/browse/QPID-3121

Version-Release number of selected component (if applicable): r1077953


How reproducible: 100%


Steps to Reproduce:
1. Install qpid and store 
2. Run attached runme.sh
  
Actual results:

Fails, usually on first iteration showing log files that didn't match.
Use diff or other such program to verify that lines similar to these are in one log but not another: 

Changed V1 statistics org.apache.qpid.broker:connection:127.0.0.1:52742-127.0.0.1:44104 len=NN
trace Changed V2 statistics org.apache.qpid.broker:connection:127.0.0.1:52742-127.0.0.1:44104

Expected results:

Test passes repeatedly.

Comment 1 Alan Conway 2011-03-07 21:13:59 UTC
Fix committed upstream r1078947

Comment 2 Alan Conway 2011-03-11 22:05:59 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: Internal test showed inconsistencies in management data in a cluster.
Consequence: Could potentially cause cluster brokers to exit with an "invalid argument" error.
Fix: Corrected the inconsistencies.
Result: Cluster brokers show consistent management data.

Comment 3 Alan Conway 2011-03-15 14:07:05 UTC
This was found & fixed during investigation of bug 681026, it was not reported by a customer. Setting skip-errata ?

Comment 4 Jeff Needle 2011-03-15 14:48:15 UTC
We shouldn't set skip-errata for things that are in shipping code.  Sure, we found this ourselves, but customers may have see this, right?  If so, it's definitely worth a mention in the errata.

Comment 6 Petr Matousek 2011-05-18 15:07:50 UTC
The issue occurrence was greatly reduces but still persists. 

Test Reproducer Results:
RHEL5.6 i386x - FAILED on 18 iteration in verify_logs()
RHEL5.6 x86_64 - FAILED on 44 iteration in verify_logs()
RHEL6.1 i386x - PASSED 100 iterations

Lines similar to the lines mentioned in Comment 1 can be seen in log files diff.

--
CPU information:
processor       : 0 1 2 3 4 5 6 7
vendor_id       : AuthenticAMD
model name      : Quad-Core AMD Opteron(tm) Processor 2376
cpu MHz         : 800.000
cpu cores       : 4
bogomips        : 1596.10 1596.10 1596.10 1596.10 1596.10 1596.10 1596.10 1596.10
[05:31:00] Memory info:
             total       used       free     shared    buffers     cached
Mem:       8312316    1646804    6665512          0     504684     509964
-/+ buffers/cache:     632156    7680160
Swap:      5668856       1948    5666908

Comment 8 Alan Conway 2011-05-25 18:36:10 UTC
what package set was used to reproduce the problem in comment 6?

Comment 9 Petr Matousek 2011-05-26 08:27:10 UTC
qpid-cpp-mrg-0.10-6

Comment 10 Alan Conway 2011-05-26 16:09:03 UTC
I've been testing qpid-cpp-server-0.10-7.el5 for about 11 hours or 400 iterations. So far no error. However I don't see changes between 0.10-6 and 0=10-7 that would be likely to have fixed the problem. I'll continue the test.

Comment 11 Alan Conway 2011-05-26 16:21:12 UTC
correction to comment 10, there are some changes between -6 and -7 that might have had an effect:

692bf60 Bug 698721 - durable messages transmission - performance issue
cd703ec Bug 697913, QPID-3214 Committing the suggested fix by Gordon Sim. 
b8e612b BZ 693895 -- CyrusSasl ctor logic change for callback storage

Comment 12 Alan Conway 2011-06-06 20:53:10 UTC
Can this be reproduced on the latest packages? I haven't been able to.

Comment 13 Petr Matousek 2011-06-07 13:00:10 UTC
I will retest the issue on the latest packages and I will make an update tomorrow.

Comment 14 Petr Matousek 2011-06-08 09:54:25 UTC
Created attachment 503656 [details]
bz682815 - logs diff

Testing the issue with the reproducer over 16 hours, passed more than 100 iterations. I've reproduced the failure in verify_logs(), but the distinction in the logs is different than originally reported, please see the attached text file.

Results:

Test Reproducer Results:
RHEL5.6 i386x - PASSED 100 iterations
RHEL5.6 x86_64 - PASSED 100 iterations
RHEL6.1 i386x - PASSED 100 iterations
RHEL6.1 x86_64 - FAILED on 18 iteration in verify_logs()

I will continue with the testing..

Comment 15 Petr Matousek 2011-06-08 11:32:47 UTC
Correction:
- RHEL6.1 x86_64 - FAILED on 18 iteration in verify_logs()
+ RHEL6.1 x86_64 - FAILED on 4th iteration in verify_logs()

Comment 16 Alan Conway 2011-06-09 15:49:05 UTC
Can  you try to reproduce this with these config entries:

log-enable=info+
log-enable=debug+:cluster

that will give us more info on what happened.

Comment 17 Alan Conway 2011-06-09 19:50:50 UTC
(In reply to comment #16)
> Can  you try to reproduce this with these config entries:
> 
> log-enable=info+
> log-enable=debug+:cluster
> 
> that will give us more info on what happened.

I see those log levels are already set in the reproducer. If you do reproudce the bug again can you attach the full logs from all the brokers.

Comment 19 Petr Matousek 2011-06-13 08:59:53 UTC
I was unable to reproduce the originally reported issue on the latest packages, but I've discovered several different failures while running the reproducer.

There are three types of failures:
1. Failure in verify_logs() reported in Comment 14

2. BadProcessStatus failure:
cluster_tests.LongTests.test_management ................................. fail
Error during test:
  Traceback (most recent call last):
    File "./qpid-python-test", line 311, in run
      phase()
    File "/root/pematous/bz682815/run_cluster_tests2/cluster_tests.py", line 834, in test_management
      c.stop()
    File "/root/pematous/bz682815/run_cluster_tests2/cluster_tests.py", line 777, in stop
      StoppableThread.stop(self)
    File "/root/pematous/bz682815/run_cluster_tests2/brokertest.py", line 565, in stop
      if self.error: raise self.error
  RethrownException: Error in ClientLoop.run
  Traceback (most recent call last):
    File "/root/pematous/bz682815/run_cluster_tests2/cluster_tests.py", line 760, in run
      self.process.unexpected(
    File "/root/pematous/bz682815/run_cluster_tests2/brokertest.py", line 194, in unexpected
      raise BadProcessStatus("%s %s%s" % (self.pname, msg, err))
  BadProcessStatus: testagent-31028 client of cluster0-6 exit code -11
Error during teardown:
  Traceback (most recent call last):
    File "./qpid-python-test", line 311, in run
      phase()
    File "/root/pematous/bz682815/run_cluster_tests2/brokertest.py", line 506, in tearDown
      if err: raise Exception("Unexpected process status:\n    "+"\n    ".join(err))
  Exception: Unexpected process status:
      Wait failed qpid-tool-31737: [Errno 10] No child processes
      Wait failed qpid-perftest-31738: [Errno 10] No child processes
      Wait failed qpid-queue-stats-31739: [Errno 10] No child processes
      Wait failed testagent-31742: [Errno 10] No child processes
      Wait failed qpid-txtest-31740: [Errno 10] No child processes
      Wait failed qpid-perftest-31774: [Errno 10] No child processes

3. Timeout exception on the Broker:
cluster_tests.LongTests.test_management ................................. fail
Error during test:
  Traceback (most recent call last):
    File "./qpid-python-test", line 311, in run
      phase()
    File "/root/pematous/bz682815/run_cluster_tests2/cluster_tests.py", line 830, in test_management
      cluster.start()
    File "/root/pematous/bz682815/run_cluster_tests2/brokertest.py", line 458, in start
      self._brokers.append(self.test.broker(self.args+args, name, expect, wait, port=port))
    File "/root/pematous/bz682815/run_cluster_tests2/brokertest.py", line 525, in broker
      raise RethrownException("Failed to start broker %s(%s): %s" % (b.name, b.log, e))
  RethrownException: Failed to start broker cluster0-7(cluster0-7.log): Timed out waiting for broker cluster0-7:
      2011-06-11 13:04:39 notice Cluster store state: empty
      2011-06-11 13:04:39 notice cluster(10.34.45.2:9210 PRE_INIT) configuration change: 10.34.45.2:7122 10.34.45.2:7388 10.34.45.2:9210
      2011-06-11 13:04:39 notice cluster(10.34.45.2:9210 PRE_INIT) Members joined: 10.34.45.2:9210
      2011-06-11 13:04:39 debug cluster(10.34.45.2:9210 PRE_INIT) received initial status from 10.34.45.2:9210
      2011-06-11 13:04:39 debug cluster(10.34.45.2:9210 PRE_INIT) received initial status from 10.34.45.2:7122

  Traceback (most recent call last):
    File "/root/pematous/bz682815/run_cluster_tests2/brokertest.py", line 523, in broker
      try: b.ready()
    File "/root/pematous/bz682815/run_cluster_tests2/brokertest.py", line 418, in ready
      raise Exception(
  Exception: Timed out waiting for broker cluster0-7:
      2011-06-11 13:04:39 notice Cluster store state: empty
      2011-06-11 13:04:39 notice cluster(10.34.45.2:9210 PRE_INIT) configuration change: 10.34.45.2:7122 10.34.45.2:7388 10.34.45.2:9210
      2011-06-11 13:04:39 notice cluster(10.34.45.2:9210 PRE_INIT) Members joined: 10.34.45.2:9210
      2011-06-11 13:04:39 debug cluster(10.34.45.2:9210 PRE_INIT) received initial status from 10.34.45.2:9210
      2011-06-11 13:04:39 debug cluster(10.34.45.2:9210 PRE_INIT) received initial status from 10.34.45.2:7122
Error during teardown:
  Traceback (most recent call last):
    File "./qpid-python-test", line 311, in run
      phase()
    File "/root/pematous/bz682815/run_cluster_tests2/brokertest.py", line 506, in tearDown
      if err: raise Exception("Unexpected process status:\n    "+"\n    ".join(err))
  Exception: Unexpected process status:
      Wait failed testagent-7293: [Errno 10] No child processes
      Wait failed qpid-queue-stats-7295: [Errno 10] No child processes
      Wait failed testagent-7895: [Errno 10] No child processes

Comment 20 Petr Matousek 2011-06-13 09:01:54 UTC
Please find below results of my recent testing of this issue:

Test Reproducer Results:
RHEL5.6 i386x - FAILED on 122 iteration with the BadProcessStatus failure 
RHEL5.6 x86_64 - FAILED on 181 iterations with the Timeout exception on the Broker
RHEL6.1 i386x - FAILED on 192 iterations with the BadProcessStatus failure
RHEL6.1 x86_64 - PASSED 550 iterations

The originally reported issue was not seen, but it is possible that the errors mentioned above blocks the reported issue occurrence. 

Alan, do you think we shall solve the failures mentioned above as a part of this BZ or should I create separate bugzillas for them?

Comment 22 Alan Conway 2011-06-13 12:52:27 UTC
Lets keep them together for now.

Comment 23 Alan Conway 2011-06-22 18:46:31 UTC
For the failures in verify_logs() - both the original one and the one in comment 14 - a plausible fix is https://reviews.apache.org/r/943/. I have not been able to reproduce the problem however so I can't verify it.

For the errors like:
  BadProcessStatus: testagent-31028 client of cluster0-6 exit code -11

This is a core dump in the testagent client, so lets make a separate BZ for that.

For the timeout on the broker, I haven't made any progress with that yet but lets also make that a separate BZ.

Comment 24 Alan Conway 2011-06-23 13:41:47 UTC
Fix for verify_logs failures comitted to trunk r1138874

Comment 25 Petr Matousek 2011-06-24 14:10:32 UTC
A new Bug 716457 was created for the 'BadProcessStatus' issue mentioned in Comment 19.

A new Bug 716458 was created for the 'Timeout on the broker' issue mentioned in Comment 19.

Removing 'Fixed In Version - qpid-cpp-mrg-0.9.1079953-1' for this bug according to Comment 24.

Comment 26 Ted Ross 2012-03-29 20:05:21 UTC
Fixed well prior to the 0.14 rebase

Comment 28 Alan Conway 2012-04-24 15:13:34 UTC
The fix from comment 24 is in build qpid-cpp-0.14-14, so there is something else going wrong.


Note You need to log in before you can comment on or make changes to this bug.