Bug 674338

Summary: Inconsistent management messages in a cluster, test fails sporadically
Product: Red Hat Enterprise MRG Reporter: Alan Conway <aconway>
Component: qpid-cppAssignee: Alan Conway <aconway>
Status: CLOSED ERRATA QA Contact: Frantisek Reznicek <freznice>
Severity: high Docs Contact:
Priority: high    
Version: 1.3CC: esammons, freznice, gsim, iboverma, jneedle, tross
Target Milestone: 1.3.2   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: qpid-cpp-mrg-0.7.946106-28 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-02-15 12:11:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Reproducer that can be run on qpid installed from RPMs none

Description Alan Conway 2011-02-01 14:46:21 UTC
Description of problem: An upstream test designed to verify consistent management messages in a cluster is failing due to inconsistencies.


Version-Release number of selected component (if applicable): trunk r1060879


How reproducible: moderate - fails about 1 out of 4 times


Steps to Reproduce:

Running the test in a qpid build: The test is disabled as it does not pass. To enable the test remove these lines from cpp/src/tests/cluster_test_logs.py:91

    # FIXME aconway 2011-01-19: disable when called from unit tests
    # Causing sporadic failures, see https://issues.apache.org/jira/browse/QPID-3007
    if __name__ != "__main__": return

To run the test in src/tests

$ make check TESTS=run_cluster_tests  CLUSTER_TESTS='*Long*test_management* -DDURATION=4' &> make-check.log

Actual results: test fails about 1/4 times.

Expected results: no failures

Additional info: https://issues.apache.org/jira/browse/QPID-3007

Comment 1 Alan Conway 2011-02-01 21:35:05 UTC
Fixed by the following upstream revisions:

1066220 QPID-3007: Unique management identifier for connections.
1066219 QPID-3007: Ignore expected connection close warning in cluster_test_logs.py
1066217 QPID-3007: Don't hold on to consumer shared-pointers in UpdateClient::consumerNumbering
1066215 QPID-3007: Don't record management statistics in cluster-unsafe contexts.

Comment 2 Alan Conway 2011-02-02 19:17:59 UTC
Created attachment 476636 [details]
Reproducer that can be run on qpid installed from RPMs

The runme.sh script runs the test in a loop. Prior to the fix the test was failing every 4-5 iterations. With the fix it has not failed during an overnight run.

Comment 5 Frantisek Reznicek 2011-02-04 09:01:51 UTC
The issue in under test atm.

Comment 7 Frantisek Reznicek 2011-02-04 11:03:16 UTC
Alan,
could you possibly confirm that the issue you saw is following, please?
(the below dump is from test ran on -27)


...
cluster_tests.LongTests.test_management
................................................................................................................
pass
cluster_tests.LongTests.test_management_qmf2
...........................................................................................................
fail
Error during test:
  Traceback (most recent call last):
    File "./qpid-python-test", line 311, in run
      phase()
    File "/root/bz/bz674338/cluster_mgmt_674338/cluster_tests.py", line 454, in
test_management_qmf2
      self.test_management(args=["--mgmt-qmf2=yes"])
    File "/root/bz/bz674338/cluster_mgmt_674338/cluster_tests.py", line 451, in
test_management
      cluster_test_logs.verify_logs()
    File "/root/bz/bz674338/cluster_mgmt_674338/cluster_test_logs.py", line
106, in verify_logs
      raise Exception("Files differ in %s"%(os.getcwd())+"".join(errors))
  Exception: Files differ in
/root/bz/bz674338/cluster_mgmt_674338/brokertest.tmp/cluster_tests.LongTests.test_management_qmf2
      cluster1-24.log.filter.8173859 cluster1-23.log.filter.8173859
Totals: 2 tests, 1 passed, 0 skipped, 0 ignored, 1 failed

Moreover, it is expected to execute just tests
cluster_tests.LongTests.test_management and
cluster_tests.LongTests.test_management_qmf2 for this defect?


Current ongoing testing indicate the issue is fixed...

Comment 8 Alan Conway 2011-02-04 15:08:50 UTC
Yes that is the issue.

> Moreover, it is expected to execute just tests
> cluster_tests.LongTests.test_management and
> cluster_tests.LongTests.test_management_qmf2 for this defect?

Those are the only tests that reliably show the defect. In sporadic cases where qpid-tool is used with a cluster it can cause brokers to exit with an "invalid-arg" error but I have not been able to reproduce that reliably.

Comment 9 Frantisek Reznicek 2011-02-04 16:33:06 UTC
Thanks Alan,
I was able to reproduce the issue reliable on the -27 build and spin the tests with extended duration to prove that the issue has been fixed (on -28).

The extensive testing in parallel on 6 machines (in total time over 25 hours, over 150 runs)

The issue has been fixed, tested on RHEL 5.6 i386 / x86_64 on packages:
python-qpid-0.7.946106-15.el5.noarch
qpid-cpp-client-0.7.946106-28.el5.i386
qpid-cpp-client-devel-0.7.946106-28.el5.i386
qpid-cpp-client-devel-docs-0.7.946106-28.el5.i386
qpid-cpp-client-rdma-0.7.946106-28.el5.i386
qpid-cpp-client-ssl-0.7.946106-28.el5.i386
qpid-cpp-mrg-debuginfo-0.7.946106-28.el5.i386
qpid-cpp-server-0.7.946106-28.el5.i386
qpid-cpp-server-cluster-0.7.946106-28.el5.i386
qpid-cpp-server-devel-0.7.946106-28.el5.i386
qpid-cpp-server-rdma-0.7.946106-28.el5.i386
qpid-cpp-server-ssl-0.7.946106-28.el5.i386
qpid-cpp-server-store-0.7.946106-28.el5.i386
qpid-cpp-server-xml-0.7.946106-28.el5.i386
qpid-dotnet-0.4.738274-2.el5.i386
qpid-java-client-0.7.946106-15.el5.noarch
qpid-java-common-0.7.946106-15.el5.noarch
qpid-java-example-0.7.946106-15.el5.noarch
qpid-tests-0.7.946106-1.el5.noarch
qpid-tools-0.7.946106-12.el5.noarch
rh-qpid-cpp-tests-0.7.946106-28.el5.i386


-> VERIFIED

Comment 10 errata-xmlrpc 2011-02-15 12:11:38 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0217.html