Bug 1024528 - QPID-5275 HA transactions failing in qpid-cluster-benchmark
QPID-5275 HA transactions failing in qpid-cluster-benchmark
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp (Show other bugs)
Unspecified Unspecified
high Severity high
: 3.0
: ---
Assigned To: Alan Conway
Leonid Zhaldybin
Depends On:
  Show dependency treegraph
Reported: 2013-10-29 16:56 EDT by Alan Conway
Modified: 2014-11-09 17:39 EST (History)
6 users (show)

See Also:
Fixed In Version: qpid-cpp-0.22-35
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2014-09-24 11:08:57 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

External Trackers
Tracker ID Priority Status Summary Last Updated
Apache JIRA QPID-5275 None None None Never
Red Hat Product Errata RHEA-2014:1296 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging 3.0 Release 2014-09-24 15:00:06 EDT

  None (edit)
Description Alan Conway 2013-10-29 16:56:07 EDT
Description of problem:

Run the following test (from qpid/cpp/src/tests) against a 3 node cluster:

qpid-cluster-benchmark b -n1 -m 1000 -q6 -s3 -r3 --send-arg=tx --send-arg=10 --receive-arg=-tx --receive-arg=10

Actual resullt: log file from the primary broker contains many errors like this:

Primary transaction 393340e4: Backup disconnected: e4886fe3@tcp:

The log file on the backup brokers contain many messages like this:

Backup of transaction 6db1a2e2: Rollback

Version-Release number of selected component (if applicable):
Qpid trunk r1536857

How reproducible: 100%

Steps to Reproduce:

Run the following test (from qpid/cpp/src/tests) against a 3 node cluster:

qpid-cluster-benchmark b -n1 -m 1000 -q6 -s3 -r3 --send-arg=tx --send-arg=10 --receive-arg=-tx --receive-arg=10

Actual results: Log files contain errors and Rollback messages

Expected results: No errors, no rollback messages.

Additional info:
Comment 3 Frantisek Reznicek 2014-01-13 10:23:20 EST
I'm very sorry but I need to claim this one as not fixed for the moment.

Testing on 3 node qpid HA cluster with virtual IP gives same results on RHEL6.5i with -29 and -31.

I believe I do it right although I'm not seeing exacly the same messages (mentioned above).

Let me describe the way I'm reproducing it:

0] qpid HA cluster configured without authentication with following configuration

[root@dl-216 i]# cat /etc/qpid/qpidd.conf
[root@dl-216 i]# cat /etc/cluster/cluster.conf 
<?xml version="1.0" ?>
<cluster config_version="1" name="dtests_ha">
    <clusternode name="" nodeid="1" votes="1">
        <method name="1">
          <device name="fence_1"/>
    <clusternode name="" nodeid="2" votes="1">
        <method name="1">
          <device name="fence_2"/>
    <clusternode name="" nodeid="3" votes="1">
        <method name="1">
          <device name="fence_3"/>
    <multicast addr=""/>
  <rm log_level="7">
      <failoverdomain name="domain_qpidd_1" restricted="1">
        <failoverdomainnode name="" priority="1"/>
      <failoverdomain name="domain_qpidd_2" restricted="1">
        <failoverdomainnode name="" priority="1"/>
      <failoverdomain name="domain_qpidd_3" restricted="1">
        <failoverdomainnode name="" priority="1"/>
      <script file="/etc/init.d/qpidd" name="qpidd">
        <action depth="*" interval="5" name="status"/>
      <script file="/etc/init.d/qpidd-primary" name="qpidd-primary">
        <action depth="*" interval="5" name="status"/>
      <ip address="" monitor_link="1"/>
    <service domain="domain_qpidd_1" name="qpidd_1" recovery="restart">
      <script ref="qpidd"/>
    <service domain="domain_qpidd_2" name="qpidd_2" recovery="restart">
      <script ref="qpidd"/>
    <service domain="domain_qpidd_3" name="qpidd_3" recovery="restart">
      <script ref="qpidd"/>
    <service autostart="1" exclusive="0" name="qpidd_primary" recovery="relocate">
      <script ref="qpidd-primary"/>
      <ip ref=""/>
  <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="30"/>
    <fencedevice action="reboot" agent="fence_xvm" domain="r6i0" ipport="1229" key_file="/etc/cluster/fence_xvm.key" name="fence_1"/>
    <fencedevice action="reboot" agent="fence_xvm" domain="r6i1" ipport="1229" key_file="/etc/cluster/fence_xvm.key" name="fence_2"/>
    <fencedevice action="reboot" agent="fence_xvm" domain="r6i2" ipport="1229" key_file="/etc/cluster/fence_xvm.key" name="fence_3"/>
  <logging debug="on"/>

1] HA is up including primary broker and RHCS virtual ip (set to
2] HA transactional benchmart is ran following way:
   for i in `seq 200`; do
     ./qpid-cluster-benchmark -b -n1 -m 1000 -q6 -s3 -r3 
         --send-arg=--tx --send-arg=10 --receive-arg=--tx --receive-arg=10
   # Note: Doublecheck please as original above switches do not work
3] ./qpid-cluster-benchmark works about as expected
   monitoring broker logs I can see on -29 as well as -31 following events:

  [root@dl-216 i]# grep -iE 'disconnected|Rollback' ~qpidd/qpidd.log
  2014-01-13 16:04:51 [Broker] info Inter-broker link disconnected from Closed by peer
  2014-01-13 16:04:51 [System] debug DISCONNECTED [qpid.]
  2014-01-13 16:04:51 [System] debug DISCONNECTED [qpid.]
  2014-01-13 16:04:51 [System] debug DISCONNECTED [qpid.]
  2014-01-13 16:04:53 [System] debug DISCONNECTED [qpid.]
  2014-01-13 16:04:53 [System] debug DISCONNECTED [qpid.]
  2014-01-13 16:04:53 [System] debug DISCONNECTED [qpid.]
  2014-01-13 16:04:53 [System] debug DISCONNECTED [qpid.]
  [root@d-37-165 i]# grep -iE 'disconnected|Rollback' ~qpidd/qpidd.log
  2014-01-13 16:04:53 [Broker] info Inter-broker link disconnected from Closed by peer
  2014-01-13 16:04:53 [System] debug DISCONNECTED [qpid.]
  [root@d-37-218 i]# grep -iE 'disconnected|Rollback' ~qpidd/qpidd.log
  2014-01-13 16:04:51 [Broker] info Inter-broker link disconnected from Connection refused
  2014-01-13 16:04:51 [System] debug DISCONNECTED [qpid.]
  2014-01-13 16:04:53 [Broker] info Inter-broker link disconnected from Closed by peer
  2014-01-13 16:05:27 [System] debug DISCONNECTED [qpid.[::1]:5672-[::1]:57452]
  2014-01-13 16:05:50 [HA] debug Primary transaction caf683b5: Rollback
  2014-01-13 16:05:50 [HA] debug Primary transaction fa3d360e: Rollback
  2014-01-13 16:05:51 [System] debug DISCONNECTED [qpid.]
  2014-01-13 16:05:51 [HA] debug Primary transaction 9799d1a7: Rollback
  2014-01-13 16:05:53 [HA] debug Primary transaction f4ab6065: Rollback
  2014-01-13 16:05:54 [HA] debug Primary transaction 5aa2196c: Rollback
  2014-01-13 16:05:56 [HA] debug Primary transaction 61af8bdb: Rollback
  2014-01-13 16:05:56 [System] debug DISCONNECTED [qpid.[::1]:5672-[::1]:57454]
  2014-01-13 16:05:57 [HA] debug Primary transaction 9a256e45: Rollback
  2014-01-13 16:06:00 [HA] debug Primary transaction b1ba97ef: Rollback
  [root@d-37-218 i]# grep -Ec 'Rollback' ~qpidd/qpidd.log
  # within 15 minutes

On older version I can see a bit less rollbacks on primary and a bit more disconnects on backups. It was tested with legacy and without stores, same results.

Although I'm not getting strictly exact messages as listed in comment 0, the rollbacks I'm seeing are in contrary with 'Expected results: No errors, no rollback messages.'


P.S. Feel free to correct my way of reproducing and/or disagree with my summary.
Comment 4 Alan Conway 2014-02-03 14:19:50 EST
Summary is quite accurate, apologies for leaving things in such a state. The following trunk commit should clean things up:

r1564010 | aconway | 2014-02-03 14:17:02 -0500 (Mon, 03 Feb 2014) | 36 lines

QPID-5528: HA Clean up error messages around rolled-back transactions.

A simple transaction test on a 3 node cluster generates a lot of errors and
rollback messages in the broker logs even though the test code never rolls back a
transaction. E.g.

  qpid-cluster-benchmark -b -n1 -m 1000 -q3 -s2 -r2 --send-arg=--tx --send-arg=10 --receive-arg=--tx --receive-arg=10

The errors are caused by queues being deleted while backup brokers are using
them. This happens a lot in the transaction test because a transactional session
must create a new transaction when the previous one closes. When the session
closes the open transaction is rolled back automatically. Thus there is almost
always an empty transaction that is created then immediately rolled back at the
end of the session. Backup brokers may still be in the process of subscribing to
the transaction's replication queue at this point, causing (harmlesss) errors.

This commit takes the following steps to clean up the unwanted error and rollback messages:

HA TX messages cleaned up:
- Remove log messages about rolling back/destroying empty transactions.
- Remove misleading "backup disconnected" message for cancelled transactions.
- Remove spurious warning about ignored unreplicated dequeues.
- Include TxReplicator destroy in QueueReplicator mutex, idempotence check before destroy.

Allow HA to suppress/modify broker exception logging:
- Move broker exception logging into ErrorListener
- Every SessionHandler has DefaultErrorListener that does the same logging as before.
- Added SessionHandlerObserver to allow plugins to change the error listener.
- HA plugin set ErrorListeners to log harmless exceptions as HA debug messages.

Unrelated cleanup:
- Broker now logs "incoming execution exceptions" as debug messages rather than ignoring.
- Exception prefixes: don't add the prefix if already present.

The exception test above should now pass without errors or rollback messages in the logs.

Comment 5 Alan Conway 2014-02-03 17:38:05 EST
On windows this also needs the following trunk commit:

r1564120 | aconway | 2014-02-03 17:37:29 -0500 (Mon, 03 Feb 2014) | 5 lines

QPID-5528: HA add missing QPID_BROKER_EXTERN declarations.

Missing from previous commit
 r1564010 | QPID-5528: HA Clean up error messages around rolled-back transactions.

Comment 7 Leonid Zhaldybin 2014-03-17 10:03:12 EDT
Tested on RHEL6.5, this issue is fixed.

Packages used for testing:

Comment 8 errata-xmlrpc 2014-09-24 11:08:57 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.