1024528 – QPID-5275 HA transactions failing in qpid-cluster-benchmark

Bug 1024528 - QPID-5275 HA transactions failing in qpid-cluster-benchmark

Summary: QPID-5275 HA transactions failing in qpid-cluster-benchmark

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	qpid-cpp
Sub Component:
Version:	Development
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	3.0
Target Release:	---
Assignee:	Alan Conway
QA Contact:	Leonid Zhaldybin
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-10-29 20:56 UTC by Alan Conway
Modified:	2014-11-09 22:39 UTC (History)
CC List:	6 users (show)
Fixed In Version:	qpid-cpp-0.22-35
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-09-24 15:08:57 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Apache JIRA	QPID-5275	0	None	None	None	Never
Red Hat Product Errata	RHEA-2014:1296	0	normal	SHIPPED_LIVE	Red Hat Enterprise MRG Messaging 3.0 Release	2014-09-24 19:00:06 UTC

Description Alan Conway 2013-10-29 20:56:07 UTC

Description of problem:

Run the following test (from qpid/cpp/src/tests) against a 3 node cluster:

qpid-cluster-benchmark b 20.0.20.200 -n1 -m 1000 -q6 -s3 -r3 --send-arg=tx --send-arg=10 --receive-arg=-tx --receive-arg=10

Actual resullt: log file from the primary broker contains many errors like this:

Primary transaction 393340e4: Backup disconnected: e4886fe3@tcp:20.0.10.33:5672(ready)

The log file on the backup brokers contain many messages like this:

Backup of transaction 6db1a2e2: Rollback

Version-Release number of selected component (if applicable):
Qpid trunk r1536857
0.22-mrg
\

How reproducible: 100%


Steps to Reproduce:

Run the following test (from qpid/cpp/src/tests) against a 3 node cluster:

qpid-cluster-benchmark b 20.0.20.200 -n1 -m 1000 -q6 -s3 -r3 --send-arg=tx --send-arg=10 --receive-arg=-tx --receive-arg=10

Actual results: Log files contain errors and Rollback messages

Expected results: No errors, no rollback messages.

Additional info:

Comment 1 Alan Conway 2013-11-12 17:22:15 UTC

Fixed on branch:

http://git.app.eng.bos.redhat.com/rh-qpid.git/commit/?h=0.22-mrg-aconway-bz1024528

Comment 3 Frantisek Reznicek 2014-01-13 15:23:20 UTC

I'm very sorry but I need to claim this one as not fixed for the moment.

Testing on 3 node qpid HA cluster with virtual IP gives same results on RHEL6.5i with -29 and -31.

I believe I do it right although I'm not seeing exacly the same messages (mentioned above).

Let me describe the way I'm reproducing it:

0] qpid HA cluster configured without authentication with following configuration

[root@dl-216 i]# cat /etc/qpid/qpidd.conf
data-dir=/var/lib/qpidd
log-to-file=/var/lib/qpidd/qpidd.log
ha-brokers-url=192.168.6.60:5672,192.168.6.61:5672,192.168.6.62:5672
auth=no
acl-file=/etc/qpid/qpidd.acl
ha-public-url=192.168.6.251:5672
ha-cluster=yes
#log-enable=info+
log-enable=debug+
ha-replicate=all
port=5672
[root@dl-216 i]# cat /etc/cluster/cluster.conf 
<?xml version="1.0" ?>
<cluster config_version="1" name="dtests_ha">
  <clusternodes>
    <clusternode name="192.168.6.60" nodeid="1" votes="1">
      <fence>
        <method name="1">
          <device name="fence_1"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="192.168.6.61" nodeid="2" votes="1">
      <fence>
        <method name="1">
          <device name="fence_2"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="192.168.6.62" nodeid="3" votes="1">
      <fence>
        <method name="1">
          <device name="fence_3"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <cman>
    <multicast addr="225.0.0.12"/>
  </cman>
  <rm log_level="7">
    <failoverdomains>
      <failoverdomain name="domain_qpidd_1" restricted="1">
        <failoverdomainnode name="192.168.6.60" priority="1"/>
      </failoverdomain>
      <failoverdomain name="domain_qpidd_2" restricted="1">
        <failoverdomainnode name="192.168.6.61" priority="1"/>
      </failoverdomain>
      <failoverdomain name="domain_qpidd_3" restricted="1">
        <failoverdomainnode name="192.168.6.62" priority="1"/>
      </failoverdomain>
    </failoverdomains>
    <resources>
      <script file="/etc/init.d/qpidd" name="qpidd">
        <action depth="*" interval="5" name="status"/>
      </script>
      <script file="/etc/init.d/qpidd-primary" name="qpidd-primary">
        <action depth="*" interval="5" name="status"/>
      </script>
      <ip address="192.168.6.251" monitor_link="1"/>
    </resources>
    <service domain="domain_qpidd_1" name="qpidd_1" recovery="restart">
      <script ref="qpidd"/>
    </service>
    <service domain="domain_qpidd_2" name="qpidd_2" recovery="restart">
      <script ref="qpidd"/>
    </service>
    <service domain="domain_qpidd_3" name="qpidd_3" recovery="restart">
      <script ref="qpidd"/>
    </service>
    <service autostart="1" exclusive="0" name="qpidd_primary" recovery="relocate">
      <script ref="qpidd-primary"/>
      <ip ref="192.168.6.251"/>
    </service>
  </rm>
  <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="30"/>
  <fencedevices>
    <fencedevice action="reboot" agent="fence_xvm" domain="r6i0" ipport="1229" key_file="/etc/cluster/fence_xvm.key" name="fence_1"/>
    <fencedevice action="reboot" agent="fence_xvm" domain="r6i1" ipport="1229" key_file="/etc/cluster/fence_xvm.key" name="fence_2"/>
    <fencedevice action="reboot" agent="fence_xvm" domain="r6i2" ipport="1229" key_file="/etc/cluster/fence_xvm.key" name="fence_3"/>
  </fencedevices>
  <logging debug="on"/>
</cluster>

1] HA is up including primary broker and RHCS virtual ip (set to 192.168.6.251)
2] HA transactional benchmart is ran following way:
   for i in `seq 200`; do
     ./qpid-cluster-benchmark -b 192.168.6.251 -n1 -m 1000 -q6 -s3 -r3 
         --send-arg=--tx --send-arg=10 --receive-arg=--tx --receive-arg=10
   done
   # Note: Doublecheck please as original above switches do not work
3] ./qpid-cluster-benchmark works about as expected
   monitoring broker logs I can see on -29 as well as -31 following events:

  [root@dl-216 i]# grep -iE 'disconnected|Rollback' ~qpidd/qpidd.log
  2014-01-13 16:04:51 [Broker] info Inter-broker link disconnected from 192.168.6.60:5672 Closed by peer
  2014-01-13 16:04:51 [System] debug DISCONNECTED [qpid.192.168.6.60:5672-192.168.6.60:50439]
  2014-01-13 16:04:51 [System] debug DISCONNECTED [qpid.192.168.6.60:5672-192.168.6.60:50439]
  2014-01-13 16:04:51 [System] debug DISCONNECTED [qpid.192.168.6.60:5672-192.168.6.60:50442]
  2014-01-13 16:04:53 [System] debug DISCONNECTED [qpid.192.168.6.60:5672-192.168.6.62:55092]
  2014-01-13 16:04:53 [System] debug DISCONNECTED [qpid.192.168.6.60:5672-192.168.6.62:55092]
  2014-01-13 16:04:53 [System] debug DISCONNECTED [qpid.192.168.6.60:5672-192.168.6.61:60145]
  2014-01-13 16:04:53 [System] debug DISCONNECTED [qpid.192.168.6.60:5672-192.168.6.61:60145]
  
  [root@d-37-165 i]# grep -iE 'disconnected|Rollback' ~qpidd/qpidd.log
  2014-01-13 16:04:53 [Broker] info Inter-broker link disconnected from 192.168.6.60:5672 Closed by peer
  2014-01-13 16:04:53 [System] debug DISCONNECTED [qpid.192.168.6.61:5672-192.168.6.61:33996]
  
  [root@d-37-218 i]# grep -iE 'disconnected|Rollback' ~qpidd/qpidd.log
  2014-01-13 16:04:51 [Broker] info Inter-broker link disconnected from 192.168.6.60:5672 Connection refused
  2014-01-13 16:04:51 [System] debug DISCONNECTED [qpid.192.168.6.62:5672-192.168.6.62:35763]
  2014-01-13 16:04:53 [Broker] info Inter-broker link disconnected from 192.168.6.60:5672 Closed by peer
  2014-01-13 16:05:27 [System] debug DISCONNECTED [qpid.[::1]:5672-[::1]:57452]
  2014-01-13 16:05:50 [HA] debug Primary transaction caf683b5: Rollback
  2014-01-13 16:05:50 [HA] debug Primary transaction fa3d360e: Rollback
  2014-01-13 16:05:51 [System] debug DISCONNECTED [qpid.192.168.6.251:5672-192.168.6.1:38922]
  2014-01-13 16:05:51 [HA] debug Primary transaction 9799d1a7: Rollback
  2014-01-13 16:05:53 [HA] debug Primary transaction f4ab6065: Rollback
  2014-01-13 16:05:54 [HA] debug Primary transaction 5aa2196c: Rollback
  2014-01-13 16:05:56 [HA] debug Primary transaction 61af8bdb: Rollback
  2014-01-13 16:05:56 [System] debug DISCONNECTED [qpid.[::1]:5672-[::1]:57454]
  2014-01-13 16:05:57 [HA] debug Primary transaction 9a256e45: Rollback
  ...
  2014-01-13 16:06:00 [HA] debug Primary transaction b1ba97ef: Rollback
  [root@d-37-218 i]# grep -Ec 'Rollback' ~qpidd/qpidd.log
  576
  # within 15 minutes


On older version I can see a bit less rollbacks on primary and a bit more disconnects on backups. It was tested with legacy and without stores, same results.

Although I'm not getting strictly exact messages as listed in comment 0, the rollbacks I'm seeing are in contrary with 'Expected results: No errors, no rollback messages.'


-> ASSIGNED

P.S. Feel free to correct my way of reproducing and/or disagree with my summary.

Comment 4 Alan Conway 2014-02-03 19:19:50 UTC

Summary is quite accurate, apologies for leaving things in such a state. The following trunk commit should clean things up:

------------------------------------------------------------------------
r1564010 | aconway | 2014-02-03 14:17:02 -0500 (Mon, 03 Feb 2014) | 36 lines

QPID-5528: HA Clean up error messages around rolled-back transactions.

A simple transaction test on a 3 node cluster generates a lot of errors and
rollback messages in the broker logs even though the test code never rolls back a
transaction. E.g.

qpid-cluster-benchmark -b 20.0.20.200 -n1 -m 1000 -q3 -s2 -r2 --send-arg=--tx --send-arg=10 --receive-arg=--tx --receive-arg=10

The errors are caused by queues being deleted while backup brokers are using
them. This happens a lot in the transaction test because a transactional session
must create a new transaction when the previous one closes. When the session
closes the open transaction is rolled back automatically. Thus there is almost
always an empty transaction that is created then immediately rolled back at the
end of the session. Backup brokers may still be in the process of subscribing to
the transaction's replication queue at this point, causing (harmlesss) errors.

This commit takes the following steps to clean up the unwanted error and rollback messages:

HA TX messages cleaned up:
- Remove log messages about rolling back/destroying empty transactions.
- Remove misleading "backup disconnected" message for cancelled transactions.
- Remove spurious warning about ignored unreplicated dequeues.
- Include TxReplicator destroy in QueueReplicator mutex, idempotence check before destroy.

Allow HA to suppress/modify broker exception logging:
- Move broker exception logging into ErrorListener
- Every SessionHandler has DefaultErrorListener that does the same logging as before.
- Added SessionHandlerObserver to allow plugins to change the error listener.
- HA plugin set ErrorListeners to log harmless exceptions as HA debug messages.

Unrelated cleanup:
- Broker now logs "incoming execution exceptions" as debug messages rather than ignoring.
- Exception prefixes: don't add the prefix if already present.

The exception test above should now pass without errors or rollback messages in the logs.

------------------------------------------------------------------------

Comment 5 Alan Conway 2014-02-03 22:38:05 UTC

On windows this also needs the following trunk commit:

------------------------------------------------------------------------
r1564120 | aconway | 2014-02-03 17:37:29 -0500 (Mon, 03 Feb 2014) | 5 lines

QPID-5528: HA add missing QPID_BROKER_EXTERN declarations.

Missing from previous commit
 r1564010 | QPID-5528: HA Clean up error messages around rolled-back transactions.

------------------------------------------------------------------------

Comment 7 Leonid Zhaldybin 2014-03-17 14:03:12 UTC

Tested on RHEL6.5, this issue is fixed.

Packages used for testing:
python-qpid-0.22-12.el6
python-qpid-qmf-0.22-28.el6
qpid-cpp-client-0.22-36
qpid-cpp-client-devel-0.22-36.el6
qpid-cpp-client-devel-docs-0.22-36.el6
qpid-cpp-client-rdma-0.22-36.el6
qpid-cpp-server-0.22-36.el6
qpid-cpp-server-devel-0.22-36.el6
qpid-cpp-server-ha-0.22-36.el6
qpid-cpp-server-linearstore-0.22-36.el6
qpid-cpp-server-rdma-0.22-36.el6
qpid-cpp-server-xml-0.22-36.el6
qpid-cpp-tar-0.22-23.el6
qpid-jca-0.22-2.el6
qpid-jca-xarecovery-0.22-2.el6
qpid-jca-zip-0.22-2.el6
qpid-proton-c-0.6-1.el6
qpid-qmf-0.22-28.el6
qpid-qmf-devel-0.22-28.el6
qpid-tests-0.22-14.el6
qpid-tools-0.22-9.el6
rh-qpid-cpp-tests-0.22-36.el6
ruby-qpid-qmf-0.22-28.el6

-> VERIFIED

Comment 8 errata-xmlrpc 2014-09-24 15:08:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1296.html

Note You need to log in before you can comment on or make changes to this bug.