Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 784890

Summary:	durable non-distributed transactions rarely trigger Error during txn sync: jexception 0x0103 wmgr::get_events() threw JERR__AIO
Product:	Red Hat Enterprise MRG	Reporter:	Frantisek Reznicek <freznice>
Component:	qpid-cpp	Assignee:	Kim van der Riet <kim.vdriet>
Status:	CLOSED DUPLICATE	QA Contact:	MRG Quality Engineering <mrgqe-bugs>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	2.1	CC:	esammons, jross, kim.vdriet
Target Milestone:	2.1.2
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-01-26 15:57:10 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Frantisek Reznicek 2012-01-26 15:14:51 UTC

Description of problem:

Long term cluster run with durable non-distributed transaction clients very occassionly cause 
  internal-error: Commit failed (qpid/broker/SemanticState.cpp:157)

signaled on broker as:
  2012-01-26 09:26:04 info LinkRegistry task late 1 times by 7448ms on average.
  2012-01-26 09:26:04 error Error preparing xid 7\x02\x00\x00\x00\x00\x00\x00\xF9\x06\x94\x90\x1D\xB7E\xD5\x81\xD8\xAE2\xB9\x8F\xDC\xD1: Error during txn sync: jexception 0x0103 wmgr::get_events() threw JERR__AIO: AIO error. (io_getevents() failed: Interrupted system call (-4)) (TxnCtxt.cpp:90)
  2012-01-26 09:26:04 error Commit failed with exception: Error during txn sync: jexception 0x0103 wmgr::get_events() threw JERR__AIO: AIO error. (io_getevents() failed: Interrupted system call (-4)) (TxnCtxt.cpp:90)
  2012-01-26 09:26:04 error Execution exception: internal-error: Commit failed (qpid/broker/SemanticState.cpp:157)
  2012-01-26 09:26:04 critical cluster(192.168.6.4:29292 READY/error) local error 69712261 did not occur on member 192.168.6.2:10674: internal-error: Commit failed (qpid/broker/SemanticState.cpp:157)
  2012-01-26 09:26:04 critical Error delivering frames: local error did not occur on all cluster members : internal-error: Commit failed (qpid/broker/SemanticState.cpp:157) (qpid/cluster/ErrorCheck.cpp:89)
  2012-01-26 09:26:04 notice cluster(192.168.6.4:29292 LEFT/error) leaving cluster mycluster
  2012-01-26 09:26:04 notice Shut down

This case was triggered on RHEL 6.2 x86_64 4 node vm cluster.

Version-Release number of selected component (if applicable):
  corosync-1.4.1-4.el6.x86_64
  corosynclib-1.4.1-4.el6.x86_64
  python-qpid-0.14-2.el6.noarch
  python-qpid-qmf-0.14-3.el6.x86_64
  python-saslwrapper-0.10-2.el6.x86_64
  qpid-cpp-client-0.14-1.el6.x86_64
  qpid-cpp-client-devel-0.14-1.el6.x86_64
  qpid-cpp-client-devel-docs-0.12-6.el6.noarch
  qpid-cpp-client-rdma-0.14-1.el6.x86_64
  qpid-cpp-client-ssl-0.14-1.el6.x86_64
  qpid-cpp-debuginfo-0.14-1.el6.x86_64
  qpid-cpp-server-0.14-1.el6.x86_64
  qpid-cpp-server-cluster-0.14-1.el6.x86_64
  qpid-cpp-server-devel-0.14-1.el6.x86_64
  qpid-cpp-server-rdma-0.14-1.el6.x86_64
  qpid-cpp-server-ssl-0.14-1.el6.x86_64
  qpid-cpp-server-store-0.14-1.el6.x86_64
  qpid-cpp-server-xml-0.14-1.el6.x86_64
  qpid-java-client-0.14-1.el6.noarch
  qpid-java-common-0.14-1.el6.noarch
  qpid-java-example-0.14-1.el6.noarch
  qpid-qmf-0.14-3.el6.x86_64
  qpid-qmf-debuginfo-0.14-3.el6.x86_64
  qpid-qmf-devel-0.14-3.el6.x86_64
  qpid-tests-0.14-1.el6.noarch
  qpid-tools-0.14-1.el6.noarch
  rh-qpid-cpp-tests-0.14-1.el6.x86_64
  ruby-qpid-qmf-0.14-3.el6.x86_64
  ruby-saslwrapper-0.10-2.el6.x86_64
  saslwrapper-0.10-2.el6.x86_64
  saslwrapper-debuginfo-0.10-2.el6.x86_64
  saslwrapper-devel-0.10-2.el6.x86_64
  sesame-1.0-2.el6.x86_64
  sesame-debuginfo-1.0-2.el6.x86_64


How reproducible:
<1%

Steps to Reproduce:
1. ./ctests.py --qmf-data-timeout=200 --cluster-maximize-uptime --log-to-file=120126_mrg-qe-17VMs_1.log   --selinux-state-force=0 --testset-loop-cnt=5 ...37.178 ...37.181 ...37.179 ...37.208 &>120126_mrg-qe-17VMs_1.tran.log
  
Actual results:
Cluster reduced to N-1 due to internal-error: Commit failed caused by rror during txn sync: jexception 0x0103 wmgr::get_events() threw JERR__AIO

Expected results:
No Commit error, cluster width unaffected.

Additional info:

Comment 1 Frantisek Reznicek 2012-01-26 15:16:46 UTC

This issue was not detected on RHEL5.7 cluster although stressed more than 6.2
cluster.

Comment 3 Kim van der Riet 2012-01-26 15:57:10 UTC

From code inspection, this error occurs when a transaction sync is waiting for all outstanding AIO events to return. This causes many calls to jctl::get_wr_events() which in turn passes on to wmgr::get_events(). This is the location of the fix for BZ 768407, so I am reasonably confident this is another manifestation of this bug.

--> DUPLICATE

*** This bug has been marked as a duplicate of bug 768407 ***