Bug 784890

Summary: durable non-distributed transactions rarely trigger Error during txn sync: jexception 0x0103 wmgr::get_events() threw JERR__AIO
Product: Red Hat Enterprise MRG Reporter: Frantisek Reznicek <freznice>
Component: qpid-cppAssignee: Kim van der Riet <kim.vdriet>
Status: CLOSED DUPLICATE QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: high Docs Contact:
Priority: unspecified    
Version: 2.1CC: esammons, jross, kim.vdriet
Target Milestone: 2.1.2   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-01-26 15:57:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Frantisek Reznicek 2012-01-26 15:14:51 UTC
Description of problem:

Long term cluster run with durable non-distributed transaction clients very occassionly cause 
  internal-error: Commit failed (qpid/broker/SemanticState.cpp:157)

signaled on broker as:
  2012-01-26 09:26:04 info LinkRegistry task late 1 times by 7448ms on average.
  2012-01-26 09:26:04 error Error preparing xid 7\x02\x00\x00\x00\x00\x00\x00\xF9\x06\x94\x90\x1D\xB7E\xD5\x81\xD8\xAE2\xB9\x8F\xDC\xD1: Error during txn sync: jexception 0x0103 wmgr::get_events() threw JERR__AIO: AIO error. (io_getevents() failed: Interrupted system call (-4)) (TxnCtxt.cpp:90)
  2012-01-26 09:26:04 error Commit failed with exception: Error during txn sync: jexception 0x0103 wmgr::get_events() threw JERR__AIO: AIO error. (io_getevents() failed: Interrupted system call (-4)) (TxnCtxt.cpp:90)
  2012-01-26 09:26:04 error Execution exception: internal-error: Commit failed (qpid/broker/SemanticState.cpp:157)
  2012-01-26 09:26:04 critical cluster(192.168.6.4:29292 READY/error) local error 69712261 did not occur on member 192.168.6.2:10674: internal-error: Commit failed (qpid/broker/SemanticState.cpp:157)
  2012-01-26 09:26:04 critical Error delivering frames: local error did not occur on all cluster members : internal-error: Commit failed (qpid/broker/SemanticState.cpp:157) (qpid/cluster/ErrorCheck.cpp:89)
  2012-01-26 09:26:04 notice cluster(192.168.6.4:29292 LEFT/error) leaving cluster mycluster
  2012-01-26 09:26:04 notice Shut down

This case was triggered on RHEL 6.2 x86_64 4 node vm cluster.

Version-Release number of selected component (if applicable):
  corosync-1.4.1-4.el6.x86_64
  corosynclib-1.4.1-4.el6.x86_64
  python-qpid-0.14-2.el6.noarch
  python-qpid-qmf-0.14-3.el6.x86_64
  python-saslwrapper-0.10-2.el6.x86_64
  qpid-cpp-client-0.14-1.el6.x86_64
  qpid-cpp-client-devel-0.14-1.el6.x86_64
  qpid-cpp-client-devel-docs-0.12-6.el6.noarch
  qpid-cpp-client-rdma-0.14-1.el6.x86_64
  qpid-cpp-client-ssl-0.14-1.el6.x86_64
  qpid-cpp-debuginfo-0.14-1.el6.x86_64
  qpid-cpp-server-0.14-1.el6.x86_64
  qpid-cpp-server-cluster-0.14-1.el6.x86_64
  qpid-cpp-server-devel-0.14-1.el6.x86_64
  qpid-cpp-server-rdma-0.14-1.el6.x86_64
  qpid-cpp-server-ssl-0.14-1.el6.x86_64
  qpid-cpp-server-store-0.14-1.el6.x86_64
  qpid-cpp-server-xml-0.14-1.el6.x86_64
  qpid-java-client-0.14-1.el6.noarch
  qpid-java-common-0.14-1.el6.noarch
  qpid-java-example-0.14-1.el6.noarch
  qpid-qmf-0.14-3.el6.x86_64
  qpid-qmf-debuginfo-0.14-3.el6.x86_64
  qpid-qmf-devel-0.14-3.el6.x86_64
  qpid-tests-0.14-1.el6.noarch
  qpid-tools-0.14-1.el6.noarch
  rh-qpid-cpp-tests-0.14-1.el6.x86_64
  ruby-qpid-qmf-0.14-3.el6.x86_64
  ruby-saslwrapper-0.10-2.el6.x86_64
  saslwrapper-0.10-2.el6.x86_64
  saslwrapper-debuginfo-0.10-2.el6.x86_64
  saslwrapper-devel-0.10-2.el6.x86_64
  sesame-1.0-2.el6.x86_64
  sesame-debuginfo-1.0-2.el6.x86_64


How reproducible:
<1%

Steps to Reproduce:
1. ./ctests.py --qmf-data-timeout=200 --cluster-maximize-uptime --log-to-file=120126_mrg-qe-17VMs_1.log   --selinux-state-force=0 --testset-loop-cnt=5 ...37.178 ...37.181 ...37.179 ...37.208 &>120126_mrg-qe-17VMs_1.tran.log
  
Actual results:
Cluster reduced to N-1 due to internal-error: Commit failed caused by rror during txn sync: jexception 0x0103 wmgr::get_events() threw JERR__AIO

Expected results:
No Commit error, cluster width unaffected.

Additional info:

Comment 1 Frantisek Reznicek 2012-01-26 15:16:46 UTC
This issue was not detected on RHEL5.7 cluster although stressed more than 6.2
cluster.

Comment 3 Kim van der Riet 2012-01-26 15:57:10 UTC
From code inspection, this error occurs when a transaction sync is waiting for all outstanding AIO events to return. This causes many calls to jctl::get_wr_events() which in turn passes on to wmgr::get_events(). This is the location of the fix for BZ 768407, so I am reasonably confident this is another manifestation of this bug.

--> DUPLICATE

*** This bug has been marked as a duplicate of bug 768407 ***