Hide Forgot
Description of problem: Long term cluster run with durable non-distributed transaction clients very occassionly cause internal-error: Commit failed (qpid/broker/SemanticState.cpp:157) signaled on broker as: 2012-01-26 09:26:04 info LinkRegistry task late 1 times by 7448ms on average. 2012-01-26 09:26:04 error Error preparing xid 7\x02\x00\x00\x00\x00\x00\x00\xF9\x06\x94\x90\x1D\xB7E\xD5\x81\xD8\xAE2\xB9\x8F\xDC\xD1: Error during txn sync: jexception 0x0103 wmgr::get_events() threw JERR__AIO: AIO error. (io_getevents() failed: Interrupted system call (-4)) (TxnCtxt.cpp:90) 2012-01-26 09:26:04 error Commit failed with exception: Error during txn sync: jexception 0x0103 wmgr::get_events() threw JERR__AIO: AIO error. (io_getevents() failed: Interrupted system call (-4)) (TxnCtxt.cpp:90) 2012-01-26 09:26:04 error Execution exception: internal-error: Commit failed (qpid/broker/SemanticState.cpp:157) 2012-01-26 09:26:04 critical cluster(192.168.6.4:29292 READY/error) local error 69712261 did not occur on member 192.168.6.2:10674: internal-error: Commit failed (qpid/broker/SemanticState.cpp:157) 2012-01-26 09:26:04 critical Error delivering frames: local error did not occur on all cluster members : internal-error: Commit failed (qpid/broker/SemanticState.cpp:157) (qpid/cluster/ErrorCheck.cpp:89) 2012-01-26 09:26:04 notice cluster(192.168.6.4:29292 LEFT/error) leaving cluster mycluster 2012-01-26 09:26:04 notice Shut down This case was triggered on RHEL 6.2 x86_64 4 node vm cluster. Version-Release number of selected component (if applicable): corosync-1.4.1-4.el6.x86_64 corosynclib-1.4.1-4.el6.x86_64 python-qpid-0.14-2.el6.noarch python-qpid-qmf-0.14-3.el6.x86_64 python-saslwrapper-0.10-2.el6.x86_64 qpid-cpp-client-0.14-1.el6.x86_64 qpid-cpp-client-devel-0.14-1.el6.x86_64 qpid-cpp-client-devel-docs-0.12-6.el6.noarch qpid-cpp-client-rdma-0.14-1.el6.x86_64 qpid-cpp-client-ssl-0.14-1.el6.x86_64 qpid-cpp-debuginfo-0.14-1.el6.x86_64 qpid-cpp-server-0.14-1.el6.x86_64 qpid-cpp-server-cluster-0.14-1.el6.x86_64 qpid-cpp-server-devel-0.14-1.el6.x86_64 qpid-cpp-server-rdma-0.14-1.el6.x86_64 qpid-cpp-server-ssl-0.14-1.el6.x86_64 qpid-cpp-server-store-0.14-1.el6.x86_64 qpid-cpp-server-xml-0.14-1.el6.x86_64 qpid-java-client-0.14-1.el6.noarch qpid-java-common-0.14-1.el6.noarch qpid-java-example-0.14-1.el6.noarch qpid-qmf-0.14-3.el6.x86_64 qpid-qmf-debuginfo-0.14-3.el6.x86_64 qpid-qmf-devel-0.14-3.el6.x86_64 qpid-tests-0.14-1.el6.noarch qpid-tools-0.14-1.el6.noarch rh-qpid-cpp-tests-0.14-1.el6.x86_64 ruby-qpid-qmf-0.14-3.el6.x86_64 ruby-saslwrapper-0.10-2.el6.x86_64 saslwrapper-0.10-2.el6.x86_64 saslwrapper-debuginfo-0.10-2.el6.x86_64 saslwrapper-devel-0.10-2.el6.x86_64 sesame-1.0-2.el6.x86_64 sesame-debuginfo-1.0-2.el6.x86_64 How reproducible: <1% Steps to Reproduce: 1. ./ctests.py --qmf-data-timeout=200 --cluster-maximize-uptime --log-to-file=120126_mrg-qe-17VMs_1.log --selinux-state-force=0 --testset-loop-cnt=5 ...37.178 ...37.181 ...37.179 ...37.208 &>120126_mrg-qe-17VMs_1.tran.log Actual results: Cluster reduced to N-1 due to internal-error: Commit failed caused by rror during txn sync: jexception 0x0103 wmgr::get_events() threw JERR__AIO Expected results: No Commit error, cluster width unaffected. Additional info:
This issue was not detected on RHEL5.7 cluster although stressed more than 6.2 cluster.
From code inspection, this error occurs when a transaction sync is waiting for all outstanding AIO events to return. This causes many calls to jctl::get_wr_events() which in turn passes on to wmgr::get_events(). This is the location of the fix for BZ 768407, so I am reasonably confident this is another manifestation of this bug. --> DUPLICATE *** This bug has been marked as a duplicate of bug 768407 ***