Bug 772166
Summary: | clustered distributed transaction integrity problem - message loss when distribution transaction client killed during process | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Frantisek Reznicek <freznice> | ||||
Component: | qpid-cpp | Assignee: | Kim van der Riet <kim.vdriet> | ||||
Status: | CLOSED UPSTREAM | QA Contact: | MRG Quality Engineering <mrgqe-bugs> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | Development | CC: | gsim, jross | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2025-02-10 03:14:36 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Frantisek Reznicek
2012-01-06 09:01:38 UTC
Reproduced twice during last week on RHEL 5.7 x86_64 4 node cluster using packages: cman-2.0.115-85.el5_7.2 cman-devel-2.0.115-85.el5_7.2 cman-devel-2.0.115-85.el5_7.2 openais-0.80.6-30.el5_7.4 openais-debuginfo-0.80.6-30.el5_7.4 openais-devel-0.80.6-30.el5_7.4 python-qpid-0.14-1.el5 python-qpid-qmf-0.14-2.el5 python-saslwrapper-0.10-4.el5 qpid-cpp-client-0.14-4.el5 qpid-cpp-client-devel-0.14-4.el5 qpid-cpp-client-devel-docs-0.14-4.el5 qpid-cpp-client-rdma-0.14-4.el5 qpid-cpp-client-ssl-0.14-4.el5 qpid-cpp-mrg-debuginfo-0.14-4.el5 qpid-cpp-server-0.14-4.el5 qpid-cpp-server-cluster-0.14-4.el5 qpid-cpp-server-devel-0.14-4.el5 qpid-cpp-server-rdma-0.14-4.el5 qpid-cpp-server-ssl-0.14-4.el5 qpid-cpp-server-store-0.14-4.el5 qpid-cpp-server-xml-0.14-4.el5 qpid-java-client-0.14-1.el5 qpid-java-common-0.14-1.el5 qpid-java-example-0.14-1.el5 qpid-qmf-0.14-2.el5 qpid-qmf-debuginfo-0.14-2.el5 qpid-qmf-devel-0.14-2.el5 qpid-tests-0.14-1.el5 qpid-tools-0.14-1.el5 rgmanager-2.0.52-21.el5 rh-qpid-cpp-tests-0.14-4.el5 ruby-qpid-qmf-0.14-2.el5 ruby-saslwrapper-0.10-4.el5 saslwrapper-0.10-4.el5 saslwrapper-debuginfo-0.10-4.el5 saslwrapper-devel-0.10-4.el5 sesame-1.0-2.el5 sesame-debuginfo-1.0-2.el5 Raising reproducibility to 50% Details of failures: Another data loss in test_cluster_dtx_phased_local_clikill9 after about 7 hours of cluster operation / tests ./ctests.py --cluster-maximize-uptime --qmf-data-timeout=200 --testset-loop-cnt=5 --log-to-file=120126_mrg-qe-16VMs_2.log --selinux-state-force=0 10.34.37.124 10.34.37.125 10.34.37.127 10.34.37.128 [23:18:04] DEBUG:TEST 'test_cluster_dtx_phased_local_clikill9' loop 1/1 started. [tst_cnt:12451, err_cnt:0, dur.:24511.3 (6.8h)] ... [23:20:09] INFO:10.34.37.127:'qpid-txtest --dtx yes --init 0 --transfer 1 --check 0 --queues=19 --messages-per-tx=119 --tx-count=150 --total-messages=2437 --hear tbeat 2 --durable yes --size 1024 ', result:-1, dur.:0.05 [23:20:12] INFO:localhost:'Cluster should be (4 of 4 up)', result:True, dur.:-1.00 [23:20:15] INFO:10.34.37.128:'qpid-txtest --dtx yes --init 0 --transfer 0 --check 1 --queues=19 --messages-per-tx=119 --tx-count=150 --total-messages=2437 --hear tbeat 2 --durable yes --size 1024 ', result:0, dur.:0.98 [23:20:16] INFO:10.34.37.127:'qpid-txtest --dtx yes --init 1 --transfer 0 --check 0 --queues=18 --messages-per-tx=112 --tx-count=104 --total-messages=8421 --hear tbeat 2 --tcp-nodelay --durable no ', result:0, dur.:0.25 [23:20:16] INFO:10.34.37.128:'P=qpid-txtest;ks=9;sts=$SECONDS;kld=0;e=;while true; do pp=$p; p=`pidof $P`; if [ -n "$p" ]; then killall -$ks $P; e=$?; let "kld++ "; else if [ $kld -gt 0 ]; then echo "$P (pid:$pp) killed -$ks (ecode:$e)"; break; fi;fi; echo -n "."; usleep $RANDOM; [ "$(($SECONDS-$sts))" -gt 10 ] && break; done; pidof $P', result:1, dur.:0.09 [23:20:16] DEBUG:stdout: .qpid-txtest (pid:11552) killed -9 (ecode:0) [23:20:16] DEBUG:test_cluster_dtx_phased_local_clikill9(): client finished (0.1 sec[s]) [23:20:16] INFO:10.34.37.128:'qpid-txtest --dtx yes --init 0 --transfer 1 --check 0 --queues=18 --messages-per-tx=112 --tx-count=104 --total-messages=8421 --hear tbeat 2 --tcp-nodelay --durable no ', result:-1, dur.:0.07 [23:20:19] INFO:localhost:'Cluster should be (4 of 4 up)', result:True, dur.:-1.00 [23:20:22] ERROR:10.34.37.124:'qpid-txtest --dtx yes --init 0 --transfer 0 --check 1 --queues=18 --messages-per-tx=112 --tx-count=104 --total-messages=8421 --hea rtbeat 2 --tcp-nodelay --durable no ', result:1, dur.:0.33 [23:20:22] DEBUG:stdout: Drained 8197 messages from tx-test-1 Drained 112 messages from tx-test-2 Drained 0 messages from tx-test-3 Drained 0 messages from tx-test-4 Drained 0 messages from tx-test-5 Drained 0 messages from tx-test-6 Drained 0 messages from tx-test-7 Drained 0 messages from tx-test-8 Drained 0 messages from tx-test-9 Drained 0 messages from tx-test-10 Drained 0 messages from tx-test-11 Drained 0 messages from tx-test-12 Drained 0 messages from tx-test-13 Drained 0 messages from tx-test-14 Drained 0 messages from tx-test-15 Drained 0 messages from tx-test-16 Drained 0 messages from tx-test-17 Drained 0 messages from tx-test-18 The following ids were missing: 'msg-113' ... 'msg-224' [23:20:22] DEBUG:stderr: [23:20:22] INFO:10.34.37.124:'qpid-txtest --dtx yes --init 1 --transfer 0 --check 0 --queues=20 --messages-per-tx=148 --tx-count=109 --total-messages=5382 --hear tbeat 2 --tcp-nodelay --durable no --size 16 ', result:0, dur.:0.18 [23:20:23] INFO:10.34.37.125:'P=qpid-txtest;ks=9;sts=$SECONDS;kld=0;e=;while true; do pp=$p; p=`pidof $P`; if [ -n "$p" ]; then killall -$ks $P; e=$?; let "kld++ "; else if [ $kld -gt 0 ]; then echo "$P (pid:$pp) killed -$ks (ecode:$e)"; break; fi;fi; echo -n "."; usleep $RANDOM; [ "$(($SECONDS-$sts))" -gt 10 ] && break; done; pidof $P', result:1, dur.:0.21 [23:20:23] DEBUG:stdout: .....qpid-txtest (pid:4357) killed -9 (ecode:0) [23:20:23] DEBUG:test_cluster_dtx_phased_local_clikill9(): client finished (0.0 sec[s]) [23:20:23] INFO:10.34.37.125:'qpid-txtest --dtx yes --init 0 --transfer 1 --check 0 --queues=20 --messages-per-tx=148 --tx-count=109 --total-messages=5382 --hear tbeat 2 --tcp-nodelay --durable no --size 16 ', result:-1, dur.:0.06 [23:20:26] INFO:localhost:'Cluster should be (4 of 4 up)', result:True, dur.:-1.00 [23:20:28] INFO:10.34.37.127:'qpid-txtest --dtx yes --init 0 --transfer 0 --check 1 --queues=20 --messages-per-tx=148 --tx-count=109 --total-messages=5382 --hear tbeat 2 --tcp-nodelay --durable no --size 16 ', result:0, dur.:0.31 brokers sayings (no error / critical): 2012-01-26 17:16:58 warning CLOSING [127.0.0.1:5672-127.0.0.1:46504] unsent data (probably due to client disconnect) 2012-01-26 17:23:52 warning CLOSING [10.34.37.124:5672-10.34.37.127:56953] unsent data (probably due to client disconnect) 2012-01-26 17:25:03 warning CLOSING [10.34.37.124:5672-10.34.37.125:38412] unsent data (probably due to client disconnect) 2012-01-26 17:26:39 warning CLOSING [10.34.37.124:5672-10.34.37.125:38430] unsent data (probably due to client disconnect) There was another one day before in test_cluster_dtx_phased_nonlocal_clikill15 with same symptoms: [16:59:58] INFO:10.34.37.124:'qpid-txtest --dtx yes --init 1 --transfer 0 --check 0 --queues=14 --messages-per-tx=173 --tx-count=150 --total-messages=2279 --tcp -nodelay --durable no --size 16 --broker=10.34.37.125', result:0, dur.:0.13 [16:59:58] INFO:10.34.37.125:'P=qpid-txtest;ks=15;sts=$SECONDS;kld=0;e=;while true; do pp=$p; p=`pidof $P`; if [ -n "$p" ]; then killall -$ks $P; e=$?; let "kld+ +"; else if [ $kld -gt 0 ]; then echo "$P (pid:$pp) killed -$ks (ecode:$e)"; break; fi;fi; echo -n "."; usleep $RANDOM; [ "$(($SECONDS-$sts))" -gt 10 ] && break; done; pidof $P', result:1, dur.:0.14 [16:59:58] DEBUG:stdout: ..qpid-txtest (pid:25905) killed -15 (ecode:0) [16:59:58] DEBUG:test_cluster_dtx_phased_nonlocal_clikill15(): client finished (0.0 sec[s]) [16:59:58] INFO:10.34.37.125:'qpid-txtest --dtx yes --init 0 --transfer 1 --check 0 --queues=14 --messages-per-tx=173 --tx-count=150 --total-messages=2279 --tcp -nodelay --durable no --size 16 --broker=10.34.37.127', result:-1, dur.:0.08 [17:00:02] INFO:localhost:'Cluster should be (4 of 4 up)', result:True, dur.:-1.00 [17:00:04] ERROR:10.34.37.127:'qpid-txtest --dtx yes --init 0 --transfer 0 --check 1 --queues=14 --messages-per-tx=173 --tx-count=150 --total-messages=2279 --tc p-nodelay --durable no --size 16 --broker=10.34.37.128', result:1, dur.:0.25 [17:00:04] DEBUG:stdout: Drained 2106 messages from tx-test-1 Drained 0 messages from tx-test-2 Drained 0 messages from tx-test-3 Drained 0 messages from tx-test-4 Drained 0 messages from tx-test-5 Drained 0 messages from tx-test-6 Drained 0 messages from tx-test-7 Drained 0 messages from tx-test-8 Drained 0 messages from tx-test-9 Drained 0 messages from tx-test-10 Drained 0 messages from tx-test-11 Drained 0 messages from tx-test-12 Drained 0 messages from tx-test-13 Drained 0 messages from tx-test-14 The following ids were missing: 'msg-1' ... 'msg-99' [17:00:04] DEBUG:stderr: Last two failures prove that massage loss can happen when dtx client is killed either of SIGTERM or SIGKILL signals. I will attempt to reproduce, however, in the interests of getting more information from your tests, can you run this test again with the following additional broker parameter: --log-enable debug+:msgstore This will print debug-level messages for the store only and I am keen to see some of the recovery details that this will provide in a failure case. Thanks The test was re-ran. I'll keep you updated about the results. I've launched 3 test runs and I need to say that original issue discussed here is inhibited by requested broker's logging '--log-enable debug+:msgstore' option. Continuing... (In reply to comment #7) > I've launched 3 test runs and I need to say that original issue discussed here > is inhibited by requested broker's logging '--log-enable debug+:msgstore' > option. Just to be clear: This option is in addition to the log-enable option you are already using (it looks as though you are using info+). So there will be a total of two log-enable parameters to the broker: --log-enable info+ --log-enable debug+:msgstore This should keep all the broker logs at info+ level, and only the store prints debug level logs. I was going to suggest that for this test, you keep this logging arrangement permanently - I guess we'll see if we can get a hit first. Created attachment 559578 [details]
Issue logs with requested log levels
I succeeded to reproduce the issue with above requested logging, see it attached.
Unfortunately there is part of qpidd log missing (the part you are looking for) which I cannot explain why atm, seems like logging bug.
Reproduced again in 2-node cluster running RHEL5.8 i386+x86_64 [13:14:38] INFO:10.34.27.69:'rpm -qa | grep -E '(qpid|qmf|sesame|saslwrapper|corosync|openais|cman|rgmanager)' | sort', result:0, dur.:3.59 [13:14:38] DEBUG:stdout: cman-2.0.115-96.el5_8.3 openais-0.80.6-36.el5_8.2 python-qpid-0.14-11.el5 python-qpid-qmf-0.14-14.el5 python-saslwrapper-0.10-4.el5 qpid-cpp-client-0.14-21.el5 qpid-cpp-client-devel-0.14-21.el5 qpid-cpp-client-devel-docs-0.14-21.el5 qpid-cpp-client-rdma-0.14-21.el5 qpid-cpp-client-ssl-0.14-21.el5 qpid-cpp-server-0.14-21.el5 qpid-cpp-server-cluster-0.14-21.el5 qpid-cpp-server-devel-0.14-21.el5 qpid-cpp-server-rdma-0.14-21.el5 qpid-cpp-server-ssl-0.14-21.el5 qpid-cpp-server-store-0.14-21.el5 qpid-cpp-server-xml-0.14-21.el5 qpid-java-client-0.18-1.el5 qpid-java-common-0.18-1.el5 qpid-java-example-0.18-1.el5 qpid-jca-0.18-1.el5 qpid-jca-xarecovery-0.18-1.el5 qpid-jca-zip-0.18-1.el5 qpid-qmf-0.14-14.el5 qpid-qmf-devel-0.14-14.el5 qpid-tests-0.14-1.el5 qpid-tools-0.14-5.el5 rgmanager-2.0.52-28.el5_8.3 rh-qpid-cpp-tests-0.14-21.el5 ruby-qpid-qmf-0.14-14.el5 ruby-saslwrapper-0.10-4.el5 saslwrapper-0.10-4.el5 saslwrapper-devel-0.10-4.el5 sesame-1.0-4.el5 [14:11:49] ERROR:10.34.27.69:'qpid-txtest --dtx yes --init 0 --transfer 0 --check 1 --queues=19 --messages-per-tx=157 --tx-count=114 --total-mess ages=4722 --durable yes --size 1024 ', result:1, dur.:0.87 [14:11:49] DEBUG:stdout: Drained 4408 messages from tx-test-1 Drained 157 messages from tx-test-2 Drained 0 messages from tx-test-3 Drained 0 messages from tx-test-4 Drained 0 messages from tx-test-5 Drained 0 messages from tx-test-6 Drained 0 messages from tx-test-7 Drained 0 messages from tx-test-8 Drained 0 messages from tx-test-9 Drained 0 messages from tx-test-10 Drained 0 messages from tx-test-11 Drained 0 messages from tx-test-12 Drained 0 messages from tx-test-13 Drained 0 messages from tx-test-14 Drained 0 messages from tx-test-15 Drained 0 messages from tx-test-16 Drained 0 messages from tx-test-17 Drained 0 messages from tx-test-18 Drained 0 messages from tx-test-19 The following ids were missing: 'msg-158' 'msg-159' 'msg-160' 'msg-161' 'msg-162' ... 'msg-311' 'msg-312' 'msg-313' 'msg-314' [14:11:49] DEBUG:stderr: This won't be reproducable on 2.4/0.22, since it won't have the clustering and dtx impl referenced here. This product has been discontinued or is no longer tracked in Red Hat Bugzilla. |