Bug 772166 - clustered distributed transaction integrity problem - message loss when distribution transaction client killed during process
Summary: clustered distributed transaction integrity problem - message loss when distr...
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: Development
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Kim van der Riet
QA Contact: MRG Quality Engineering
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-01-06 09:01 UTC by Frantisek Reznicek
Modified: 2020-11-04 19:14 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Issue logs with requested log levels (333.12 KB, application/x-tbz)
2012-02-06 09:33 UTC, Frantisek Reznicek
no flags Details

Description Frantisek Reznicek 2012-01-06 09:01:38 UTC
Description of problem:

Distributed transaction test ran in 4 node cluster configuration very rarely fails. The test scenario is following:

  [03:27:58] INFO:dhcp-37-125 :'qpid-txtest --dtx yes --init 1 --transfer 0 --check 0 --queues=17 --messages-per-tx=148 --tx-count=100 --total-messages=3551 --heartbeat 3  --durable yes  --broker=dhcp-37-124 ', result:0, dur.:0.203347
  [03:27:58] INFO:dhcp-37-127 :'qpid-txtest --dtx yes --init 0 --transfer 1 --check 0 --queues=17 --messages-per-tx=148 --tx-count=100 --total-messages=3551 --heartbeat 3  --durable yes  --broker=dhcp-37-128 ' STARTED on background (*)
  [03:27:58] INFO:dhcp-37-127 :'P=qpid-txtest;ks=15;sts=$SECONDS;kld=0;e=;while true; do pp=$p; p=`pidof $P`; if [ -n "$p" ]; then killall -$ks $P; e=$?; let "kld++"; else if [ $kld -gt 0 ]; then echo "$P (pid:$pp) killed -$ks (ecode:$e)"; break; fi;fi; echo -n "."; usleep $RANDOM; [ "$(($SECONDS-$sts))" -gt 10 ] && break; done; pidof $P', result:1, dur.:0.089416
  [03:27:58] DEBUG:stdout:
    .qpid-txtest (pid:25691) killed -15 (ecode:0)
  [03:27:58] DEBUG:test_cluster_dtx_phased_nonlocal_clikill15(): client finished (0.1 sec[s])
  [03:27:58] INFO:dhcp-37-127 :'qpid-txtest --dtx yes --init 0 --transfer 1 --check 0 --queues=17 --messages-per-tx=148 --tx-count=100 --total-messages=3551 --heartbeat 3  --durable yes  --broker=dhcp-37-128 ', result:-1, dur.:0.104056 (*)
  [03:28:01] INFO:localhost:'Cluster should be (4 of 4 up)', result:True, dur.:-1
  [03:28:04] ERROR:dhcp-37-128 :'qpid-txtest --dtx yes --init 0 --transfer 0 --check 1 --queues=17 --messages-per-tx=148 --tx-count=100 --total-messages=3551 --heartbeat 3  --durable yes  --broker=dhcp-37-127 ', result:1, dur.:0.461266
  [03:28:04] DEBUG:stdout:
    Drained 3255 messages from tx-test-1
    Drained 148 messages from tx-test-2
    Drained 0 messages from tx-test-3
    Drained 0 messages from tx-test-4
    Drained 0 messages from tx-test-5
    Drained 0 messages from tx-test-6
    Drained 0 messages from tx-test-7
    Drained 0 messages from tx-test-8
    Drained 0 messages from tx-test-9
    Drained 0 messages from tx-test-10
    Drained 0 messages from tx-test-11
    Drained 0 messages from tx-test-12
    Drained 0 messages from tx-test-13
    Drained 0 messages from tx-test-14
    Drained 0 messages from tx-test-15
    Drained 0 messages from tx-test-16
    Drained 0 messages from tx-test-17
    The following ids were missing:
        'msg-1'
        'msg-10'
        'msg-100'
        'msg-101'
        ... (141 lines cut)
        'msg-97'
        'msg-98'
        'msg-99'
  [03:28:04] DEBUG:stderr:

Testing scenario uses qpid-txtest in durable dtx mode, test steps are detailed below in Steps to reproduce.

Above test scenario expects that if client using transactions crash all messages should be present (as client circulates messages among few queues).

All broker logs do not show any problem (error/critical), see details below.
The issue was seen on 4 node rhel5.7 x86_64 cluster.

Version-Release number of selected component (if applicable):
  cman-2.0.115-85.el5_7.2
  cman-devel-2.0.115-85.el5_7.2
  cman-devel-2.0.115-85.el5_7.2
  openais-0.80.6-30.el5_7.4
  openais-debuginfo-0.80.6-30.el5_7.4
  openais-devel-0.80.6-30.el5_7.4
  python-qpid-0.14-1.el5
  python-qpid-qmf-0.14-2.el5
  python-saslwrapper-0.10-4.el5
  qpid-cpp-client-0.14-3.el5
  qpid-cpp-client-devel-0.14-3.el5
  qpid-cpp-client-devel-docs-0.14-3.el5
  qpid-cpp-client-rdma-0.14-3.el5
  qpid-cpp-client-ssl-0.14-3.el5
  qpid-cpp-mrg-debuginfo-0.14-3.el5
  qpid-cpp-server-0.14-3.el5
  qpid-cpp-server-cluster-0.14-3.el5
  qpid-cpp-server-devel-0.14-3.el5
  qpid-cpp-server-rdma-0.14-3.el5
  qpid-cpp-server-ssl-0.14-3.el5
  qpid-cpp-server-store-0.14-3.el5
  qpid-cpp-server-xml-0.14-3.el5
  qpid-java-client-0.14-1.el5
  qpid-java-common-0.14-1.el5
  qpid-java-example-0.14-1.el5
  qpid-qmf-0.14-2.el5
  qpid-qmf-debuginfo-0.14-2.el5
  qpid-qmf-devel-0.14-2.el5
  qpid-tests-0.14-1.el5
  qpid-tools-0.14-1.el5
  rgmanager-2.0.52-21.el5
  rh-qpid-cpp-tests-0.14-3.el5
  ruby-qpid-qmf-0.14-2.el5
  ruby-saslwrapper-0.10-4.el5
  saslwrapper-0.10-4.el5
  saslwrapper-debuginfo-0.10-4.el5
  saslwrapper-devel-0.10-4.el5
  sesame-1.0-2.el5
  sesame-debuginfo-1.0-2.el5


How reproducible:
<5%

Steps to Reproduce:
original run
./ctests.py --cluster-maximize-uptime  --qmf-data-timeout=200
            --testset-loop-cnt=10      --log-to-file=a.log
            --selinux-state-force=0
            <host-A> <host-B> <host-C> <host-D>

or more focused on transactions only:
./ctests.py --cluster-maximize-uptime --qmf-data-timeout=200
            --testset-loop-cnt=10     --log-to-file=a.log
            --selinux-state-force=0
            --exec-tests=testset_cluster_tx,testset_cluster_dtx
            <host-A> <host-B> <host-C> <host-D>

or manually by looping:
1. run on foreground - init
   host-B:'qpid-txtest --dtx yes --init 1 --transfer 0 --check 0 --queues=17 --messages-per-tx=148 --tx-count=100 --tota
l-messages=3551 --heartbeat 3  --durable yes  --broker=host-A'

2. run on background - start transfer
   host-C:'qpid-txtest --dtx yes --init 0 --transfer 1 --check 0 --queues=17 --messages-per-tx=148 --tx-count=100 --total-messages=3551 --heartbeat 3  --durable yes  --broker=host-D'

3. run on foreground - i.e. kill client launched as point 2.
   sweep ks=(15|9)
   host-C:'P=qpid-txtest;ks=15;sts=$SECONDS;kld=0;e=;while true; do pp=$p; p=`pidof $P`; if [ -n "$p" ]; then killall -$
ks $P; e=$?; let "kld++"; else if [ $kld -gt 0 ]; then echo "$P (pid:$pp) killed -$ks (ecode:$e)"; break; fi;fi; echo -n "."; usleep $RANDOM; [ "$(($SECONDS-$sts))" -gt 10 ] && break; done; pidof $P'
4. run on foreground - evaluation
   host-D:'qpid-txtest --dtx yes --init 0 --transfer 0 --check 1 --queues=17 --messages-per-tx=148 --tx-count=100 --total-messages=3551 --heartbeat 3  --durable yes  --broker=host-C'
5. goto 1

Actual results:
  Message loss observed when dtx durable client killed in the middle of operation.

Expected results:
  No message loss when dtx durable client killed in the middle of operation.

Additional info:

machine IDs & local time
  [16:50:20] INFO:dhcp-37-124:'date ; uname -a', result:0, dur.:0.0439711
  [16:50:20] DEBUG:stdout:
    Thu Jan  5 10:50:20 EST 2012
    Linux dhcp-37-124 2.6.18-274.12.1.el5 #1 SMP Tue Nov 8 21:37:35 EST 2011 x86_64 x86_64 x86_64 GNU/Linux
  [16:50:20] INFO:dhcp-37-125:'date ; uname -a', result:0, dur.:0.0451019
  [16:50:20] DEBUG:stdout:
    Thu Jan  5 10:50:20 EST 2012
    Linux dhcp-37-125 2.6.18-274.12.1.el5 #1 SMP Tue Nov 8 21:37:35 EST 2011 x86_64 x86_64 x86_64 GNU/Linux
  [16:50:21] INFO:dhcp-37-127:'date ; uname -a', result:0, dur.:0.047008
  [16:50:21] DEBUG:stdout:
    Thu Jan  5 10:50:21 EST 2012
    Linux dhcp-37-127 2.6.18-274.12.1.el5 #1 SMP Tue Nov 8 21:37:35 EST 2011 x86_64 x86_64 x86_64 GNU/Linux
  [16:50:21] INFO:dhcp-37-128:'date ; uname -a', result:0, dur.:0.0472062
  [16:50:21] DEBUG:stdout:
    Thu Jan  5 10:50:21 EST 2012
    Linux dhcp-37-128 2.6.18-274.12.1.el5 #1 SMP Tue Nov 8 21:37:35 EST 2011 x86_64 x86_64 x86_64 GNU/Linux
  
  
passing testing scenario:
  
  [03:27:51] INFO:dhcp-37-127:'qpid-txtest --dtx yes --init 1 --transfer 0 --check 0 --queues=17 --messages-per-tx=158 --tx-count=115 --tota
  l-messages=7135 --heartbeat 3  --durable no --size 1024 --broker=dhcp-37-128', result:0, dur.:0.757424
  [03:27:52] INFO:dhcp-37-128:'P=qpid-txtest;ks=15;sts=$SECONDS;kld=0;e=;while true; do pp=$p; p=`pidof $P`; if [ -n "$p" ]; then killall -$
  ks $P; e=$?; let "kld++"; else if [ $kld -gt 0 ]; then echo "$P (pid:$pp) killed -$ks (ecode:$e)"; break; fi;fi; echo -n "."; usleep $RANDOM; [ "$(($SECONDS-$sts
  ))" -gt 10 ] && break; done; pidof $P', result:1, dur.:0.125782
  [03:27:52] DEBUG:stdout:
    .qpid-txtest (pid:8407) killed -15 (ecode:0)
  [03:27:52] DEBUG:test_cluster_dtx_phased_nonlocal_clikill15(): client finished (0.0 sec[s])
  [03:27:52] INFO:dhcp-37-128:'qpid-txtest --dtx yes --init 0 --transfer 1 --check 0 --queues=17 --messages-per-tx=158 --tx-count=115 --tota
  l-messages=7135 --heartbeat 3  --durable no --size 1024 --broker=dhcp-37-124', result:-1, dur.:0.0784519
  [03:27:55] INFO:localhost:'Cluster should be (4 of 4 up)', result:True, dur.:-1
  [03:27:57] INFO:dhcp-37-124:'qpid-txtest --dtx yes --init 0 --transfer 0 --check 1 --queues=17 --messages-per-tx=158 --tx-count=115 --tota
  l-messages=7135 --heartbeat 3  --durable no --size 1024 --broker=dhcp-37-125', result:0, dur.:0.592289
  
failing testing scenario:
  
  [03:27:58] INFO:dhcp-37-125:'qpid-txtest --dtx yes --init 1 --transfer 0 --check 0 --queues=17 --messages-per-tx=148 --tx-count=100 --tota
  l-messages=3551 --heartbeat 3  --durable yes  --broker=dhcp-37-124', result:0, dur.:0.203347
  [03:27:58] INFO:dhcp-37-127:'P=qpid-txtest;ks=15;sts=$SECONDS;kld=0;e=;while true; do pp=$p; p=`pidof $P`; if [ -n "$p" ]; then killall -$
  ks $P; e=$?; let "kld++"; else if [ $kld -gt 0 ]; then echo "$P (pid:$pp) killed -$ks (ecode:$e)"; break; fi;fi; echo -n "."; usleep $RANDOM; [ "$(($SECONDS-$sts
  ))" -gt 10 ] && break; done; pidof $P', result:1, dur.:0.089416
  [03:27:58] DEBUG:stdout:
    .qpid-txtest (pid:25691) killed -15 (ecode:0)
  [03:27:58] DEBUG:test_cluster_dtx_phased_nonlocal_clikill15(): client finished (0.1 sec[s])
  [03:27:58] INFO:dhcp-37-127:'qpid-txtest --dtx yes --init 0 --transfer 1 --check 0 --queues=17 --messages-per-tx=148 --tx-count=100 --tota
  l-messages=3551 --heartbeat 3  --durable yes  --broker=dhcp-37-128', result:-1, dur.:0.104056
  [03:28:01] INFO:localhost:'Cluster should be (4 of 4 up)', result:True, dur.:-1
  [03:28:04] ERROR:dhcp-37-128:'qpid-txtest --dtx yes --init 0 --transfer 0 --check 1 --queues=17 --messages-per-tx=148 --tx-count=100 --tot
  al-messages=3551 --heartbeat 3  --durable yes  --broker=dhcp-37-127', result:1, dur.:0.461266
  [03:28:04] DEBUG:stdout:
    Drained 3255 messages from tx-test-1
    Drained 148 messages from tx-test-2
    Drained 0 messages from tx-test-3
    Drained 0 messages from tx-test-4
    Drained 0 messages from tx-test-5
    Drained 0 messages from tx-test-6
    Drained 0 messages from tx-test-7
    Drained 0 messages from tx-test-8
    Drained 0 messages from tx-test-9
    Drained 0 messages from tx-test-10
    Drained 0 messages from tx-test-11
    Drained 0 messages from tx-test-12
    Drained 0 messages from tx-test-13
    Drained 0 messages from tx-test-14
    Drained 0 messages from tx-test-15
    Drained 0 messages from tx-test-16
    Drained 0 messages from tx-test-17
    The following ids were missing:
        'msg-1'
        'msg-10'
        'msg-100'
        'msg-101'
        'msg-102'
        'msg-103'
        'msg-104'
        'msg-105'
        'msg-106'
        'msg-107'
        'msg-108'
        'msg-109'
        'msg-11'
        'msg-110'
        'msg-111'
        'msg-112'
        'msg-113'
        'msg-114'
        'msg-115'
        'msg-116'
        'msg-117'
        'msg-118'
        'msg-119'
        'msg-12'
        'msg-120'
        'msg-121'
        'msg-122'
        'msg-123'
        'msg-124'
        'msg-125'
        'msg-126'
        'msg-127'
        'msg-128'
        'msg-129'
        'msg-13'
        'msg-130'
        'msg-131'
        'msg-132'
        'msg-133'
        'msg-134'
        'msg-135'
        'msg-136'
        'msg-137'
        'msg-138'
        'msg-139'
        'msg-14'
        'msg-140'
        'msg-141'
        'msg-142'
        'msg-143'
        'msg-144'
        'msg-145'
        'msg-146'
        'msg-147'
        'msg-148'
        'msg-15'
        'msg-16'
        'msg-17'
        'msg-18'
        'msg-19'
        'msg-2'
        'msg-20'
        'msg-21'
        'msg-22'
        'msg-23'
        'msg-24'
        'msg-25'
        'msg-26'
        'msg-27'
        'msg-28'
        'msg-29'
        'msg-3'
        'msg-30'
        'msg-31'
        'msg-32'
        'msg-33'
        'msg-34'
        'msg-35'
        'msg-36'
        'msg-37'
        'msg-38'
        'msg-39'
        'msg-4'
        'msg-40'
        'msg-41'
        'msg-42'
        'msg-43'
        'msg-44'
        'msg-45'
        'msg-46'
        'msg-47'
        'msg-48'
        'msg-49'
        'msg-5'
        'msg-50'
        'msg-51'
        'msg-52'
        'msg-53'
        'msg-54'
        'msg-55'
        'msg-56'
        'msg-57'
        'msg-58'
        'msg-59'
        'msg-6'
        'msg-60'
        'msg-61'
        'msg-62'
        'msg-63'
        'msg-64'
        'msg-65'
        'msg-66'
        'msg-67'
        'msg-68'
        'msg-69'
        'msg-7'
        'msg-70'
        'msg-71'
        'msg-72'
        'msg-73'
        'msg-74'
        'msg-75'
        'msg-76'
        'msg-77'
        'msg-78'
        'msg-79'
        'msg-8'
        'msg-80'
        'msg-81'
        'msg-82'
        'msg-83'
        'msg-84'
        'msg-85'
        'msg-86'
        'msg-87'
        'msg-88'
        'msg-89'
        'msg-9'
        'msg-90'
        'msg-91'
        'msg-92'
        'msg-93'
        'msg-94'
        'msg-95'
        'msg-96'
        'msg-97'
        'msg-98'
        'msg-99'
  [03:28:04] DEBUG:stderr:
  
passing testing scenario:
  
  [03:28:04] INFO:dhcp-37-125:'qpid-txtest --dtx yes --init 1 --transfer 0 --check 0 --queues=10 --messages-per-tx=181 --tx-count=112 --tota
  l-messages=4338 --heartbeat 3  --durable yes --size 16 --broker=dhcp-37-124', result:0, dur.:0.177393
  [03:28:05] INFO:dhcp-37-127:'P=qpid-txtest;ks=15;sts=$SECONDS;kld=0;e=;while true; do pp=$p; p=`pidof $P`; if [ -n "$p" ]; then killall -$
  ks $P; e=$?; let "kld++"; else if [ $kld -gt 0 ]; then echo "$P (pid:$pp) killed -$ks (ecode:$e)"; break; fi;fi; echo -n "."; usleep $RANDOM; [ "$(($SECONDS-$sts
  ))" -gt 10 ] && break; done; pidof $P', result:1, dur.:0.0975151
  [03:28:05] DEBUG:stdout:
    .qpid-txtest (pid:25789) killed -15 (ecode:0)
  [03:28:05] DEBUG:test_cluster_dtx_phased_nonlocal_clikill15(): client finished (0.0 sec[s])
  [03:28:05] INFO:dhcp-37-127:'qpid-txtest --dtx yes --init 0 --transfer 1 --check 0 --queues=10 --messages-per-tx=181 --tx-count=112 --tota
  l-messages=4338 --heartbeat 3  --durable yes --size 16 --broker=dhcp-37-125', result:-1, dur.:0.059232
  [03:28:08] INFO:localhost:'Cluster should be (4 of 4 up)', result:True, dur.:-1
  [03:28:10] INFO:dhcp-37-128:'qpid-txtest --dtx yes --init 0 --transfer 0 --check 1 --queues=10 --messages-per-tx=181 --tx-count=112 --tota
  l-messages=4338 --heartbeat 3  --durable yes --size 16 --broker=dhcp-37-127', result:0, dur.:0.208456
  
  
broker logs:
  
  [freznice@dhcp-26-251 cluster_test]$ for i in $(ls 120105_mrg-qe-16VMs_1.log._.dhcp-*.qpidd.log); do echo ${i} ; grep " 20:2" ${i} | awk '{print "  " $0}'; done
  120105_mrg-qe-16VMs_1.log._.dhcp-37-124.qpidd.log
    2012-01-05 20:21:53 warning CLOSING [10.34.37.124:5672-10.34.37.127:36312] unsent data (probably due to client disconnect)
    2012-01-05 20:21:53 info JournalInactive:test_cluster_toggle_load_w_sesame-00 task late and overran 1 times: late 4ms, overran 3ms (taking 7000ns) on average.
    2012-01-05 20:22:30 warning CLOSING [10.34.37.124:5672-10.34.37.128:32991] unsent data (probably due to client disconnect)
    2012-01-05 20:24:29 info JournalInactive:test_cluster_toggle_load_w_sesame-00 task late and overran 1 times: late 5ms, overran 3ms (taking 10000ns) on average.
    2012-01-05 20:26:12 warning CLOSING [10.34.37.124:5672-10.34.37.128:33041] unsent data (probably due to client disconnect)
    2012-01-05 20:27:00 info JournalInactive:test_cluster_toggle_load_w_sesame-00 task late and overran 1 times: late 7ms, overran 5ms (taking 45000ns) on average.
    2012-01-05 20:27:31 info JournalInactive:test_cluster_toggle_load_w_sesame-00 task late and overran 1 times: late 4ms, overran 3ms (taking 8000ns) on average.
    2012-01-05 20:28:32 info JournalInactive:tx-test-19 task late and overran 1 times: late 2ms, overran 2ms (taking 8000ns) on average.
  120105_mrg-qe-16VMs_1.log._.dhcp-37-125.qpidd.log
    2012-01-05 20:20:03 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 5 times: late 2ms, overran 2ms (taking 7400ns) on average.
    2012-01-05 20:20:20 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 4 times: late 3ms, overran 2ms (taking 8250ns) on average.
    2012-01-05 20:20:48 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 1 times: late 2ms, overran 2ms (taking 6000ns) on average.
    2012-01-05 20:21:04 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 4 times: late 2ms, overran 2ms (taking 8000ns) on average.
    2012-01-05 20:21:23 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 1 times: late 2ms, overran 2ms (taking 6000ns) on average.
    2012-01-05 20:21:34 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 2 times: late 2ms, overran 2ms (taking 10000ns) on average.
    2012-01-05 20:22:05 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 1 times: late 2ms, overran 2ms (taking 10000ns) on average.
    2012-01-05 20:22:20 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 4 times: late 2ms, overran 2ms (taking 8500ns) on average.
    2012-01-05 20:22:36 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 2 times: late 2ms, overran 2ms (taking 10500ns) on average.
    2012-01-05 20:22:48 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 3 times: late 2ms, overran 2ms (taking 7666ns) on average.
    2012-01-05 20:23:07 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 3 times: late 2ms, overran 2ms (taking 9666ns) on average.
    2012-01-05 20:23:23 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 1 times: late 2ms, overran 2ms (taking 7000ns) on average.
    2012-01-05 20:23:25 warning CLOSING [10.34.37.125:5672-10.34.37.127:39811] unsent data (probably due to client disconnect)
    2012-01-05 20:23:36 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 3 times: late 2ms, overran 2ms (taking 11333ns) on average.
    2012-01-05 20:23:51 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 6 times: late 2ms, overran 2ms (taking 9833ns) on average.
    2012-01-05 20:24:09 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 3 times: late 2ms, overran 2ms (taking 10666ns) on average.
    2012-01-05 20:24:21 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 1 times: late 2ms, overran 2ms (taking 8000ns) on average.
    2012-01-05 20:24:38 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 4 times: late 2ms, overran 2ms (taking 7000ns) on average.
    2012-01-05 20:24:53 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 6 times: late 2ms, overran 2ms (taking 8333ns) on average.
    2012-01-05 20:25:07 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 5 times: late 2ms, overran 2ms (taking 7800ns) on average.
    2012-01-05 20:25:25 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 3 times: late 2ms, overran 2ms (taking 6666ns) on average.
    2012-01-05 20:25:43 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 3 times: late 3ms, overran 2ms (taking 8000ns) on average.
    2012-01-05 20:25:54 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 1 times: late 2ms, overran 2ms (taking 8000ns) on average.
    2012-01-05 20:26:11 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 5 times: late 2ms, overran 2ms (taking 7000ns) on average.
    2012-01-05 20:26:25 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 2 times: late 2ms, overran 2ms (taking 9000ns) on average.
    2012-01-05 20:26:41 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 3 times: late 2ms, overran 2ms (taking 7333ns) on average.
    2012-01-05 20:27:00 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 4 times: late 2ms, overran 2ms (taking 292500ns) on average.
    2012-01-05 20:27:10 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 1 times: late 2ms, overran 2ms (taking 6000ns) on average.
    2012-01-05 20:27:26 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 3 times: late 2ms, overran 2ms (taking 8666ns) on average.
    2012-01-05 20:27:49 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 3 times: late 3ms, overran 2ms (taking 168000ns) on average.
    2012-01-05 20:28:00 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 1 times: late 2ms, overran 2ms (taking 5000ns) on average.
    2012-01-05 20:28:13 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 1 times: late 2ms, overran 2ms (taking 9000ns) on average.
    2012-01-05 20:28:31 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 6 times: late 2ms, overran 2ms (taking 8500ns) on average.
    2012-01-05 20:28:44 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 2 times: late 2ms, overran 2ms (taking 8500ns) on average.
    2012-01-05 20:28:59 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 3 times: late 2ms, overran 2ms (taking 7333ns) on average.
    2012-01-05 20:29:13 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 2 times: late 2ms, overran 2ms (taking 8500ns) on average.
    2012-01-05 20:29:28 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 4 times: late 2ms, overran 2ms (taking 8500ns) on average.
    2012-01-05 20:29:35 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 5 times: late 2ms, overran 2ms (taking 14600ns) on average.
    2012-01-05 20:29:45 info JournalInactive:test_cluster_toggle_load_w_sesame-40 task late and overran 1 times: late 2ms, overran 2ms (taking 9000ns) on average.
  120105_mrg-qe-16VMs_1.log._.dhcp-37-127.qpidd.log
    2012-01-05 20:25:38 warning CLOSING [10.34.37.127:5672-10.34.37.124:47894] unsent data (probably due to client disconnect)
    2012-01-05 20:25:38 warning CLOSING [10.34.37.127:5672-10.34.37.124:47893] unsent data (probably due to client disconnect)
    2012-01-05 20:27:14 warning CLOSING [10.34.37.127:5672-10.34.37.128:58336] unsent data (probably due to client disconnect)
  120105_mrg-qe-16VMs_1.log._.dhcp-37-128.qpidd.log
  [freznice@dhcp-26-251 cluster_test]$

Comment 2 Gordon Sim 2012-01-26 16:15:23 UTC
See also: https://bugzilla.redhat.com/show_bug.cgi?id=761023

Comment 3 Frantisek Reznicek 2012-01-27 13:12:05 UTC
Reproduced twice during last week on RHEL 5.7 x86_64 4 node cluster using packages:
  cman-2.0.115-85.el5_7.2
  cman-devel-2.0.115-85.el5_7.2
  cman-devel-2.0.115-85.el5_7.2
  openais-0.80.6-30.el5_7.4
  openais-debuginfo-0.80.6-30.el5_7.4
  openais-devel-0.80.6-30.el5_7.4
  python-qpid-0.14-1.el5
  python-qpid-qmf-0.14-2.el5
  python-saslwrapper-0.10-4.el5
  qpid-cpp-client-0.14-4.el5
  qpid-cpp-client-devel-0.14-4.el5
  qpid-cpp-client-devel-docs-0.14-4.el5
  qpid-cpp-client-rdma-0.14-4.el5
  qpid-cpp-client-ssl-0.14-4.el5
  qpid-cpp-mrg-debuginfo-0.14-4.el5
  qpid-cpp-server-0.14-4.el5
  qpid-cpp-server-cluster-0.14-4.el5
  qpid-cpp-server-devel-0.14-4.el5
  qpid-cpp-server-rdma-0.14-4.el5
  qpid-cpp-server-ssl-0.14-4.el5
  qpid-cpp-server-store-0.14-4.el5
  qpid-cpp-server-xml-0.14-4.el5
  qpid-java-client-0.14-1.el5
  qpid-java-common-0.14-1.el5
  qpid-java-example-0.14-1.el5
  qpid-qmf-0.14-2.el5
  qpid-qmf-debuginfo-0.14-2.el5
  qpid-qmf-devel-0.14-2.el5
  qpid-tests-0.14-1.el5
  qpid-tools-0.14-1.el5
  rgmanager-2.0.52-21.el5
  rh-qpid-cpp-tests-0.14-4.el5
  ruby-qpid-qmf-0.14-2.el5
  ruby-saslwrapper-0.10-4.el5
  saslwrapper-0.10-4.el5
  saslwrapper-debuginfo-0.10-4.el5
  saslwrapper-devel-0.10-4.el5
  sesame-1.0-2.el5
  sesame-debuginfo-1.0-2.el5

Raising reproducibility to 50%


Details of failures:
Another data loss in test_cluster_dtx_phased_local_clikill9 after about 7 hours of cluster operation / tests

./ctests.py --cluster-maximize-uptime  --qmf-data-timeout=200
            --testset-loop-cnt=5      --log-to-file=120126_mrg-qe-16VMs_2.log
            --selinux-state-force=0
            10.34.37.124 10.34.37.125 10.34.37.127 10.34.37.128
  [23:18:04] DEBUG:TEST 'test_cluster_dtx_phased_local_clikill9' loop 1/1 started. [tst_cnt:12451, err_cnt:0, dur.:24511.3 (6.8h)]
  ...
  [23:20:09] INFO:10.34.37.127:'qpid-txtest --dtx yes --init 0 --transfer 1 --check 0 --queues=19 --messages-per-tx=119 --tx-count=150 --total-messages=2437 --hear
  tbeat 2  --durable yes --size 1024 ', result:-1, dur.:0.05
  [23:20:12] INFO:localhost:'Cluster should be (4 of 4 up)', result:True, dur.:-1.00
  [23:20:15] INFO:10.34.37.128:'qpid-txtest --dtx yes --init 0 --transfer 0 --check 1 --queues=19 --messages-per-tx=119 --tx-count=150 --total-messages=2437 --hear
  tbeat 2  --durable yes --size 1024 ', result:0, dur.:0.98
  [23:20:16] INFO:10.34.37.127:'qpid-txtest --dtx yes --init 1 --transfer 0 --check 0 --queues=18 --messages-per-tx=112 --tx-count=104 --total-messages=8421 --hear
  tbeat 2 --tcp-nodelay --durable no  ', result:0, dur.:0.25
  [23:20:16] INFO:10.34.37.128:'P=qpid-txtest;ks=9;sts=$SECONDS;kld=0;e=;while true; do pp=$p; p=`pidof $P`; if [ -n "$p" ]; then killall -$ks $P; e=$?; let "kld++
  "; else if [ $kld -gt 0 ]; then echo "$P (pid:$pp) killed -$ks (ecode:$e)"; break; fi;fi; echo -n "."; usleep $RANDOM; [ "$(($SECONDS-$sts))" -gt 10 ] && break;
  done; pidof $P', result:1, dur.:0.09
  [23:20:16] DEBUG:stdout:
    .qpid-txtest (pid:11552) killed -9 (ecode:0)
  [23:20:16] DEBUG:test_cluster_dtx_phased_local_clikill9(): client finished (0.1 sec[s])
  [23:20:16] INFO:10.34.37.128:'qpid-txtest --dtx yes --init 0 --transfer 1 --check 0 --queues=18 --messages-per-tx=112 --tx-count=104 --total-messages=8421 --hear
  tbeat 2 --tcp-nodelay --durable no  ', result:-1, dur.:0.07
  [23:20:19] INFO:localhost:'Cluster should be (4 of 4 up)', result:True, dur.:-1.00
  [23:20:22] ERROR:10.34.37.124:'qpid-txtest --dtx yes --init 0 --transfer 0 --check 1 --queues=18 --messages-per-tx=112 --tx-count=104 --total-messages=8421 --hea
  rtbeat 2 --tcp-nodelay --durable no  ', result:1, dur.:0.33
  [23:20:22] DEBUG:stdout:
    Drained 8197 messages from tx-test-1
    Drained 112 messages from tx-test-2
    Drained 0 messages from tx-test-3
    Drained 0 messages from tx-test-4
    Drained 0 messages from tx-test-5
    Drained 0 messages from tx-test-6
    Drained 0 messages from tx-test-7
    Drained 0 messages from tx-test-8
    Drained 0 messages from tx-test-9
    Drained 0 messages from tx-test-10
    Drained 0 messages from tx-test-11
    Drained 0 messages from tx-test-12
    Drained 0 messages from tx-test-13
    Drained 0 messages from tx-test-14
    Drained 0 messages from tx-test-15
    Drained 0 messages from tx-test-16
    Drained 0 messages from tx-test-17
    Drained 0 messages from tx-test-18
    The following ids were missing:
        'msg-113'
        ...
        'msg-224'
  [23:20:22] DEBUG:stderr:
  
  [23:20:22] INFO:10.34.37.124:'qpid-txtest --dtx yes --init 1 --transfer 0 --check 0 --queues=20 --messages-per-tx=148 --tx-count=109 --total-messages=5382 --hear
  tbeat 2 --tcp-nodelay --durable no --size 16 ', result:0, dur.:0.18
  [23:20:23] INFO:10.34.37.125:'P=qpid-txtest;ks=9;sts=$SECONDS;kld=0;e=;while true; do pp=$p; p=`pidof $P`; if [ -n "$p" ]; then killall -$ks $P; e=$?; let "kld++
  "; else if [ $kld -gt 0 ]; then echo "$P (pid:$pp) killed -$ks (ecode:$e)"; break; fi;fi; echo -n "."; usleep $RANDOM; [ "$(($SECONDS-$sts))" -gt 10 ] && break;
  done; pidof $P', result:1, dur.:0.21
  [23:20:23] DEBUG:stdout:
    .....qpid-txtest (pid:4357) killed -9 (ecode:0)
  [23:20:23] DEBUG:test_cluster_dtx_phased_local_clikill9(): client finished (0.0 sec[s])
  [23:20:23] INFO:10.34.37.125:'qpid-txtest --dtx yes --init 0 --transfer 1 --check 0 --queues=20 --messages-per-tx=148 --tx-count=109 --total-messages=5382 --hear
  tbeat 2 --tcp-nodelay --durable no --size 16 ', result:-1, dur.:0.06
  [23:20:26] INFO:localhost:'Cluster should be (4 of 4 up)', result:True, dur.:-1.00
  [23:20:28] INFO:10.34.37.127:'qpid-txtest --dtx yes --init 0 --transfer 0 --check 1 --queues=20 --messages-per-tx=148 --tx-count=109 --total-messages=5382 --hear
  tbeat 2 --tcp-nodelay --durable no --size 16 ', result:0, dur.:0.31

brokers sayings (no error / critical):
  2012-01-26 17:16:58 warning CLOSING [127.0.0.1:5672-127.0.0.1:46504] unsent data (probably due to client disconnect)
  2012-01-26 17:23:52 warning CLOSING [10.34.37.124:5672-10.34.37.127:56953] unsent data (probably due to client disconnect)
  2012-01-26 17:25:03 warning CLOSING [10.34.37.124:5672-10.34.37.125:38412] unsent data (probably due to client disconnect)
  2012-01-26 17:26:39 warning CLOSING [10.34.37.124:5672-10.34.37.125:38430] unsent data (probably due to client disconnect)


There was another one day before in test_cluster_dtx_phased_nonlocal_clikill15 with same symptoms:

  [16:59:58] INFO:10.34.37.124:'qpid-txtest --dtx yes --init 1 --transfer 0 --check 0 --queues=14 --messages-per-tx=173 --tx-count=150 --total-messages=2279  --tcp
  -nodelay --durable no --size 16 --broker=10.34.37.125', result:0, dur.:0.13
  [16:59:58] INFO:10.34.37.125:'P=qpid-txtest;ks=15;sts=$SECONDS;kld=0;e=;while true; do pp=$p; p=`pidof $P`; if [ -n "$p" ]; then killall -$ks $P; e=$?; let "kld+
  +"; else if [ $kld -gt 0 ]; then echo "$P (pid:$pp) killed -$ks (ecode:$e)"; break; fi;fi; echo -n "."; usleep $RANDOM; [ "$(($SECONDS-$sts))" -gt 10 ] && break;
  done; pidof $P', result:1, dur.:0.14
  [16:59:58] DEBUG:stdout:
    ..qpid-txtest (pid:25905) killed -15 (ecode:0)
  [16:59:58] DEBUG:test_cluster_dtx_phased_nonlocal_clikill15(): client finished (0.0 sec[s])
  [16:59:58] INFO:10.34.37.125:'qpid-txtest --dtx yes --init 0 --transfer 1 --check 0 --queues=14 --messages-per-tx=173 --tx-count=150 --total-messages=2279  --tcp
  -nodelay --durable no --size 16 --broker=10.34.37.127', result:-1, dur.:0.08
  [17:00:02] INFO:localhost:'Cluster should be (4 of 4 up)', result:True, dur.:-1.00
  [17:00:04] ERROR:10.34.37.127:'qpid-txtest --dtx yes --init 0 --transfer 0 --check 1 --queues=14 --messages-per-tx=173 --tx-count=150 --total-messages=2279  --tc
  p-nodelay --durable no --size 16 --broker=10.34.37.128', result:1, dur.:0.25
  [17:00:04] DEBUG:stdout:
    Drained 2106 messages from tx-test-1
    Drained 0 messages from tx-test-2
    Drained 0 messages from tx-test-3
    Drained 0 messages from tx-test-4
    Drained 0 messages from tx-test-5
    Drained 0 messages from tx-test-6
    Drained 0 messages from tx-test-7
    Drained 0 messages from tx-test-8
    Drained 0 messages from tx-test-9
    Drained 0 messages from tx-test-10
    Drained 0 messages from tx-test-11
    Drained 0 messages from tx-test-12
    Drained 0 messages from tx-test-13
    Drained 0 messages from tx-test-14
    The following ids were missing:
        'msg-1'
        ...
        'msg-99'
  [17:00:04] DEBUG:stderr:

Comment 4 Frantisek Reznicek 2012-01-27 13:14:23 UTC
Last two failures prove that massage loss can happen when dtx client is killed either of SIGTERM or SIGKILL signals.

Comment 5 Kim van der Riet 2012-01-27 19:28:05 UTC
I will attempt to reproduce, however, in the interests of getting more information from your tests, can you run this test again with the following additional broker parameter:

--log-enable debug+:msgstore

This will print debug-level messages for the store only and I am keen to see some of the recovery details that this will provide in a failure case.

Thanks

Comment 6 Frantisek Reznicek 2012-01-30 09:48:25 UTC
The test was re-ran. I'll keep you updated about the results.

Comment 7 Frantisek Reznicek 2012-02-02 08:46:02 UTC
I've launched 3 test runs and I need to say that original issue discussed here is inhibited by requested broker's logging '--log-enable debug+:msgstore' option.

Continuing...

Comment 8 Kim van der Riet 2012-02-03 18:07:08 UTC
(In reply to comment #7)
> I've launched 3 test runs and I need to say that original issue discussed here
> is inhibited by requested broker's logging '--log-enable debug+:msgstore'
> option.

Just to be clear: This option is in addition to the log-enable option you are already using (it looks as though you are using info+). So there will be a total of two log-enable parameters to the broker:

--log-enable info+ --log-enable debug+:msgstore

This should keep all the broker logs at info+ level, and only the store prints debug level logs. I was going to suggest that for this test, you keep this logging arrangement permanently - I guess we'll see if we can get a hit first.

Comment 9 Frantisek Reznicek 2012-02-06 09:33:08 UTC
Created attachment 559578 [details]
Issue logs with requested log levels

I succeeded to reproduce the issue with above requested logging, see it attached.

Unfortunately there is part of qpidd log missing (the part you are looking for) which I cannot explain why atm, seems like logging bug.

Comment 11 Frantisek Reznicek 2012-08-27 14:25:33 UTC
Reproduced again in 2-node cluster running RHEL5.8 i386+x86_64
[13:14:38] INFO:10.34.27.69:'rpm -qa | grep -E  '(qpid|qmf|sesame|saslwrapper|corosync|openais|cman|rgmanager)' | sort', result:0, dur.:3.59
[13:14:38] DEBUG:stdout:
  cman-2.0.115-96.el5_8.3
  openais-0.80.6-36.el5_8.2
  python-qpid-0.14-11.el5
  python-qpid-qmf-0.14-14.el5
  python-saslwrapper-0.10-4.el5
  qpid-cpp-client-0.14-21.el5
  qpid-cpp-client-devel-0.14-21.el5
  qpid-cpp-client-devel-docs-0.14-21.el5
  qpid-cpp-client-rdma-0.14-21.el5
  qpid-cpp-client-ssl-0.14-21.el5
  qpid-cpp-server-0.14-21.el5
  qpid-cpp-server-cluster-0.14-21.el5
  qpid-cpp-server-devel-0.14-21.el5
  qpid-cpp-server-rdma-0.14-21.el5
  qpid-cpp-server-ssl-0.14-21.el5
  qpid-cpp-server-store-0.14-21.el5
  qpid-cpp-server-xml-0.14-21.el5
  qpid-java-client-0.18-1.el5
  qpid-java-common-0.18-1.el5
  qpid-java-example-0.18-1.el5
  qpid-jca-0.18-1.el5
  qpid-jca-xarecovery-0.18-1.el5
  qpid-jca-zip-0.18-1.el5
  qpid-qmf-0.14-14.el5
  qpid-qmf-devel-0.14-14.el5
  qpid-tests-0.14-1.el5
  qpid-tools-0.14-5.el5
  rgmanager-2.0.52-28.el5_8.3
  rh-qpid-cpp-tests-0.14-21.el5
  ruby-qpid-qmf-0.14-14.el5
  ruby-saslwrapper-0.10-4.el5
  saslwrapper-0.10-4.el5
  saslwrapper-devel-0.10-4.el5
  sesame-1.0-4.el5
[14:11:49] ERROR:10.34.27.69:'qpid-txtest --dtx yes --init 0 --transfer 0 --check 1 --queues=19 --messages-per-tx=157 --tx-count=114 --total-mess
ages=4722   --durable yes --size 1024 ', result:1, dur.:0.87
[14:11:49] DEBUG:stdout:
  Drained 4408 messages from tx-test-1
  Drained 157 messages from tx-test-2
  Drained 0 messages from tx-test-3
  Drained 0 messages from tx-test-4
  Drained 0 messages from tx-test-5
  Drained 0 messages from tx-test-6
  Drained 0 messages from tx-test-7
  Drained 0 messages from tx-test-8
  Drained 0 messages from tx-test-9
  Drained 0 messages from tx-test-10
  Drained 0 messages from tx-test-11
  Drained 0 messages from tx-test-12
  Drained 0 messages from tx-test-13
  Drained 0 messages from tx-test-14
  Drained 0 messages from tx-test-15
  Drained 0 messages from tx-test-16
  Drained 0 messages from tx-test-17
  Drained 0 messages from tx-test-18
  Drained 0 messages from tx-test-19
  The following ids were missing:
      'msg-158'
      'msg-159'
      'msg-160'
      'msg-161'
      'msg-162'
      ...
      'msg-311'
      'msg-312'
      'msg-313'
      'msg-314'
[14:11:49] DEBUG:stderr:

Comment 12 Justin Ross 2013-02-23 11:53:18 UTC
This won't be reproducable on 2.4/0.22, since it won't have the clustering and dtx impl referenced here.


Note You need to log in before you can comment on or make changes to this bug.