In c++ broker and client at a minimum.
No test info. Putting NEEDINFO flag.
Heartbeat has now been implemented for the C++ client and broker. The implementation is as follows: The broker allows a heartbeat interval between no heartbeat (=0) and 2 minutes (120s). this is sent to the client as part of the Tune message in the initial connection setup. Note that heartbeat is a Connection property. As per the AMQP standard the timeout is currently specified in seconds. The C++ server impl will at present send a heartbeat frame every heartbeat interval regardless of any other traffic on the connection. At 1 message per second this is a pretty low communication overhead. Once heartbeat has been negotiated the client will ensure that there isn't a period of 2*timeout interval without any traffic on the connection. It does this by setting a timer for twice the heartbeat period and constantly extending the timout whenever there is traffic on the connection. If the client timeouts with no traffic twice the heartbeat interval it tears down the connection. If the client is managing failover (in a clustered scenario) then this will initiate failover.
Limitations: The current implementation only sends heartbeat controls from broker to client and not the other way round. The broker will accept heartbeat controls from the client and ignore them. However the broker will currently never check for lack of traffic from the client
Testing: There are 2 cases to test- 1. simple client disconnect upon loss of server heartbeat 2. cluster failover upon loss of server heartbeat 1. a. Set up a broker in window 1: qpidd --auth no ... b. Start perftest in another window perftest --heartbeat 1 c. Now stop the qpidd process either by ^z in window 1 or by sending it the STOP signal - kill -STOP ... d. after 2s (we set heartbeat interval to 1s in b.) perftest should close the connection with the broker. 2. a. Set up a simple cluster of brokers: [I assume you already have clustering working] Something like: src/qpidd --auth no --load-module src/.libs/cluster.so --cluster-name ams --no-data-dir --port 12345 & src/qpidd --auth no --load-module src/.libs/cluster.so --cluster-name ams --no-data-dir b. create a queue for the sender/receiver: qpid-config add queue test-queue c. run receiver: src/tests/receiver --heartbeat 1 d. run sender (with a large number of messages to send): (for (( i=30000; i--;)) do echo $i; done) | src/tests/sender --heartbeat 1 --send-eos 1 e. Now stop the broker running on port 5672 (the one without --port 12345 in the command line). Either by ^z in the window you started it from or by using kill -STOP ... f. After a 2s pause you should see that sending/receiving go on uninterrupted with no lost messages. g. you can now kill -KILL the broker or continue it kill -CONT / fg
First case (1.) works well. The second case does not work how we expected. a) Using kill -9 "the broker without --port 12345" at step e. We can see that receiver does not receive all messages send by sender - so there is message loss. Moreover we see that both receiver and sender exits immediatelly after qpidd kill with ecode 139 (SEGFAULT). [root@dell-pe2900-03 ~]# ./receiver --heartbeat 1 > file.txt; echo $? > rec_exit.txt 2009-feb-27 11:23:28 warning Connection closed [3]+ Killed qpidd --auth no --cluster-name ams --no-data-dir Segmentation fault [root@dell-pe2900-03 ~]# cat rec receiver receiver.cpp rec_exit.txt [root@dell-pe2900-03 ~]# cat rec_exit.txt 139 [root@dell-pe2900-03 ~]# (for (( i=100000; i--;)) do echo $i; done) | ./sender --heartbeat 1 --send-eos 1 2009-feb-27 11:05:32 warning Connection closed Segmentation fault [root@dell-pe2900-03 ~]# (for (( i=100000; i--;)) do echo $i; done) | ./sender --heartbeat 1 --send-eos 1 2009-feb-27 11:08:55 warning Connection closed Segmentation fault [root@dell-pe2900-03 ~]# (for (( i=100000; i--;)) do echo $i; done) | ./sender --heartbeat 1 --send-eos 1; echo $? > sender_exit.txt 2009-feb-27 11:23:28 warning Connection closed Segmentation fault [root@dell-pe2900-03 ~]# cat sender_exit.txt 139 So it seems that failover from qpidd at default port to qpidd at port 12345 does not went correctly -> finally message loss seen! b) Using kill -STOP "the broker without --port 12345" at step e. In this case we see that receiver get even more messages than expected (100k sent, 100.1k received). So there were few tens of duplicated messages detected. This happened probably because of the asynchnonous communication between receiver and qpidd on default port. So after qpidd on default port is stopped the other qpidd doesn't know about few receiver's acks. That could explain why it happens, but we're not sure about correctness. Moreover after bringing qpidd on default port up again (kill -CONT) we can see that newly launched receiver get approx 20% old messages (from previous session). We believe this is incorrect. putting in ASSIGNED with NEEDINFO
a) If you use kill -9 on the broker you aren't testing HEARTBEAT you are testing detection of diconnection at the client end. Heartbeat is only useful if the connection is still connected. So this doesn't test the heartbeat functionality b) The heartbeat functionality is not connected to reconnection at all, it is only responsible for ensuring that the connection gets cut if the heartbeat isn't detected. The test b) above if it passed would demonstrate that both heartbeat and reconnect work. But in failure, at least in this instance it appears that the bug is in the reconnect logic. I suggest you raise a new BZ against that logic.
Changing back to Modified as there is no heartbeat issue in Comment #8
Can we get a backtrace from the segfaults please? (Maybe also indication of whether that is independent of the --heartbeat option ).
Fyi, duplication is to be expected (you can avoid seeing it by using --ignore-duplicates option to receiver). Message loss _IS_ a problem, but not necessarily connected to the heartbeats.
Ignore comments above (I was being lazy!). I have reproduced the issue: #0 0x45b8b410 in __kernel_vsyscall () #1 0x45bd1069 in raise () from /lib/libc.so.6 #2 0x45bd2671 in abort () from /lib/libc.so.6 #3 0x006e4d60 in __gnu_cxx::__verbose_terminate_handler () from /usr/lib/libstdc++.so.6 #4 0x006e2795 in std::set_unexpected () from /usr/lib/libstdc++.so.6 #5 0x006e27d2 in std::terminate () from /usr/lib/libstdc++.so.6 #6 0x006e2eb5 in __cxa_pure_virtual () from /usr/lib/libstdc++.so.6 #7 0x0053f946 in qpid::client::HeartbeatTask::fire (this=0x8f259a8) at qpid/client/ConnectionImpl.cpp:59 #8 0x00c6dbd5 in qpid::sys::TimerTask::fireTask (this=0x8f259a8) at qpid/sys/Timer.cpp:51 #9 0x00c6eadb in qpid::sys::Timer::run (this=0x5b80a0) at qpid/sys/Timer.cpp:91 #10 0x00c0e301 in qpid::sys::(anonymous namespace)::runRunnable (p=0x5b80a0) at qpid/sys/posix/Thread.cpp:35 #11 0x45f0640b in start_thread () from /lib/libpthread.so.0 #12 0x45c71b7e in clone () from /lib/libc.so.6
The above segfault should be resolved by r749297. I believe this is the cause of apparent 'lost messages'. Duplicated messages are to be expected; use --ignore-duplicates option to exclude them from the list reported by the receiver.
When re-testing on RHEL 5.2 / 5.3 found misbehavior of test scenario 2 (kill -STOP)... I can see that not all messages are delivered to receiver after one of the cluster qpidd instances stopped. Below mentioned behavior should be reproducable in 40-60% of cases. It was seen that sometimes the test procedures result in no message loss. It is clearly seen that there are missing some messages from the middle of the input file i.e. some messages beeing processed in kill -STOP <pid> moment. - experiment #1 on RHEL 5.2 x86_64 - [root@dhcp-lab-200 bz461932]# rm -rf *.log *.ecode data* [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# ( qpidd --auth no -p 5673 --cluster-name ams --log-enable warning+ --data-dir data2 >qpidd2.log 2>&1 ; \ > echo "qpidd2 ecode:$?" > qpidd2.ecode ) & [1] 30585 [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# sleep 2 [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# qpidd2_pid=$(netstat -nlp | grep qpidd | grep 5673 | awk '{print $(NF)}' | awk -F/ '{print $1}') [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# ( qpidd --auth no -p 5672 --cluster-name ams --log-enable warning+ --data-dir data1 >qpidd1.log 2>&1 ; \ > echo "qpidd1 ecode:$?" > qpidd1.ecode ) & [2] 30609 [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# sleep 2 [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# qpidd1_pid=$(netstat -nlp | grep qpidd | grep 5672 | awk '{print $(NF)}' | awk -F/ '{print $1}') [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# if [ -n "${qpidd1_pid}" -a -n "${qpidd2_pid}" ]; then > echo "1st:qpidd1&qpidd2 up & running" > else > echo "ERROR:1st:qpidd1&qpidd2 up & running" > fi 1st:qpidd1&qpidd2 up & running [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# qpid-config add queue test-queue [root@dhcp-lab-200 bz461932]# echo $? 0 [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# # create data [root@dhcp-lab-200 bz461932]# for (( i=30000; i--;)) do printf "msg_%05d\n" ${i}; done > data.in [root@dhcp-lab-200 bz461932]# echo "data.in: $(grep -c ^msg_ data.in) lines [unique:$(grep ^msg_ data.in | sort -u | wc -l)]" data.in: 30000 lines [unique:30000] [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# ( ./receiver --heartbeat 1 >receiver.log 2>&1 ; echo "receiver ecode:$?" > receiver.ecode ) & [3] 30647 [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# ( cat data.in | ./sender --heartbeat 1 --send-eos 1 ; echo "sender ecode:$?" > receiver.ecode ) & [4] 30649 [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# sleep 1 [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# receiver_pid=$(ps aux | grep ./receiver | grep -v grep | awk '{print $2}') [root@dhcp-lab-200 bz461932]# sender_pid=$(ps aux | grep ./sender | grep -v grep | awk '{print $2}') [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# if [ -n "${qpidd1_pid}" -a -n "${qpidd2_pid}" -a \ > -n "${receiver_pid}" -a -n "${sender_pid}" ]; then > echo "2nd:qpidd1&qpidd2&receiver&sender up & running" > else > echo "ERROR:2nd:qpidd1&qpidd2&receiver&sender up & running" > fi 2nd:qpidd1&qpidd2&receiver&sender up & running [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# echo "receiver log: $(grep -c ^msg_ receiver.log) lines [unique:$(grep ^msg_ receiver.log | sort -u | wc -l)]" receiver log: 12552 lines [unique:12761] [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# kill -STOP ${qpidd1_pid} [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# sleep 2 2009-mar-09 15:33:21 warning Connection closed [root@dhcp-lab-200 bz461932]# echo "receiver log: $(grep -c ^msg_ receiver.log) lines [unique:$(grep ^msg_ receiver.log | sort -u | wc -l)]" receiver log: 12813 lines [unique:12813] [root@dhcp-lab-200 bz461932]# sleep 4 [3]- Done ( ./receiver --heartbeat 1 > receiver.log 2>&1; echo "receiver ecode:$?" > receiver.ecode ) [4]+ Done ( cat data.in | ./sender --heartbeat 1 --send-eos 1; echo "sender ecode:$?" > receiver.ecode ) [root@dhcp-lab-200 bz461932]# echo "receiver log: $(grep -c ^msg_ receiver.log) lines [unique:$(grep ^msg_ receiver.log | sort -u | wc -l)]" receiver log: 30122 lines [unique:29908] [root@dhcp-lab-200 bz461932]# sleep 4 [root@dhcp-lab-200 bz461932]# echo "receiver log: $(grep -c ^msg_ receiver.log) lines [unique:$(grep ^msg_ receiver.log | sort -u | wc -l)]" receiver log: 30122 lines [unique:29908] [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# grep ^msg_ receiver.log | sort -ru > receiver.log.su [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# echo "data.in: $(grep -c ^msg_ data.in) l; receiver log (uniquified): $(grep -c ^msg_ receiver.log.su) l" data.in: 30000 l; receiver log (uniquified): 29908 l [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# diff data.in receiver.log.su 12814,12905d12813 < msg_17186 < msg_17185 < msg_17184 < msg_17183 < msg_17182 < msg_17181 < msg_17180 < msg_17179 < msg_17178 < msg_17177 < msg_17176 < msg_17175 < msg_17174 < msg_17173 < msg_17172 < msg_17171 < msg_17170 < msg_17169 < msg_17168 < msg_17167 < msg_17166 < msg_17165 < msg_17164 < msg_17163 < msg_17162 < msg_17161 < msg_17160 < msg_17159 < msg_17158 < msg_17157 < msg_17156 < msg_17155 < msg_17154 < msg_17153 < msg_17152 < msg_17151 < msg_17150 < msg_17149 < msg_17148 < msg_17147 < msg_17146 < msg_17145 < msg_17144 < msg_17143 < msg_17142 < msg_17141 < msg_17140 < msg_17139 < msg_17138 < msg_17137 < msg_17136 < msg_17135 < msg_17134 < msg_17133 < msg_17132 < msg_17131 < msg_17130 < msg_17129 < msg_17128 < msg_17127 < msg_17126 < msg_17125 < msg_17124 < msg_17123 < msg_17122 < msg_17121 < msg_17120 < msg_17119 < msg_17118 < msg_17117 < msg_17116 < msg_17115 < msg_17114 < msg_17113 < msg_17112 < msg_17111 < msg_17110 < msg_17109 < msg_17108 < msg_17107 < msg_17106 < msg_17105 < msg_17104 < msg_17103 < msg_17102 < msg_17101 < msg_17100 < msg_17099 < msg_17098 < msg_17097 < msg_17096 < msg_17095 - experiment #2 on RHEL 5.3 i386 - [root@ibm-firefly bz461932]# rm -rf *.log *.ecode data* [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# ( qpidd --auth no -p 5673 --cluster-name ams --log-enable warning+ --data-dir data2 >qpidd2.log 2>&1 ; \ > echo "qpidd2 ecode:$?" > qpidd2.ecode ) & [1] 27942 [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# sleep 2 [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# qpidd2_pid=$(netstat -nlp | grep qpidd | grep 5673 | awk '{print $(NF)}' | awk -F/ '{print $1}') [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# ( qpidd --auth no -p 5672 --cluster-name ams --log-enable warning+ --data-dir data1 >qpidd1.log 2>&1 ; \ > echo "qpidd1 ecode:$?" > qpidd1.ecode ) & [2] 27960 [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# sleep 2 [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# qpidd1_pid=$(netstat -nlp | grep qpidd | grep 5672 | awk '{print $(NF)}' | awk -F/ '{print $1}') [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# if [ -n "${qpidd1_pid}" -a -n "${qpidd2_pid}" ]; then > echo "1st:qpidd1&qpidd2 up & running" > else > echo "ERROR:1st:qpidd1&qpidd2 up & running" > fi 1st:qpidd1&qpidd2 up & running [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# qpid-config add queue test-queue [root@ibm-firefly bz461932]# echo $? 0 [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# # create data [root@ibm-firefly bz461932]# for (( i=30000; i--;)) do printf "msg_%05d\n" ${i}; done > data.in [root@ibm-firefly bz461932]# echo "data.in: $(grep -c ^msg_ data.in) lines [unique:$(grep ^msg_ data.in | sort -u | wc -l)]" data.in: 30000 lines [unique:30000] [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# ( ./receiver --heartbeat 1 >receiver.log 2>&1 ; echo "receiver ecode:$?" > receiver.ecode ) & [3] 27991 [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# ( cat data.in | ./sender --heartbeat 1 --send-eos 1 ; echo "sender ecode:$?" > receiver.ecode ) & [4] 27993 [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# sleep 1 [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# receiver_pid=$(ps aux | grep ./receiver | grep -v grep | awk '{print $2}') [root@ibm-firefly bz461932]# sender_pid=$(ps aux | grep ./sender | grep -v grep | awk '{print $2}') [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# if [ -n "${qpidd1_pid}" -a -n "${qpidd2_pid}" -a \ > -n "${receiver_pid}" -a -n "${sender_pid}" ]; then > echo "2nd:qpidd1&qpidd2&receiver&sender up & running" > else > echo "ERROR:2nd:qpidd1&qpidd2&receiver&sender up & running" > fi 2nd:qpidd1&qpidd2&receiver&sender up & running [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# echo "receiver log: $(grep -c ^msg_ receiver.log) lines [unique:$(grep ^msg_ receiver.log | sort -u | wc -l)]" receiver log: 9651 lines [unique:9936] [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# kill -STOP ${qpidd1_pid} [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# sleep 2 2009-mar-09 10:13:13 warning Connection closed [root@ibm-firefly bz461932]# echo "receiver log: $(grep -c ^msg_ receiver.log) lines [unique:$(grep ^msg_ receiver.log | sort -u | wc -l)]" receiver log: 10639 lines [unique:10639] [root@ibm-firefly bz461932]# sleep 4 [3]- Done ( ./receiver --heartbeat 1 > receiver.log 2>&1; echo "receiver ecode:$?" > receiver.ecode ) [4]+ Done ( cat data.in | ./sender --heartbeat 1 --send-eos 1; echo "sender ecode:$?" > receiver.ecode ) [root@ibm-firefly bz461932]# echo "receiver log: $(grep -c ^msg_ receiver.log) lines [unique:$(grep ^msg_ receiver.log | sort -u | wc -l)]" receiver log: 30058 lines [unique:29917] [root@ibm-firefly bz461932]# sleep 4 [root@ibm-firefly bz461932]# echo "receiver log: $(grep -c ^msg_ receiver.log) lines [unique:$(grep ^msg_ receiver.log | sort -u | wc -l)]" receiver log: 30058 lines [unique:29917] [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# grep ^msg_ receiver.log | sort -ru > receiver.log.su [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# echo "data.in: $(grep -c ^msg_ data.in) l; receiver log (iniquified): $(grep -c ^msg_ receiver.log.su) l" data.in: 30000 l; receiver log (iniquified): 29917 l [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# diff data.in receiver.log.su 10640,10722d10639 < msg_19360 < msg_19359 < msg_19358 < msg_19357 < msg_19356 < msg_19355 < msg_19354 < msg_19353 < msg_19352 < msg_19351 < msg_19350 < msg_19349 < msg_19348 < msg_19347 < msg_19346 < msg_19345 < msg_19344 < msg_19343 < msg_19342 < msg_19341 < msg_19340 < msg_19339 < msg_19338 < msg_19337 < msg_19336 < msg_19335 < msg_19334 < msg_19333 < msg_19332 < msg_19331 < msg_19330 < msg_19329 < msg_19328 < msg_19327 < msg_19326 < msg_19325 < msg_19324 < msg_19323 < msg_19322 < msg_19321 < msg_19320 < msg_19319 < msg_19318 < msg_19317 < msg_19316 < msg_19315 < msg_19314 < msg_19313 < msg_19312 < msg_19311 < msg_19310 < msg_19309 < msg_19308 < msg_19307 < msg_19306 < msg_19305 < msg_19304 < msg_19303 < msg_19302 < msg_19301 < msg_19300 < msg_19299 < msg_19298 < msg_19297 < msg_19296 < msg_19295 < msg_19294 < msg_19293 < msg_19292 < msg_19291 < msg_19290 < msg_19289 < msg_19288 < msg_19287 < msg_19286 < msg_19285 < msg_19284 < msg_19283 < msg_19282 < msg_19281 < msg_19280 < msg_19279 < msg_19278 [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# setenforce 1 [root@ibm-firefly bz461932]# echo "receiver log: $(grep -c ^msg_ receiver.log) lines [unique:$(grep ^msg_ receiver.log | sort -u | wc -l)]" receiver log: 30058 lines [unique:29917] [root@ibm-firefly bz461932]# jobs [1]- Running ( qpidd --auth no -p 5673 --cluster-name ams --log-enable warning+ --data-dir data2 > qpidd2.log 2>&1; echo "qpidd2 ecode:$?" > qpidd2.ecode ) & [2]+ Running ( qpidd --auth no -p 5672 --cluster-name ams --log-enable warning+ --data-dir data1 > qpidd1.log 2>&1; echo "qpidd1 ecode:$?" > qpidd1.ecode ) & [root@ibm-firefly bz461932]# ps auxw | grep qpidd root 27944 3.2 0.3 136696 15276 pts/1 Sl 10:13 0:03 qpidd --auth no -p 5673 --cluster-name ams --log-enable warning+ --data-dir data2 root 27962 1.3 0.3 111580 12912 pts/1 Tl 10:13 0:01 qpidd --auth no -p 5672 --cluster-name ams --log-enable warning+ --data-dir data1 root 28055 0.0 0.0 3912 708 pts/1 S+ 10:14 0:00 grep qpidd Putting back to ASSIGNED. Test scenario 1 passes on RHEL4.7 / 5.3 i386 / x86_64. ->ASSIGNED
Re-tested failover clustered qpidd behavior and it works without any issue. (disabling HB + killing broker kill -9 instead of kill -STOP)
I retested Frantisek's test scenario (comment 15) on RHEL 5.3 i386 with qpidc 0.5.752581-1.el5 and rhm 0.5.3153-1.el5 with same result as in comment 15.
I observed message-holding, and late delivery. (not message loss) It looks like it is not related to heartbeat, but to cluster behavior when a broker becomes unresponsive (i.e. with kill -STOP ) I submitted a separate BZ -- 490931 -- to cover this. 1. start_cluster { NOTE 1 { The first broker, which is the one that the sender and receiver will connect to, is advertising only one interface to the cluster. This was part of the solution to my experience of hanging-forever on mrg5. The other part of the solution was to modify cpp/src/tests/receiver.cpp so that, when it failed over due to heartbeat loss, it would not retry the same address and hang. } NOTE 2 { If you use --tcp-nodelay in the broker command lines, the number of "lost" messages will be greatly reduced. } script { #! /bin/bash pkill qpidd sleep 1 rootdir=/home/mgoulish/461932 rm $rootdir/qpidd1.log $rootdir/qpidd2.log $rootdir/qpidd3.log rm -rf $rootdir/data1 $rootdir/data2 $rootdir/data3 mkdir $rootdir/data1 $rootdir/data2 $rootdir/data3 QPIDD=/usr/sbin/qpidd $QPIDD --cluster-name micks_cluster \ -p 5672 \ --cluster-url amqp:tcp:10.16.43.11:5672 \ --auth=no \ --log-enable info+ \ --log-to-file $rootdir/qpidd1.log \ -d \ --data-dir $rootdir/data1 $QPIDD --cluster-name micks_cluster \ -p 6666 \ --auth=no \ --log-enable info+ \ --log-to-file $rootdir/qpidd2.log \ -d \ --data-dir $rootdir/data2 $QPIDD --cluster-name micks_cluster \ -p 6667 \ --auth=no \ --log-enable info+ \ --log-to-file $rootdir/qpidd3.log \ -d \ --data-dir $rootdir/data3 } } 2. declare the queues { I did this with a modified copy of cpp/examples/direct/declare_queues.cpp The modification looks like this: session.queueDeclare(arg::queue="test-queue"); session.exchangeBind(arg::exchange="amq.direct", arg::queue="test-queue", arg::bindingKey="test-queue"); } 3. run sender to completion { script { #! /bin/bash dir=/home/mgoulish/trunk/qpid/cpp/src/tests (for (( i=100000; i--;)) do echo $i; done) | $dir/sender --heartbeat 1 --send-eos 1 } } 4. run receiver and stop default broker. { Run this script and then as quickly as possible kill -STOP the broker that is listening on the default port. } 5. receiver behavior { the receiver keeps running but puts a message like this in its output: 91593 91592 91591 2009-mar-18 09:41:33 warning Connection closed MDEBUG Receiver::editUrlList called ! resuming_receiver first connect. 91589 91588 91587 91586 NOTE { In this case only one messages was lost because I used --tcp-nodelay in broker startup. Without --tcp-nodelay I would expect to see a few thousand missing. } } 6. kill -9 the default broker 7. wait for the receiver to terminate. 8. { look at the receiver output to see out-of-order messages like this: 52474 52473 91590 <-- yikes! 52472 52471 } NOTO BENE { When I made this change to cpp/src/tests/receiver.cpp, at gsim's suggestion, the lost message problem went away. change { void Receiver::execute(AsyncSession& session, bool isRetry) { std::cout << "resuming_receiver " << (isRetry ? "first " : "re-") << "connect." << endl; if ( ! done ) { SubscriptionManager subs(session); /* ADD THIS! */ settings.flowControl = FlowControl::messageWindow(1); subscription = subs.subscribe(*this, queue, settings); subs.run(); if (settings.autoAck) { subscription.accept(subscription.getUnaccepted()); } } } } }
The feature has been implemented, no message loss seen after slight modification of receiver and stopping+killing broker. The holding of messages when qpidd stopped addressed in another BZ (490931). Validated on RHEl 5.2 / 5.3 i386 / x86_64 on packages: [root@hp-dl385-01 bz461932]# rpm -qa | egrep '(qpid|rhm|openai)' rhm-docs-0.5.753238-1.el5 qpidc-rdma-0.5.752581-1.el5 qpidd-xml-0.5.752581-1.el5 qpidc-perftest-0.5.752581-1.el5 openais-0.80.3-22.el5_3.3 qpid-java-common-0.5.751061-1.el5 python-qpid-0.5.752581-1.el5 qpidc-0.5.752581-1.el5 qpidc-ssl-0.5.752581-1.el5 qpidd-ssl-0.5.752581-1.el5 qpidd-acl-0.5.752581-1.el5 rhm-0.5.3153-1.el5 qpidc-devel-0.5.752581-1.el5 qpidd-devel-0.5.752581-1.el5 qpid-java-client-0.5.751061-1.el5 qpidd-0.5.752581-1.el5 qpidd-rdma-0.5.752581-1.el5 qpidd-cluster-0.5.752581-1.el5 -> VERIFIED
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-0434.html