Bug 461932
| Summary: | Support heartbeats | ||
|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Gordon Sim <gsim> |
| Component: | qpid-cpp | Assignee: | Andrew Stitcher <astitcher> |
| Status: | CLOSED ERRATA | QA Contact: | Frantisek Reznicek <freznice> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 1.0 | CC: | cctrieloff, esammons, freznice, mgoulish, mkudlej, sergei.vorobiev |
| Target Milestone: | 1.1.1 | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2009-04-21 16:17:46 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 478874 | ||
|
Description
Gordon Sim
2008-09-11 14:21:57 UTC
No test info. Putting NEEDINFO flag. Heartbeat has now been implemented for the C++ client and broker. The implementation is as follows: The broker allows a heartbeat interval between no heartbeat (=0) and 2 minutes (120s). this is sent to the client as part of the Tune message in the initial connection setup. Note that heartbeat is a Connection property. As per the AMQP standard the timeout is currently specified in seconds. The C++ server impl will at present send a heartbeat frame every heartbeat interval regardless of any other traffic on the connection. At 1 message per second this is a pretty low communication overhead. Once heartbeat has been negotiated the client will ensure that there isn't a period of 2*timeout interval without any traffic on the connection. It does this by setting a timer for twice the heartbeat period and constantly extending the timout whenever there is traffic on the connection. If the client timeouts with no traffic twice the heartbeat interval it tears down the connection. If the client is managing failover (in a clustered scenario) then this will initiate failover. Limitations: The current implementation only sends heartbeat controls from broker to client and not the other way round. The broker will accept heartbeat controls from the client and ignore them. However the broker will currently never check for lack of traffic from the client Testing: There are 2 cases to test- 1. simple client disconnect upon loss of server heartbeat 2. cluster failover upon loss of server heartbeat 1. a. Set up a broker in window 1: qpidd --auth no ... b. Start perftest in another window perftest --heartbeat 1 c. Now stop the qpidd process either by ^z in window 1 or by sending it the STOP signal - kill -STOP ... d. after 2s (we set heartbeat interval to 1s in b.) perftest should close the connection with the broker. 2. a. Set up a simple cluster of brokers: [I assume you already have clustering working] Something like: src/qpidd --auth no --load-module src/.libs/cluster.so --cluster-name ams --no-data-dir --port 12345 & src/qpidd --auth no --load-module src/.libs/cluster.so --cluster-name ams --no-data-dir b. create a queue for the sender/receiver: qpid-config add queue test-queue c. run receiver: src/tests/receiver --heartbeat 1 d. run sender (with a large number of messages to send): (for (( i=30000; i--;)) do echo $i; done) | src/tests/sender --heartbeat 1 --send-eos 1 e. Now stop the broker running on port 5672 (the one without --port 12345 in the command line). Either by ^z in the window you started it from or by using kill -STOP ... f. After a 2s pause you should see that sending/receiving go on uninterrupted with no lost messages. g. you can now kill -KILL the broker or continue it kill -CONT / fg First case (1.) works well. The second case does not work how we expected. a) Using kill -9 "the broker without --port 12345" at step e. We can see that receiver does not receive all messages send by sender - so there is message loss. Moreover we see that both receiver and sender exits immediatelly after qpidd kill with ecode 139 (SEGFAULT). [root@dell-pe2900-03 ~]# ./receiver --heartbeat 1 > file.txt; echo $? > rec_exit.txt 2009-feb-27 11:23:28 warning Connection closed [3]+ Killed qpidd --auth no --cluster-name ams --no-data-dir Segmentation fault [root@dell-pe2900-03 ~]# cat rec receiver receiver.cpp rec_exit.txt [root@dell-pe2900-03 ~]# cat rec_exit.txt 139 [root@dell-pe2900-03 ~]# (for (( i=100000; i--;)) do echo $i; done) | ./sender --heartbeat 1 --send-eos 1 2009-feb-27 11:05:32 warning Connection closed Segmentation fault [root@dell-pe2900-03 ~]# (for (( i=100000; i--;)) do echo $i; done) | ./sender --heartbeat 1 --send-eos 1 2009-feb-27 11:08:55 warning Connection closed Segmentation fault [root@dell-pe2900-03 ~]# (for (( i=100000; i--;)) do echo $i; done) | ./sender --heartbeat 1 --send-eos 1; echo $? > sender_exit.txt 2009-feb-27 11:23:28 warning Connection closed Segmentation fault [root@dell-pe2900-03 ~]# cat sender_exit.txt 139 So it seems that failover from qpidd at default port to qpidd at port 12345 does not went correctly -> finally message loss seen! b) Using kill -STOP "the broker without --port 12345" at step e. In this case we see that receiver get even more messages than expected (100k sent, 100.1k received). So there were few tens of duplicated messages detected. This happened probably because of the asynchnonous communication between receiver and qpidd on default port. So after qpidd on default port is stopped the other qpidd doesn't know about few receiver's acks. That could explain why it happens, but we're not sure about correctness. Moreover after bringing qpidd on default port up again (kill -CONT) we can see that newly launched receiver get approx 20% old messages (from previous session). We believe this is incorrect. putting in ASSIGNED with NEEDINFO a) If you use kill -9 on the broker you aren't testing HEARTBEAT you are testing detection of diconnection at the client end. Heartbeat is only useful if the connection is still connected. So this doesn't test the heartbeat functionality b) The heartbeat functionality is not connected to reconnection at all, it is only responsible for ensuring that the connection gets cut if the heartbeat isn't detected. The test b) above if it passed would demonstrate that both heartbeat and reconnect work. But in failure, at least in this instance it appears that the bug is in the reconnect logic. I suggest you raise a new BZ against that logic. Changing back to Modified as there is no heartbeat issue in Comment #8 Can we get a backtrace from the segfaults please? (Maybe also indication of whether that is independent of the --heartbeat option ). Fyi, duplication is to be expected (you can avoid seeing it by using --ignore-duplicates option to receiver). Message loss _IS_ a problem, but not necessarily connected to the heartbeats. Ignore comments above (I was being lazy!). I have reproduced the issue:
#0 0x45b8b410 in __kernel_vsyscall ()
#1 0x45bd1069 in raise () from /lib/libc.so.6
#2 0x45bd2671 in abort () from /lib/libc.so.6
#3 0x006e4d60 in __gnu_cxx::__verbose_terminate_handler ()
from /usr/lib/libstdc++.so.6
#4 0x006e2795 in std::set_unexpected () from /usr/lib/libstdc++.so.6
#5 0x006e27d2 in std::terminate () from /usr/lib/libstdc++.so.6
#6 0x006e2eb5 in __cxa_pure_virtual () from /usr/lib/libstdc++.so.6
#7 0x0053f946 in qpid::client::HeartbeatTask::fire (this=0x8f259a8)
at qpid/client/ConnectionImpl.cpp:59
#8 0x00c6dbd5 in qpid::sys::TimerTask::fireTask (this=0x8f259a8)
at qpid/sys/Timer.cpp:51
#9 0x00c6eadb in qpid::sys::Timer::run (this=0x5b80a0)
at qpid/sys/Timer.cpp:91
#10 0x00c0e301 in qpid::sys::(anonymous namespace)::runRunnable (p=0x5b80a0)
at qpid/sys/posix/Thread.cpp:35
#11 0x45f0640b in start_thread () from /lib/libpthread.so.0
#12 0x45c71b7e in clone () from /lib/libc.so.6
The above segfault should be resolved by r749297. I believe this is the cause of apparent 'lost messages'. Duplicated messages are to be expected; use --ignore-duplicates option to exclude them from the list reported by the receiver. When re-testing on RHEL 5.2 / 5.3 found misbehavior of test scenario 2 (kill -STOP)... I can see that not all messages are delivered to receiver after one of the cluster qpidd instances stopped. Below mentioned behavior should be reproducable in 40-60% of cases. It was seen that sometimes the test procedures result in no message loss. It is clearly seen that there are missing some messages from the middle of the input file i.e. some messages beeing processed in kill -STOP <pid> moment. - experiment #1 on RHEL 5.2 x86_64 - [root@dhcp-lab-200 bz461932]# rm -rf *.log *.ecode data* [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# ( qpidd --auth no -p 5673 --cluster-name ams --log-enable warning+ --data-dir data2 >qpidd2.log 2>&1 ; \ > echo "qpidd2 ecode:$?" > qpidd2.ecode ) & [1] 30585 [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# sleep 2 [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# qpidd2_pid=$(netstat -nlp | grep qpidd | grep 5673 | awk '{print $(NF)}' | awk -F/ '{print $1}') [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# ( qpidd --auth no -p 5672 --cluster-name ams --log-enable warning+ --data-dir data1 >qpidd1.log 2>&1 ; \ > echo "qpidd1 ecode:$?" > qpidd1.ecode ) & [2] 30609 [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# sleep 2 [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# qpidd1_pid=$(netstat -nlp | grep qpidd | grep 5672 | awk '{print $(NF)}' | awk -F/ '{print $1}') [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# if [ -n "${qpidd1_pid}" -a -n "${qpidd2_pid}" ]; then > echo "1st:qpidd1&qpidd2 up & running" > else > echo "ERROR:1st:qpidd1&qpidd2 up & running" > fi 1st:qpidd1&qpidd2 up & running [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# qpid-config add queue test-queue [root@dhcp-lab-200 bz461932]# echo $? 0 [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# # create data [root@dhcp-lab-200 bz461932]# for (( i=30000; i--;)) do printf "msg_%05d\n" ${i}; done > data.in [root@dhcp-lab-200 bz461932]# echo "data.in: $(grep -c ^msg_ data.in) lines [unique:$(grep ^msg_ data.in | sort -u | wc -l)]" data.in: 30000 lines [unique:30000] [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# ( ./receiver --heartbeat 1 >receiver.log 2>&1 ; echo "receiver ecode:$?" > receiver.ecode ) & [3] 30647 [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# ( cat data.in | ./sender --heartbeat 1 --send-eos 1 ; echo "sender ecode:$?" > receiver.ecode ) & [4] 30649 [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# sleep 1 [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# receiver_pid=$(ps aux | grep ./receiver | grep -v grep | awk '{print $2}') [root@dhcp-lab-200 bz461932]# sender_pid=$(ps aux | grep ./sender | grep -v grep | awk '{print $2}') [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# if [ -n "${qpidd1_pid}" -a -n "${qpidd2_pid}" -a \ > -n "${receiver_pid}" -a -n "${sender_pid}" ]; then > echo "2nd:qpidd1&qpidd2&receiver&sender up & running" > else > echo "ERROR:2nd:qpidd1&qpidd2&receiver&sender up & running" > fi 2nd:qpidd1&qpidd2&receiver&sender up & running [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# echo "receiver log: $(grep -c ^msg_ receiver.log) lines [unique:$(grep ^msg_ receiver.log | sort -u | wc -l)]" receiver log: 12552 lines [unique:12761] [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# kill -STOP ${qpidd1_pid} [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# sleep 2 2009-mar-09 15:33:21 warning Connection closed [root@dhcp-lab-200 bz461932]# echo "receiver log: $(grep -c ^msg_ receiver.log) lines [unique:$(grep ^msg_ receiver.log | sort -u | wc -l)]" receiver log: 12813 lines [unique:12813] [root@dhcp-lab-200 bz461932]# sleep 4 [3]- Done ( ./receiver --heartbeat 1 > receiver.log 2>&1; echo "receiver ecode:$?" > receiver.ecode ) [4]+ Done ( cat data.in | ./sender --heartbeat 1 --send-eos 1; echo "sender ecode:$?" > receiver.ecode ) [root@dhcp-lab-200 bz461932]# echo "receiver log: $(grep -c ^msg_ receiver.log) lines [unique:$(grep ^msg_ receiver.log | sort -u | wc -l)]" receiver log: 30122 lines [unique:29908] [root@dhcp-lab-200 bz461932]# sleep 4 [root@dhcp-lab-200 bz461932]# echo "receiver log: $(grep -c ^msg_ receiver.log) lines [unique:$(grep ^msg_ receiver.log | sort -u | wc -l)]" receiver log: 30122 lines [unique:29908] [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# grep ^msg_ receiver.log | sort -ru > receiver.log.su [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# echo "data.in: $(grep -c ^msg_ data.in) l; receiver log (uniquified): $(grep -c ^msg_ receiver.log.su) l" data.in: 30000 l; receiver log (uniquified): 29908 l [root@dhcp-lab-200 bz461932]# [root@dhcp-lab-200 bz461932]# diff data.in receiver.log.su 12814,12905d12813 < msg_17186 < msg_17185 < msg_17184 < msg_17183 < msg_17182 < msg_17181 < msg_17180 < msg_17179 < msg_17178 < msg_17177 < msg_17176 < msg_17175 < msg_17174 < msg_17173 < msg_17172 < msg_17171 < msg_17170 < msg_17169 < msg_17168 < msg_17167 < msg_17166 < msg_17165 < msg_17164 < msg_17163 < msg_17162 < msg_17161 < msg_17160 < msg_17159 < msg_17158 < msg_17157 < msg_17156 < msg_17155 < msg_17154 < msg_17153 < msg_17152 < msg_17151 < msg_17150 < msg_17149 < msg_17148 < msg_17147 < msg_17146 < msg_17145 < msg_17144 < msg_17143 < msg_17142 < msg_17141 < msg_17140 < msg_17139 < msg_17138 < msg_17137 < msg_17136 < msg_17135 < msg_17134 < msg_17133 < msg_17132 < msg_17131 < msg_17130 < msg_17129 < msg_17128 < msg_17127 < msg_17126 < msg_17125 < msg_17124 < msg_17123 < msg_17122 < msg_17121 < msg_17120 < msg_17119 < msg_17118 < msg_17117 < msg_17116 < msg_17115 < msg_17114 < msg_17113 < msg_17112 < msg_17111 < msg_17110 < msg_17109 < msg_17108 < msg_17107 < msg_17106 < msg_17105 < msg_17104 < msg_17103 < msg_17102 < msg_17101 < msg_17100 < msg_17099 < msg_17098 < msg_17097 < msg_17096 < msg_17095 - experiment #2 on RHEL 5.3 i386 - [root@ibm-firefly bz461932]# rm -rf *.log *.ecode data* [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# ( qpidd --auth no -p 5673 --cluster-name ams --log-enable warning+ --data-dir data2 >qpidd2.log 2>&1 ; \ > echo "qpidd2 ecode:$?" > qpidd2.ecode ) & [1] 27942 [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# sleep 2 [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# qpidd2_pid=$(netstat -nlp | grep qpidd | grep 5673 | awk '{print $(NF)}' | awk -F/ '{print $1}') [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# ( qpidd --auth no -p 5672 --cluster-name ams --log-enable warning+ --data-dir data1 >qpidd1.log 2>&1 ; \ > echo "qpidd1 ecode:$?" > qpidd1.ecode ) & [2] 27960 [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# sleep 2 [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# qpidd1_pid=$(netstat -nlp | grep qpidd | grep 5672 | awk '{print $(NF)}' | awk -F/ '{print $1}') [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# if [ -n "${qpidd1_pid}" -a -n "${qpidd2_pid}" ]; then > echo "1st:qpidd1&qpidd2 up & running" > else > echo "ERROR:1st:qpidd1&qpidd2 up & running" > fi 1st:qpidd1&qpidd2 up & running [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# qpid-config add queue test-queue [root@ibm-firefly bz461932]# echo $? 0 [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# # create data [root@ibm-firefly bz461932]# for (( i=30000; i--;)) do printf "msg_%05d\n" ${i}; done > data.in [root@ibm-firefly bz461932]# echo "data.in: $(grep -c ^msg_ data.in) lines [unique:$(grep ^msg_ data.in | sort -u | wc -l)]" data.in: 30000 lines [unique:30000] [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# ( ./receiver --heartbeat 1 >receiver.log 2>&1 ; echo "receiver ecode:$?" > receiver.ecode ) & [3] 27991 [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# ( cat data.in | ./sender --heartbeat 1 --send-eos 1 ; echo "sender ecode:$?" > receiver.ecode ) & [4] 27993 [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# sleep 1 [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# receiver_pid=$(ps aux | grep ./receiver | grep -v grep | awk '{print $2}') [root@ibm-firefly bz461932]# sender_pid=$(ps aux | grep ./sender | grep -v grep | awk '{print $2}') [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# if [ -n "${qpidd1_pid}" -a -n "${qpidd2_pid}" -a \ > -n "${receiver_pid}" -a -n "${sender_pid}" ]; then > echo "2nd:qpidd1&qpidd2&receiver&sender up & running" > else > echo "ERROR:2nd:qpidd1&qpidd2&receiver&sender up & running" > fi 2nd:qpidd1&qpidd2&receiver&sender up & running [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# echo "receiver log: $(grep -c ^msg_ receiver.log) lines [unique:$(grep ^msg_ receiver.log | sort -u | wc -l)]" receiver log: 9651 lines [unique:9936] [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# kill -STOP ${qpidd1_pid} [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# sleep 2 2009-mar-09 10:13:13 warning Connection closed [root@ibm-firefly bz461932]# echo "receiver log: $(grep -c ^msg_ receiver.log) lines [unique:$(grep ^msg_ receiver.log | sort -u | wc -l)]" receiver log: 10639 lines [unique:10639] [root@ibm-firefly bz461932]# sleep 4 [3]- Done ( ./receiver --heartbeat 1 > receiver.log 2>&1; echo "receiver ecode:$?" > receiver.ecode ) [4]+ Done ( cat data.in | ./sender --heartbeat 1 --send-eos 1; echo "sender ecode:$?" > receiver.ecode ) [root@ibm-firefly bz461932]# echo "receiver log: $(grep -c ^msg_ receiver.log) lines [unique:$(grep ^msg_ receiver.log | sort -u | wc -l)]" receiver log: 30058 lines [unique:29917] [root@ibm-firefly bz461932]# sleep 4 [root@ibm-firefly bz461932]# echo "receiver log: $(grep -c ^msg_ receiver.log) lines [unique:$(grep ^msg_ receiver.log | sort -u | wc -l)]" receiver log: 30058 lines [unique:29917] [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# grep ^msg_ receiver.log | sort -ru > receiver.log.su [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# echo "data.in: $(grep -c ^msg_ data.in) l; receiver log (iniquified): $(grep -c ^msg_ receiver.log.su) l" data.in: 30000 l; receiver log (iniquified): 29917 l [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# diff data.in receiver.log.su 10640,10722d10639 < msg_19360 < msg_19359 < msg_19358 < msg_19357 < msg_19356 < msg_19355 < msg_19354 < msg_19353 < msg_19352 < msg_19351 < msg_19350 < msg_19349 < msg_19348 < msg_19347 < msg_19346 < msg_19345 < msg_19344 < msg_19343 < msg_19342 < msg_19341 < msg_19340 < msg_19339 < msg_19338 < msg_19337 < msg_19336 < msg_19335 < msg_19334 < msg_19333 < msg_19332 < msg_19331 < msg_19330 < msg_19329 < msg_19328 < msg_19327 < msg_19326 < msg_19325 < msg_19324 < msg_19323 < msg_19322 < msg_19321 < msg_19320 < msg_19319 < msg_19318 < msg_19317 < msg_19316 < msg_19315 < msg_19314 < msg_19313 < msg_19312 < msg_19311 < msg_19310 < msg_19309 < msg_19308 < msg_19307 < msg_19306 < msg_19305 < msg_19304 < msg_19303 < msg_19302 < msg_19301 < msg_19300 < msg_19299 < msg_19298 < msg_19297 < msg_19296 < msg_19295 < msg_19294 < msg_19293 < msg_19292 < msg_19291 < msg_19290 < msg_19289 < msg_19288 < msg_19287 < msg_19286 < msg_19285 < msg_19284 < msg_19283 < msg_19282 < msg_19281 < msg_19280 < msg_19279 < msg_19278 [root@ibm-firefly bz461932]# [root@ibm-firefly bz461932]# setenforce 1 [root@ibm-firefly bz461932]# echo "receiver log: $(grep -c ^msg_ receiver.log) lines [unique:$(grep ^msg_ receiver.log | sort -u | wc -l)]" receiver log: 30058 lines [unique:29917] [root@ibm-firefly bz461932]# jobs [1]- Running ( qpidd --auth no -p 5673 --cluster-name ams --log-enable warning+ --data-dir data2 > qpidd2.log 2>&1; echo "qpidd2 ecode:$?" > qpidd2.ecode ) & [2]+ Running ( qpidd --auth no -p 5672 --cluster-name ams --log-enable warning+ --data-dir data1 > qpidd1.log 2>&1; echo "qpidd1 ecode:$?" > qpidd1.ecode ) & [root@ibm-firefly bz461932]# ps auxw | grep qpidd root 27944 3.2 0.3 136696 15276 pts/1 Sl 10:13 0:03 qpidd --auth no -p 5673 --cluster-name ams --log-enable warning+ --data-dir data2 root 27962 1.3 0.3 111580 12912 pts/1 Tl 10:13 0:01 qpidd --auth no -p 5672 --cluster-name ams --log-enable warning+ --data-dir data1 root 28055 0.0 0.0 3912 708 pts/1 S+ 10:14 0:00 grep qpidd Putting back to ASSIGNED. Test scenario 1 passes on RHEL4.7 / 5.3 i386 / x86_64. ->ASSIGNED Re-tested failover clustered qpidd behavior and it works without any issue. (disabling HB + killing broker kill -9 instead of kill -STOP) I retested Frantisek's test scenario (comment 15) on RHEL 5.3 i386 with qpidc 0.5.752581-1.el5 and rhm 0.5.3153-1.el5 with same result as in comment 15.
I observed message-holding, and late delivery. (not message loss)
It looks like it is not related to heartbeat, but to cluster behavior
when a broker becomes unresponsive (i.e. with kill -STOP )
I submitted a separate BZ -- 490931 -- to cover this.
1. start_cluster
{
NOTE 1
{
The first broker, which is the one that the sender and
receiver will connect to, is advertising only one interface
to the cluster. This was part of the solution to my experience
of hanging-forever on mrg5.
The other part of the solution was to modify cpp/src/tests/receiver.cpp
so that, when it failed over due to heartbeat loss, it would not
retry the same address and hang.
}
NOTE 2
{
If you use --tcp-nodelay in the broker command lines, the
number of "lost" messages will be greatly reduced.
}
script
{
#! /bin/bash
pkill qpidd
sleep 1
rootdir=/home/mgoulish/461932
rm $rootdir/qpidd1.log $rootdir/qpidd2.log $rootdir/qpidd3.log
rm -rf $rootdir/data1 $rootdir/data2 $rootdir/data3
mkdir $rootdir/data1 $rootdir/data2 $rootdir/data3
QPIDD=/usr/sbin/qpidd
$QPIDD --cluster-name micks_cluster \
-p 5672 \
--cluster-url amqp:tcp:10.16.43.11:5672 \
--auth=no \
--log-enable info+ \
--log-to-file $rootdir/qpidd1.log \
-d \
--data-dir $rootdir/data1
$QPIDD --cluster-name micks_cluster \
-p 6666 \
--auth=no \
--log-enable info+ \
--log-to-file $rootdir/qpidd2.log \
-d \
--data-dir $rootdir/data2
$QPIDD --cluster-name micks_cluster \
-p 6667 \
--auth=no \
--log-enable info+ \
--log-to-file $rootdir/qpidd3.log \
-d \
--data-dir $rootdir/data3
}
}
2. declare the queues
{
I did this with a modified copy of cpp/examples/direct/declare_queues.cpp
The modification looks like this:
session.queueDeclare(arg::queue="test-queue");
session.exchangeBind(arg::exchange="amq.direct", arg::queue="test-queue", arg::bindingKey="test-queue");
}
3. run sender to completion
{
script
{
#! /bin/bash
dir=/home/mgoulish/trunk/qpid/cpp/src/tests
(for (( i=100000; i--;)) do echo $i; done) | $dir/sender --heartbeat 1 --send-eos 1
}
}
4. run receiver and stop default broker.
{
Run this script and then as quickly as possible kill -STOP the
broker that is listening on the default port.
}
5. receiver behavior
{
the receiver keeps running but puts a message like this in its
output:
91593
91592
91591
2009-mar-18 09:41:33 warning Connection closed
MDEBUG Receiver::editUrlList called !
resuming_receiver first connect.
91589
91588
91587
91586
NOTE
{
In this case only one messages was lost because I used --tcp-nodelay
in broker startup. Without --tcp-nodelay I would expect to see
a few thousand missing.
}
}
6. kill -9 the default broker
7. wait for the receiver to terminate.
8.
{
look at the receiver output to see out-of-order messages
like this:
52474
52473
91590 <-- yikes!
52472
52471
}
NOTO BENE
{
When I made this change to cpp/src/tests/receiver.cpp, at gsim's
suggestion, the lost message problem went away.
change
{
void Receiver::execute(AsyncSession& session, bool isRetry)
{
std::cout << "resuming_receiver " << (isRetry ? "first " : "re-") << "connect." << endl;
if ( ! done ) {
SubscriptionManager subs(session);
/* ADD THIS! */ settings.flowControl = FlowControl::messageWindow(1);
subscription = subs.subscribe(*this, queue, settings);
subs.run();
if (settings.autoAck) {
subscription.accept(subscription.getUnaccepted());
}
}
}
}
}
The feature has been implemented, no message loss seen after slight modification of receiver and stopping+killing broker. The holding of messages when qpidd stopped addressed in another BZ (490931). Validated on RHEl 5.2 / 5.3 i386 / x86_64 on packages: [root@hp-dl385-01 bz461932]# rpm -qa | egrep '(qpid|rhm|openai)' rhm-docs-0.5.753238-1.el5 qpidc-rdma-0.5.752581-1.el5 qpidd-xml-0.5.752581-1.el5 qpidc-perftest-0.5.752581-1.el5 openais-0.80.3-22.el5_3.3 qpid-java-common-0.5.751061-1.el5 python-qpid-0.5.752581-1.el5 qpidc-0.5.752581-1.el5 qpidc-ssl-0.5.752581-1.el5 qpidd-ssl-0.5.752581-1.el5 qpidd-acl-0.5.752581-1.el5 rhm-0.5.3153-1.el5 qpidc-devel-0.5.752581-1.el5 qpidd-devel-0.5.752581-1.el5 qpid-java-client-0.5.751061-1.el5 qpidd-0.5.752581-1.el5 qpidd-rdma-0.5.752581-1.el5 qpidd-cluster-0.5.752581-1.el5 -> VERIFIED An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-0434.html |