Bug 461932

Summary:	Support heartbeats
Product:	Red Hat Enterprise MRG	Reporter:	Gordon Sim <gsim>
Component:	qpid-cpp	Assignee:	Andrew Stitcher <astitcher>
Status:	CLOSED ERRATA	QA Contact:	Frantisek Reznicek <freznice>
Severity:	high	Docs Contact:
Priority:	high
Version:	1.0	CC:	cctrieloff, esammons, freznice, mgoulish, mkudlej, sergei.vorobiev
Target Milestone:	1.1.1
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2009-04-21 16:17:46 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	478874

Description Gordon Sim 2008-09-11 14:21:57 UTC

In c++ broker and client at a minimum.

Comment 2 Frantisek Reznicek 2008-10-31 16:24:02 UTC

No test info. Putting NEEDINFO flag.

Comment 4 Andrew Stitcher 2009-01-15 19:08:14 UTC

Heartbeat has now been implemented for the C++ client and broker.

The implementation is as follows:

The broker allows a heartbeat interval between no heartbeat (=0) and 2 minutes (120s). this is sent to the client as part of the Tune message in the initial connection setup. Note that heartbeat is a Connection property.

As per the AMQP standard the timeout is currently specified in seconds.

The C++ server impl will at present send a heartbeat frame every heartbeat interval regardless of any other traffic on the connection. At 1 message per second this is a pretty low communication overhead.

Once heartbeat has been negotiated the client will ensure that there isn't a period of 2*timeout interval without any traffic on the connection. It does this by setting a timer for twice the heartbeat period and constantly extending the timout whenever there is traffic on the connection.

If the client timeouts with no traffic twice the heartbeat interval it tears down the connection. If the client is managing failover (in a clustered scenario) then this will initiate failover.

Comment 5 Andrew Stitcher 2009-01-15 19:13:26 UTC

Limitations:

The current implementation only sends heartbeat controls from broker to client and not the other way round. The broker will accept heartbeat controls from the client and ignore them.

However the broker will currently never check for lack of traffic from the client

Comment 6 Andrew Stitcher 2009-01-15 19:29:29 UTC

Testing:

There are 2 cases to test-

1. simple client disconnect upon loss of server heartbeat
2. cluster failover upon loss of server heartbeat

1.
a. Set up a broker in window 1:

qpidd --auth no ...

b. Start perftest in another window

perftest --heartbeat 1

c. Now stop the qpidd process either by ^z in window 1 or by sending it the STOP signal - kill -STOP ...

d. after 2s (we set heartbeat interval to 1s in b.) perftest should close the connection with the broker.

2.
a. Set up a simple cluster of brokers:
[I assume you already have clustering working]
Something like:
src/qpidd --auth no --load-module src/.libs/cluster.so --cluster-name ams --no-data-dir --port 12345 &
src/qpidd --auth no --load-module src/.libs/cluster.so --cluster-name ams --no-data-dir

b. create a queue for the sender/receiver:
qpid-config add queue test-queue

c. run receiver:
src/tests/receiver --heartbeat 1

d. run sender (with a large number of messages to send):
(for (( i=30000; i--;)) do echo $i; done) | src/tests/sender --heartbeat 1 --send-eos 1

e. Now stop the broker running on port 5672 (the one without --port 12345 in the command line). Either by ^z in the window you started it from or by using kill -STOP ...

f. After a 2s pause you should see that sending/receiving go on uninterrupted with no lost messages.

g. you can now kill -KILL the broker or continue it kill -CONT / fg

Comment 8 Martin Kudlej 2009-02-27 16:46:07 UTC

First case (1.) works well. 

The second case does not work how we expected.

a) Using kill -9 "the broker without --port 12345" at step e.
We can see that receiver does not receive all messages send by sender - so there is message loss.
Moreover we see that both receiver and sender exits immediatelly after qpidd kill with ecode 139 (SEGFAULT).


[root@dell-pe2900-03 ~]# ./receiver --heartbeat 1 > file.txt; echo $? > rec_exit.txt
2009-feb-27 11:23:28 warning Connection closed
[3]+  Killed                  qpidd --auth no --cluster-name ams --no-data-dir
Segmentation fault
[root@dell-pe2900-03 ~]# cat rec
receiver      receiver.cpp  rec_exit.txt  
[root@dell-pe2900-03 ~]# cat rec_exit.txt 
139

[root@dell-pe2900-03 ~]# (for (( i=100000; i--;)) do echo $i; done) | ./sender --heartbeat 1 --send-eos 1
2009-feb-27 11:05:32 warning Connection closed
Segmentation fault
[root@dell-pe2900-03 ~]# (for (( i=100000; i--;)) do echo $i; done) | ./sender --heartbeat 1 --send-eos 1
2009-feb-27 11:08:55 warning Connection closed
Segmentation fault
[root@dell-pe2900-03 ~]# (for (( i=100000; i--;)) do echo $i; done) | ./sender --heartbeat 1 --send-eos 1; echo $? > sender_exit.txt
2009-feb-27 11:23:28 warning Connection closed
Segmentation fault
[root@dell-pe2900-03 ~]# cat sender_exit.txt 
139

So it seems that failover from qpidd at default port to qpidd at port 12345 does not went correctly -> finally message loss seen!


b) Using kill -STOP "the broker without --port 12345" at step e.

In this case we see that receiver get even more messages than expected (100k sent, 100.1k received). So there were few tens of duplicated messages detected.

This happened probably because of the asynchnonous communication between receiver and qpidd on default port. So after qpidd on default port is stopped the other qpidd doesn't know about few receiver's acks.

That could explain why it happens, but we're not sure about correctness.
Moreover after bringing qpidd on default port up again (kill -CONT) we can see that newly launched receiver get approx 20% old messages (from previous session). We believe this is incorrect.


putting in ASSIGNED with NEEDINFO

Comment 9 Andrew Stitcher 2009-02-27 18:49:29 UTC

a) If you use kill -9 on the broker you aren't testing HEARTBEAT you are testing detection of diconnection at the client end. Heartbeat is only useful if the connection is still connected.

So this doesn't test the heartbeat functionality

b) The heartbeat functionality is not connected to reconnection at all, it is only responsible for ensuring that the connection gets cut if the heartbeat isn't detected.

The test b) above if it passed would demonstrate that both heartbeat and reconnect work. But in failure, at least in this instance it appears that the bug is in the reconnect logic.

I suggest you raise a new BZ against that logic.

Comment 10 Andrew Stitcher 2009-02-27 18:55:53 UTC

Changing back to Modified as there is no heartbeat issue in  Comment #8

Comment 11 Gordon Sim 2009-02-27 19:04:33 UTC

Can we get a backtrace from the segfaults please? (Maybe also indication of whether that is independent of the --heartbeat option ).

Comment 12 Gordon Sim 2009-02-27 19:06:40 UTC

Fyi, duplication is to be expected (you can avoid seeing it by using --ignore-duplicates option to receiver). Message loss _IS_ a problem, but not necessarily connected to the heartbeats.

Comment 13 Gordon Sim 2009-02-27 19:26:58 UTC

Ignore comments above (I was being lazy!). I have reproduced the issue:

#0  0x45b8b410 in __kernel_vsyscall ()
#1  0x45bd1069 in raise () from /lib/libc.so.6
#2  0x45bd2671 in abort () from /lib/libc.so.6
#3  0x006e4d60 in __gnu_cxx::__verbose_terminate_handler ()
   from /usr/lib/libstdc++.so.6
#4  0x006e2795 in std::set_unexpected () from /usr/lib/libstdc++.so.6
#5  0x006e27d2 in std::terminate () from /usr/lib/libstdc++.so.6
#6  0x006e2eb5 in __cxa_pure_virtual () from /usr/lib/libstdc++.so.6
#7  0x0053f946 in qpid::client::HeartbeatTask::fire (this=0x8f259a8)
    at qpid/client/ConnectionImpl.cpp:59
#8  0x00c6dbd5 in qpid::sys::TimerTask::fireTask (this=0x8f259a8)
    at qpid/sys/Timer.cpp:51
#9  0x00c6eadb in qpid::sys::Timer::run (this=0x5b80a0)
    at qpid/sys/Timer.cpp:91
#10 0x00c0e301 in qpid::sys::(anonymous namespace)::runRunnable (p=0x5b80a0)
    at qpid/sys/posix/Thread.cpp:35
#11 0x45f0640b in start_thread () from /lib/libpthread.so.0
#12 0x45c71b7e in clone () from /lib/libc.so.6

Comment 14 Gordon Sim 2009-03-02 13:03:32 UTC

The above segfault should be resolved by r749297. I believe this is the cause of apparent 'lost messages'. Duplicated messages are to be expected; use --ignore-duplicates option to exclude them from the list reported by the receiver.

Comment 15 Frantisek Reznicek 2009-03-09 14:46:49 UTC

When re-testing on RHEL 5.2 / 5.3 found misbehavior of test scenario 2 (kill -STOP)...

I can see that not all messages are delivered to receiver after one of the cluster qpidd instances stopped. Below mentioned behavior should be reproducable in 40-60% of cases. It was seen that sometimes the test procedures result in no message loss.

It is clearly seen that there are missing some messages from the middle of the input file i.e. some messages beeing processed in kill -STOP <pid> moment.

- experiment #1 on RHEL 5.2 x86_64 - 

[root@dhcp-lab-200 bz461932]# rm -rf *.log *.ecode data*
[root@dhcp-lab-200 bz461932]#
[root@dhcp-lab-200 bz461932]# ( qpidd --auth no -p 5673 --cluster-name ams --log-enable warning+ --data-dir data2 >qpidd2.log 2>&1 ; \
>   echo "qpidd2 ecode:$?" > qpidd2.ecode ) &
[1] 30585
[root@dhcp-lab-200 bz461932]#
[root@dhcp-lab-200 bz461932]# sleep 2
[root@dhcp-lab-200 bz461932]#
[root@dhcp-lab-200 bz461932]# qpidd2_pid=$(netstat -nlp | grep qpidd | grep 5673 | awk '{print $(NF)}' | awk -F/ '{print $1}')
[root@dhcp-lab-200 bz461932]#
[root@dhcp-lab-200 bz461932]# ( qpidd --auth no -p 5672 --cluster-name ams --log-enable warning+ --data-dir data1 >qpidd1.log 2>&1 ; \
>   echo "qpidd1 ecode:$?" > qpidd1.ecode ) &
[2] 30609
[root@dhcp-lab-200 bz461932]#
[root@dhcp-lab-200 bz461932]# sleep 2
[root@dhcp-lab-200 bz461932]#
[root@dhcp-lab-200 bz461932]# qpidd1_pid=$(netstat -nlp | grep qpidd | grep 5672 | awk '{print $(NF)}' | awk -F/ '{print $1}')
[root@dhcp-lab-200 bz461932]#
[root@dhcp-lab-200 bz461932]#
[root@dhcp-lab-200 bz461932]#
[root@dhcp-lab-200 bz461932]# if [ -n "${qpidd1_pid}" -a -n "${qpidd2_pid}" ]; then
>   echo "1st:qpidd1&qpidd2 up & running"
> else
>   echo "ERROR:1st:qpidd1&qpidd2 up & running"
> fi
1st:qpidd1&qpidd2 up & running
[root@dhcp-lab-200 bz461932]#
[root@dhcp-lab-200 bz461932]# qpid-config add queue test-queue
[root@dhcp-lab-200 bz461932]# echo $?
0
[root@dhcp-lab-200 bz461932]#
[root@dhcp-lab-200 bz461932]# # create data
[root@dhcp-lab-200 bz461932]# for (( i=30000; i--;)) do printf "msg_%05d\n" ${i}; done > data.in
[root@dhcp-lab-200 bz461932]# echo "data.in: $(grep -c ^msg_ data.in) lines [unique:$(grep ^msg_ data.in | sort -u | wc -l)]"
data.in: 30000 lines [unique:30000]
[root@dhcp-lab-200 bz461932]#
[root@dhcp-lab-200 bz461932]# ( ./receiver --heartbeat 1 >receiver.log 2>&1 ; echo "receiver ecode:$?" > receiver.ecode ) &
[3] 30647
[root@dhcp-lab-200 bz461932]#
[root@dhcp-lab-200 bz461932]#
[root@dhcp-lab-200 bz461932]# ( cat data.in | ./sender --heartbeat 1 --send-eos 1 ; echo "sender ecode:$?" > receiver.ecode ) &
[4] 30649
[root@dhcp-lab-200 bz461932]#
[root@dhcp-lab-200 bz461932]# sleep 1
[root@dhcp-lab-200 bz461932]#
[root@dhcp-lab-200 bz461932]# receiver_pid=$(ps aux | grep ./receiver | grep -v grep | awk '{print $2}')
[root@dhcp-lab-200 bz461932]# sender_pid=$(ps aux | grep ./sender | grep -v grep | awk '{print $2}')
[root@dhcp-lab-200 bz461932]#
[root@dhcp-lab-200 bz461932]# if [ -n "${qpidd1_pid}" -a -n "${qpidd2_pid}" -a \
>      -n "${receiver_pid}" -a -n "${sender_pid}" ]; then
>   echo "2nd:qpidd1&qpidd2&receiver&sender up & running"
> else
>   echo "ERROR:2nd:qpidd1&qpidd2&receiver&sender up & running"
> fi
2nd:qpidd1&qpidd2&receiver&sender up & running
[root@dhcp-lab-200 bz461932]#
[root@dhcp-lab-200 bz461932]# echo "receiver log: $(grep -c ^msg_ receiver.log) lines [unique:$(grep ^msg_ receiver.log | sort -u | wc -l)]"
receiver log: 12552 lines [unique:12761]
[root@dhcp-lab-200 bz461932]#
[root@dhcp-lab-200 bz461932]# kill -STOP ${qpidd1_pid}
[root@dhcp-lab-200 bz461932]#
[root@dhcp-lab-200 bz461932]# sleep 2
2009-mar-09 15:33:21 warning Connection closed
[root@dhcp-lab-200 bz461932]# echo "receiver log: $(grep -c ^msg_ receiver.log) lines [unique:$(grep ^msg_ receiver.log | sort -u | wc -l)]"
receiver log: 12813 lines [unique:12813]
[root@dhcp-lab-200 bz461932]# sleep 4
[3]-  Done                    ( ./receiver --heartbeat 1 > receiver.log 2>&1; echo "receiver ecode:$?" > receiver.ecode )
[4]+  Done                    ( cat data.in | ./sender --heartbeat 1 --send-eos 1; echo "sender ecode:$?" > receiver.ecode )
[root@dhcp-lab-200 bz461932]# echo "receiver log: $(grep -c ^msg_ receiver.log) lines [unique:$(grep ^msg_ receiver.log | sort -u | wc -l)]"
receiver log: 30122 lines [unique:29908]
[root@dhcp-lab-200 bz461932]# sleep 4
[root@dhcp-lab-200 bz461932]# echo "receiver log: $(grep -c ^msg_ receiver.log) lines [unique:$(grep ^msg_ receiver.log | sort -u | wc -l)]"
receiver log: 30122 lines [unique:29908]
[root@dhcp-lab-200 bz461932]#
[root@dhcp-lab-200 bz461932]# grep ^msg_ receiver.log | sort -ru > receiver.log.su
[root@dhcp-lab-200 bz461932]#
[root@dhcp-lab-200 bz461932]# echo "data.in: $(grep -c ^msg_ data.in) l; receiver log (uniquified): $(grep -c ^msg_ receiver.log.su) l"
data.in: 30000 l; receiver log (uniquified): 29908 l
[root@dhcp-lab-200 bz461932]#
[root@dhcp-lab-200 bz461932]# diff data.in receiver.log.su
12814,12905d12813
< msg_17186
< msg_17185
< msg_17184
< msg_17183
< msg_17182
< msg_17181
< msg_17180
< msg_17179
< msg_17178
< msg_17177
< msg_17176
< msg_17175
< msg_17174
< msg_17173
< msg_17172
< msg_17171
< msg_17170
< msg_17169
< msg_17168
< msg_17167
< msg_17166
< msg_17165
< msg_17164
< msg_17163
< msg_17162
< msg_17161
< msg_17160
< msg_17159
< msg_17158
< msg_17157
< msg_17156
< msg_17155
< msg_17154
< msg_17153
< msg_17152
< msg_17151
< msg_17150
< msg_17149
< msg_17148
< msg_17147
< msg_17146
< msg_17145
< msg_17144
< msg_17143
< msg_17142
< msg_17141
< msg_17140
< msg_17139
< msg_17138
< msg_17137
< msg_17136
< msg_17135
< msg_17134
< msg_17133
< msg_17132
< msg_17131
< msg_17130
< msg_17129
< msg_17128
< msg_17127
< msg_17126
< msg_17125
< msg_17124
< msg_17123
< msg_17122
< msg_17121
< msg_17120
< msg_17119
< msg_17118
< msg_17117
< msg_17116
< msg_17115
< msg_17114
< msg_17113
< msg_17112
< msg_17111
< msg_17110
< msg_17109
< msg_17108
< msg_17107
< msg_17106
< msg_17105
< msg_17104
< msg_17103
< msg_17102
< msg_17101
< msg_17100
< msg_17099
< msg_17098
< msg_17097
< msg_17096
< msg_17095


- experiment #2 on RHEL 5.3 i386 - 


[root@ibm-firefly bz461932]# rm -rf *.log *.ecode data*
[root@ibm-firefly bz461932]#
[root@ibm-firefly bz461932]# ( qpidd --auth no -p 5673 --cluster-name ams --log-enable warning+ --data-dir data2 >qpidd2.log 2>&1 ; \
>   echo "qpidd2 ecode:$?" > qpidd2.ecode ) &
[1] 27942
[root@ibm-firefly bz461932]#
[root@ibm-firefly bz461932]# sleep 2
[root@ibm-firefly bz461932]#
[root@ibm-firefly bz461932]# qpidd2_pid=$(netstat -nlp | grep qpidd | grep 5673 | awk '{print $(NF)}' | awk -F/ '{print $1}')
[root@ibm-firefly bz461932]#
[root@ibm-firefly bz461932]# ( qpidd --auth no -p 5672 --cluster-name ams --log-enable warning+ --data-dir data1 >qpidd1.log 2>&1 ; \
>   echo "qpidd1 ecode:$?" > qpidd1.ecode ) &
[2] 27960
[root@ibm-firefly bz461932]#
[root@ibm-firefly bz461932]# sleep 2
[root@ibm-firefly bz461932]#
[root@ibm-firefly bz461932]# qpidd1_pid=$(netstat -nlp | grep qpidd | grep 5672 | awk '{print $(NF)}' | awk -F/ '{print $1}')
[root@ibm-firefly bz461932]#
[root@ibm-firefly bz461932]#
[root@ibm-firefly bz461932]#
[root@ibm-firefly bz461932]# if [ -n "${qpidd1_pid}" -a -n "${qpidd2_pid}" ]; then
>   echo "1st:qpidd1&qpidd2 up & running"
> else
>   echo "ERROR:1st:qpidd1&qpidd2 up & running"
> fi
1st:qpidd1&qpidd2 up & running
[root@ibm-firefly bz461932]#
[root@ibm-firefly bz461932]# qpid-config add queue test-queue
[root@ibm-firefly bz461932]# echo $?
0
[root@ibm-firefly bz461932]#
[root@ibm-firefly bz461932]# # create data
[root@ibm-firefly bz461932]# for (( i=30000; i--;)) do printf "msg_%05d\n" ${i}; done > data.in
[root@ibm-firefly bz461932]# echo "data.in: $(grep -c ^msg_ data.in) lines [unique:$(grep ^msg_ data.in | sort -u | wc -l)]"
data.in: 30000 lines [unique:30000]
[root@ibm-firefly bz461932]#
[root@ibm-firefly bz461932]# ( ./receiver --heartbeat 1 >receiver.log 2>&1 ; echo "receiver ecode:$?" > receiver.ecode ) &
[3] 27991
[root@ibm-firefly bz461932]#
[root@ibm-firefly bz461932]#
[root@ibm-firefly bz461932]# ( cat data.in | ./sender --heartbeat 1 --send-eos 1 ; echo "sender ecode:$?" > receiver.ecode ) &
[4] 27993
[root@ibm-firefly bz461932]#
[root@ibm-firefly bz461932]# sleep 1
[root@ibm-firefly bz461932]#
[root@ibm-firefly bz461932]# receiver_pid=$(ps aux | grep ./receiver | grep -v grep | awk '{print $2}')
[root@ibm-firefly bz461932]# sender_pid=$(ps aux | grep ./sender | grep -v grep | awk '{print $2}')
[root@ibm-firefly bz461932]#
[root@ibm-firefly bz461932]# if [ -n "${qpidd1_pid}" -a -n "${qpidd2_pid}" -a \
>      -n "${receiver_pid}" -a -n "${sender_pid}" ]; then
>   echo "2nd:qpidd1&qpidd2&receiver&sender up & running"
> else
>   echo "ERROR:2nd:qpidd1&qpidd2&receiver&sender up & running"
> fi
2nd:qpidd1&qpidd2&receiver&sender up & running
[root@ibm-firefly bz461932]#
[root@ibm-firefly bz461932]# echo "receiver log: $(grep -c ^msg_ receiver.log) lines [unique:$(grep ^msg_ receiver.log | sort -u | wc -l)]"
receiver log: 9651 lines [unique:9936]
[root@ibm-firefly bz461932]#
[root@ibm-firefly bz461932]# kill -STOP ${qpidd1_pid}
[root@ibm-firefly bz461932]#
[root@ibm-firefly bz461932]# sleep 2
2009-mar-09 10:13:13 warning Connection closed
[root@ibm-firefly bz461932]# echo "receiver log: $(grep -c ^msg_ receiver.log) lines [unique:$(grep ^msg_ receiver.log | sort -u | wc -l)]"
receiver log: 10639 lines [unique:10639]
[root@ibm-firefly bz461932]# sleep 4
[3]-  Done                    ( ./receiver --heartbeat 1 > receiver.log 2>&1; echo "receiver ecode:$?" > receiver.ecode )
[4]+  Done                    ( cat data.in | ./sender --heartbeat 1 --send-eos 1; echo "sender ecode:$?" > receiver.ecode )
[root@ibm-firefly bz461932]# echo "receiver log: $(grep -c ^msg_ receiver.log) lines [unique:$(grep ^msg_ receiver.log | sort -u | wc -l)]"
receiver log: 30058 lines [unique:29917]
[root@ibm-firefly bz461932]# sleep 4
[root@ibm-firefly bz461932]# echo "receiver log: $(grep -c ^msg_ receiver.log) lines [unique:$(grep ^msg_ receiver.log | sort -u | wc -l)]"
receiver log: 30058 lines [unique:29917]
[root@ibm-firefly bz461932]#
[root@ibm-firefly bz461932]# grep ^msg_ receiver.log | sort -ru > receiver.log.su
[root@ibm-firefly bz461932]#
[root@ibm-firefly bz461932]# echo "data.in: $(grep -c ^msg_ data.in) l; receiver log (iniquified): $(grep -c ^msg_ receiver.log.su) l"
data.in: 30000 l; receiver log (iniquified): 29917 l
[root@ibm-firefly bz461932]#
[root@ibm-firefly bz461932]# diff data.in receiver.log.su
10640,10722d10639
< msg_19360
< msg_19359
< msg_19358
< msg_19357
< msg_19356
< msg_19355
< msg_19354
< msg_19353
< msg_19352
< msg_19351
< msg_19350
< msg_19349
< msg_19348
< msg_19347
< msg_19346
< msg_19345
< msg_19344
< msg_19343
< msg_19342
< msg_19341
< msg_19340
< msg_19339
< msg_19338
< msg_19337
< msg_19336
< msg_19335
< msg_19334
< msg_19333
< msg_19332
< msg_19331
< msg_19330
< msg_19329
< msg_19328
< msg_19327
< msg_19326
< msg_19325
< msg_19324
< msg_19323
< msg_19322
< msg_19321
< msg_19320
< msg_19319
< msg_19318
< msg_19317
< msg_19316
< msg_19315
< msg_19314
< msg_19313
< msg_19312
< msg_19311
< msg_19310
< msg_19309
< msg_19308
< msg_19307
< msg_19306
< msg_19305
< msg_19304
< msg_19303
< msg_19302
< msg_19301
< msg_19300
< msg_19299
< msg_19298
< msg_19297
< msg_19296
< msg_19295
< msg_19294
< msg_19293
< msg_19292
< msg_19291
< msg_19290
< msg_19289
< msg_19288
< msg_19287
< msg_19286
< msg_19285
< msg_19284
< msg_19283
< msg_19282
< msg_19281
< msg_19280
< msg_19279
< msg_19278
[root@ibm-firefly bz461932]#
[root@ibm-firefly bz461932]# setenforce  1
[root@ibm-firefly bz461932]# echo "receiver log: $(grep -c ^msg_ receiver.log) lines [unique:$(grep ^msg_ receiver.log | sort -u | wc -l)]"
receiver log: 30058 lines [unique:29917]
[root@ibm-firefly bz461932]# jobs
[1]-  Running                 ( qpidd --auth no -p 5673 --cluster-name ams --log-enable warning+ --data-dir data2 > qpidd2.log 2>&1; echo "qpidd2 ecode:$?" > qpidd2.ecode ) &
[2]+  Running                 ( qpidd --auth no -p 5672 --cluster-name ams --log-enable warning+ --data-dir data1 > qpidd1.log 2>&1; echo "qpidd1 ecode:$?" > qpidd1.ecode ) &
[root@ibm-firefly bz461932]# ps auxw | grep qpidd
root     27944  3.2  0.3 136696 15276 pts/1    Sl   10:13   0:03 qpidd --auth no -p 5673 --cluster-name ams --log-enable warning+ --data-dir data2
root     27962  1.3  0.3 111580 12912 pts/1    Tl   10:13   0:01 qpidd --auth no -p 5672 --cluster-name ams --log-enable warning+ --data-dir data1
root     28055  0.0  0.0   3912   708 pts/1    S+   10:14   0:00 grep qpidd

Putting back to ASSIGNED. Test scenario 1 passes on RHEL4.7 / 5.3 i386 / x86_64.

->ASSIGNED

Comment 16 Frantisek Reznicek 2009-03-09 15:37:07 UTC

Re-tested failover clustered qpidd behavior and it works without any issue.
(disabling HB + killing broker kill -9 instead of kill -STOP)

Comment 17 Martin Kudlej 2009-03-13 12:36:44 UTC

I retested Frantisek's test scenario (comment 15) on RHEL 5.3 i386 with qpidc 0.5.752581-1.el5 and rhm 0.5.3153-1.el5 with same result as in comment 15.

Comment 18 mick 2009-03-18 15:55:55 UTC

I observed message-holding, and late delivery. (not message loss)
It looks like it is not related to heartbeat, but to cluster behavior 
when a broker becomes unresponsive  (i.e. with kill -STOP )

I submitted a separate BZ -- 490931 -- to cover this.


  1. start_cluster
  {
    NOTE 1
    {
      The first broker, which is the one that the sender and
      receiver will connect to, is advertising only one interface
      to the cluster.  This was part of the solution to my experience
      of hanging-forever on mrg5.

      The other part of the solution was to modify cpp/src/tests/receiver.cpp
      so that, when it failed over due to heartbeat loss, it would not
      retry the same address and hang.
    }

    NOTE 2
    {
      If you use --tcp-nodelay in the broker command lines, the
      number of "lost" messages will be greatly reduced.
    }

    script
    {
      #! /bin/bash



      pkill qpidd
      sleep 1

      rootdir=/home/mgoulish/461932

      rm $rootdir/qpidd1.log $rootdir/qpidd2.log $rootdir/qpidd3.log
      rm -rf $rootdir/data1  $rootdir/data2  $rootdir/data3
      mkdir  $rootdir/data1  $rootdir/data2  $rootdir/data3

      QPIDD=/usr/sbin/qpidd



      $QPIDD --cluster-name micks_cluster       \
             -p 5672                            \
             --cluster-url amqp:tcp:10.16.43.11:5672 \
             --auth=no                          \
             --log-enable info+                 \
             --log-to-file $rootdir/qpidd1.log  \
             -d                                 \
             --data-dir $rootdir/data1



      $QPIDD --cluster-name micks_cluster       \
             -p 6666                            \
             --auth=no                          \
             --log-enable info+                 \
             --log-to-file $rootdir/qpidd2.log  \
             -d                                 \
             --data-dir $rootdir/data2



      $QPIDD --cluster-name micks_cluster       \
             -p 6667                            \
             --auth=no                          \
             --log-enable info+                 \
             --log-to-file $rootdir/qpidd3.log  \
             -d                                 \
             --data-dir $rootdir/data3
    }
  }



  2. declare the queues
  {
    I did this with a modified copy of cpp/examples/direct/declare_queues.cpp


     The modification looks like this:

      session.queueDeclare(arg::queue="test-queue");
      session.exchangeBind(arg::exchange="amq.direct", arg::queue="test-queue", arg::bindingKey="test-queue");
  }



  3. run sender to completion
  {
    script
    {
      #! /bin/bash

      dir=/home/mgoulish/trunk/qpid/cpp/src/tests

      (for (( i=100000; i--;)) do echo $i; done) | $dir/sender --heartbeat 1 --send-eos 1
    }
  }


  4. run receiver and stop default broker.
  {
    Run this script and then as quickly as possible kill -STOP the
    broker that is listening on the default port.
  }


  5. receiver behavior
  {
    the receiver keeps running but puts a message like this in its
    output:

      91593
      91592
      91591
      2009-mar-18 09:41:33 warning Connection closed
      MDEBUG Receiver::editUrlList called !
      resuming_receiver first connect.
      91589
      91588
      91587
      91586

    NOTE
    {
      In this case only one messages was lost because I used --tcp-nodelay
      in broker startup.   Without --tcp-nodelay I would expect to see
      a few thousand missing.
    }
  }



  6. kill -9 the default broker

  7. wait for the receiver to terminate.

  8.
  {
    look at the receiver output to see out-of-order messages
    like this:

    52474
    52473
    91590  <-- yikes!
    52472
    52471
  }




  NOTO BENE
  {
    When I made this change to cpp/src/tests/receiver.cpp, at gsim's
    suggestion, the lost message problem went away.

    change
    {
      void Receiver::execute(AsyncSession& session, bool isRetry)
      {
          std::cout << "resuming_receiver " << (isRetry ? "first " : "re-") << "connect." << endl;
          if ( ! done ) {
              SubscriptionManager subs(session);
              /* ADD THIS! */ settings.flowControl = FlowControl::messageWindow(1);
              subscription = subs.subscribe(*this, queue, settings);
              subs.run();
              if (settings.autoAck) {
                  subscription.accept(subscription.getUnaccepted());
              }
          }
       }

    }
  }

Comment 19 Frantisek Reznicek 2009-03-20 14:53:11 UTC

The feature has been implemented, no message loss seen after slight modification of receiver and stopping+killing broker.
The holding of messages when qpidd stopped addressed in another BZ (490931).

Validated on RHEl 5.2 / 5.3 i386 / x86_64 on packages:
[root@hp-dl385-01 bz461932]# rpm -qa  | egrep '(qpid|rhm|openai)'
rhm-docs-0.5.753238-1.el5
qpidc-rdma-0.5.752581-1.el5
qpidd-xml-0.5.752581-1.el5
qpidc-perftest-0.5.752581-1.el5
openais-0.80.3-22.el5_3.3
qpid-java-common-0.5.751061-1.el5
python-qpid-0.5.752581-1.el5
qpidc-0.5.752581-1.el5
qpidc-ssl-0.5.752581-1.el5
qpidd-ssl-0.5.752581-1.el5
qpidd-acl-0.5.752581-1.el5
rhm-0.5.3153-1.el5
qpidc-devel-0.5.752581-1.el5
qpidd-devel-0.5.752581-1.el5
qpid-java-client-0.5.751061-1.el5
qpidd-0.5.752581-1.el5
qpidd-rdma-0.5.752581-1.el5
qpidd-cluster-0.5.752581-1.el5

-> VERIFIED

Comment 21 errata-xmlrpc 2009-04-21 16:17:46 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0434.html