Bug 543364
Summary: | openais aisexec aborts when MRG 1.2 candidate ran on it | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Frantisek Reznicek <freznice> | ||||||
Component: | openais | Assignee: | Steven Dake <sdake> | ||||||
Status: | CLOSED DUPLICATE | QA Contact: | Cluster QE <mspqa-list> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 5.4.z | CC: | cluster-maint, edamato, esammons | ||||||
Target Milestone: | rc | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2009-12-03 17:14:44 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Frantisek Reznicek
2009-12-02 09:31:15 UTC
Steps to Reproduce: 1. set-up openais (see attached configuration) 2. service openais start 3. set-up qpidd (see attached configuration) 4. service qpidd start 5. do the same on two or more machines perform step 4. on all machines almost in one moment 6. run following loop on every machine just after steps 4. while true; do qpid-cluster ; ps auxw | grep -E '(perftest|qpid|aisex)' | grep -v grep ;qpid-stat -b; done 7. wait couple of minutes, if nothing happens run on one of the machines this: perftest --user guest --password guest --count 10000 -s 8. wait first for non responding cluster and then for aisexec abort Created attachment 375731 [details] openais_abort_091203 logfiles and backtrace I believe the problem occurs when lan switch does not propagate multicast packets proprly. On RHTS machines I was able to solve the issue by -mcastaddr: 226.94.1.1 +mcastaddr: 225.0.0.12 but on our BRQ lab network I still see problem even when using 225.0.0.12 multicast ip target. The lab switch HW is Cisco Catalyst 3560 It might be following issue described here: http://kbase.redhat.com/faq/docs/DOC-5933 but befor I ask for resolving I need to be sure it is it. So I reproduced the abort case A with 225.0.0.12, data attached In the openais log I can see a lot of retransmits: the node which aborted... ... Dec 2 9:47:53.360784 [TOTEM] Retransmit List: 81 83 Dec 2 9:47:53.550812 [TOTEM] Retransmit List: 81 83 Dec 2 9:47:53.739831 [TOTEM] Retransmit List: 81 83 ... Dec 2 9:48:11.089077 [TOTEM] Retransmit List: bb Dec 2 9:48:11.089117 [TOTEM] FAILED TO RECEIVE the other node... Dec 2 9:48:46.417519 [CPG ] got joinlist message from node 924918282 Dec 2 9:48:47.400935 [TOTEM] FAILED TO RECEIVE Dec 2 9:48:47.401003 [TOTEM] entering GATHER state from 6. Dec 2 9:48:47.648053 [TOTEM] FAILED TO RECEIVE Dec 2 9:48:47.648138 [TOTEM] entering GATHER state from 6. Dec 2 9:48:47.895028 [TOTEM] FAILED TO RECEIVE Dec 2 9:48:47.895109 [TOTEM] entering GATHER state from 6. Dec 2 9:48:48.142030 [TOTEM] FAILED TO RECEIVE Dec 2 9:48:48.142110 [TOTEM] entering GATHER state from 6. Dec 2 9:48:48.389045 [TOTEM] FAILED TO RECEIVE Dec 2 9:48:48.389123 [TOTEM] entering GATHER state from 6. Dec 2 9:48:48.442623 [TOTEM] Creating commit token because I am the rep. Dec 2 9:48:48.442672 [TOTEM] Saving state aru 62c1 high seq received 62c1 Dec 2 9:48:48.442716 [TOTEM] Storing new sequence id for ring 5378 Dec 2 9:48:48.442811 [TOTEM] entering COMMIT state. Dec 2 9:48:48.442986 [TOTEM] entering RECOVERY state. Dec 2 9:48:48.443043 [TOTEM] position [0] member 10.34.33.54: Dec 2 9:48:48.443065 [TOTEM] previous ring seq 21364 rep 10.34.33.54 Dec 2 9:48:48.443086 [TOTEM] aru 62c1 high delivered 62c1 received flag 1 Dec 2 9:48:48.443105 [TOTEM] position [1] member 10.34.33.55: Dec 2 9:48:48.443129 [TOTEM] previous ring seq 21364 rep 10.34.33.54 Dec 2 9:48:48.443151 [TOTEM] aru 6203 high delivered 6203 received flag 0 Dec 2 9:48:48.443183 [TOTEM] copying all old ring messages from 6204-62c1. Dec 2 9:48:48.443813 [TOTEM] Originated 190 messages in RECOVERY. Dec 2 9:48:48.443836 [TOTEM] Originated for recovery: 6204 6205 6206 6207 6208 6209 620a 620b 620c 620d 620e 620f 6210 6211 6212 6213 6214 6215 6216 6217 6218 6219 621a 621b 621c 621d 621e 621f 6220 6221 6222 6223 6224 6225 6226 6227 6228 6229 622a 622b 622c 622d 622e 622f 6230 6231 6232 6233 6234 6235 6236 6237 6238 6239 623a 623b 623c 623d 623e 623f 6240 6241 6242 6243 6244 6245 6246 6247 6248 6249 624a 624b 624c 624d 624e 624f 6250 6251 6252 6253 6254 6255 6256 6257 6258 6259 625a 625b 625c 625d 625e 625f 6260 6261 6262 6263 6264 6265 6266 6267 6268 6269 626a 626b 626c 626d 626e 626f 6270 6271 6272 6273 6274 6275 6276 6277 6278 6279 627a 627b 627c 627d 627e 627f 6280 6281 6282 6283 6284 6285 6286 6287 6288 6289 628a 628b 628c 628d 628e 628f 6290 6291 6292 6293 6294 6295 6296 6297 6298 6299 629a 629b 629c 629d 629e 629f 62a0 62a1 62a2 62a3 62a4 62a5 62a6 62a7 62a8 62a9 62aa 62ab 62ac 62ad 62ae 62af 62b0 62b1 62b2 62b3 62b4 62b5 62b6 62b7 62b8 62b9 62ba 62bb 62bc 62bd 62be 62bf 62c0 62c1 see details in the above attached package... We're currently trying to determine multicast traffic using tcpdump... This is a duplicate of #490856. *** This bug has been marked as a duplicate of bug 490856 *** |