Bug 543364

Summary: openais aisexec aborts when MRG 1.2 candidate ran on it
Product: Red Hat Enterprise Linux 5 Reporter: Frantisek Reznicek <freznice>
Component: openaisAssignee: Steven Dake <sdake>
Status: CLOSED DUPLICATE QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: high    
Version: 5.4.zCC: cluster-maint, edamato, esammons
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-12-03 17:14:44 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
openais and qpidd configuration and logs
none
openais_abort_091203 logfiles and backtrace none

Description Frantisek Reznicek 2009-12-02 09:31:15 UTC
Created attachment 375375 [details]
openais and qpidd configuration and logs

Description of problem:

The openais daemon aisexec rarely aborts when MRG 1.2 candidate messaging system ran on it.

The test is repeating qpid-cluster, qpid-stat -b and optionaly perftest -s commands.
After some time each MRG node stops to see each other and in some more time MRG cluster stops responding. While running above commands in the loop for some couple of minutes I detected that one MRG node shutdowns with:

  2009-dec-01 09:09:00 debug DeliveryRecord::setEnded() id=9
  2009-dec-01 09:09:00 debug Credit available for 'rdest' on 0x138fb2b8 bytes:   4294967295 msgs: 4294967295
  2009-dec-01 09:09:00 debug Credit allocated for 'rdest' on 0x138fb2b8, was    bytes: 4294967295 msgs: 4294967295 now bytes: 4294967295 msgs: 4294967295
  2009-dec-01 09:09:00 debug DeliveryRecord::setEnded() id=10
  2009-dec-01 09:09:00 debug Exception constructed: Error in CPG dispatch:  library (2)
  2009-dec-01 09:09:00 critical Error in cluster dispatch: Error in CPG dispatch: library (2)
  2009-dec-01 09:09:00 notice 10.34.33.55:11837(LEFT) leaving cluster xxxy

Problem appeared on RHEL 5.4 x86_64 when in cluster with another RHEL 5.4 i386, both updated to latest packages as of 2009-12-01.

After analyzing what happened I found generated aisexec core files...

aisexec abort case A:
GNU gdb Fedora (6.8-37.el5)
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...
Reading symbols from /lib64/libdl.so.2...done.
...
Loaded symbols for /usr/libexec/lcrso/service_confdb.lcrso
Reading symbols from /lib64/libgcc_s.so.1...done.
Loaded symbols for /lib64/libgcc_s.so.1
Core was generated by `/usr/sbin/aisexec'.
Program terminated with signal 6, Aborted.
[New process 11239]
[New process 11317]
[New process 11242]
[New process 11240]
#0  0x0000003238630265 in raise () from /lib64/libc.so.6
(gdb)   4 process 11240  0x00000032386cae46 in poll () from /lib64/libc.so.6
  3 process 11242  0x000000323867a974 in memset () from /lib64/libc.so.6
  2 process 11317  0x00000032386d4ff7 in semop () from /lib64/libc.so.6
* 1 process 11239  0x0000003238630265 in raise () from /lib64/libc.so.6
(gdb) 
Thread 4 (process 11240):
#0  0x00000032386cae46 in poll () from /lib64/libc.so.6
#1  0x000000000041cb95 in prioritized_timer_thread (data=<value optimized out>)
    at timer.c:125
#2  0x00000032392064a7 in start_thread () from /lib64/libpthread.so.0
#3  0x00000032386d3c2d in clone () from /lib64/libc.so.6

Thread 3 (process 11242):
#0  0x000000323867a974 in memset () from /lib64/libc.so.6
#1  0x0000003b418085b8 in ?? () from /lib64/libgcc_s.so.1
#2  0x0000003b41808d39 in _Unwind_ForcedUnwind () from /lib64/libgcc_s.so.1
#3  0x000000323920ce10 in __pthread_unwind () from /lib64/libpthread.so.0
#4  0x00000032392053f0 in sigcancel_handler () from /lib64/libpthread.so.0
#5  <signal handler called>
#6  0x000000323920ab99 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#7  0x0000000000417d0f in worker_thread (thread_data_in=<value optimized out>)
    at wthread.c:73
#8  0x00000032392064a7 in start_thread () from /lib64/libpthread.so.0
#9  0x00000032386d3c2d in clone () from /lib64/libc.so.6

Thread 2 (process 11317):
#0  0x00000032386d4ff7 in semop () from /lib64/libc.so.6
#1  0x000000000041b79b in pthread_ipc_consumer (conn=0x19904dc0) at ipc.c:320
#2  0x00000032392064a7 in start_thread () from /lib64/libpthread.so.0
#3  0x00000032386d3c2d in clone () from /lib64/libc.so.6

Thread 1 (process 11239):
#0  0x0000003238630265 in raise () from /lib64/libc.so.6
#1  0x0000003238631d10 in abort () from /lib64/libc.so.6
#2  0x00000032386296e6 in __assert_fail () from /lib64/libc.so.6
#3  0x000000000041200c in message_handler_orf_token (instance=0x2aaaaaaad010, 
    msg=<value optimized out>, msg_len=<value optimized out>, 
    endian_conversion_needed=<value optimized out>) at ../include/sq.h:171
#4  0x0000000000409f43 in rrp_deliver_fn (context=0x198e5bc0, msg=0x198e6284, 
    msg_len=70) at totemrrp.c:1308
#5  0x00000000004084fb in net_deliver_fn (handle=<value optimized out>, 
    fd=<value optimized out>, revents=<value optimized out>, data=0x198e5c00)
    at totemnet.c:695
#6  0x0000000000405d10 in poll_run (handle=0) at aispoll.c:402
#7  0x0000000000418834 in main (argc=<value optimized out>, 
    argv=<value optimized out>) at main.c:620
(gdb) quit

aisexec abort case B:
GNU gdb Fedora (6.8-37.el5)
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...
Reading symbols from /lib64/libdl.so.2...done.
Loaded symbols for /lib64/libdl.so.2
...
Loaded symbols for /usr/libexec/lcrso/service_confdb.lcrso
Reading symbols from /lib64/libgcc_s.so.1...done.
Loaded symbols for /lib64/libgcc_s.so.1
Core was generated by `/usr/sbin/aisexec'.
Program terminated with signal 6, Aborted.
[New process 11529]
[New process 11532]
[New process 11530]
#0  0x0000003238630265 in raise () from /lib64/libc.so.6
(gdb)   3 process 11530  0x00000032386cae46 in poll () from /lib64/libc.so.6
  2 process 11532  0x0000003b4180a6e7 in ?? () from /lib64/libgcc_s.so.1
* 1 process 11529  0x0000003238630265 in raise () from /lib64/libc.so.6
(gdb) 
Thread 3 (process 11530):
#0  0x00000032386cae46 in poll () from /lib64/libc.so.6
#1  0x000000000041cb95 in prioritized_timer_thread (data=<value optimized out>)
    at timer.c:125
#2  0x00000032392064a7 in start_thread () from /lib64/libpthread.so.0
#3  0x00000032386d3c2d in clone () from /lib64/libc.so.6

Thread 2 (process 11532):
#0  0x0000003b4180a6e7 in ?? () from /lib64/libgcc_s.so.1
#1  0x0000003238708567 in dl_iterate_phdr () from /lib64/libc.so.6
#2  0x0000003b4180a626 in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1
#3  0x0000003b418075b5 in ?? () from /lib64/libgcc_s.so.1
#4  0x0000003b418085db in ?? () from /lib64/libgcc_s.so.1
#5  0x0000003b41808d39 in _Unwind_ForcedUnwind () from /lib64/libgcc_s.so.1
#6  0x000000323920ce10 in __pthread_unwind () from /lib64/libpthread.so.0
#7  0x00000032392053f0 in sigcancel_handler () from /lib64/libpthread.so.0
#8  <signal handler called>
#9  0x000000323920ab99 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#10 0x0000000000417d0f in worker_thread (thread_data_in=<value optimized out>)
    at wthread.c:73
#11 0x00000032392064a7 in start_thread () from /lib64/libpthread.so.0
#12 0x00000032386d3c2d in clone () from /lib64/libc.so.6

Thread 1 (process 11529):
#0  0x0000003238630265 in raise () from /lib64/libc.so.6
#1  0x0000003238631d10 in abort () from /lib64/libc.so.6
#2  0x00000032386296e6 in __assert_fail () from /lib64/libc.so.6
#3  0x000000000040d867 in memb_state_gather_enter (instance=0x2aaaaaaad010, 
    gather_from=12) at totemsrp.c:1691
#4  0x000000000040e4b1 in message_handler_memb_join (instance=0x2aaaaaaad010, 
    msg=0x1c706284, msg_len=<value optimized out>, 
    endian_conversion_needed=<value optimized out>) at totemsrp.c:3971
#5  0x0000000000409f6e in rrp_deliver_fn (context=0x1c705bc0, msg=0x1c706284, 
    msg_len=112) at totemrrp.c:1319
#6  0x00000000004084fb in net_deliver_fn (handle=<value optimized out>, 
    fd=<value optimized out>, revents=<value optimized out>, data=0x1c705c00)
    at totemnet.c:695
#7  0x0000000000405d10 in poll_run (handle=0) at aispoll.c:402
#8  0x0000000000418834 in main (argc=<value optimized out>, 
    argv=<value optimized out>) at main.c:620
(gdb) quit


Version-Release number of selected component (if applicable):
openais-0.80.6-8.el5_4.1
openais-0.80.6-8.el5_4.1
qpidd-0.5.752581-34.el5
qpidc-perftest-0.5.752581-34.el5
python-qpid-0.5.758389-2.el5

How reproducible:
very rarely

Steps to Reproduce:
1.
2.
3.
  
Actual results:
openais aborts

Expected results:
openais should continue working...

Additional info:


openais config
# Please read the openais.conf.5 manual page

totem {
        version: 2
        secauth: off
        threads: 0
        interface {
                ringnumber: 0
bindnetaddr: 10.34.32.0
                mcastaddr: 226.94.1.1
                mcastport: 5405
#mcastaddr: 226.94.11.12
#mcastport: 11112
        }
}

logging {
debug: on
timestamp: on
to_file: yes
logfile: /tmp/openais.log
}

amf {
        mode: disabled
}


openias log, qpidd log:
see attached logfile

qpidd config
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.
#
# Configuration file for qpidd. Entries are of the form:
#   name=value
#
# (Note: no spaces on either side of '='). Using default settings:
# "qpidd --help" or "man qpidd" for more details.
cluster-mechanism=ANONYMOUS
cluster-name=xxxy
cluster-cman=no
log-to-file=/tmp/qpidd.log
log-enable=debug+

Comment 1 Frantisek Reznicek 2009-12-02 09:36:57 UTC
Steps to Reproduce:
1. set-up openais (see attached configuration)
2. service openais start
3. set-up qpidd (see attached configuration)
4. service qpidd start
5. do the same on two or more machines
   perform step 4. on all machines almost in one moment
6. run following loop on every machine just after steps 4.
   while true; do qpid-cluster ; ps auxw | grep -E '(perftest|qpid|aisex)' | grep -v grep ;qpid-stat -b; done
7. wait couple of minutes, if nothing happens run on one of the machines this:
   perftest --user guest --password guest --count 10000 -s
8. wait first for non responding cluster and then for aisexec abort

Comment 2 Frantisek Reznicek 2009-12-03 11:15:26 UTC
Created attachment 375731 [details]
openais_abort_091203 logfiles and backtrace

I believe the problem occurs when lan switch does not propagate multicast packets proprly. On RHTS machines I was able to solve the issue by
-mcastaddr: 226.94.1.1
+mcastaddr: 225.0.0.12

but on our BRQ lab network I still see problem even when using 225.0.0.12 multicast ip target. The lab switch HW is Cisco Catalyst 3560
It might be following issue described here:
 http://kbase.redhat.com/faq/docs/DOC-5933
but befor I ask for resolving I need to be sure it is it.

So I reproduced the abort case A with 225.0.0.12, data attached

In the openais log I can see a lot of retransmits:
the node which aborted...
...
Dec  2  9:47:53.360784 [TOTEM] Retransmit List: 81 83
Dec  2  9:47:53.550812 [TOTEM] Retransmit List: 81 83
Dec  2  9:47:53.739831 [TOTEM] Retransmit List: 81 83
...
Dec  2  9:48:11.089077 [TOTEM] Retransmit List: bb
Dec  2  9:48:11.089117 [TOTEM] FAILED TO RECEIVE


the other node...
Dec  2  9:48:46.417519 [CPG  ] got joinlist message from node 924918282
Dec  2  9:48:47.400935 [TOTEM] FAILED TO RECEIVE
Dec  2  9:48:47.401003 [TOTEM] entering GATHER state from 6.
Dec  2  9:48:47.648053 [TOTEM] FAILED TO RECEIVE
Dec  2  9:48:47.648138 [TOTEM] entering GATHER state from 6.
Dec  2  9:48:47.895028 [TOTEM] FAILED TO RECEIVE
Dec  2  9:48:47.895109 [TOTEM] entering GATHER state from 6.
Dec  2  9:48:48.142030 [TOTEM] FAILED TO RECEIVE
Dec  2  9:48:48.142110 [TOTEM] entering GATHER state from 6.
Dec  2  9:48:48.389045 [TOTEM] FAILED TO RECEIVE
Dec  2  9:48:48.389123 [TOTEM] entering GATHER state from 6.
Dec  2  9:48:48.442623 [TOTEM] Creating commit token because I am the rep.
Dec  2  9:48:48.442672 [TOTEM] Saving state aru 62c1 high seq received 62c1
Dec  2  9:48:48.442716 [TOTEM] Storing new sequence id for ring 5378
Dec  2  9:48:48.442811 [TOTEM] entering COMMIT state.
Dec  2  9:48:48.442986 [TOTEM] entering RECOVERY state.
Dec  2  9:48:48.443043 [TOTEM] position [0] member 10.34.33.54:
Dec  2  9:48:48.443065 [TOTEM] previous ring seq 21364 rep 10.34.33.54
Dec  2  9:48:48.443086 [TOTEM] aru 62c1 high delivered 62c1 received flag 1
Dec  2  9:48:48.443105 [TOTEM] position [1] member 10.34.33.55:
Dec  2  9:48:48.443129 [TOTEM] previous ring seq 21364 rep 10.34.33.54
Dec  2  9:48:48.443151 [TOTEM] aru 6203 high delivered 6203 received flag 0
Dec  2  9:48:48.443183 [TOTEM] copying all old ring messages from 6204-62c1.
Dec  2  9:48:48.443813 [TOTEM] Originated 190 messages in RECOVERY.
Dec  2  9:48:48.443836 [TOTEM] Originated for recovery: 6204 6205 6206 6207 6208 6209 620a 620b 620c 620d 620e 620f 6210 6211 6212 6213 6214 6215
 6216 6217 6218 6219 621a 621b 621c 621d 621e 621f 6220 6221 6222 6223 6224 6225 6226 6227 6228 6229 622a 622b 622c 622d 622e 622f 6230 6231 6232
 6233 6234 6235 6236 6237 6238 6239 623a 623b 623c 623d 623e 623f 6240 6241 6242 6243 6244 6245 6246 6247 6248 6249 624a 624b 624c 624d 624e 624f
 6250 6251 6252 6253 6254 6255 6256 6257 6258 6259 625a 625b 625c 625d 625e 625f 6260 6261 6262 6263 6264 6265 6266 6267 6268 6269 626a 626b 626c
 626d 626e 626f 6270 6271 6272 6273 6274 6275 6276 6277 6278 6279 627a 627b 627c 627d 627e 627f 6280 6281 6282 6283 6284 6285 6286 6287 6288 6289
 628a 628b 628c 628d 628e 628f 6290 6291 6292 6293 6294 6295 6296 6297 6298 6299 629a 629b 629c 629d 629e 629f 62a0 62a1 62a2 62a3 62a4 62a5 62a6
 62a7 62a8 62a9 62aa 62ab 62ac 62ad 62ae 62af 62b0 62b1 62b2 62b3 62b4 62b5 62b6 62b7 62b8 62b9 62ba 62bb 62bc 62bd 62be 62bf 62c0 62c1

see details in the above attached package...


We're currently trying to determine multicast traffic using tcpdump...

Comment 3 Steven Dake 2009-12-03 17:14:23 UTC
This is a duplicate of #490856.

Comment 4 Steven Dake 2009-12-03 17:14:44 UTC

*** This bug has been marked as a duplicate of bug 490856 ***

Comment 5 Steven Dake 2009-12-21 17:07:21 UTC
Please note aisexec abort case B is a separate bugzilla.  I didn't catch this originally.  Please file separate bugzillas for different abort cases in the future if possible.  That case is a dupe of bz#547828.