Hide Forgot
Description of problem: Sometimes, rgmanager crashes on startup: Sep 27 07:10:13 marathon-01 rgmanager[5525]: I am node #1 Sep 27 07:10:13 marathon-01 rgmanager[5525]: Resource Group Manager Starting Sep 27 07:10:13 marathon-01 rgmanager[5525]: Loading Service Data Sep 27 07:10:15 marathon-01 rgmanager[5525]: Initializing Services Sep 27 07:10:15 marathon-01 rgmanager[6421]: [clusterfs] /dev/dm-3 is not mounted Sep 27 07:10:15 marathon-01 rgmanager[6455]: [nfsclient] /mnt/ext3 is not a directory Sep 27 07:10:15 marathon-01 rgmanager[5525]: stop on nfsclient "*" returned 2 (invalid argument(s)) Sep 27 07:10:15 marathon-01 rgmanager[6491]: [nfsserver] Stopping NFS daemons Sep 27 07:10:16 marathon-01 rgmanager[6598]: [nfsserver] Stopping rpc.statd Sep 27 07:10:16 marathon-01 rgmanager[6728]: [fs] /dev/dm-4 is not mounted Sep 27 07:10:16 marathon-01 rgmanager[5525]: Services Initialized Sep 27 07:10:16 marathon-01 rgmanager[5525]: State change: Local UP Sep 27 07:10:16 marathon-01 rgmanager[5525]: State change: marathon-02 UP Sep 27 07:10:16 marathon-01 rgmanager[5525]: State change: marathon-03 UP Sep 27 07:10:16 marathon-01 rgmanager[5525]: State change: marathon-04 UP Sep 27 07:10:16 marathon-01 rgmanager[5525]: State change: marathon-05 UP Sep 27 07:10:21 marathon-01 abrt[6756]: saved core dump of pid 5525 (/usr/sbin/rgmanager) to /var/spool/abrt/ccpp-2011-09-27-07:10:21-5525.new/coredump (44290048 bytes) Version-Release number of selected component (if applicable): rgmanager-3.0.12.1-4.el6.x86_64 cman-3.0.12.1-22.el6.x86_64 dbus-1.2.24-6.el6.x86_64 dbus-libs-1.2.24-6.el6.x86_64 How reproducible: 20% Steps to Reproduce: 1. cluster-wide rgmanager start with service defined 2. 3. Actual results: sometimes one of the nodes crashes Expected results: no rgmanager crash Additional info: Core was generated by `rgmanager'. Program terminated with signal 6, Aborted. #0 0x00007ff2fd8b7885 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 64 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig); Missing separate debuginfos, use: debuginfo-install libgcc-4.4.6-3.el6.x86_64 libxml2-2.7.6-3.el6.x86_64 slang-2.2.1-1.el6.x86_64 zlib-1.2.3-27.el6.x86_64 (gdb) thread apply all bt Thread 6 (Thread 0x7ff2fd680700 (LWP 5526)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 #1 0x00000038b5e01653 in thread_fn (arg=<value optimized out>) at /usr/src/debug/cluster-3.0.12.1/common/liblogthread/liblogthread.c:83 #2 0x00007ff2fe0767f1 in start_thread (arg=0x7ff2fd680700) at pthread_create.c:301 #3 0x00007ff2fd96a70d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 5 (Thread 0x7ff2fcc7f700 (LWP 6751)): #0 0x00007ff2fd9632f3 in select () at ../sysdeps/unix/syscall-template.S:82 #1 0x00000000004215ee in poll_cluster_messages (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/clulib/msg_cluster.c:186 #2 cluster_comms_thread (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/clulib/msg_cluster.c:967 #3 0x00007ff2fe0767f1 in start_thread (arg=0x7ff2fcc7f700) at pthread_create.c:301 #4 0x00007ff2fd96a70d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 4 (Thread 0x7ff2fe68b700 (LWP 6753)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:216 #1 0x00000000004116b1 in _event_thread_f (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/daemons/rg_event.c:393 #2 0x00007ff2fe0767f1 in start_thread (arg=0x7ff2fe68b700) at pthread_create.c:301 #3 0x00007ff2fd96a70d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 3 (Thread 0x7ff2fe71e700 (LWP 5525)): #0 0x00007ff2fd9632f3 in select () at ../sysdeps/unix/syscall-template.S:82 #1 0x0000000000409e3a in event_loop (argc=<value optimized out>, argv=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/daemons/main.c:695 #2 main (argc=<value optimized out>, argv=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/daemons/main.c:1152 Thread 2 (Thread 0x7ff2f7fff700 (LWP 6752)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:216 #1 0x00000000004202c7 in cluster_msg_wait (ctx=0xf74a10, timeout=3) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/clulib/msg_cluster.c:270 #2 0x000000000041ef44 in vf_event_loop (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/clulib/vft.c:801 #3 vf_server (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/clulib/vft.c:871 #4 0x00007ff2fe0767f1 in start_thread (arg=0x7ff2f7fff700) at pthread_create.c:301 #5 0x00007ff2fd96a70d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 1 (Thread 0x7ff2f75fe700 (LWP 6755)): #0 0x00007ff2fd8b7885 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x00007ff2fd8b9065 in abort () at abort.c:92 #2 0x00007ff2fde58975 in _dbus_abort () at dbus-sysdeps.c:88 #3 0x00007ff2fde54845 in _dbus_warn_check_failed ( format=0x7ff2fde5d388 "The last reference on a connection was dropped without closing the connection. This is a bug in an application. See dbus_connection_unref() documentation for details.\n%s") at dbus-internals.c:283 #4 0x00007ff2fde3ec62 in _dbus_connection_read_write_dispatch (connection=0xf67390, timeout_milliseconds=500, dispatch=<value optimized out>) at dbus-connection.c:3512 #5 0x000000000041a409 in _dbus_auto_flush (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/daemons/update-dbus.c:130 #6 0x00007ff2fe0767f1 in start_thread (arg=0x7ff2f75fe700) at pthread_create.c:301 #7 0x00007ff2fd96a70d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
Created attachment 525112 [details] core file
relevant rgmanager snip from cluster.conf: <rm> <failoverdomains> <failoverdomain name="marathon_domain" ordered="0" restricted="0"> <failoverdomainnode name="marathon-01" priority="1"/> <failoverdomainnode name="marathon-02" priority="1"/> <failoverdomainnode name="marathon-03" priority="1"/> <failoverdomainnode name="marathon-04" priority="1"/> <failoverdomainnode name="marathon-05" priority="1"/> </failoverdomain> </failoverdomains> <resources> <ip address="10.15.89.211" monitor_link="1"/> <clusterfs device="/dev/marathon/marathon0" force_unmount="1" fsid="9039" fstype="gfs" mountpoint="/mnt/gfs1" name="marathon0" options="" self_fence="1"/> <fs device="/dev/marathon/marathon1" force_fsck="0" force_unmount="1" fsid="7083" fstype="ext3" mountpoint="/mnt/ext3" name="marathon1" options="" self_fence="1"/> <nfsserver name="marathon nfs server"/> <nfsclient name="*" options="rw" target="*"/> </resources> <service autostart="1" domain="marathon_domain" name="nfs_service"> <clusterfs ref="marathon0"> <nfsserver ref="marathon nfs server"> <nfsclient ref="*"/> </nfsserver> </clusterfs> <fs ref="marathon1"> <nfsserver ref="marathon nfs server"> <nfsclient ref="*"/> </nfsserver> </fs> <ip ref="10.15.89.211"/> </service> </rm>
This is odd; I did a lot of testing but it was with the .4 release of dbus. I'll have to see if there's anything new here.
This is always fun: [Thread debugging using libthread_db enabled] ../../gdb/linux-thread-db.c:1048: internal-error: attach_thread: Assertion `ti_p->ti_tid != 0' failed. A problem internal to GDB has been detected, further debugging may prove unreliable. Quit this debugging session? (y or n) EOF [assumed Y] ../../gdb/linux-thread-db.c:1048: internal-error: attach_thread: Assertion `ti_p->ti_tid != 0' failed. A problem internal to GDB has been detected, further debugging may prove unreliable. Create a core file of GDB? (y or n)
Core was generated by `rgmanager'. Program terminated with signal 6, Aborted. #0 0x0000003239e32885 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install clusterlib-3.0.12.1-23.el6.x86_64 corosynclib-1.4.1-4.el6.x86_64 glibc-2.12-1.44.el6.x86_64 libgcc-4.4.6-3.el6.x86_64 libxml2-2.7.6-4.el6.x86_64 slang-2.2.1-1.el6.x86_64 zlib-1.2.3-27.el6.x86_64 (gdb) bt #0 0x0000003239e32885 in raise () from /lib64/libc.so.6 #1 0x0000003239e34065 in abort () from /lib64/libc.so.6 #2 0x000000323ba2a975 in _dbus_abort () at dbus-sysdeps.c:88 #3 0x000000323ba26845 in _dbus_warn_check_failed ( format=0x323ba2f388 "The last reference on a connection was dropped without closing the connection. This is a bug in an application. See dbus_connection_unref() documentation for details.\n%s") at dbus-internals.c:283 #4 0x000000323ba10c62 in _dbus_connection_read_write_dispatch (connection=0x1cbc390, timeout_milliseconds=500, dispatch=<value optimized out>) at dbus-connection.c:3512 #5 0x000000000041a409 in _dbus_auto_flush (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/daemons/update-dbus.c:130 #6 0x000000323a2077f1 in start_thread () from /lib64/libpthread.so.0 #7 0x0000003239ee570d in clone () from /lib64/libc.so.6 Oct 14 06:13:45 marathon-02 rgmanager[17874]: I am node #2 Oct 14 06:13:45 marathon-02 rgmanager[17874]: Resource Group Manager Starting Oct 14 06:13:45 marathon-02 rgmanager[17874]: Loading Service Data Oct 14 06:13:46 marathon-02 rgmanager[17874]: Initializing Services Oct 14 06:13:46 marathon-02 rgmanager[18769]: [clusterfs] /dev/dm-4 is not mounted Oct 14 06:13:46 marathon-02 rgmanager[18803]: [nfsclient] Removing export: *:/mnt/ext3 Oct 14 06:13:47 marathon-02 rgmanager[18840]: [nfsserver] Stopping NFS daemons Oct 14 06:13:47 marathon-02 rgmanager[18947]: [nfsserver] Stopping rpc.statd Oct 14 06:13:47 marathon-02 rgmanager[19077]: [fs] /dev/dm-3 is not mounted Oct 14 06:13:47 marathon-02 rgmanager[17874]: Services Initialized Oct 14 06:13:47 marathon-02 rgmanager[17874]: State change: Local UP Oct 14 06:13:47 marathon-02 rgmanager[17874]: State change: marathon-03 UP Oct 14 06:13:47 marathon-02 rgmanager[17874]: State change: marathon-01 UP Oct 14 06:13:47 marathon-02 rgmanager[17874]: State change: marathon-04 UP Oct 14 06:13:52 marathon-02 abrt[19105]: saved core dump of pid 17874 (/usr/sbin/rgmanager) to /var/spool/abrt/ccpp-2011-10-14-06:13:52-17874.new/coredump (44285952 bytes) $ rpm -qva | grep -E 'rgmanager|dbus' | sort dbus-1.2.24-5.el6_1.x86_64 dbus-debuginfo-1.2.24-5.el6_1.x86_64 dbus-glib-0.86-5.el6.x86_64 dbus-libs-1.2.24-5.el6_1.x86_64 dbus-python-0.83.0-6.1.el6.x86_64 eggdbus-0.6-3.el6.x86_64 rgmanager-3.0.12.1-4.el6.x86_64 rgmanager-debuginfo-3.0.12.1-4.el6.x86_64 On 17th iteration out of 500 planned 3 of 5 nodes were down and the game was over :). The error is still the same and the syslog messages are also like pasted above. Reproducer: ITER=500 NODES="marathon-01 marathon-02 marathon-03 marathon-04 marathon-05" for i in `seq 1 $ITER`; do echo "Starting iteration $i"; for node in $NODES; do qarsh root@$node service rgmanager start & done echo "Waiting for all nodes to start..." wait echo "sleeping 10s" sleep 10 for node in $NODES; do qarsh root@$node service rgmanager stop & done echo "Waiting for all nodes to stop..." wait done
Created attachment 528200 [details] 3 cores from the crashed nodes
This appears to be a concurrency issue with dbus -- that is, signals are perhaps not being handled as well as they could be in libdbus. This is why it only reproduces sometimes and only on startup/shutdown. Most of the the dbus calls in rgmanager are made with signals blocked. This is because the dbus signals are emitted when we run a callback on committed data (e.g. rgmanager service state changes). Unfortunately, the _dbus_auto_flush thread rgmanager -did not- have signals blocked, causing the SIGTERM to hit that thread while in dbus_connection_read_write(). It is my suspicion that depending on where we are in this function when the signal arrives, we either get a direct segfault, or an extra dbus_connection_unref() call - causing the SIGABRT and backtraces above. Blocking signals in the _dbus_auto_flush thread should consequently fix this. My first patch hit 44 iterations with no issues; I am testing a second (far simpler) patch currently.
Turns out the first patch was more robust; I still get segfaults on the second patch. Restarting testing with first patch.
I retested the patches over 55 iterations, so 55+44 = 99, with no segfaults and without dbus raising SIGABRT. However, the dlm ended up sticking on one of the hosts: INFO: task rgmanager:7665 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. rgmanager D 0000000000000002 0 7665 1 0x00000080 ffff88001bbb3c70 0000000000000086 0000000000000000 ffff88001bbb3c34 ffffffff00000000 ffff88001ec23480 ffff880002296000 000000000000027e ffff88001db22678 ffff88001bbb3fd8 000000000000f508 ffff88001db22678 Call Trace: [<ffffffff814ee9f5>] rwsem_down_failed_common+0x95/0x1d0 [<ffffffff814eeb86>] rwsem_down_read_failed+0x26/0x30 [<ffffffff81276b14>] call_rwsem_down_read_failed+0x14/0x30 [<ffffffff814ee084>] ? down_read+0x24/0x30 [<ffffffffa031e1b7>] dlm_user_request+0x47/0x240 [dlm] [<ffffffff8113c2f4>] ? handle_mm_fault+0x1e4/0x2b0 [<ffffffff8115f19c>] ? __kmalloc+0x20c/0x220 [<ffffffffa032bc36>] device_write+0x5f6/0x7d0 [dlm] [<ffffffff81095e22>] ? hrtimer_cancel+0x22/0x30 [<ffffffff814edf63>] ? do_nanosleep+0x93/0xc0 [<ffffffff8120c0d6>] ? security_file_permission+0x16/0x20 [<ffffffff81176388>] vfs_write+0xb8/0x1a0 [<ffffffff810d4612>] ? audit_syscall_entry+0x272/0x2a0 [<ffffffff81176d91>] sys_write+0x51/0x90 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b This does not appear to be a bug in rgmanager.
https://www.redhat.com/archives/cluster-devel/2011-October/msg00064.html https://www.redhat.com/archives/cluster-devel/2011-October/msg00065.html In addition to the above testing results, a Coverity scan before and after did not reveal any new defects introduced by these patches.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2011-1595.html