Bug 741607

Summary: rgmanager sometimes crashes on startup
Product: Red Hat Enterprise Linux 6 Reporter: Jaroslav Kortus <jkortus>
Component: rgmanagerAssignee: Lon Hohberger <lhh>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.2CC: cluster-maint
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rgmanager-3.0.12.1-5.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-12-06 11:59:57 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 748554    
Attachments:
Description Flags
core file
none
3 cores from the crashed nodes none

Description Jaroslav Kortus 2011-09-27 12:32:38 UTC
Description of problem:
Sometimes, rgmanager crashes on startup:
Sep 27 07:10:13 marathon-01 rgmanager[5525]: I am node #1
Sep 27 07:10:13 marathon-01 rgmanager[5525]: Resource Group Manager Starting
Sep 27 07:10:13 marathon-01 rgmanager[5525]: Loading Service Data
Sep 27 07:10:15 marathon-01 rgmanager[5525]: Initializing Services
Sep 27 07:10:15 marathon-01 rgmanager[6421]: [clusterfs] /dev/dm-3 is not mounted
Sep 27 07:10:15 marathon-01 rgmanager[6455]: [nfsclient] /mnt/ext3 is not a directory
Sep 27 07:10:15 marathon-01 rgmanager[5525]: stop on nfsclient "*" returned 2 (invalid argument(s))
Sep 27 07:10:15 marathon-01 rgmanager[6491]: [nfsserver] Stopping NFS daemons
Sep 27 07:10:16 marathon-01 rgmanager[6598]: [nfsserver] Stopping rpc.statd
Sep 27 07:10:16 marathon-01 rgmanager[6728]: [fs] /dev/dm-4 is not mounted
Sep 27 07:10:16 marathon-01 rgmanager[5525]: Services Initialized
Sep 27 07:10:16 marathon-01 rgmanager[5525]: State change: Local UP
Sep 27 07:10:16 marathon-01 rgmanager[5525]: State change: marathon-02 UP
Sep 27 07:10:16 marathon-01 rgmanager[5525]: State change: marathon-03 UP
Sep 27 07:10:16 marathon-01 rgmanager[5525]: State change: marathon-04 UP
Sep 27 07:10:16 marathon-01 rgmanager[5525]: State change: marathon-05 UP
Sep 27 07:10:21 marathon-01 abrt[6756]: saved core dump of pid 5525 (/usr/sbin/rgmanager) to /var/spool/abrt/ccpp-2011-09-27-07:10:21-5525.new/coredump (44290048 bytes)


Version-Release number of selected component (if applicable):
rgmanager-3.0.12.1-4.el6.x86_64
cman-3.0.12.1-22.el6.x86_64
dbus-1.2.24-6.el6.x86_64
dbus-libs-1.2.24-6.el6.x86_64


How reproducible:
20%

Steps to Reproduce:
1. cluster-wide rgmanager start with service defined
2.
3.
  
Actual results:
sometimes one of the nodes crashes

Expected results:
no rgmanager crash

Additional info:
Core was generated by `rgmanager'.
Program terminated with signal 6, Aborted.
#0  0x00007ff2fd8b7885 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64	  return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
Missing separate debuginfos, use: debuginfo-install libgcc-4.4.6-3.el6.x86_64 libxml2-2.7.6-3.el6.x86_64 slang-2.2.1-1.el6.x86_64 zlib-1.2.3-27.el6.x86_64
(gdb) thread apply all bt

Thread 6 (Thread 0x7ff2fd680700 (LWP 5526)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00000038b5e01653 in thread_fn (arg=<value optimized out>) at /usr/src/debug/cluster-3.0.12.1/common/liblogthread/liblogthread.c:83
#2  0x00007ff2fe0767f1 in start_thread (arg=0x7ff2fd680700) at pthread_create.c:301
#3  0x00007ff2fd96a70d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 5 (Thread 0x7ff2fcc7f700 (LWP 6751)):
#0  0x00007ff2fd9632f3 in select () at ../sysdeps/unix/syscall-template.S:82
#1  0x00000000004215ee in poll_cluster_messages (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/clulib/msg_cluster.c:186
#2  cluster_comms_thread (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/clulib/msg_cluster.c:967
#3  0x00007ff2fe0767f1 in start_thread (arg=0x7ff2fcc7f700) at pthread_create.c:301
#4  0x00007ff2fd96a70d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 4 (Thread 0x7ff2fe68b700 (LWP 6753)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:216
#1  0x00000000004116b1 in _event_thread_f (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/daemons/rg_event.c:393
#2  0x00007ff2fe0767f1 in start_thread (arg=0x7ff2fe68b700) at pthread_create.c:301
#3  0x00007ff2fd96a70d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 3 (Thread 0x7ff2fe71e700 (LWP 5525)):
#0  0x00007ff2fd9632f3 in select () at ../sysdeps/unix/syscall-template.S:82
#1  0x0000000000409e3a in event_loop (argc=<value optimized out>, argv=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/daemons/main.c:695
#2  main (argc=<value optimized out>, argv=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/daemons/main.c:1152

Thread 2 (Thread 0x7ff2f7fff700 (LWP 6752)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:216
#1  0x00000000004202c7 in cluster_msg_wait (ctx=0xf74a10, timeout=3) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/clulib/msg_cluster.c:270
#2  0x000000000041ef44 in vf_event_loop (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/clulib/vft.c:801
#3  vf_server (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/clulib/vft.c:871
#4  0x00007ff2fe0767f1 in start_thread (arg=0x7ff2f7fff700) at pthread_create.c:301
#5  0x00007ff2fd96a70d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 1 (Thread 0x7ff2f75fe700 (LWP 6755)):
#0  0x00007ff2fd8b7885 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007ff2fd8b9065 in abort () at abort.c:92
#2  0x00007ff2fde58975 in _dbus_abort () at dbus-sysdeps.c:88
#3  0x00007ff2fde54845 in _dbus_warn_check_failed (
    format=0x7ff2fde5d388 "The last reference on a connection was dropped without closing the connection. This is a bug in an application. See dbus_connection_unref() documentation for details.\n%s") at dbus-internals.c:283
#4  0x00007ff2fde3ec62 in _dbus_connection_read_write_dispatch (connection=0xf67390, timeout_milliseconds=500, dispatch=<value optimized out>) at dbus-connection.c:3512
#5  0x000000000041a409 in _dbus_auto_flush (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/daemons/update-dbus.c:130
#6  0x00007ff2fe0767f1 in start_thread (arg=0x7ff2f75fe700) at pthread_create.c:301
#7  0x00007ff2fd96a70d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Comment 1 Jaroslav Kortus 2011-09-27 12:35:31 UTC
Created attachment 525112 [details]
core file

Comment 2 Jaroslav Kortus 2011-09-27 12:37:38 UTC
relevant rgmanager snip from cluster.conf:
	<rm>
		<failoverdomains>
			<failoverdomain name="marathon_domain" ordered="0" restricted="0">
				<failoverdomainnode name="marathon-01" priority="1"/>
				<failoverdomainnode name="marathon-02" priority="1"/>
				<failoverdomainnode name="marathon-03" priority="1"/>
				<failoverdomainnode name="marathon-04" priority="1"/>
				<failoverdomainnode name="marathon-05" priority="1"/>
			</failoverdomain>
		</failoverdomains>
		<resources>
			<ip address="10.15.89.211" monitor_link="1"/>
			<clusterfs device="/dev/marathon/marathon0" force_unmount="1" fsid="9039" fstype="gfs" mountpoint="/mnt/gfs1" name="marathon0" options="" self_fence="1"/>
			<fs device="/dev/marathon/marathon1" force_fsck="0" force_unmount="1" fsid="7083" fstype="ext3" mountpoint="/mnt/ext3" name="marathon1" options="" self_fence="1"/>
			<nfsserver name="marathon nfs server"/>
			<nfsclient name="*" options="rw" target="*"/>
		</resources>
		<service autostart="1" domain="marathon_domain" name="nfs_service">
			<clusterfs ref="marathon0">
				<nfsserver ref="marathon nfs server">
					<nfsclient ref="*"/>
				</nfsserver>
			</clusterfs>
			<fs ref="marathon1">
				<nfsserver ref="marathon nfs server">
					<nfsclient ref="*"/>
				</nfsserver>
			</fs>
			<ip ref="10.15.89.211"/>
		</service>
	</rm>

Comment 5 Lon Hohberger 2011-09-28 14:24:09 UTC
This is odd; I did a lot of testing but it was with the .4 release of dbus.  I'll have to see if there's anything new here.

Comment 7 Lon Hohberger 2011-10-11 23:12:11 UTC
This is always fun:

[Thread debugging using libthread_db enabled]
../../gdb/linux-thread-db.c:1048: internal-error: attach_thread: Assertion `ti_p->ti_tid != 0' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n) EOF [assumed Y]
../../gdb/linux-thread-db.c:1048: internal-error: attach_thread: Assertion `ti_p->ti_tid != 0' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Create a core file of GDB? (y or n)

Comment 8 Jaroslav Kortus 2011-10-14 12:31:47 UTC
Core was generated by `rgmanager'.
Program terminated with signal 6, Aborted.
#0  0x0000003239e32885 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install clusterlib-3.0.12.1-23.el6.x86_64 corosynclib-1.4.1-4.el6.x86_64 glibc-2.12-1.44.el6.x86_64 libgcc-4.4.6-3.el6.x86_64 libxml2-2.7.6-4.el6.x86_64 slang-2.2.1-1.el6.x86_64 zlib-1.2.3-27.el6.x86_64
(gdb) bt
#0  0x0000003239e32885 in raise () from /lib64/libc.so.6
#1  0x0000003239e34065 in abort () from /lib64/libc.so.6
#2  0x000000323ba2a975 in _dbus_abort () at dbus-sysdeps.c:88
#3  0x000000323ba26845 in _dbus_warn_check_failed (
    format=0x323ba2f388 "The last reference on a connection was dropped without closing the connection. This is a bug in an application. See dbus_connection_unref() documentation for details.\n%s")
    at dbus-internals.c:283
#4  0x000000323ba10c62 in _dbus_connection_read_write_dispatch (connection=0x1cbc390, timeout_milliseconds=500, dispatch=<value optimized out>) at dbus-connection.c:3512
#5  0x000000000041a409 in _dbus_auto_flush (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/daemons/update-dbus.c:130
#6  0x000000323a2077f1 in start_thread () from /lib64/libpthread.so.0
#7  0x0000003239ee570d in clone () from /lib64/libc.so.6




Oct 14 06:13:45 marathon-02 rgmanager[17874]: I am node #2
Oct 14 06:13:45 marathon-02 rgmanager[17874]: Resource Group Manager Starting
Oct 14 06:13:45 marathon-02 rgmanager[17874]: Loading Service Data
Oct 14 06:13:46 marathon-02 rgmanager[17874]: Initializing Services
Oct 14 06:13:46 marathon-02 rgmanager[18769]: [clusterfs] /dev/dm-4 is not mounted
Oct 14 06:13:46 marathon-02 rgmanager[18803]: [nfsclient] Removing export: *:/mnt/ext3
Oct 14 06:13:47 marathon-02 rgmanager[18840]: [nfsserver] Stopping NFS daemons
Oct 14 06:13:47 marathon-02 rgmanager[18947]: [nfsserver] Stopping rpc.statd
Oct 14 06:13:47 marathon-02 rgmanager[19077]: [fs] /dev/dm-3 is not mounted
Oct 14 06:13:47 marathon-02 rgmanager[17874]: Services Initialized
Oct 14 06:13:47 marathon-02 rgmanager[17874]: State change: Local UP
Oct 14 06:13:47 marathon-02 rgmanager[17874]: State change: marathon-03 UP
Oct 14 06:13:47 marathon-02 rgmanager[17874]: State change: marathon-01 UP
Oct 14 06:13:47 marathon-02 rgmanager[17874]: State change: marathon-04 UP
Oct 14 06:13:52 marathon-02 abrt[19105]: saved core dump of pid 17874 (/usr/sbin/rgmanager) to /var/spool/abrt/ccpp-2011-10-14-06:13:52-17874.new/coredump (44285952 bytes)



$ rpm -qva | grep -E 'rgmanager|dbus' | sort
dbus-1.2.24-5.el6_1.x86_64
dbus-debuginfo-1.2.24-5.el6_1.x86_64
dbus-glib-0.86-5.el6.x86_64
dbus-libs-1.2.24-5.el6_1.x86_64
dbus-python-0.83.0-6.1.el6.x86_64
eggdbus-0.6-3.el6.x86_64
rgmanager-3.0.12.1-4.el6.x86_64
rgmanager-debuginfo-3.0.12.1-4.el6.x86_64


On 17th iteration out of 500 planned 3 of 5 nodes were down and the game was over :). The error is still the same and the syslog messages are also like pasted above.

Reproducer:
ITER=500
NODES="marathon-01 marathon-02 marathon-03 marathon-04 marathon-05"

for i in `seq 1 $ITER`; do
	echo "Starting iteration $i";
	for node in $NODES; do
		qarsh root@$node service rgmanager start &
	done
	echo "Waiting for all nodes to start..."
	wait
	echo "sleeping 10s"
	sleep 10
	for node in $NODES; do
		qarsh root@$node service rgmanager stop &
	done
	echo "Waiting for all nodes to stop..."
	wait
done

Comment 9 Jaroslav Kortus 2011-10-14 12:33:09 UTC
Created attachment 528200 [details]
3 cores from the crashed nodes

Comment 10 Lon Hohberger 2011-10-19 00:28:07 UTC
This appears to be a concurrency issue with dbus -- that is, signals are perhaps not being handled as well as they could be in libdbus.  This is why it only reproduces sometimes and only on startup/shutdown.

Most of the the dbus calls in rgmanager are made with signals blocked.  This is because the dbus signals are emitted when we run a callback on committed data (e.g. rgmanager service state changes).

Unfortunately, the _dbus_auto_flush thread rgmanager -did not- have signals blocked, causing the SIGTERM to hit that thread while in dbus_connection_read_write().

It is my suspicion that depending on where we are in this function when the signal arrives, we either get a direct segfault, or an extra dbus_connection_unref() call - causing the SIGABRT and backtraces above.

Blocking signals in the _dbus_auto_flush thread should consequently fix this.

My first patch hit 44 iterations with no issues; I am testing a second (far simpler) patch currently.

Comment 11 Lon Hohberger 2011-10-19 01:34:43 UTC
Turns out the first patch was more robust; I still get segfaults on the second patch.  Restarting testing with first patch.

Comment 12 Lon Hohberger 2011-10-19 03:29:40 UTC
I retested the patches over 55 iterations, so 55+44 = 99, with no segfaults and without dbus raising SIGABRT.  However, the dlm ended up sticking on one of the hosts:

INFO: task rgmanager:7665 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
rgmanager     D 0000000000000002     0  7665      1 0x00000080
 ffff88001bbb3c70 0000000000000086 0000000000000000 ffff88001bbb3c34
 ffffffff00000000 ffff88001ec23480 ffff880002296000 000000000000027e
 ffff88001db22678 ffff88001bbb3fd8 000000000000f508 ffff88001db22678
Call Trace:
 [<ffffffff814ee9f5>] rwsem_down_failed_common+0x95/0x1d0
 [<ffffffff814eeb86>] rwsem_down_read_failed+0x26/0x30
 [<ffffffff81276b14>] call_rwsem_down_read_failed+0x14/0x30
 [<ffffffff814ee084>] ? down_read+0x24/0x30
 [<ffffffffa031e1b7>] dlm_user_request+0x47/0x240 [dlm]
 [<ffffffff8113c2f4>] ? handle_mm_fault+0x1e4/0x2b0
 [<ffffffff8115f19c>] ? __kmalloc+0x20c/0x220
 [<ffffffffa032bc36>] device_write+0x5f6/0x7d0 [dlm]
 [<ffffffff81095e22>] ? hrtimer_cancel+0x22/0x30
 [<ffffffff814edf63>] ? do_nanosleep+0x93/0xc0
 [<ffffffff8120c0d6>] ? security_file_permission+0x16/0x20
 [<ffffffff81176388>] vfs_write+0xb8/0x1a0
 [<ffffffff810d4612>] ? audit_syscall_entry+0x272/0x2a0
 [<ffffffff81176d91>] sys_write+0x51/0x90
 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b

This does not appear to be a bug in rgmanager.

Comment 13 Lon Hohberger 2011-10-19 03:31:05 UTC
https://www.redhat.com/archives/cluster-devel/2011-October/msg00064.html
https://www.redhat.com/archives/cluster-devel/2011-October/msg00065.html

In addition to the above testing results, a Coverity scan before and after did not reveal any new defects introduced by these patches.

Comment 17 errata-xmlrpc 2011-12-06 11:59:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1595.html