Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 741607

Summary:

rgmanager sometimes crashes on startup

Product:

Red Hat Enterprise Linux 6

Reporter:

Jaroslav Kortus <jkortus>

Component:

rgmanager

Assignee:

Lon Hohberger <lhh>

Status:

CLOSED ERRATA

QA Contact:

Cluster QE <mspqa-list>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

6.2

CC:

cluster-maint

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

rgmanager-3.0.12.1-5.el6

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-12-06 11:59:57 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

748554

Attachments:

Description	Flags
core file	none
3 cores from the crashed nodes	none

Description Jaroslav Kortus 2011-09-27 12:32:38 UTC

Description of problem:
Sometimes, rgmanager crashes on startup:
Sep 27 07:10:13 marathon-01 rgmanager[5525]: I am node #1
Sep 27 07:10:13 marathon-01 rgmanager[5525]: Resource Group Manager Starting
Sep 27 07:10:13 marathon-01 rgmanager[5525]: Loading Service Data
Sep 27 07:10:15 marathon-01 rgmanager[5525]: Initializing Services
Sep 27 07:10:15 marathon-01 rgmanager[6421]: [clusterfs] /dev/dm-3 is not mounted
Sep 27 07:10:15 marathon-01 rgmanager[6455]: [nfsclient] /mnt/ext3 is not a directory
Sep 27 07:10:15 marathon-01 rgmanager[5525]: stop on nfsclient "*" returned 2 (invalid argument(s))
Sep 27 07:10:15 marathon-01 rgmanager[6491]: [nfsserver] Stopping NFS daemons
Sep 27 07:10:16 marathon-01 rgmanager[6598]: [nfsserver] Stopping rpc.statd
Sep 27 07:10:16 marathon-01 rgmanager[6728]: [fs] /dev/dm-4 is not mounted
Sep 27 07:10:16 marathon-01 rgmanager[5525]: Services Initialized
Sep 27 07:10:16 marathon-01 rgmanager[5525]: State change: Local UP
Sep 27 07:10:16 marathon-01 rgmanager[5525]: State change: marathon-02 UP
Sep 27 07:10:16 marathon-01 rgmanager[5525]: State change: marathon-03 UP
Sep 27 07:10:16 marathon-01 rgmanager[5525]: State change: marathon-04 UP
Sep 27 07:10:16 marathon-01 rgmanager[5525]: State change: marathon-05 UP
Sep 27 07:10:21 marathon-01 abrt[6756]: saved core dump of pid 5525 (/usr/sbin/rgmanager) to /var/spool/abrt/ccpp-2011-09-27-07:10:21-5525.new/coredump (44290048 bytes)


Version-Release number of selected component (if applicable):
rgmanager-3.0.12.1-4.el6.x86_64
cman-3.0.12.1-22.el6.x86_64
dbus-1.2.24-6.el6.x86_64
dbus-libs-1.2.24-6.el6.x86_64


How reproducible:
20%

Steps to Reproduce:
1. cluster-wide rgmanager start with service defined
2.
3.
  
Actual results:
sometimes one of the nodes crashes

Expected results:
no rgmanager crash

Additional info:
Core was generated by `rgmanager'.
Program terminated with signal 6, Aborted.
#0  0x00007ff2fd8b7885 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64	  return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
Missing separate debuginfos, use: debuginfo-install libgcc-4.4.6-3.el6.x86_64 libxml2-2.7.6-3.el6.x86_64 slang-2.2.1-1.el6.x86_64 zlib-1.2.3-27.el6.x86_64
(gdb) thread apply all bt

Thread 6 (Thread 0x7ff2fd680700 (LWP 5526)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00000038b5e01653 in thread_fn (arg=<value optimized out>) at /usr/src/debug/cluster-3.0.12.1/common/liblogthread/liblogthread.c:83
#2  0x00007ff2fe0767f1 in start_thread (arg=0x7ff2fd680700) at pthread_create.c:301
#3  0x00007ff2fd96a70d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 5 (Thread 0x7ff2fcc7f700 (LWP 6751)):
#0  0x00007ff2fd9632f3 in select () at ../sysdeps/unix/syscall-template.S:82
#1  0x00000000004215ee in poll_cluster_messages (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/clulib/msg_cluster.c:186
#2  cluster_comms_thread (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/clulib/msg_cluster.c:967
#3  0x00007ff2fe0767f1 in start_thread (arg=0x7ff2fcc7f700) at pthread_create.c:301
#4  0x00007ff2fd96a70d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 4 (Thread 0x7ff2fe68b700 (LWP 6753)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:216
#1  0x00000000004116b1 in _event_thread_f (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/daemons/rg_event.c:393
#2  0x00007ff2fe0767f1 in start_thread (arg=0x7ff2fe68b700) at pthread_create.c:301
#3  0x00007ff2fd96a70d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 3 (Thread 0x7ff2fe71e700 (LWP 5525)):
#0  0x00007ff2fd9632f3 in select () at ../sysdeps/unix/syscall-template.S:82
#1  0x0000000000409e3a in event_loop (argc=<value optimized out>, argv=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/daemons/main.c:695
#2  main (argc=<value optimized out>, argv=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/daemons/main.c:1152

Thread 2 (Thread 0x7ff2f7fff700 (LWP 6752)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:216
#1  0x00000000004202c7 in cluster_msg_wait (ctx=0xf74a10, timeout=3) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/clulib/msg_cluster.c:270
#2  0x000000000041ef44 in vf_event_loop (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/clulib/vft.c:801
#3  vf_server (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/clulib/vft.c:871
#4  0x00007ff2fe0767f1 in start_thread (arg=0x7ff2f7fff700) at pthread_create.c:301
#5  0x00007ff2fd96a70d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 1 (Thread 0x7ff2f75fe700 (LWP 6755)):
#0  0x00007ff2fd8b7885 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007ff2fd8b9065 in abort () at abort.c:92
#2  0x00007ff2fde58975 in _dbus_abort () at dbus-sysdeps.c:88
#3  0x00007ff2fde54845 in _dbus_warn_check_failed (
    format=0x7ff2fde5d388 "The last reference on a connection was dropped without closing the connection. This is a bug in an application. See dbus_connection_unref() documentation for details.\n%s") at dbus-internals.c:283
#4  0x00007ff2fde3ec62 in _dbus_connection_read_write_dispatch (connection=0xf67390, timeout_milliseconds=500, dispatch=<value optimized out>) at dbus-connection.c:3512
#5  0x000000000041a409 in _dbus_auto_flush (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/daemons/update-dbus.c:130
#6  0x00007ff2fe0767f1 in start_thread (arg=0x7ff2f75fe700) at pthread_create.c:301
#7  0x00007ff2fd96a70d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Comment 1 Jaroslav Kortus 2011-09-27 12:35:31 UTC

Created attachment 525112 [details]
core file

Comment 2 Jaroslav Kortus 2011-09-27 12:37:38 UTC

relevant rgmanager snip from cluster.conf:
	<rm>
		<failoverdomains>
			<failoverdomain name="marathon_domain" ordered="0" restricted="0">
				<failoverdomainnode name="marathon-01" priority="1"/>
				<failoverdomainnode name="marathon-02" priority="1"/>
				<failoverdomainnode name="marathon-03" priority="1"/>
				<failoverdomainnode name="marathon-04" priority="1"/>
				<failoverdomainnode name="marathon-05" priority="1"/>
			</failoverdomain>
		</failoverdomains>
		<resources>
			<ip address="10.15.89.211" monitor_link="1"/>
			<clusterfs device="/dev/marathon/marathon0" force_unmount="1" fsid="9039" fstype="gfs" mountpoint="/mnt/gfs1" name="marathon0" options="" self_fence="1"/>
			<fs device="/dev/marathon/marathon1" force_fsck="0" force_unmount="1" fsid="7083" fstype="ext3" mountpoint="/mnt/ext3" name="marathon1" options="" self_fence="1"/>
			<nfsserver name="marathon nfs server"/>
			<nfsclient name="*" options="rw" target="*"/>
		</resources>
		<service autostart="1" domain="marathon_domain" name="nfs_service">
			<clusterfs ref="marathon0">
				<nfsserver ref="marathon nfs server">
					<nfsclient ref="*"/>
				</nfsserver>
			</clusterfs>
			<fs ref="marathon1">
				<nfsserver ref="marathon nfs server">
					<nfsclient ref="*"/>
				</nfsserver>
			</fs>
			<ip ref="10.15.89.211"/>
		</service>
	</rm>

Comment 5 Lon Hohberger 2011-09-28 14:24:09 UTC

This is odd; I did a lot of testing but it was with the .4 release of dbus.  I'll have to see if there's anything new here.

Comment 7 Lon Hohberger 2011-10-11 23:12:11 UTC

This is always fun:

[Thread debugging using libthread_db enabled]
../../gdb/linux-thread-db.c:1048: internal-error: attach_thread: Assertion `ti_p->ti_tid != 0' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n) EOF [assumed Y]
../../gdb/linux-thread-db.c:1048: internal-error: attach_thread: Assertion `ti_p->ti_tid != 0' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Create a core file of GDB? (y or n)

Comment 8 Jaroslav Kortus 2011-10-14 12:31:47 UTC

Core was generated by `rgmanager'.
Program terminated with signal 6, Aborted.
#0  0x0000003239e32885 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install clusterlib-3.0.12.1-23.el6.x86_64 corosynclib-1.4.1-4.el6.x86_64 glibc-2.12-1.44.el6.x86_64 libgcc-4.4.6-3.el6.x86_64 libxml2-2.7.6-4.el6.x86_64 slang-2.2.1-1.el6.x86_64 zlib-1.2.3-27.el6.x86_64
(gdb) bt
#0  0x0000003239e32885 in raise () from /lib64/libc.so.6
#1  0x0000003239e34065 in abort () from /lib64/libc.so.6
#2  0x000000323ba2a975 in _dbus_abort () at dbus-sysdeps.c:88
#3  0x000000323ba26845 in _dbus_warn_check_failed (
    format=0x323ba2f388 "The last reference on a connection was dropped without closing the connection. This is a bug in an application. See dbus_connection_unref() documentation for details.\n%s")
    at dbus-internals.c:283
#4  0x000000323ba10c62 in _dbus_connection_read_write_dispatch (connection=0x1cbc390, timeout_milliseconds=500, dispatch=<value optimized out>) at dbus-connection.c:3512
#5  0x000000000041a409 in _dbus_auto_flush (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/daemons/update-dbus.c:130
#6  0x000000323a2077f1 in start_thread () from /lib64/libpthread.so.0
#7  0x0000003239ee570d in clone () from /lib64/libc.so.6




Oct 14 06:13:45 marathon-02 rgmanager[17874]: I am node #2
Oct 14 06:13:45 marathon-02 rgmanager[17874]: Resource Group Manager Starting
Oct 14 06:13:45 marathon-02 rgmanager[17874]: Loading Service Data
Oct 14 06:13:46 marathon-02 rgmanager[17874]: Initializing Services
Oct 14 06:13:46 marathon-02 rgmanager[18769]: [clusterfs] /dev/dm-4 is not mounted
Oct 14 06:13:46 marathon-02 rgmanager[18803]: [nfsclient] Removing export: *:/mnt/ext3
Oct 14 06:13:47 marathon-02 rgmanager[18840]: [nfsserver] Stopping NFS daemons
Oct 14 06:13:47 marathon-02 rgmanager[18947]: [nfsserver] Stopping rpc.statd
Oct 14 06:13:47 marathon-02 rgmanager[19077]: [fs] /dev/dm-3 is not mounted
Oct 14 06:13:47 marathon-02 rgmanager[17874]: Services Initialized
Oct 14 06:13:47 marathon-02 rgmanager[17874]: State change: Local UP
Oct 14 06:13:47 marathon-02 rgmanager[17874]: State change: marathon-03 UP
Oct 14 06:13:47 marathon-02 rgmanager[17874]: State change: marathon-01 UP
Oct 14 06:13:47 marathon-02 rgmanager[17874]: State change: marathon-04 UP
Oct 14 06:13:52 marathon-02 abrt[19105]: saved core dump of pid 17874 (/usr/sbin/rgmanager) to /var/spool/abrt/ccpp-2011-10-14-06:13:52-17874.new/coredump (44285952 bytes)



$ rpm -qva | grep -E 'rgmanager|dbus' | sort
dbus-1.2.24-5.el6_1.x86_64
dbus-debuginfo-1.2.24-5.el6_1.x86_64
dbus-glib-0.86-5.el6.x86_64
dbus-libs-1.2.24-5.el6_1.x86_64
dbus-python-0.83.0-6.1.el6.x86_64
eggdbus-0.6-3.el6.x86_64
rgmanager-3.0.12.1-4.el6.x86_64
rgmanager-debuginfo-3.0.12.1-4.el6.x86_64


On 17th iteration out of 500 planned 3 of 5 nodes were down and the game was over :). The error is still the same and the syslog messages are also like pasted above.

Reproducer:
ITER=500
NODES="marathon-01 marathon-02 marathon-03 marathon-04 marathon-05"

for i in `seq 1 $ITER`; do
	echo "Starting iteration $i";
	for node in $NODES; do
		qarsh root@$node service rgmanager start &
	done
	echo "Waiting for all nodes to start..."
	wait
	echo "sleeping 10s"
	sleep 10
	for node in $NODES; do
		qarsh root@$node service rgmanager stop &
	done
	echo "Waiting for all nodes to stop..."
	wait
done

Comment 9 Jaroslav Kortus 2011-10-14 12:33:09 UTC

Created attachment 528200 [details]
3 cores from the crashed nodes

Comment 10 Lon Hohberger 2011-10-19 00:28:07 UTC

This appears to be a concurrency issue with dbus -- that is, signals are perhaps not being handled as well as they could be in libdbus.  This is why it only reproduces sometimes and only on startup/shutdown.

Most of the the dbus calls in rgmanager are made with signals blocked.  This is because the dbus signals are emitted when we run a callback on committed data (e.g. rgmanager service state changes).

Unfortunately, the _dbus_auto_flush thread rgmanager -did not- have signals blocked, causing the SIGTERM to hit that thread while in dbus_connection_read_write().

It is my suspicion that depending on where we are in this function when the signal arrives, we either get a direct segfault, or an extra dbus_connection_unref() call - causing the SIGABRT and backtraces above.

Blocking signals in the _dbus_auto_flush thread should consequently fix this.

My first patch hit 44 iterations with no issues; I am testing a second (far simpler) patch currently.

Comment 11 Lon Hohberger 2011-10-19 01:34:43 UTC

Turns out the first patch was more robust; I still get segfaults on the second patch.  Restarting testing with first patch.

Comment 12 Lon Hohberger 2011-10-19 03:29:40 UTC

I retested the patches over 55 iterations, so 55+44 = 99, with no segfaults and without dbus raising SIGABRT.  However, the dlm ended up sticking on one of the hosts:

INFO: task rgmanager:7665 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
rgmanager     D 0000000000000002     0  7665      1 0x00000080
 ffff88001bbb3c70 0000000000000086 0000000000000000 ffff88001bbb3c34
 ffffffff00000000 ffff88001ec23480 ffff880002296000 000000000000027e
 ffff88001db22678 ffff88001bbb3fd8 000000000000f508 ffff88001db22678
Call Trace:
 [<ffffffff814ee9f5>] rwsem_down_failed_common+0x95/0x1d0
 [<ffffffff814eeb86>] rwsem_down_read_failed+0x26/0x30
 [<ffffffff81276b14>] call_rwsem_down_read_failed+0x14/0x30
 [<ffffffff814ee084>] ? down_read+0x24/0x30
 [<ffffffffa031e1b7>] dlm_user_request+0x47/0x240 [dlm]
 [<ffffffff8113c2f4>] ? handle_mm_fault+0x1e4/0x2b0
 [<ffffffff8115f19c>] ? __kmalloc+0x20c/0x220
 [<ffffffffa032bc36>] device_write+0x5f6/0x7d0 [dlm]
 [<ffffffff81095e22>] ? hrtimer_cancel+0x22/0x30
 [<ffffffff814edf63>] ? do_nanosleep+0x93/0xc0
 [<ffffffff8120c0d6>] ? security_file_permission+0x16/0x20
 [<ffffffff81176388>] vfs_write+0xb8/0x1a0
 [<ffffffff810d4612>] ? audit_syscall_entry+0x272/0x2a0
 [<ffffffff81176d91>] sys_write+0x51/0x90
 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b

This does not appear to be a bug in rgmanager.

Comment 13 Lon Hohberger 2011-10-19 03:31:05 UTC

https://www.redhat.com/archives/cluster-devel/2011-October/msg00064.html
https://www.redhat.com/archives/cluster-devel/2011-October/msg00065.html

In addition to the above testing results, a Coverity scan before and after did not reveal any new defects introduced by these patches.

Comment 17 errata-xmlrpc 2011-12-06 11:59:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1595.html