501561 – gfs_controld segfault during simultaneous gfs mounts

Bug 501561 - gfs_controld segfault during simultaneous gfs mounts

Summary: gfs_controld segfault during simultaneous gfs mounts

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	openais
Sub Component:
Version:	5.4
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Steven Dake
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	499734 (view as bug list)
Depends On:
Blocks:	502044 502940
TreeView+	depends on / blocked

Reported:	2009-05-19 18:28 UTC by Corey Marthaler
Modified:	2016-04-26 15:42 UTC (History)
CC List:	6 users (show)
Fixed In Version:	openais-0.80.6-2.e5_4
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	502044 (view as bug list)
Environment:
Last Closed:	2009-09-02 11:30:06 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
core file from hayes-01 (8.48 MB, application/octet-stream) 2009-05-19 21:15 UTC, Corey Marthaler	no flags	Details
debug info (25.22 KB, text/plain) 2009-05-20 17:41 UTC, David Teigland	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2009:1366	0	normal	SHIPPED_LIVE	openais bug-fix and enhancement update	2009-09-01 11:00:17 UTC

Description Corey Marthaler 2009-05-19 18:28:56 UTC

Description of problem:
This may be the same issue as bug 480401. I saw this just after re-installing my machines, so I didn't have the core gathering stuff set up yet. I'll try and reproduce this and add more info.

May 19 13:02:41 hayes-01 gfs_controld[27759]: replace zero id for 0 with 4108050209
May 19 13:02:41 hayes-01 gfs_controld[27759]: Assertion failed on line 682 of file recover.c Assertion:  "memb"
May 19 13:02:41 hayes-01 kernel: gfs_controld[27759]: segfault at 0000000000000024 rip 000000000040ba3e rsp 00007fffb00af610 error 6
May 19 13:02:41 hayes-01 groupd[27546]: gfs daemon appears to be dead
May 19 13:02:41 hayes-01 groupd[27546]: mark_node_stopped: event not stopping/begin: state JOIN_START_WAIT from 1
May 19 13:02:41 hayes-01 kernel: Trying to join cluster "lock_dlm", "HAYES:0"
May 19 13:02:41 hayes-01 groupd[27546]: cman_set_dirty error -1
May 19 13:02:41 hayes-01 fenced[27612]: cluster is down, exiting
May 19 13:02:41 hayes-01 dlm_controld[27665]: cluster is down, exiting


Version-Release number of selected component (if applicable):
2.6.18-148.el5
gfs-utils-0.1.19-3.el5
gfs2-utils-0.1.56-1.el5
kmod-gfs-0.1.32-1.el5
cman-2.0.101-1.el5

Comment 1 Robert Peterson 2009-05-19 18:45:51 UTC

Reassigning to Dave Teigland; gfs_controld is his area of expertise.

Comment 2 Corey Marthaler 2009-05-19 21:14:52 UTC

Reproduced and got a core.

Comment 3 Corey Marthaler 2009-05-19 21:15:40 UTC

Created attachment 344712 [details]
core file from hayes-01

Comment 4 David Teigland 2009-05-19 22:23:33 UTC

from hayes-01

mount that works

1242770170 client 6: join /mnt/hayes0 gfs lock_dlm HAYES:0 rw /dev/mapper/HAYES-HAYES0
1242770170 mount: /mnt/hayes0 gfs lock_dlm HAYES:0 rw /dev/mapper/HAYES-HAYES0
1242770170 0 cluster name matches: HAYES
1242770170 0 do_mount: rv 0
1242770170 groupd cb: set_id 0 30001
1242770170 groupd cb: start 0 type 2 count 1 members 1 
1242770170 0 start 11 init 1 type 2 member_count 1
1242770170 0 add member 1
1242770170 0 total members 1 master_nodeid -1 prev -1
1242770170 0 start_first_mounter
1242770170 0 start_done 11
1242770170 notify_mount_client: nodir not found for lockspace 0
1242770170 notify_mount_client: ccs_disconnect
1242770170 notify_mount_client: hostdata=jid=0:id=196609:first=1
1242770170 groupd cb: finish 0
1242770170 0 finish 11 needs_recovery 0
1242770170 0 set /sys/fs/gfs/HAYES:0/lock_module/block to 0

then a mount (in parallel with other nodes) that doesn't

1242770261 client 6: join /mnt/hayes0 gfs lock_dlm HAYES:0 rw /dev/mapper/HAYES-HAYES0
1242770261 mount: /mnt/hayes0 gfs lock_dlm HAYES:0 rw /dev/mapper/HAYES-HAYES0
1242770261 0 cluster name matches: HAYES
1242770261 0 do_mount: rv 0
1242770261 groupd cb: stop 0
1242770261 0 set /sys/fs/gfs/HAYES:0/lock_module/block to 1
1242770261 0 set open /sys/fs/gfs/HAYES:0/lock_module/block error -1 2
1242770261 0 do_stop causes mount_client_delay
1242770261 groupd cb: set_id 0 0
1242770261 replace zero id for 0 with 4108050209
1242770261 groupd cb: start 0 type 2 count 1 members 2 
1242770261 0 start 23 init 1 type 2 member_count 1
1242770261 0 add member 2
1242770261 0 total members 1 master_nodeid -1 prev -1
1242770261 0 start_first_mounter
1242770261 Assertion failed on line 682 of file recover.c
Assertion:  "memb"

Seems to be bad callbacks/data from groupd.

Comment 5 David Teigland 2009-05-19 22:24:53 UTC

maybe related to bug 480709?

Comment 6 Steven Dake 2009-05-20 17:21:33 UTC

openais regression.

Comment 7 David Teigland 2009-05-20 17:41:35 UTC

Created attachment 344852 [details]
debug info

Here are logs from groupd, dlm_controld, gfs_controld, group_tool, /var/log/messages, along with some analysis.  groupd seems to be getting some strange cpg confchg data, but I can't say yet if there's one consistent problem with them.

Comment 10 Corey Marthaler 2009-05-22 19:30:13 UTC

Fix verified in openais-0.80.6-2.el5 / cman-2.0.103-1.el5.

Comment 11 Steven Dake 2009-05-23 09:34:05 UTC

*** Bug 499734 has been marked as a duplicate of this bug. ***

Comment 14 errata-xmlrpc 2009-09-02 11:30:06 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1366.html

Note You need to log in before you can comment on or make changes to this bug.