Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 502044

Summary:	gfs_controld segfault during simultaneous gfs mounts
Product:	Red Hat Enterprise Linux 5	Reporter:	Corey Marthaler <cmarthal>
Component:	openais	Assignee:	Steven Dake <sdake>
Status:	CLOSED NOTABUG	QA Contact:	Cluster QE <mspqa-list>
Severity:	high	Docs Contact:
Priority:	high
Version:	5.3	CC:	cluster-maint, edamato, rpeterso, sdake, swhiteho
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:	501561	Environment:
Last Closed:	2009-05-21 16:02:54 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	501561
Bug Blocks:

Description Corey Marthaler 2009-05-21 15:47:46 UTC

This is the 5.3.Z version of this bug

+++ This bug was initially created as a clone of Bug #501561 +++

Description of problem:
This may be the same issue as bug 480401. I saw this just after re-installing my machines, so I didn't have the core gathering stuff set up yet. I'll try and reproduce this and add more info.

May 19 13:02:41 hayes-01 gfs_controld[27759]: replace zero id for 0 with 4108050209
May 19 13:02:41 hayes-01 gfs_controld[27759]: Assertion failed on line 682 of file recover.c Assertion:  "memb"
May 19 13:02:41 hayes-01 kernel: gfs_controld[27759]: segfault at 0000000000000024 rip 000000000040ba3e rsp 00007fffb00af610 error 6
May 19 13:02:41 hayes-01 groupd[27546]: gfs daemon appears to be dead
May 19 13:02:41 hayes-01 groupd[27546]: mark_node_stopped: event not stopping/begin: state JOIN_START_WAIT from 1
May 19 13:02:41 hayes-01 kernel: Trying to join cluster "lock_dlm", "HAYES:0"
May 19 13:02:41 hayes-01 groupd[27546]: cman_set_dirty error -1
May 19 13:02:41 hayes-01 fenced[27612]: cluster is down, exiting
May 19 13:02:41 hayes-01 dlm_controld[27665]: cluster is down, exiting


Version-Release number of selected component (if applicable):
2.6.18-148.el5
gfs-utils-0.1.19-3.el5
gfs2-utils-0.1.56-1.el5
kmod-gfs-0.1.32-1.el5
cman-2.0.101-1.el5

--- Additional comment from rpeterso on 2009-05-19 14:45:51 EDT ---

Reassigning to Dave Teigland; gfs_controld is his area of expertise.

--- Additional comment from cmarthal on 2009-05-19 17:14:52 EDT ---

Reproduced and got a core.

--- Additional comment from cmarthal on 2009-05-19 17:15:40 EDT ---

Created an attachment (id=344712)
core file from hayes-01

--- Additional comment from teigland on 2009-05-19 18:23:33 EDT ---

from hayes-01

mount that works

1242770170 client 6: join /mnt/hayes0 gfs lock_dlm HAYES:0 rw /dev/mapper/HAYES-HAYES0
1242770170 mount: /mnt/hayes0 gfs lock_dlm HAYES:0 rw /dev/mapper/HAYES-HAYES0
1242770170 0 cluster name matches: HAYES
1242770170 0 do_mount: rv 0
1242770170 groupd cb: set_id 0 30001
1242770170 groupd cb: start 0 type 2 count 1 members 1 
1242770170 0 start 11 init 1 type 2 member_count 1
1242770170 0 add member 1
1242770170 0 total members 1 master_nodeid -1 prev -1
1242770170 0 start_first_mounter
1242770170 0 start_done 11
1242770170 notify_mount_client: nodir not found for lockspace 0
1242770170 notify_mount_client: ccs_disconnect
1242770170 notify_mount_client: hostdata=jid=0:id=196609:first=1
1242770170 groupd cb: finish 0
1242770170 0 finish 11 needs_recovery 0
1242770170 0 set /sys/fs/gfs/HAYES:0/lock_module/block to 0

then a mount (in parallel with other nodes) that doesn't

1242770261 client 6: join /mnt/hayes0 gfs lock_dlm HAYES:0 rw /dev/mapper/HAYES-HAYES0
1242770261 mount: /mnt/hayes0 gfs lock_dlm HAYES:0 rw /dev/mapper/HAYES-HAYES0
1242770261 0 cluster name matches: HAYES
1242770261 0 do_mount: rv 0
1242770261 groupd cb: stop 0
1242770261 0 set /sys/fs/gfs/HAYES:0/lock_module/block to 1
1242770261 0 set open /sys/fs/gfs/HAYES:0/lock_module/block error -1 2
1242770261 0 do_stop causes mount_client_delay
1242770261 groupd cb: set_id 0 0
1242770261 replace zero id for 0 with 4108050209
1242770261 groupd cb: start 0 type 2 count 1 members 2 
1242770261 0 start 23 init 1 type 2 member_count 1
1242770261 0 add member 2
1242770261 0 total members 1 master_nodeid -1 prev -1
1242770261 0 start_first_mounter
1242770261 Assertion failed on line 682 of file recover.c
Assertion:  "memb"

Seems to be bad callbacks/data from groupd.

--- Additional comment from teigland on 2009-05-19 18:24:53 EDT ---

maybe related to bug 480709?

--- Additional comment from sdake on 2009-05-20 13:21:33 EDT ---

openais regression.

--- Additional comment from teigland on 2009-05-20 13:41:35 EDT ---

Created an attachment (id=344852)
debug info

Here are logs from groupd, dlm_controld, gfs_controld, group_tool, /var/log/messages, along with some analysis.  groupd seems to be getting some strange cpg confchg data, but I can't say yet if there's one consistent problem with them.

Comment 1 Corey Marthaler 2009-05-21 15:49:20 UTC

[root@taft-02 ~]# uname -ar
Linux taft-02 2.6.18-128.1.10.el5 #1 SMP Wed Apr 29 13:53:08 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

[root@taft-02 ~]# rpm -q openais
openais-0.80.3-22.el5_3.6

[root@taft-02 ~]# rpm -q cman
cman-2.0.98-1.el5_3.1

Comment 3 Steven Dake 2009-05-21 16:02:54 UTC

this is not the zstream process.