Bug 126971

Summary:	mount of GFS with type lock_gulmd segfaults
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Corey Marthaler <cmarthal>
Component:	gfs	Assignee:	michael conrad tadpol tilstra <mtilstra>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Cluster QE <mspqa-list>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-09-15 15:37:58 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Corey Marthaler 2004-06-29 21:41:34 UTC

Description of problem: 
This was working last week so I may be doing somehing wrong here but 
even if so, an error message would be nicer than a seg fault. 
 
[root@morph-01 root]# gfs_mkfs -p lock_gulm -t morph-cluster:corey1 
-j 6 -J 32MB /dev/corey/lvol0 
This will destroy any data on /dev/corey/lvol0. 
  It appears to contain a GFS filesystem. 
 
Are you sure you want to proceed? [y/n] y 
 
Device:                    /dev/corey/lvol0 
Blocksize:                 4096 
Filesystem Size:           523344260 
Journals:                  6 
Resource Groups:           7988 
Locking Protocol:          lock_gulm 
Lock Table:                morph-cluster:corey1 
 
Syncing... 
All Done 
 
on all nodes (morph-01 - morph-06): 
 
lock_gulmd -s morph-01,morph-03,morph-05 -n morph-cluster 
 
[root@morph-01 root]# ps -ef | grep lock_gulmd 
root      3333     1  0 16:35 ?        00:00:00 lock_gulmd_core -s 
morph-01,morph-03,morph-05 
root      3336     1  0 16:35 ?        00:00:00 lock_gulmd_LT -s 
morph-01,morph-03,morph-05 
root      3339     1  0 16:35 ?        00:00:00 lock_gulmd_LTPX -s 
morph-01,morph-03,morph-05 
root      3385  2123  0 16:36 pts/0    00:00:00 grep lock_gulmd 
 
[root@morph-01 root]# mount -t gfs /dev/corey/lvol0 /mnt/corey 
Segmentation fault 
 
GFS: can't mount proto = lock_gulm, table = morph-cluster:corey1, 
hostdata = 
Unable to handle kernel NULL pointer dereference at virtual address 
00000427 
 printing eip: 
c01533f1 
*pde = 00000000 
Oops: 0000 [#2] 
Modules linked in: gfs lock_gulm lock_dlm dlm cman lock_harness ipv6 
parport_pc lp parport autofs4 sunrpc e1000 floppy sg microcode 
dm_mod uhci_hcd ehci_hcd button battery asus_acpi ac ext3 jbd 
qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod 
CPU:    0 
EIP:    0060:[<c01533f1>]    Not tainted 
EFLAGS: 00010246   (2.6.7) 
EIP is at do_kern_mount+0xb1/0x140 
eax: 00000000   ebx: 00000000   ecx: c031ca60   edx: 00000000 
esi: 000003eb   edi: f7f73980   ebp: f8a807e0   esp: ea58bf0c 
ds: 007b   es: 007b   ss: 0068 
Process mount (pid: 3402, threadinfo=ea58a000 task=f58822b0) 
Stack: 00000000 00000000 f5012000 00000000 ea5bd000 00000000 
ea5bd003 ea58bf60 
       c0165cab 00000000 c031ca60 00000000 ea58bf60 00000000 
c0165ff0 00000000 
       f5012000 00000000 00000000 ea5bd000 f5012000 f50316ac 
f7f73f80 c0316fcc 
Call Trace: 
 [<c0165cab>] do_add_mount+0x5b/0x170 
 [<c0165ff0>] do_mount+0x170/0x1b0 
 [<c01adce5>] copy_from_user+0x45/0x80 
 [<c0165e19>] copy_mount_options+0x59/0xc0 
 [<c016634a>] sys_mount+0x7a/0xe0 
 [<c0105cad>] sysenter_past_esp+0x52/0x71 
 
Code: 8b 56 3c 85 d2 74 08 8b 02 85 c0 74 3a ff 02 89 57 10 8b 46 
 
How reproducible: 
Always

Comment 1 michael conrad tadpol tilstra 2004-07-09 16:32:12 UTC

when it worked before, did you load both gulm and dlm lock modules?

Comment 2 Corey Marthaler 2004-08-26 22:36:36 UTC

When I attempted to use gulm originally (and when it was worked for
me) it is possible that I didn't have the dlm mod loaded and was just
using a raw device, but maybe not, maybe I was using lvm2 then as
well. I can't remeber anymore. :( 

When the seg fault was happening I did have dlm loaded as I was using
lvm2. 

I haven't been able to reproduce this seg fault anymore, however every
attempt to mount a filesystem using lock_gulm results in a hang,
wheither it's a raw dev or an lvm2 volume.

Comment 3 michael conrad tadpol tilstra 2004-09-08 18:40:11 UTC

which code base is this from?

Comment 4 Corey Marthaler 2004-09-08 22:14:57 UTC

latest code base in cluster tree as of last week, 2.6.8.1

Comment 5 michael conrad tadpol tilstra 2004-09-08 22:30:17 UTC

ah, well all of that just changed this afternoon. So give it another go.

Comment 6 Corey Marthaler 2004-09-09 21:15:16 UTC

This seems to be more of a lock_gulm state issue than a mount issue. 
 
I start lock_gulmd on all my nodes with the following cmd: 
lock_gulmd -s morph-01,morph-03,morph-05 -n morph-cluster 
 
But the servers get stuck either in "Arbitrating" or in "Pending", 
thus causing the mount to hang or time out due to a connection 
refused. 
 
[root@morph-01 root]# gulm_tool getstats $(hostname) 
I_am = Arbitrating 
quorum_has = 1 
quorum_needs = 2 
rank = 0 
GenerationID = 1094764224845226 
run time = 362 
pid = 4187 
verbosity = Default 
failover = enabled 
 
[root@morph-02 root]# gulm_tool getstats $(hostname) 
I_am = Client 
quorum_has = 1 
quorum_needs = 2 
rank = -1 
GenerationID = 0 
run time = 301 
pid = 3995 
verbosity = Default 
failover = enabled 
 
[root@morph-03 root]# gulm_tool getstats $(hostname) 
Command timed out. 
 
[root@morph-04 root]# gulm_tool getstats $(hostname) 
I_am = Client 
quorum_has = 1 
quorum_needs = 2 
rank = -1 
GenerationID = 0 
run time = 430 
pid = 3883 
verbosity = Default 
failover = enabled 
 
[root@morph-05 root]# gulm_tool getstats $(hostname) 
I_am = Pending 
quorum_has = 1 
quorum_needs = 2 
rank = 2 
GenerationID = 0 
run time = 276 
pid = 3922 
verbosity = Default 
failover = enabled 
 
[root@morph-06 root]# gulm_tool getstats $(hostname) 
I_am = Client 
quorum_has = 1 
quorum_needs = 2 
rank = -1 
GenerationID = 0 
run time = 431 
pid = 3947 
verbosity = Default 
failover = enabled

Comment 7 Corey Marthaler 2004-09-09 21:17:29 UTC

This is with the lastest code. 
 
I also see these errors in all the syslogs: 
 
Sep  9 16:17:53 morph-01 lock_gulmd_core[4187]: ERROR 
[src/core_io.c:1317] Node (morph-03.lab.msp.redhat.com 
::ffff:192.168.44.63) has been denied from connecting here. 
Sep  9 16:17:57 morph-01 lock_gulmd_core[4187]: ERROR 
[src/core_io.c:1317] Node (morph-05.lab.msp.redhat.com 
::ffff:192.168.44.65) has been denied from connecting here. 
Sep  9 16:18:02 morph-01 lock_gulmd_core[4187]: ERROR 
[src/core_io.c:1317] Node (morph-04.lab.msp.redhat.com 
::ffff:192.168.44.64) has been denied from connecting here. 
Sep  9 16:18:07 morph-01 lock_gulmd_core[4187]: ERROR 
[src/core_io.c:1317] Node (morph-02.lab.msp.redhat.com 
::ffff:192.168.44.62) has been denied from connecting here. 
Sep  9 16:18:07 morph-01 lock_gulmd_core[4187]: ERROR 
[src/core_io.c:1317] Node (morph-06.lab.msp.redhat.com 
::ffff:192.168.44.66) has been denied from connecting here.

Comment 8 michael conrad tadpol tilstra 2004-09-09 21:32:20 UTC

Is ccs running? or are you doing all gulm config via cmdline?

Comment 9 Corey Marthaler 2004-09-09 21:42:57 UTC

ccsd is running and then I do the lock_gulmd cmdline I posted.

Comment 10 michael conrad tadpol tilstra 2004-09-09 21:59:30 UTC

ok. what does the nodes section of the ccs conf look like?

Comment 11 Corey Marthaler 2004-09-09 22:06:45 UTC

[root@morph-01 root]# cat /etc/cluster/cluster.conf 
<?xml version="1.0"?> 
<cluster name="morph-cluster" config_version="1"> 
 
<cman> 
</cman> 
 
<dlm> 
</dlm> 
 
<nodes> 
        <node name="morph-01" votes="1"> 
                <fcdriver>qla2300</fcdriver> 
                <fence> 
                        <method name="single"> 
                                <device name="apc" switch="1" 
port="1"/> 
                        </method> 
                </fence> 
        </node> 
        <node name="morph-02" votes="1"> 
                <fcdriver>qla2300</fcdriver> 
                <fence> 
                        <method name="single"> 
                                <device name="apc" switch="1" 
port="2"/> 
                        </method> 
                </fence> 
        </node> 
        <node name="morph-03" votes="1"> 
                <fcdriver>qla2300</fcdriver> 
                <fence> 
                        <method name="single"> 
                                <device name="apc" switch="1" 
port="3"/> 
                        </method> 
                </fence> 
        </node> 
        <node name="morph-04" votes="1"> 
                <fcdriver>lpfc</fcdriver> 
                <fence> 
                        <method name="single"> 
                                <device name="apc" switch="1" 
port="4"/> 
                        </method> 
                </fence> 
        </node> 
        <node name="morph-05" votes="1"> 
                <fcdriver>lpfc</fcdriver> 
                <fence> 
                        <method name="single"> 
                                <device name="apc" switch="1" 
port="5"/> 
                        </method> 
                </fence> 
        </node> 
        <node name="morph-06" votes="1"> 
                <fcdriver>qla2300</fcdriver> 
                <fence> 
                        <method name="single"> 
                                <device name="apc" switch="1" 
port="6"/> 
                        </method> 
                </fence> 
        </node> 
 
</nodes> 
 
 
<fence_devices> 
        <device name="apc" agent="fence_apc" ipaddr="morph-apc" 
login="apc" passwd="apc"/> 
</fence_devices> 
 
 
<rm> 
</rm> 
 
</cluster>

Comment 12 michael conrad tadpol tilstra 2004-09-09 22:33:06 UTC

well fun. gulm is looking in ccs for a node called
"morph-03.lab.msp.redhat.com", but it doesn't see any with that name.

("morph-03" != "morph-03.lab.msp.redhat.com")

Comment 13 Corey Marthaler 2004-09-10 16:50:04 UTC

This is related to bz132222

Comment 14 Corey Marthaler 2004-09-15 15:37:58 UTC

no longer seg faults or is stuck due to FQDN after fix for 132222

Comment 15 michael conrad tadpol tilstra 2004-09-15 15:40:50 UTC

err, little confused how a fix to cman fixed gulm, but hey, if it
works it works i guess.