Bug 157472

Summary:	assertion in util.c during recovery: "!ret"
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Corey Marthaler <cmarthal>
Component:	gfs	Assignee:	Kiersten (Kerri) Anderson <kanderso>
Status:	CLOSED CURRENTRELEASE	QA Contact:	GFS Bugs <gfs-bugs>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4	CC:	kpreslan
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	gfs 6.1	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2006-11-06 19:43:15 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Corey Marthaler 2005-05-11 21:28:44 UTC

Description of problem:
I had the revolver HEAVY I/O load running and shot two of the four nodes in the
 gulm cluster. 

GFS: fsid=tank-cluster:gfs1.0: jid=2: Trying to acquire journal lock...
GFS: fsid=tank-cluster:gfs0.0: jid=2: Trying to acquire journal lock...
GFS: fsid=tank-cluster:gfs3.0: jid=2: Trying to acquire journal lock...
GFS: fsid=tank-cluster:gfs2.0: jid=2: Trying to acquire journal lock...
GFS: fsid=tank-cluster:gfs0.0: jid=2: Busy
GFS: fsid=tank-cluster:gfs3.0: jid=2: Busy
GFS: fsid=tank-cluster:gfs2.0: jid=2: Busy
GFS: fsid=tank-cluster:gfs1.0: jid=2: Busy
GFS: fsid=tank-cluster:gfs3.0: jid=3: Trying to acquire journal lock...
GFS: fsid=tank-cluster:gfs0.0: jid=3: Trying to acquire journal lock...
GFS: fsid=tank-cluster:gfs1.0: jid=3: Trying to acquire journal lock...
GFS: fsid=tank-cluster:gfs2.0: jid=3: Trying to acquire journal lock...
GFS: fsid=tank-cluster:gfs0.0: jid=3: Looking at journal...
GFS: fsid=tank-cluster:gfs3.0: jid=3: Looking at journal...
GFS: fsid=tank-cluster:gfs1.0: jid=3: Looking at journal...
GFS: fsid=tank-cluster:gfs2.0: jid=3: Looking at journal...
GFS: fsid=tank-cluster:gfs3.0: jid=3: Acquiring the transaction lock...
GFS: fsid=tank-cluster:gfs2.0: jid=3: Acquiring the transaction lock...
GFS: fsid=tank-cluster:gfs0.0: jid=3: Acquiring the transaction lock...
GFS: fsid=tank-cluster:gfs1.0: jid=3: Acquiring the transaction lock...
GFS: fsid=tank-cluster:gfs3.0: jid=3: Replaying journal...
GFS: fsid=tank-cluster:gfs0.0: warning: assertion "!ret" failed
GFS: fsid=tank-cluster:gfs0.0:   function = drop_bh
GFS: fsid=tank-cluster:gfs0.0:   file =
/usr/src/build/563141-i686/BUILD/hugemem/src/gfs/glock.c, line = 1189
GFS: fsid=tank-cluster:gfs0.0:   time = 1115839337
------------[ cut here ]------------
kernel BUG at /usr/src/build/563141-i686/BUILD/hugemem/src/gfs/util.c:289!
invalid operand: 0000 [#1]
SMP
Modules linked in: gnbd(U) lock_nolock(U) gfs(U) lock_gulm(U) lock_harness(U)
md5 ipv6 parport_pc lp parport autofs4 sunrpc button battery ac uhci_hcd
hw_random e1000 floppy lpfc dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod
qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod
CPU:    1
EIP:    0060:[<f8b89fb3>]    Not tainted VLI
EFLAGS: 00010202   (2.6.9-6.37.ELhugemem)
EIP is at gfs_assert_warn_i+0x7f/0x93 [gfs]
eax: 003285e3   ebx: f8b0624c   ecx: f5d95f00   edx: f8b9080f
esi: 00002710   edi: f8b06000   ebp: f8b8a6b1   esp: f5d95f24
ds: 007b   es: 007b   ss: 0068
Process gulm_Cb_Handler (pid: 5845, threadinfo=f5d95000 task=f42c8030)
Stack: f8b8bf62 00000001 00000001 ee511f10 f5fbae40 ee511f2c f8b680f3 f8b8bc80
       000004a5 f8b9c360 f8b06000 00000009 f8b06000 ee511f10 f5d95fc0 f5d95000
       f8b6950b f5b35b00 39e026b0 39e026d8 f8974cc7 39e026d0 39e026e0 00000000
Call Trace:
 [<f8b680f3>] drop_bh+0xb0/0x194 [gfs]
 [<f8b6950b>] gfs_glock_cb+0xa3/0x131 [gfs]
 [<f8974cc7>] handler+0x15f/0x180 [lock_gulm]
 [<0211dd64>] default_wake_function+0x0/0xc
 [<0211cc04>] finish_task_switch+0x30/0x66
 [<0211dd64>] default_wake_function+0x0/0xc
 [<f8974b68>] handler+0x0/0x180 [lock_gulm]
 [<021041f1>] kernel_thread_helper+0x5/0xb
Code: <3>Debug: sleeping function called from invalid context at
include/linux/rwsem.h:43
in_atomic():0[expected: 0], irqs_disabled():1
 [<0211f405>] __might_sleep+0x7d/0x88
 [<0215054b>] rw_vm+0xdb/0x282
 [<f8b89f88>] gfs_assert_warn_i+0x54/0x93 [gfs]
 [<f8b89f88>] gfs_assert_warn_i+0x54/0x93 [gfs]
 [<021509a5>] get_user_size+0x30/0x57
 [<f8b89f88>] gfs_assert_warn_i+0x54/0x93 [gfs]
 [<0210615b>] show_registers+0x115/0x16c
 [<021062f2>] die+0xdb/0x16b
 [<02106664>] do_invalid_op+0x0/0xd5
 [<02106664>] do_invalid_op+0x0/0xd5
 [<02106730>] do_invalid_op+0xcc/0xd5
 [<f8b89fb3>] gfs_assert_warn_i+0x7f/0x93 [gfs]
 [<0211f6e7>] autoremove_wake_function+0xd/0x2d
 [<0211dda6>] __wake_up_common+0x36/0x51
 [<0211ddea>] __wake_up+0x29/0x3c
 [<02121c5b>] release_console_sem+0xa4/0xa9
 [<f8b8007b>] gfs_permission_i+0xc7/0x176 [gfs]
 [<f8b89fb3>] gfs_assert_warn_i+0x7f/0x93 [gfs]
 [<f8b680f3>] drop_bh+0xb0/0x194 [gfs]
 [<f8b6950b>] gfs_glock_cb+0xa3/0x131 [gfs]
 [<f8974cc7>] handler+0x15f/0x180 [lock_gulm]
 [<0211dd64>] default_wake_function+0x0/0xc
 [<0211cc04>] finish_task_switch+0x30/0x66
 [<0211dd64>] default_wake_function+0x0/0xc
 [<f8974b68>] handler+0x0/0x180 [lock_gulm]
 [<021041f1>] kernel_thread_helper+0x5/0xb
 Bad EIP value.
 <0>Fatal exception: panic in 5 seconds
GFS: fsid=tank-cluster:gfs0.0: jid=3: Replaying journal...
GFS: fsid=tank-cluster:gfs2.0: jid=3: Replaying journal...
GFS: fsid=tank-cluster:gfs0.0: jid=3: Replayed 0 of 29 blocks
GFS: fsid=tank-cluster:gfs0.0: jid=3: replays = 0, skips = 24, sames = 5
GFS: fsid=tank-cluster:gfs0.0: jid=3: Journal replayed in 2s
GFS: fsid=tank-cluster:gfs2.0: jid=3: Replayed 0 of 6 blocks
GFS: fsid=tank-cluster:gfs2.0: jid=3: replays = 0, skips = 1, sames = 5
GFS: fsid=tank-cluster:gfs1.0: jid=3: Replaying journal...
GFS: fsid=tank-cluster:gfs2.0: jid=3: Journal replayed in 2s
GFS: fsid=tank-cluster:gfs1.0: jid=3: Replayed 0 of 6 blocks
GFS: fsid=tank-cluster:gfs1.0: jid=3: replays = 0, skips = 1, sames = 5
GFS: fsid=tank-cluster:gfs1.0: jid=3: Journal replayed in 2s
GFS: fsid=tank-cluster:gfs3.0: jid=3: Replayed 507 of 507 blocks
GFS: fsid=tank-cluster:gfs3.0: jid=3: replays = 507, skips = 0, sames = 0
GFS: fsid=tank-cluster:gfs3.0: jid=3: Journal replayed in 2s
GFS: fsid=tank-cluster:gfs0.0: jid=3: Done
GFS: fsid=tank-cluster:gfs2.0: jid=3: Done
GFS: fsid=tank-cluster:gfs1.0: jid=3: Done
GFS: fsid=tank-cluster:gfs3.0: jid=3: Done
Kernel panic - not syncing: Fatal exception

Version-Release number of selected component (if applicable):
Gulm 2.6.9-32.0 (built May  5 2005 12:12:36) installed
GFS 2.6.9-32.0 (built May  5 2005 12:12:50) installed

Comment 1 Corey Marthaler 2005-05-11 21:35:09 UTC

This happened on tank-01 which was either a Slave or the Master at the time when
tank-04 and tank-05 went down.

Comment 2 michael conrad tadpol tilstra 2005-05-12 13:18:48 UTC

so, are you going to describe the cluster? or just leave me guessing?
only four nodes? embedded or seperate lock servers? which nodes were shot?
clients? slaves? How many lock servers?

Comment 3 Corey Marthaler 2005-05-12 14:53:38 UTC

Tilstra, you know I'd never knowingly leave you hangin' man. :)
Here's a copy of the config file. These were embedded servers, again tank-04 and
tank-05 were shot, tank-04 was a client and tank-05 was either a Slave or the
Master, I really don't know. It was kindda a fluke that I hit this (in that I
was trying to set a senario for a different bug so I wasn't paying attention as
much as I normally would). 


<?xml version="1.0"?>
<cluster config_version="8" name="tank-cluster">
        <gulm>
                <lockserver name="tank-01.lab.msp.redhat.com"/>
                <lockserver name="tank-03.lab.msp.redhat.com"/>
                <lockserver name="tank-05.lab.msp.redhat.com"/>
        </gulm>
        <clusternodes>
                <clusternode name="tank-01.lab.msp.redhat.com" votes="1">
                        <fence>
                                <method name="single">
                                        <device name="apc" port="1" switch="1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="tank-03.lab.msp.redhat.com" votes="1">
                        <fence>
                                <method name="single">
                                        <device name="apc" port="3" switch="1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="tank-04.lab.msp.redhat.com" votes="1">
                        <fence>
                                <method name="single">
                                        <device name="apc" port="4" switch="1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="tank-05.lab.msp.redhat.com" votes="1">
                        <fence>
                                <method name="single">
                                        <device name="apc" port="5" switch="1"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <fencedevices>
                <fencedevice agent="fence_apc" ipaddr="tank-apc" login="apc"
name="apc" passwd="apc"/>
        </fencedevices>
        <rm>
                <resources>
                        <ip address="192.168.45.91" monitor_link="1"/>
                        <ip address="192.168.45.92" monitor_link="1"/>
                        <ip address="192.168.45.93" monitor_link="1"/>
                        <ip address="192.168.45.94" monitor_link="1"/>
                        <ip address="192.168.45.95" monitor_link="1"/>
                </resources>
                <service name="test1">
                        <ip ref="192.168.45.91"/>
                </service>
                <failoverdomains/>
                <service exclusive="1" name="coreyservice">
                        <clusterfs device="111" fstype="gfs" mountpoint="111"
name="111" options="111">
                                <clusterfs device="222" fstype="gfs"
mountpoint="222" name="222" options="222">
                                        <clusterfs device="333" fstype="gfs"
mountpoint="333" name="333" options="333">
                                                <clusterfs device="444"
fstype="gfs" mountpoint="444" name="444" options="444"/>
                                        </clusterfs>
                                </clusterfs>
                                <clusterfs device="222b" fstype="gfs"
mountpoint="222b" name="222b" options="222b"/>
                        </clusterfs>
                </service>
        </rm>
        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
</cluster>

Comment 4 michael conrad tadpol tilstra 2005-05-18 15:34:39 UTC

GFS 6.1 doesn't like getting error codes from the lock modules.  Prior versions
handled this by retrying the lock request.  So requeue lock reqs instead of
telling gfs there was an error.

Comment 5 Corey Marthaler 2006-11-06 19:43:15 UTC

This was fixed awhile ago.