Bug 139863

Summary:	GFS nodes panic when NFS exported fs mounted using noac
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Rick Spurgeon <spurgeon>
Component:	gfs	Assignee:	Ben Marzinski <bmarzins>
Status:	CLOSED ERRATA	QA Contact:	GFS Bugs <gfs-bugs>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3	CC:	chrisw, etay, kanderso, tao
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-12-21 15:58:29 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Rick Spurgeon 2004-11-18 15:32:30 UTC

One of the Army Research Labs uses GFS in a large cluster (16 GFS
nodes serving 128 compute nodes via NFS).  The system has been
installed since May, but is not yet in production.  LNXI is the reseller.

They have been disappointed with the I/O speed and, in an effort to
improve this, the /etc/fstab on each of the compute nodes was changed
to specify "noac" for the nfs imports.

Since making this modification, though, the GFS nodes have started to
have a serious problem with panicking.  The problem seems to be
related to transaction volume (this is a database type of app), as
opposed to bandwidth.  More info concerning the panics is shown at the
bottom of this email.

They are using RHEL3 U2 with GFS 5.2.1 "Lrrr", and might be persuaded
to move to the latest RHEL/GFS, if there is a plausible argument to be
made that this will solve their problem.

According to the customer, "Best I can tell, this started two
days ago... which would coincide with fixing another NFS/GFS speed
problem: removing the "noac" from the client NFS mount attributes. 
This means this app would have hit the I/O subsystem harder, which may
have led to the current panics.  This is a guess.  I can't find
anything else that has changed. The GFS nodes are panicing.  I
documented some of the panic messages as shown below.  It is a
recursive panic... so, the node doesn't get restarted until it fences".

do_IRQ: stack overflow: 736
^@f5384be0 000002e0 00000001 c0435c80 00000001 00000c08 c68e4000 c68e4000
^@       c010da00 c03ec324 c0435180 c695f080 00000a00 c68e4000
c68e4000 f5384c78
^@       c0435c00 f62e0068 c0430068 ffffff00 c01232a4 00000060
00000282 c0435c80
^@Call Trace:   [<c010da00>] do_IRQ [kernel] 0x0 (0xf5384c00)
^@[<c01232a4>] schedule [kernel] 0x324 (0xf5384c30)
^@[<c01245aa>] io_schedule [kernel] 0x2a (0xf5384c7c)
^@[<c0161d5e>] __wait_on_buffer [kernel] 0x5e (0xf5384c88)
^@[<f8b996e5>] gfs_dreread [gfs] 0x61 (0xf5384cc0)
^@[<f8bc5906>] gfs_rgrp_read [gfs] 0xb6 (0xf5384ce0)
^@[<f8bb1826>] gfs_glock_xmote_th [gfs] 0x7a (0xf5384d20)
^@[<f8b9e219>] lock_rgrp [gfs] 0x2d (0xf5384d40)
^@[<f8bd7780>] gfs_rgrp_glops [gfs] 0x0 (0xf5384d5c)
^@[<f8bb1da0>] glock_wait_internal [gfs] 0x17c (0xf5384d60)
^@[<f8bb1cbf>] glock_wait_internal [gfs] 0x9b (0xf5384d70)
^@[<f8bd7780>] gfs_rgrp_glops [gfs] 0x0 (0xf5384d7c)
^@[<f8bb20e2>] gfs_glock_nq [gfs] 0x6a (0xf5384d90)
^@[<f8bb2780>] nq_m_sync [gfs] 0x70 (0xf5384db0)
^@[<f8bb259c>] glock_compare [gfs] 0x0 (0xf5384dc0)
^@[<f8bb28d9>] gfs_glock_nq_m [gfs] 0x129 (0xf5385490)
^@[<f8bb28d9>] gfs_glock_nq_m [gfs] 0x129 (0xf5385520)
^@[<c01342f2>] timer_bh [kernel] 0x62 (0xf538579c)
^@[<f88377a2>] qla2x00_queuecommand [qla2300] 0x292 (0xf5385858)
^@[<c0222909>] __kfree_skb [kernel] 0x139 (0xf5385864)
^@[<c0235cbf>] qdisc_restart [kernel] 0x1f (0xf53858c4)
^@[<c02289c0>] dev_queue_xmit [kernel] 0x290 (0xf53858dc)
^@[<c02483ff>] ip_finish_output2 [kernel] 0xcf (0xf53858f4)
^@[<c0246218>] ip_output [kernel] 0x88 (0xf5385914)
^@[<c0246560>] ip_queue_xmit [kernel] 0x310 (0xf5385934)
^@[<f88377a2>] qla2x00_queuecommand [qla2300] 0x292 (0xf5385968)
^@[<f88165e9>] __scsi_end_request [scsi_mod] 0xc9 (0xf53859b4)
^@[<c025f1bd>] tcp_v4_send_check [kernel] 0x4d (0xf53859cc)
^@[<c0121af0>] wake_up_cpu [kernel] 0x20 (0xf53859d8)
^@[<c02594c0>] tcp_transmit_skb [kernel] 0x2c0 (0xf53859ec)
^@[<c0134040>] process_timeout [kernel] 0x0 (0xf5385a34)
^@[<c0122046>] wake_up_process [kernel] 0x26 (0xf5385a44)
^@[<c01345d6>] __run_timers [kernel] 0xb6 (0xf5385a5c)
^@[<c01342f2>] timer_bh [kernel] 0x62 (0xf5385a88)
^@[<c012ef65>] bh_action [kernel] 0x55 (0xf5385a9c)
^@[<c012ee07>] tasklet_hi_action [kernel] 0x67 (0xf5385aa4)
^@[<c010db48>] do_IRQ [kernel] 0x148 (0xf5385ad8)
^@[<c010da00>] do_IRQ [kernel] 0x0 (0xf5385afc)
^@[<f8bb0ca1>] gfs_init_holder [gfs] 0x21 (0xf5385b20)
^@[<f8bb48c6>] gmalloc_wrapper [gfs] 0x1e (0xf5385b30)
^@[<f8bc75e2>] gfs_rlist_alloc [gfs] 0x46 (0xf5385b50)
^@[<f8ba5fd9>] do_strip [gfs] 0x179 (0xf5385b80)
^@[<f8ba5d4f>] recursive_scan [gfs] 0x93 (0xf5385c10)
^@[<f8ba5de8>] recursive_scan [gfs] 0x12c (0xf5385c60)
^@[<f8ba5e60>] do_strip [gfs] 0x0 (0xf5385c80)
^@[<f8ba67ea>] gfs_shrink [gfs] 0x34e (0xf5385cc0)
^@[<f8ba5e60>] do_strip [gfs] 0x0 (0xf5385ce0)
^@[<f8b9de28>] xmote_inode_bh [gfs] 0x44 (0xf5385d40)
^@[<f8bb1cbf>] glock_wait_internal [gfs] 0x9b (0xf5385d60)
^@[<f8bd7740>] gfs_inode_glops [gfs] 0x0 (0xf5385d6c)
^@[<f8bd7bc0>] gfs_sops [gfs] 0x0 (0xf5385d7c)
^@[<c017cb75>] iput [kernel] 0x55 (0xf5385d84)
^@[<f8ba13b7>] gfs_permission [gfs] 0xc3 (0xf5385da0)
^@[<f8ba6936>] gfs_truncatei [gfs] 0xc6 (0xf5385dc0)
^@[<f8ba2144>] gfs_truncator_page [gfs] 0x0 (0xf5385dd0)
^@[<c0140168>] vmtruncate [kernel] 0x98 (0xf5385e08)
^@[<f8b9eb6e>] gfs_setattr [gfs] 0x34a (0xf5385e20)
^@[<f8ba2144>] gfs_truncator_page [gfs] 0x0 (0xf5385e30)
^@[<f9032f8d>] find_fh_dentry [nfsd] 0x22d (0xf5385e44)
^@[<c0139773>] in_group_p [kernel] 0x23 (0xf5385e58)
^@[<c016e832>] vfs_permission [kernel] 0x82 (0xf5385e60)
^@[<c017dc3e>] notify_change [kernel] 0x2ce (0xf5385eb0)
^@[<f9034809>] nfsd_setattr [nfsd] 0x3f9 (0xf5385ecc)
^@[<f891ba0e>] svc_sock_enqueue [sunrpc] 0x1de (0xf5385ee8)
^@[<f903b6ef>] nfsd3_proc_setattr [nfsd] 0x7f (0xf5385f24)
^@[<f9043af0>] nfsd_version3 [nfsd] 0x0 (0xf5385f3c)
^@[<f903d873>] nfs3svc_decode_sattrargs [nfsd] 0x73 (0xf5385f40)
^@[<f9044248>] nfsd_procedures3 [nfsd] 0x48 (0xf5385f50)
^@[<f9043af0>] nfsd_version3 [nfsd] 0x0 (0xf5385f58)
^@[<f903064e>] nfsd_dispatch [nfsd] 0xce (0xf5385f5c)
^@[<f9044248>] nfsd_procedures3 [nfsd] 0x48 (0xf5385f70)
^@[<f891b65f>] svc_process_Rsmp_462cdaea [sunrpc] 0x42f (0xf5385f78)
^@[<f9030407>] nfsd [nfsd] 0x207 (0xf5385fb0)
^@[<f9030200>] nfsd [nfsd] 0x0 (0xf5385fe0)
^@[<c010958d>] kernel_thread_helper [kernel] 0x5 (0xf5385ff0)


io05 login: lock_gulm: Checking for journals for dead node "io04"
^@GFS:  fsid=mhpcc:workspace4, jid=5:  Trying to acquire journal lock...
^@GFS:  fsid=mhpcc:workspace3, jid=5:  Trying to acquire journal lock...
^@GFS:  fsid=mhpcc:workspace2, jid=5:  Trying to acquire journal lock...
^@GFS:  fsid=mhpcc:workspace3, jid=5:  Busy
^@GFS:  fsid=mhpcc:workspace1, jid=5:  Trying to acquire journal lock...
^@GFS:  fsid=mhpcc:workspace2, jid=5:  Busy
^@GFS:  fsid=mhpcc:workspace1, jid=5:  Busy
^@GFS:  fsid=mhpcc:workspace4, jid=5:  Busy
^@do_IRQ: stack overflow: 984
^@f5c1ccd8 000003d8 00000000 74409c72 00000001 00000d00 00000001
f5c1d414
^@       c010da00 c03ec324 00000001 e5dc3768 00000000 00000001 f5c1d414
f5c1cd6c
^@       ffffffff f5c10068 f8bb0068 ffffff00 f8bb04df 00000060 00000282
f2d787e0
^@Call Trace:   [<c010da00>] do_IRQ [kernel] 0x0 (0xf5c1ccf8)
^@[<f8bb0068>] gfs_writei [gfs] 0x1a4 (0xf5c1cd20)
^@[<f8bb04df>] gfs_sort [gfs] 0x5b (0xf5c1cd28)
^@[<f8bb275b>] nq_m_sync [gfs] 0x4b (0xf5c1cd70)
^@[<f8bb259c>] glock_compare [gfs] 0x0 (0xf5c1cd80)
^@[<f8bb28d9>] gfs_glock_nq_m [gfs] 0x129 (0xf5c1d470)
^@[<c0222909>] __kfree_skb [kernel] 0x139 (0xf5c1d864)
^@[<c0235cbf>] qdisc_restart [kernel] 0x1f (0xf5c1d8c4)
^@[<c02289c0>] dev_queue_xmit [kernel] 0x290 (0xf5c1d8dc)
^@[<c02483ff>] ip_finish_output2 [kernel] 0xcf (0xf5c1d8f4)
^@[<c0246218>] ip_output [kernel] 0x88 (0xf5c1d914)
^@[<c0246560>] ip_queue_xmit [kernel] 0x310 (0xf5c1d934)
^@[<c0222909>] __kfree_skb [kernel] 0x139 (0xf5c1d9a0)
^@[<c026aa92>] arp_process [kernel] 0xa2 (0xf5c1d9b8)
^@[<c025f1bd>] tcp_v4_send_check [kernel] 0x4d (0xf5c1d9cc)
^@[<c0121af0>] wake_up_cpu [kernel] 0x20 (0xf5c1d9d8)
^@[<c0222514>] alloc_skb [kernel] 0xc4 (0xf5c1d9f0)
^@[<c0134040>] process_timeout [kernel] 0x0 (0xf5c1da34)
^@[<c0122046>] wake_up_process [kernel] 0x26 (0xf5c1da44)
^@[<c0155b1b>] rmqueue [kernel] 0x35b (0xf5c1da64)
^@[<c0155d17>] __alloc_pages [kernel] 0x97 (0xf5c1daa0)
^@[<c0122ea1>] scheduler_tick [kernel] 0x3d1 (0xf5c1dab0)
^@[<f8bb0ca1>] gfs_init_holder [gfs] 0x21 (0xf5c1db20)
^@[<f8bb48c6>] gmalloc_wrapper [gfs] 0x1e (0xf5c1db30)
^@[<f8bc75e2>] gfs_rlist_alloc [gfs] 0x46 (0xf5c1db50)
^@[<f8ba5fd9>] do_strip [gfs] 0x179 (0xf5c1db80)
^@[<f8ba5d4f>] recursive_scan [gfs] 0x93 (0xf5c1dc10)
^@[<f8ba5de8>] recursive_scan [gfs] 0x12c (0xf5c1dc60)
^@[<f8ba5e60>] do_strip [gfs] 0x0 (0xf5c1dc80)

Also, for the redhat EL folks, nodes don't come back to life after the
I/O node (NFS server) reboots:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=129861

Comment 1 Ken Preslan 2004-11-18 16:32:14 UTC

I don't think this has a whole lot to do with NFS.  It's probably more
of an issue with deallocating large files.

That and I don't really understand the backtraces.

Comment 2 Ben Marzinski 2004-11-20 22:11:49 UTC

I've been trying to recreate this bug without success.  If I could
get a more detailed description of the machines that GFS is running
on, that would be helpful.  Specifically, the output 
of "cat /proc/cpuinfo" would be a great help. I've also been looking
into the possibility that this bug isn't any one piece of software's
fault, but that the stack space was simply nickled and dimed away.
If that's the case, we can probably reduce the stack space used up
by gfs when it's deallocating files.

Comment 3 Derek Anderson 2004-11-22 14:41:41 UTC

FWIW, the client side noac might do the opposite of what you intend 
(increase performance).  The noac option turns off all attribute 
caching and, thus, ensures that all client-side attributes are in 
sync with the server, at the cost of constant checking of attributes 
with the server.  You probably want to set 'noatime' on the client 
mount options and try leaving ac on (which is the default). 
 
Also, please update this bug with the exact version of 5.2.1 they 
are running.  Multiple fixes have been made since the introduction 
of the Opteron to reduce GFS' use of stack space and may alleviate 
this problem if they upgrade to the latest.

Comment 4 Rick Spurgeon 2004-11-22 17:33:45 UTC

From the customer:

# pdsh -w io[01-16] "rpm -qa | grep GFS" | dshbak -c
----------------
io[01-16]
----------------
 GFS-smp-5.2.1-25.3.1.11

Comment 6 Ben Marzinski 2004-11-29 17:53:54 UTC

*** Bug 139867 has been marked as a duplicate of this bug. ***

Comment 7 Ben Marzinski 2004-11-29 18:33:19 UTC

I am still not able to recreate this problem on my machines.  I have
an idea that will generate some more useful information.
Unfortunately, it involves having the customer run a modified gfs
module.  The new module would work exactly like their current one,
except that at the start of each gfs function, it would perform the
check currently being performed in the interrupt. If it found that the
available stack size was under 1K, it would print the stack (just like
the interrupt code currently does), but it would also print an gfs
internal stack trace (to disabiguate the kernel stack trace, at least
for the gfs portions), and a raw hex dump of the entire stack. Then it
would halt the machine, so stuff doesn't keep on getting printed. 
From this information, I could figure out exactly how much stack space
each function was using.  Most likely this will make the problem
easier to recreate (since you are checking on every gfs function, not
just in interrupts). Even if this check never finds the overflow, that
is still useful information, because it means that whatever is using
up the stack is running in an interrupt context, which points to
device drivers.

Of course, this all hinges on the customer's willingness to run a
modified gfs module. If someone could find out whether or not they are
o.k. with this, that would be a big help.

Comment 8 Ben Marzinski 2004-12-01 01:43:51 UTC

Forget about that last comment.  I found the bug.  There are some GFS
functions, namely gfs_glock_nq_m() and nq_m_sync(), that create
variable size arrays on the stack, depending on their arguments.  For
some reason, the customer's load is causing them to create arrays that
eat up 
3184 bytes of stack space.  I've been staring at backtraces for far too
long, and I'm going home now, but this should be fixed tomorrow.

Comment 9 Ben Marzinski 2004-12-02 15:35:28 UTC

The fix is in. rpms are either being generated, or will be shortly.
I will post a message when the rpms are ready.

Comment 10 Chris Worley 2004-12-03 14:36:06 UTC

It sounds like this will be a simple module replacement, correct?

When should we expect the RPM?

Will it be w.r.t. U2, or will we need to upgrade.

If it's U2 compatible, and the RPM is available, we will be down for
service today... so we could try it out.

Comment 11 Ben Marzinski 2004-12-03 21:15:41 UTC

Yeah, it's just a module replacement. To verify that this fix solves
your problem, you can download a modified gfs.o module at

ftp://ftp.sistina.com/pub/misc/.test/gfs.o

This module was built from the GFS-smp-5.2.1-25.3.1.11 source
for linux-2.4.21-15.ELsmp, with a patch added to correct the problem
I found.

To cut down on the number of different permutations of kernel/gfs
module that we need to support, we are simply adding this bug fix to
our latest rebuild, which is against 2.4.21-27 (the kernel for
RHEL3-U4). If this patched module works for you, you can just run with
it until RHEL3-U4 is released , then you should upgrade to the lastest
gfs release.

How does that sound?

Comment 12 Chris Worley 2004-12-04 02:42:46 UTC

Sounds good.

It's too late to try out today.  We'll need to wait for the next
allowed system downtime.

Thanks for all your help! We really do apprecate it!

Comment 13 John Flanagan 2004-12-21 15:58:29 UTC

An advisory has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-602.html