One of the Army Research Labs uses GFS in a large cluster (16 GFS
nodes serving 128 compute nodes via NFS). The system has been
installed since May, but is not yet in production. LNXI is the reseller.
They have been disappointed with the I/O speed and, in an effort to
improve this, the /etc/fstab on each of the compute nodes was changed
to specify "noac" for the nfs imports.
Since making this modification, though, the GFS nodes have started to
have a serious problem with panicking. The problem seems to be
related to transaction volume (this is a database type of app), as
opposed to bandwidth. More info concerning the panics is shown at the
bottom of this email.
They are using RHEL3 U2 with GFS 5.2.1 "Lrrr", and might be persuaded
to move to the latest RHEL/GFS, if there is a plausible argument to be
made that this will solve their problem.
According to the customer, "Best I can tell, this started two
days ago... which would coincide with fixing another NFS/GFS speed
problem: removing the "noac" from the client NFS mount attributes.
This means this app would have hit the I/O subsystem harder, which may
have led to the current panics. This is a guess. I can't find
anything else that has changed. The GFS nodes are panicing. I
documented some of the panic messages as shown below. It is a
recursive panic... so, the node doesn't get restarted until it fences".
do_IRQ: stack overflow: 736
^@f5384be0 000002e0 00000001 c0435c80 00000001 00000c08 c68e4000 c68e4000
^@ c010da00 c03ec324 c0435180 c695f080 00000a00 c68e4000
^@ c0435c00 f62e0068 c0430068 ffffff00 c01232a4 00000060
^@Call Trace: [<c010da00>] do_IRQ [kernel] 0x0 (0xf5384c00)
^@[<c01232a4>] schedule [kernel] 0x324 (0xf5384c30)
^@[<c01245aa>] io_schedule [kernel] 0x2a (0xf5384c7c)
^@[<c0161d5e>] __wait_on_buffer [kernel] 0x5e (0xf5384c88)
^@[<f8b996e5>] gfs_dreread [gfs] 0x61 (0xf5384cc0)
^@[<f8bc5906>] gfs_rgrp_read [gfs] 0xb6 (0xf5384ce0)
^@[<f8bb1826>] gfs_glock_xmote_th [gfs] 0x7a (0xf5384d20)
^@[<f8b9e219>] lock_rgrp [gfs] 0x2d (0xf5384d40)
^@[<f8bd7780>] gfs_rgrp_glops [gfs] 0x0 (0xf5384d5c)
^@[<f8bb1da0>] glock_wait_internal [gfs] 0x17c (0xf5384d60)
^@[<f8bb1cbf>] glock_wait_internal [gfs] 0x9b (0xf5384d70)
^@[<f8bd7780>] gfs_rgrp_glops [gfs] 0x0 (0xf5384d7c)
^@[<f8bb20e2>] gfs_glock_nq [gfs] 0x6a (0xf5384d90)
^@[<f8bb2780>] nq_m_sync [gfs] 0x70 (0xf5384db0)
^@[<f8bb259c>] glock_compare [gfs] 0x0 (0xf5384dc0)
^@[<f8bb28d9>] gfs_glock_nq_m [gfs] 0x129 (0xf5385490)
^@[<f8bb28d9>] gfs_glock_nq_m [gfs] 0x129 (0xf5385520)
^@[<c01342f2>] timer_bh [kernel] 0x62 (0xf538579c)
^@[<f88377a2>] qla2x00_queuecommand [qla2300] 0x292 (0xf5385858)
^@[<c0222909>] __kfree_skb [kernel] 0x139 (0xf5385864)
^@[<c0235cbf>] qdisc_restart [kernel] 0x1f (0xf53858c4)
^@[<c02289c0>] dev_queue_xmit [kernel] 0x290 (0xf53858dc)
^@[<c02483ff>] ip_finish_output2 [kernel] 0xcf (0xf53858f4)
^@[<c0246218>] ip_output [kernel] 0x88 (0xf5385914)
^@[<c0246560>] ip_queue_xmit [kernel] 0x310 (0xf5385934)
^@[<f88377a2>] qla2x00_queuecommand [qla2300] 0x292 (0xf5385968)
^@[<f88165e9>] __scsi_end_request [scsi_mod] 0xc9 (0xf53859b4)
^@[<c025f1bd>] tcp_v4_send_check [kernel] 0x4d (0xf53859cc)
^@[<c0121af0>] wake_up_cpu [kernel] 0x20 (0xf53859d8)
^@[<c02594c0>] tcp_transmit_skb [kernel] 0x2c0 (0xf53859ec)
^@[<c0134040>] process_timeout [kernel] 0x0 (0xf5385a34)
^@[<c0122046>] wake_up_process [kernel] 0x26 (0xf5385a44)
^@[<c01345d6>] __run_timers [kernel] 0xb6 (0xf5385a5c)
^@[<c01342f2>] timer_bh [kernel] 0x62 (0xf5385a88)
^@[<c012ef65>] bh_action [kernel] 0x55 (0xf5385a9c)
^@[<c012ee07>] tasklet_hi_action [kernel] 0x67 (0xf5385aa4)
^@[<c010db48>] do_IRQ [kernel] 0x148 (0xf5385ad8)
^@[<c010da00>] do_IRQ [kernel] 0x0 (0xf5385afc)
^@[<f8bb0ca1>] gfs_init_holder [gfs] 0x21 (0xf5385b20)
^@[<f8bb48c6>] gmalloc_wrapper [gfs] 0x1e (0xf5385b30)
^@[<f8bc75e2>] gfs_rlist_alloc [gfs] 0x46 (0xf5385b50)
^@[<f8ba5fd9>] do_strip [gfs] 0x179 (0xf5385b80)
^@[<f8ba5d4f>] recursive_scan [gfs] 0x93 (0xf5385c10)
^@[<f8ba5de8>] recursive_scan [gfs] 0x12c (0xf5385c60)
^@[<f8ba5e60>] do_strip [gfs] 0x0 (0xf5385c80)
^@[<f8ba67ea>] gfs_shrink [gfs] 0x34e (0xf5385cc0)
^@[<f8ba5e60>] do_strip [gfs] 0x0 (0xf5385ce0)
^@[<f8b9de28>] xmote_inode_bh [gfs] 0x44 (0xf5385d40)
^@[<f8bb1cbf>] glock_wait_internal [gfs] 0x9b (0xf5385d60)
^@[<f8bd7740>] gfs_inode_glops [gfs] 0x0 (0xf5385d6c)
^@[<f8bd7bc0>] gfs_sops [gfs] 0x0 (0xf5385d7c)
^@[<c017cb75>] iput [kernel] 0x55 (0xf5385d84)
^@[<f8ba13b7>] gfs_permission [gfs] 0xc3 (0xf5385da0)
^@[<f8ba6936>] gfs_truncatei [gfs] 0xc6 (0xf5385dc0)
^@[<f8ba2144>] gfs_truncator_page [gfs] 0x0 (0xf5385dd0)
^@[<c0140168>] vmtruncate [kernel] 0x98 (0xf5385e08)
^@[<f8b9eb6e>] gfs_setattr [gfs] 0x34a (0xf5385e20)
^@[<f8ba2144>] gfs_truncator_page [gfs] 0x0 (0xf5385e30)
^@[<f9032f8d>] find_fh_dentry [nfsd] 0x22d (0xf5385e44)
^@[<c0139773>] in_group_p [kernel] 0x23 (0xf5385e58)
^@[<c016e832>] vfs_permission [kernel] 0x82 (0xf5385e60)
^@[<c017dc3e>] notify_change [kernel] 0x2ce (0xf5385eb0)
^@[<f9034809>] nfsd_setattr [nfsd] 0x3f9 (0xf5385ecc)
^@[<f891ba0e>] svc_sock_enqueue [sunrpc] 0x1de (0xf5385ee8)
^@[<f903b6ef>] nfsd3_proc_setattr [nfsd] 0x7f (0xf5385f24)
^@[<f9043af0>] nfsd_version3 [nfsd] 0x0 (0xf5385f3c)
^@[<f903d873>] nfs3svc_decode_sattrargs [nfsd] 0x73 (0xf5385f40)
^@[<f9044248>] nfsd_procedures3 [nfsd] 0x48 (0xf5385f50)
^@[<f9043af0>] nfsd_version3 [nfsd] 0x0 (0xf5385f58)
^@[<f903064e>] nfsd_dispatch [nfsd] 0xce (0xf5385f5c)
^@[<f9044248>] nfsd_procedures3 [nfsd] 0x48 (0xf5385f70)
^@[<f891b65f>] svc_process_Rsmp_462cdaea [sunrpc] 0x42f (0xf5385f78)
^@[<f9030407>] nfsd [nfsd] 0x207 (0xf5385fb0)
^@[<f9030200>] nfsd [nfsd] 0x0 (0xf5385fe0)
^@[<c010958d>] kernel_thread_helper [kernel] 0x5 (0xf5385ff0)
io05 login: lock_gulm: Checking for journals for dead node "io04"
^@GFS: fsid=mhpcc:workspace4, jid=5: Trying to acquire journal lock...
^@GFS: fsid=mhpcc:workspace3, jid=5: Trying to acquire journal lock...
^@GFS: fsid=mhpcc:workspace2, jid=5: Trying to acquire journal lock...
^@GFS: fsid=mhpcc:workspace3, jid=5: Busy
^@GFS: fsid=mhpcc:workspace1, jid=5: Trying to acquire journal lock...
^@GFS: fsid=mhpcc:workspace2, jid=5: Busy
^@GFS: fsid=mhpcc:workspace1, jid=5: Busy
^@GFS: fsid=mhpcc:workspace4, jid=5: Busy
^@do_IRQ: stack overflow: 984
^@f5c1ccd8 000003d8 00000000 74409c72 00000001 00000d00 00000001
^@ c010da00 c03ec324 00000001 e5dc3768 00000000 00000001 f5c1d414
^@ ffffffff f5c10068 f8bb0068 ffffff00 f8bb04df 00000060 00000282
^@Call Trace: [<c010da00>] do_IRQ [kernel] 0x0 (0xf5c1ccf8)
^@[<f8bb0068>] gfs_writei [gfs] 0x1a4 (0xf5c1cd20)
^@[<f8bb04df>] gfs_sort [gfs] 0x5b (0xf5c1cd28)
^@[<f8bb275b>] nq_m_sync [gfs] 0x4b (0xf5c1cd70)
^@[<f8bb259c>] glock_compare [gfs] 0x0 (0xf5c1cd80)
^@[<f8bb28d9>] gfs_glock_nq_m [gfs] 0x129 (0xf5c1d470)
^@[<c0222909>] __kfree_skb [kernel] 0x139 (0xf5c1d864)
^@[<c0235cbf>] qdisc_restart [kernel] 0x1f (0xf5c1d8c4)
^@[<c02289c0>] dev_queue_xmit [kernel] 0x290 (0xf5c1d8dc)
^@[<c02483ff>] ip_finish_output2 [kernel] 0xcf (0xf5c1d8f4)
^@[<c0246218>] ip_output [kernel] 0x88 (0xf5c1d914)
^@[<c0246560>] ip_queue_xmit [kernel] 0x310 (0xf5c1d934)
^@[<c0222909>] __kfree_skb [kernel] 0x139 (0xf5c1d9a0)
^@[<c026aa92>] arp_process [kernel] 0xa2 (0xf5c1d9b8)
^@[<c025f1bd>] tcp_v4_send_check [kernel] 0x4d (0xf5c1d9cc)
^@[<c0121af0>] wake_up_cpu [kernel] 0x20 (0xf5c1d9d8)
^@[<c0222514>] alloc_skb [kernel] 0xc4 (0xf5c1d9f0)
^@[<c0134040>] process_timeout [kernel] 0x0 (0xf5c1da34)
^@[<c0122046>] wake_up_process [kernel] 0x26 (0xf5c1da44)
^@[<c0155b1b>] rmqueue [kernel] 0x35b (0xf5c1da64)
^@[<c0155d17>] __alloc_pages [kernel] 0x97 (0xf5c1daa0)
^@[<c0122ea1>] scheduler_tick [kernel] 0x3d1 (0xf5c1dab0)
^@[<f8bb0ca1>] gfs_init_holder [gfs] 0x21 (0xf5c1db20)
^@[<f8bb48c6>] gmalloc_wrapper [gfs] 0x1e (0xf5c1db30)
^@[<f8bc75e2>] gfs_rlist_alloc [gfs] 0x46 (0xf5c1db50)
^@[<f8ba5fd9>] do_strip [gfs] 0x179 (0xf5c1db80)
^@[<f8ba5d4f>] recursive_scan [gfs] 0x93 (0xf5c1dc10)
^@[<f8ba5de8>] recursive_scan [gfs] 0x12c (0xf5c1dc60)
^@[<f8ba5e60>] do_strip [gfs] 0x0 (0xf5c1dc80)
Also, for the redhat EL folks, nodes don't come back to life after the
I/O node (NFS server) reboots:
I don't think this has a whole lot to do with NFS. It's probably more
of an issue with deallocating large files.
That and I don't really understand the backtraces.
I've been trying to recreate this bug without success. If I could
get a more detailed description of the machines that GFS is running
on, that would be helpful. Specifically, the output
of "cat /proc/cpuinfo" would be a great help. I've also been looking
into the possibility that this bug isn't any one piece of software's
fault, but that the stack space was simply nickled and dimed away.
If that's the case, we can probably reduce the stack space used up
by gfs when it's deallocating files.
FWIW, the client side noac might do the opposite of what you intend
(increase performance). The noac option turns off all attribute
caching and, thus, ensures that all client-side attributes are in
sync with the server, at the cost of constant checking of attributes
with the server. You probably want to set 'noatime' on the client
mount options and try leaving ac on (which is the default).
Also, please update this bug with the exact version of 5.2.1 they
are running. Multiple fixes have been made since the introduction
of the Opteron to reduce GFS' use of stack space and may alleviate
this problem if they upgrade to the latest.
From the customer:
# pdsh -w io[01-16] "rpm -qa | grep GFS" | dshbak -c
*** Bug 139867 has been marked as a duplicate of this bug. ***
I am still not able to recreate this problem on my machines. I have
an idea that will generate some more useful information.
Unfortunately, it involves having the customer run a modified gfs
module. The new module would work exactly like their current one,
except that at the start of each gfs function, it would perform the
check currently being performed in the interrupt. If it found that the
available stack size was under 1K, it would print the stack (just like
the interrupt code currently does), but it would also print an gfs
internal stack trace (to disabiguate the kernel stack trace, at least
for the gfs portions), and a raw hex dump of the entire stack. Then it
would halt the machine, so stuff doesn't keep on getting printed.
From this information, I could figure out exactly how much stack space
each function was using. Most likely this will make the problem
easier to recreate (since you are checking on every gfs function, not
just in interrupts). Even if this check never finds the overflow, that
is still useful information, because it means that whatever is using
up the stack is running in an interrupt context, which points to
Of course, this all hinges on the customer's willingness to run a
modified gfs module. If someone could find out whether or not they are
o.k. with this, that would be a big help.
Forget about that last comment. I found the bug. There are some GFS
functions, namely gfs_glock_nq_m() and nq_m_sync(), that create
variable size arrays on the stack, depending on their arguments. For
some reason, the customer's load is causing them to create arrays that
3184 bytes of stack space. I've been staring at backtraces for far too
long, and I'm going home now, but this should be fixed tomorrow.
The fix is in. rpms are either being generated, or will be shortly.
I will post a message when the rpms are ready.
It sounds like this will be a simple module replacement, correct?
When should we expect the RPM?
Will it be w.r.t. U2, or will we need to upgrade.
If it's U2 compatible, and the RPM is available, we will be down for
service today... so we could try it out.
Yeah, it's just a module replacement. To verify that this fix solves
your problem, you can download a modified gfs.o module at
This module was built from the GFS-smp-5.2.1-188.8.131.52 source
for linux-2.4.21-15.ELsmp, with a patch added to correct the problem
To cut down on the number of different permutations of kernel/gfs
module that we need to support, we are simply adding this bug fix to
our latest rebuild, which is against 2.4.21-27 (the kernel for
RHEL3-U4). If this patched module works for you, you can just run with
it until RHEL3-U4 is released , then you should upgrade to the lastest
How does that sound?
It's too late to try out today. We'll need to wait for the next
allowed system downtime.
Thanks for all your help! We really do apprecate it!
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.