One of the Army Research Labs uses GFS in a large cluster (16 GFS nodes serving 128 compute nodes via NFS). The system has been installed since May, but is not yet in production. LNXI is the reseller. They have been disappointed with the I/O speed and, in an effort to improve this, the /etc/fstab on each of the compute nodes was changed to specify "noac" for the nfs imports. Since making this modification, though, the GFS nodes have started to have a serious problem with panicking. The problem seems to be related to transaction volume (this is a database type of app), as opposed to bandwidth. More info concerning the panics is shown at the bottom of this email. They are using RHEL3 U2 with GFS 5.2.1 "Lrrr", and might be persuaded to move to the latest RHEL/GFS, if there is a plausible argument to be made that this will solve their problem. According to the customer, "Best I can tell, this started two days ago... which would coincide with fixing another NFS/GFS speed problem: removing the "noac" from the client NFS mount attributes. This means this app would have hit the I/O subsystem harder, which may have led to the current panics. This is a guess. I can't find anything else that has changed. The GFS nodes are panicing. I documented some of the panic messages as shown below. It is a recursive panic... so, the node doesn't get restarted until it fences". do_IRQ: stack overflow: 736 ^@f5384be0 000002e0 00000001 c0435c80 00000001 00000c08 c68e4000 c68e4000 ^@ c010da00 c03ec324 c0435180 c695f080 00000a00 c68e4000 c68e4000 f5384c78 ^@ c0435c00 f62e0068 c0430068 ffffff00 c01232a4 00000060 00000282 c0435c80 ^@Call Trace: [<c010da00>] do_IRQ [kernel] 0x0 (0xf5384c00) ^@[<c01232a4>] schedule [kernel] 0x324 (0xf5384c30) ^@[<c01245aa>] io_schedule [kernel] 0x2a (0xf5384c7c) ^@[<c0161d5e>] __wait_on_buffer [kernel] 0x5e (0xf5384c88) ^@[<f8b996e5>] gfs_dreread [gfs] 0x61 (0xf5384cc0) ^@[<f8bc5906>] gfs_rgrp_read [gfs] 0xb6 (0xf5384ce0) ^@[<f8bb1826>] gfs_glock_xmote_th [gfs] 0x7a (0xf5384d20) ^@[<f8b9e219>] lock_rgrp [gfs] 0x2d (0xf5384d40) ^@[<f8bd7780>] gfs_rgrp_glops [gfs] 0x0 (0xf5384d5c) ^@[<f8bb1da0>] glock_wait_internal [gfs] 0x17c (0xf5384d60) ^@[<f8bb1cbf>] glock_wait_internal [gfs] 0x9b (0xf5384d70) ^@[<f8bd7780>] gfs_rgrp_glops [gfs] 0x0 (0xf5384d7c) ^@[<f8bb20e2>] gfs_glock_nq [gfs] 0x6a (0xf5384d90) ^@[<f8bb2780>] nq_m_sync [gfs] 0x70 (0xf5384db0) ^@[<f8bb259c>] glock_compare [gfs] 0x0 (0xf5384dc0) ^@[<f8bb28d9>] gfs_glock_nq_m [gfs] 0x129 (0xf5385490) ^@[<f8bb28d9>] gfs_glock_nq_m [gfs] 0x129 (0xf5385520) ^@[<c01342f2>] timer_bh [kernel] 0x62 (0xf538579c) ^@[<f88377a2>] qla2x00_queuecommand [qla2300] 0x292 (0xf5385858) ^@[<c0222909>] __kfree_skb [kernel] 0x139 (0xf5385864) ^@[<c0235cbf>] qdisc_restart [kernel] 0x1f (0xf53858c4) ^@[<c02289c0>] dev_queue_xmit [kernel] 0x290 (0xf53858dc) ^@[<c02483ff>] ip_finish_output2 [kernel] 0xcf (0xf53858f4) ^@[<c0246218>] ip_output [kernel] 0x88 (0xf5385914) ^@[<c0246560>] ip_queue_xmit [kernel] 0x310 (0xf5385934) ^@[<f88377a2>] qla2x00_queuecommand [qla2300] 0x292 (0xf5385968) ^@[<f88165e9>] __scsi_end_request [scsi_mod] 0xc9 (0xf53859b4) ^@[<c025f1bd>] tcp_v4_send_check [kernel] 0x4d (0xf53859cc) ^@[<c0121af0>] wake_up_cpu [kernel] 0x20 (0xf53859d8) ^@[<c02594c0>] tcp_transmit_skb [kernel] 0x2c0 (0xf53859ec) ^@[<c0134040>] process_timeout [kernel] 0x0 (0xf5385a34) ^@[<c0122046>] wake_up_process [kernel] 0x26 (0xf5385a44) ^@[<c01345d6>] __run_timers [kernel] 0xb6 (0xf5385a5c) ^@[<c01342f2>] timer_bh [kernel] 0x62 (0xf5385a88) ^@[<c012ef65>] bh_action [kernel] 0x55 (0xf5385a9c) ^@[<c012ee07>] tasklet_hi_action [kernel] 0x67 (0xf5385aa4) ^@[<c010db48>] do_IRQ [kernel] 0x148 (0xf5385ad8) ^@[<c010da00>] do_IRQ [kernel] 0x0 (0xf5385afc) ^@[<f8bb0ca1>] gfs_init_holder [gfs] 0x21 (0xf5385b20) ^@[<f8bb48c6>] gmalloc_wrapper [gfs] 0x1e (0xf5385b30) ^@[<f8bc75e2>] gfs_rlist_alloc [gfs] 0x46 (0xf5385b50) ^@[<f8ba5fd9>] do_strip [gfs] 0x179 (0xf5385b80) ^@[<f8ba5d4f>] recursive_scan [gfs] 0x93 (0xf5385c10) ^@[<f8ba5de8>] recursive_scan [gfs] 0x12c (0xf5385c60) ^@[<f8ba5e60>] do_strip [gfs] 0x0 (0xf5385c80) ^@[<f8ba67ea>] gfs_shrink [gfs] 0x34e (0xf5385cc0) ^@[<f8ba5e60>] do_strip [gfs] 0x0 (0xf5385ce0) ^@[<f8b9de28>] xmote_inode_bh [gfs] 0x44 (0xf5385d40) ^@[<f8bb1cbf>] glock_wait_internal [gfs] 0x9b (0xf5385d60) ^@[<f8bd7740>] gfs_inode_glops [gfs] 0x0 (0xf5385d6c) ^@[<f8bd7bc0>] gfs_sops [gfs] 0x0 (0xf5385d7c) ^@[<c017cb75>] iput [kernel] 0x55 (0xf5385d84) ^@[<f8ba13b7>] gfs_permission [gfs] 0xc3 (0xf5385da0) ^@[<f8ba6936>] gfs_truncatei [gfs] 0xc6 (0xf5385dc0) ^@[<f8ba2144>] gfs_truncator_page [gfs] 0x0 (0xf5385dd0) ^@[<c0140168>] vmtruncate [kernel] 0x98 (0xf5385e08) ^@[<f8b9eb6e>] gfs_setattr [gfs] 0x34a (0xf5385e20) ^@[<f8ba2144>] gfs_truncator_page [gfs] 0x0 (0xf5385e30) ^@[<f9032f8d>] find_fh_dentry [nfsd] 0x22d (0xf5385e44) ^@[<c0139773>] in_group_p [kernel] 0x23 (0xf5385e58) ^@[<c016e832>] vfs_permission [kernel] 0x82 (0xf5385e60) ^@[<c017dc3e>] notify_change [kernel] 0x2ce (0xf5385eb0) ^@[<f9034809>] nfsd_setattr [nfsd] 0x3f9 (0xf5385ecc) ^@[<f891ba0e>] svc_sock_enqueue [sunrpc] 0x1de (0xf5385ee8) ^@[<f903b6ef>] nfsd3_proc_setattr [nfsd] 0x7f (0xf5385f24) ^@[<f9043af0>] nfsd_version3 [nfsd] 0x0 (0xf5385f3c) ^@[<f903d873>] nfs3svc_decode_sattrargs [nfsd] 0x73 (0xf5385f40) ^@[<f9044248>] nfsd_procedures3 [nfsd] 0x48 (0xf5385f50) ^@[<f9043af0>] nfsd_version3 [nfsd] 0x0 (0xf5385f58) ^@[<f903064e>] nfsd_dispatch [nfsd] 0xce (0xf5385f5c) ^@[<f9044248>] nfsd_procedures3 [nfsd] 0x48 (0xf5385f70) ^@[<f891b65f>] svc_process_Rsmp_462cdaea [sunrpc] 0x42f (0xf5385f78) ^@[<f9030407>] nfsd [nfsd] 0x207 (0xf5385fb0) ^@[<f9030200>] nfsd [nfsd] 0x0 (0xf5385fe0) ^@[<c010958d>] kernel_thread_helper [kernel] 0x5 (0xf5385ff0) io05 login: lock_gulm: Checking for journals for dead node "io04" ^@GFS: fsid=mhpcc:workspace4, jid=5: Trying to acquire journal lock... ^@GFS: fsid=mhpcc:workspace3, jid=5: Trying to acquire journal lock... ^@GFS: fsid=mhpcc:workspace2, jid=5: Trying to acquire journal lock... ^@GFS: fsid=mhpcc:workspace3, jid=5: Busy ^@GFS: fsid=mhpcc:workspace1, jid=5: Trying to acquire journal lock... ^@GFS: fsid=mhpcc:workspace2, jid=5: Busy ^@GFS: fsid=mhpcc:workspace1, jid=5: Busy ^@GFS: fsid=mhpcc:workspace4, jid=5: Busy ^@do_IRQ: stack overflow: 984 ^@f5c1ccd8 000003d8 00000000 74409c72 00000001 00000d00 00000001 f5c1d414 ^@ c010da00 c03ec324 00000001 e5dc3768 00000000 00000001 f5c1d414 f5c1cd6c ^@ ffffffff f5c10068 f8bb0068 ffffff00 f8bb04df 00000060 00000282 f2d787e0 ^@Call Trace: [<c010da00>] do_IRQ [kernel] 0x0 (0xf5c1ccf8) ^@[<f8bb0068>] gfs_writei [gfs] 0x1a4 (0xf5c1cd20) ^@[<f8bb04df>] gfs_sort [gfs] 0x5b (0xf5c1cd28) ^@[<f8bb275b>] nq_m_sync [gfs] 0x4b (0xf5c1cd70) ^@[<f8bb259c>] glock_compare [gfs] 0x0 (0xf5c1cd80) ^@[<f8bb28d9>] gfs_glock_nq_m [gfs] 0x129 (0xf5c1d470) ^@[<c0222909>] __kfree_skb [kernel] 0x139 (0xf5c1d864) ^@[<c0235cbf>] qdisc_restart [kernel] 0x1f (0xf5c1d8c4) ^@[<c02289c0>] dev_queue_xmit [kernel] 0x290 (0xf5c1d8dc) ^@[<c02483ff>] ip_finish_output2 [kernel] 0xcf (0xf5c1d8f4) ^@[<c0246218>] ip_output [kernel] 0x88 (0xf5c1d914) ^@[<c0246560>] ip_queue_xmit [kernel] 0x310 (0xf5c1d934) ^@[<c0222909>] __kfree_skb [kernel] 0x139 (0xf5c1d9a0) ^@[<c026aa92>] arp_process [kernel] 0xa2 (0xf5c1d9b8) ^@[<c025f1bd>] tcp_v4_send_check [kernel] 0x4d (0xf5c1d9cc) ^@[<c0121af0>] wake_up_cpu [kernel] 0x20 (0xf5c1d9d8) ^@[<c0222514>] alloc_skb [kernel] 0xc4 (0xf5c1d9f0) ^@[<c0134040>] process_timeout [kernel] 0x0 (0xf5c1da34) ^@[<c0122046>] wake_up_process [kernel] 0x26 (0xf5c1da44) ^@[<c0155b1b>] rmqueue [kernel] 0x35b (0xf5c1da64) ^@[<c0155d17>] __alloc_pages [kernel] 0x97 (0xf5c1daa0) ^@[<c0122ea1>] scheduler_tick [kernel] 0x3d1 (0xf5c1dab0) ^@[<f8bb0ca1>] gfs_init_holder [gfs] 0x21 (0xf5c1db20) ^@[<f8bb48c6>] gmalloc_wrapper [gfs] 0x1e (0xf5c1db30) ^@[<f8bc75e2>] gfs_rlist_alloc [gfs] 0x46 (0xf5c1db50) ^@[<f8ba5fd9>] do_strip [gfs] 0x179 (0xf5c1db80) ^@[<f8ba5d4f>] recursive_scan [gfs] 0x93 (0xf5c1dc10) ^@[<f8ba5de8>] recursive_scan [gfs] 0x12c (0xf5c1dc60) ^@[<f8ba5e60>] do_strip [gfs] 0x0 (0xf5c1dc80) Also, for the redhat EL folks, nodes don't come back to life after the I/O node (NFS server) reboots: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=129861
I don't think this has a whole lot to do with NFS. It's probably more of an issue with deallocating large files. That and I don't really understand the backtraces.
I've been trying to recreate this bug without success. If I could get a more detailed description of the machines that GFS is running on, that would be helpful. Specifically, the output of "cat /proc/cpuinfo" would be a great help. I've also been looking into the possibility that this bug isn't any one piece of software's fault, but that the stack space was simply nickled and dimed away. If that's the case, we can probably reduce the stack space used up by gfs when it's deallocating files.
FWIW, the client side noac might do the opposite of what you intend (increase performance). The noac option turns off all attribute caching and, thus, ensures that all client-side attributes are in sync with the server, at the cost of constant checking of attributes with the server. You probably want to set 'noatime' on the client mount options and try leaving ac on (which is the default). Also, please update this bug with the exact version of 5.2.1 they are running. Multiple fixes have been made since the introduction of the Opteron to reduce GFS' use of stack space and may alleviate this problem if they upgrade to the latest.
From the customer: # pdsh -w io[01-16] "rpm -qa | grep GFS" | dshbak -c ---------------- io[01-16] ---------------- GFS-smp-5.2.1-25.3.1.11
*** Bug 139867 has been marked as a duplicate of this bug. ***
I am still not able to recreate this problem on my machines. I have an idea that will generate some more useful information. Unfortunately, it involves having the customer run a modified gfs module. The new module would work exactly like their current one, except that at the start of each gfs function, it would perform the check currently being performed in the interrupt. If it found that the available stack size was under 1K, it would print the stack (just like the interrupt code currently does), but it would also print an gfs internal stack trace (to disabiguate the kernel stack trace, at least for the gfs portions), and a raw hex dump of the entire stack. Then it would halt the machine, so stuff doesn't keep on getting printed. From this information, I could figure out exactly how much stack space each function was using. Most likely this will make the problem easier to recreate (since you are checking on every gfs function, not just in interrupts). Even if this check never finds the overflow, that is still useful information, because it means that whatever is using up the stack is running in an interrupt context, which points to device drivers. Of course, this all hinges on the customer's willingness to run a modified gfs module. If someone could find out whether or not they are o.k. with this, that would be a big help.
Forget about that last comment. I found the bug. There are some GFS functions, namely gfs_glock_nq_m() and nq_m_sync(), that create variable size arrays on the stack, depending on their arguments. For some reason, the customer's load is causing them to create arrays that eat up 3184 bytes of stack space. I've been staring at backtraces for far too long, and I'm going home now, but this should be fixed tomorrow.
The fix is in. rpms are either being generated, or will be shortly. I will post a message when the rpms are ready.
It sounds like this will be a simple module replacement, correct? When should we expect the RPM? Will it be w.r.t. U2, or will we need to upgrade. If it's U2 compatible, and the RPM is available, we will be down for service today... so we could try it out.
Yeah, it's just a module replacement. To verify that this fix solves your problem, you can download a modified gfs.o module at ftp://ftp.sistina.com/pub/misc/.test/gfs.o This module was built from the GFS-smp-5.2.1-25.3.1.11 source for linux-2.4.21-15.ELsmp, with a patch added to correct the problem I found. To cut down on the number of different permutations of kernel/gfs module that we need to support, we are simply adding this bug fix to our latest rebuild, which is against 2.4.21-27 (the kernel for RHEL3-U4). If this patched module works for you, you can just run with it until RHEL3-U4 is released , then you should upgrade to the lastest gfs release. How does that sound?
Sounds good. It's too late to try out today. We'll need to wait for the next allowed system downtime. Thanks for all your help! We really do apprecate it!
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-602.html