Bug 139863
Summary: | GFS nodes panic when NFS exported fs mounted using noac | ||
---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Rick Spurgeon <spurgeon> |
Component: | gfs | Assignee: | Ben Marzinski <bmarzins> |
Status: | CLOSED ERRATA | QA Contact: | GFS Bugs <gfs-bugs> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 3 | CC: | chrisw, etay, kanderso, tao |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2004-12-21 15:58:29 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Rick Spurgeon
2004-11-18 15:32:30 UTC
I don't think this has a whole lot to do with NFS. It's probably more of an issue with deallocating large files. That and I don't really understand the backtraces. I've been trying to recreate this bug without success. If I could get a more detailed description of the machines that GFS is running on, that would be helpful. Specifically, the output of "cat /proc/cpuinfo" would be a great help. I've also been looking into the possibility that this bug isn't any one piece of software's fault, but that the stack space was simply nickled and dimed away. If that's the case, we can probably reduce the stack space used up by gfs when it's deallocating files. FWIW, the client side noac might do the opposite of what you intend (increase performance). The noac option turns off all attribute caching and, thus, ensures that all client-side attributes are in sync with the server, at the cost of constant checking of attributes with the server. You probably want to set 'noatime' on the client mount options and try leaving ac on (which is the default). Also, please update this bug with the exact version of 5.2.1 they are running. Multiple fixes have been made since the introduction of the Opteron to reduce GFS' use of stack space and may alleviate this problem if they upgrade to the latest. From the customer: # pdsh -w io[01-16] "rpm -qa | grep GFS" | dshbak -c ---------------- io[01-16] ---------------- GFS-smp-5.2.1-25.3.1.11 *** Bug 139867 has been marked as a duplicate of this bug. *** I am still not able to recreate this problem on my machines. I have an idea that will generate some more useful information. Unfortunately, it involves having the customer run a modified gfs module. The new module would work exactly like their current one, except that at the start of each gfs function, it would perform the check currently being performed in the interrupt. If it found that the available stack size was under 1K, it would print the stack (just like the interrupt code currently does), but it would also print an gfs internal stack trace (to disabiguate the kernel stack trace, at least for the gfs portions), and a raw hex dump of the entire stack. Then it would halt the machine, so stuff doesn't keep on getting printed. From this information, I could figure out exactly how much stack space each function was using. Most likely this will make the problem easier to recreate (since you are checking on every gfs function, not just in interrupts). Even if this check never finds the overflow, that is still useful information, because it means that whatever is using up the stack is running in an interrupt context, which points to device drivers. Of course, this all hinges on the customer's willingness to run a modified gfs module. If someone could find out whether or not they are o.k. with this, that would be a big help. Forget about that last comment. I found the bug. There are some GFS functions, namely gfs_glock_nq_m() and nq_m_sync(), that create variable size arrays on the stack, depending on their arguments. For some reason, the customer's load is causing them to create arrays that eat up 3184 bytes of stack space. I've been staring at backtraces for far too long, and I'm going home now, but this should be fixed tomorrow. The fix is in. rpms are either being generated, or will be shortly. I will post a message when the rpms are ready. It sounds like this will be a simple module replacement, correct? When should we expect the RPM? Will it be w.r.t. U2, or will we need to upgrade. If it's U2 compatible, and the RPM is available, we will be down for service today... so we could try it out. Yeah, it's just a module replacement. To verify that this fix solves your problem, you can download a modified gfs.o module at ftp://ftp.sistina.com/pub/misc/.test/gfs.o This module was built from the GFS-smp-5.2.1-25.3.1.11 source for linux-2.4.21-15.ELsmp, with a patch added to correct the problem I found. To cut down on the number of different permutations of kernel/gfs module that we need to support, we are simply adding this bug fix to our latest rebuild, which is against 2.4.21-27 (the kernel for RHEL3-U4). If this patched module works for you, you can just run with it until RHEL3-U4 is released , then you should upgrade to the lastest gfs release. How does that sound? Sounds good. It's too late to try out today. We'll need to wait for the next allowed system downtime. Thanks for all your help! We really do apprecate it! An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-602.html |