Bug 139867
Summary: | NFS servers (running GFS) panic (when gfs_dreread calls wait_on_buffer) as database daemon starts on NFS client nodes | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Chris Worley <chrisw> | ||||
Component: | nfs-utils | Assignee: | Ben Marzinski <bmarzins> | ||||
Status: | CLOSED DUPLICATE | QA Contact: | |||||
Severity: | high | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 3.0 | CC: | danderso, kanderso, kpreslan, shillman | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i686 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2006-02-21 19:07:07 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Chris Worley
2004-11-18 15:40:06 UTC
Created attachment 106964 [details]
The last dozen panics on an IO node
The problem is not on just one I/O node... many I/O nodes have panic'd.
The attachement shows the stack dumps at panic time for about a dozen panics of
one I/O node.
The panics are recursive, so the attachment is lengthy.
I've been trying to recreate this bug without success. If I could get a more detailed description of the machines that GFS is running on, that would be helpful. Specifically, the output of "cat /proc/cpuinfo" would be a great help. I've also been looking into the possibility that this bug isn't any one piece of software's fault, but that the stack space was simply nickled and dimed away. If that's the case, we can probably reduce the stack space used up by gfs when it's deallocating files. This would also explain why this happens only with nfs. Adding nfs to the kernel stack might be just enough to cause it to overflow. Hardware configuration: 20 dual 3.06GHz XEONs (E7501 chipset) each w/4GB RAM connected through a QLA2312 (lspci says) HBA to Qlogic SanBoxes connected to two DDN S2A8000 couplets (4 controllers altogether... 3GB/s sustained throughput out 16 FC ports). Servers are running GFS and NFS servers. Output to NFS clients is via channel bonded GigE (e1000) to a Foundry FastIron GigE switch with ~200 GigE ports (using 16 port blades). No local disks on servers... they use EXT3 partitions off the same SAN for their local partitions. 128 NFS Clients are the same configuration, without the SAN hardware, and only 1 GigE port per client. They are running disklessly... mostly an NFS root boot, but really a RAM disk with NFS mounts on all the big directories under the root. The problem seems to be application specific. Panic occurs during startup (application runs about 5 threads on all client nodes). an offline request asked that I add /proc/cpuinfofor one of the IO nodes to this thread: # cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) CPU 3.06GHz stepping : 9 cpu MHz : 3065.847 cache size : 512 KB physical id : 0 siblings : 1 runqueue : 0 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm bogomips : 6121.06 processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) CPU 3.06GHz stepping : 9 cpu MHz : 3065.847 cache size : 512 KB physical id : 3 siblings : 1 runqueue : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm bogomips : 6121.06 *** This bug has been marked as a duplicate of 139863 *** Changed to 'CLOSED' state since 'RESOLVED' has been deprecated. |