Bug 139867

Summary: NFS servers (running GFS) panic (when gfs_dreread calls wait_on_buffer) as database daemon starts on NFS client nodes
Product: Red Hat Enterprise Linux 3 Reporter: Chris Worley <chrisw>
Component: nfs-utilsAssignee: Ben Marzinski <bmarzins>
Status: CLOSED DUPLICATE QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: 3.0CC: danderso, kanderso, kpreslan, shillman
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-02-21 19:07:07 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
The last dozen panics on an IO node none

Description Chris Worley 2004-11-18 15:40:06 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; rv:1.7.3) Gecko/20041001
Firefox/0.10.1

Description of problem:


Version-Release number of selected component (if applicable):
GFS 5.2, RHEL U2, 2.4.21-15smp

How reproducible:
Always

Steps to Reproduce:
1. User starts app, which iterates through NFS client nodes starting
daemons.
2. 8 NFS client compute nodes per I/O server (16 I/O servers, 128
Compute nodes).
3. One or more of the I/O servers will panic.
    

Actual Results:  One or more of the I/O servers will panic.

Expected Results:  None of the I/O servers should panic.

Additional info:

Comment 1 Chris Worley 2004-11-18 16:02:35 UTC
Created attachment 106964 [details]
The last dozen panics on an IO node

The problem is not on just one I/O node... many I/O nodes have panic'd.

The attachement shows the stack dumps at panic time for about a dozen panics of
one I/O node.

The panics are recursive, so the attachment is lengthy.

Comment 2 Ben Marzinski 2004-11-20 22:20:23 UTC
I've been trying to recreate this bug without success.  If I could
get a more detailed description of the machines that GFS is running
on, that would be helpful.  Specifically, the output 
of "cat /proc/cpuinfo" would be a great help. I've also been looking
into the possibility that this bug isn't any one piece of software's
fault, but that the stack space was simply nickled and dimed away.
If that's the case, we can probably reduce the stack space used up
by gfs when it's deallocating files. This would also explain why this
happens only with nfs.  Adding nfs to the kernel stack might be just
enough to cause it to overflow.

Comment 3 Chris Worley 2004-11-21 17:43:23 UTC
Hardware configuration:

20 dual 3.06GHz XEONs (E7501 chipset) each w/4GB RAM connected through
a QLA2312 (lspci says) HBA to Qlogic SanBoxes connected to two DDN
S2A8000 couplets (4 controllers altogether... 3GB/s sustained
throughput out 16 FC ports).  Servers are running GFS and NFS servers.
Output to NFS clients is via channel bonded GigE (e1000) to a Foundry
FastIron GigE switch with ~200 GigE ports (using 16 port blades).  No
local disks on servers... they use EXT3 partitions off the same SAN
for their local partitions.

128 NFS Clients are the same configuration, without the SAN hardware,
and only 1 GigE port per client. They are running disklessly... mostly
an NFS root boot, but really a RAM disk with NFS mounts on all the big
directories under the root.

The problem seems to be application specific. Panic occurs during
startup (application runs about 5 threads on all client nodes).

Comment 4 Chris Worley 2004-11-21 22:54:30 UTC
an offline request asked that I add /proc/cpuinfofor one of the IO
nodes to this thread:

# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Xeon(TM) CPU 3.06GHz
stepping        : 9
cpu MHz         : 3065.847
cache size      : 512 KB
physical id     : 0
siblings        : 1
runqueue        : 0
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips        : 6121.06

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Xeon(TM) CPU 3.06GHz
stepping        : 9
cpu MHz         : 3065.847
cache size      : 512 KB
physical id     : 3
siblings        : 1
runqueue        : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips        : 6121.06



Comment 5 Ben Marzinski 2004-11-29 17:53:46 UTC

*** This bug has been marked as a duplicate of 139863 ***

Comment 6 Red Hat Bugzilla 2006-02-21 19:07:07 UTC
Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.