Bug 193851

Summary: corrupt dlm message during heavy nfs usage
Product: Red Hat Enterprise Linux 5 Reporter: Andrej Filipcic <andrej.filipcic>
Component: kernelAssignee: David Teigland <teigland>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.0CC: ccaulfie, cluster-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-12-12 21:49:04 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Andrej Filipcic 2006-06-02 06:20:35 UTC
Description of problem:

nfs-exported gfs partiton has severe problems with dlm. After heavy nfs load
from several nfs clients, dlm_recvd starts to consume all available cpu and
eventually locks gfs access from cluster nodes. nfsd consumes the rest of (SMP)
 cpu in a system. The only workaround is to reboot the server.
There is the same problem with cluster-1.02.00, but CVS version does not corrupt
cluster state so the rest of the cluster remains operational after reboot of the
problematic server. The server also joins after reboot without any problem.


Version-Release number of selected component (if applicable):
cluster CVS stable, 2006-05-30

How reproducible:
after heavy nfs load when trying to access lots of small files from several nfs
clients

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
dmesg log
----------------
dlm: midcomms: bad header version 18ce0378
dlm: midcomms: cmd=0, flags=0, length=256, lkid=4290794496, lockspace=4294934785
dlm: midcomms: base=ffff810048b83000, offset=256, len=2640, ret=256,
limit=00001000 newbuf=0
78 03 ce 18 00 00 00 01-00 54 c0 ff 01 81 ff ff
fe ff fe ff 02 81 ff ff-00 02 64 98 00 81 ff ff
15 d4 16 80 ff ff ff ff-01 00 00 00 00 00 00 00
00 54 c0 ff 01 81 ff ff-d0 00 00 00 00 00 00 00
01 00 01 00 05 00 48 00-9c 01 09 19 08 00 00 01
00 54 c0 ff 01 81 ff ff-fe ff fe ff 02 81 ff ff
00 02 64 98 00 81 ff ff-15 d4 16 80 ff ff ff ff
01 00 00 00 00 00 00 00-00 54 c0 ff 01 81 ff ff
d0 00 00 00 00 00 00 00-01 00 01 00 05 00 48 00
22 00 ee 18 08 00 00 01-00 54 c0 ff 01 81 ff ff
fe ff fe ff 02 81 ff ff-00 02 64 98 00 81 ff ff
15 d4 16 80 ff ff ff ff-01 00 00 00 00 00 00 00
00 54 c0 ff 01 81 ff ff-d0 00 00 00 00 00 00 00
01 00 01 00 05 00 48 00-63 02 c9 18 08 00 00 01
00 54 c0 ff 01 81 ff ff-fe ff fe ff 02 81 ff ff
00 02 64 98
00 81 ff ff
15 d4 16 80
ff
ff
ff
ff
dlm: lowcomms: addr=ffff810048b83000, base=0, len=2896, iov_len=4096,
iov_base[0]=ffff810048b83b50, read=2896
dlm: midcomms: bad header version 18ce0378
dlm: midcomms: cmd=0, flags=0, length=256, lkid=4290794496, lockspace=4294934785
dlm: midcomms: base=ffff810048b83000, offset=256, len=3840, ret=256,
limit=00001000 newbuf=0
78 03 ce 18 00 00 00 01-00 54 c0 ff 01 81 ff ff
fe ff fe ff 02 81 ff ff-00 02 64 98 00 81 ff ff
15 d4 16 80 ff ff ff ff-01 00 00 00 00 00 00 00
00 54 c0 ff 01 81 ff ff-d0 00 00 00 00 00 00 00
01 00 01 00 05 00 48 00-9c 01 09 19 08 00 00 01
00 54 c0 ff 01 81 ff ff-fe ff fe ff 02 81 ff ff
00 02 64 98 00 81 ff ff-15 d4 16 80 ff ff ff ff
01 00 00 00 00 00 00 00-00 54 c0 ff 01 81 ff ff
d0 00 00 00 00 00 00 00-01 00 01 00 05 00 48 00
22 00 ee 18 08 00 00 01-00 54 c0 ff 01 81 ff ff
fe ff fe ff 02 81 ff ff-00 02 64 98 00 81 ff ff
15 d4 16 80 ff ff ff ff-01 00 00 00 00 00 00 00
00 54 c0 ff 01 81 ff ff-d0 00 00 00 00 00 00 00
01 00 01 00 05 00 48 00-63 02 c9 18 08 00 00 01
00 54 c0 ff 01 81 ff ff-fe ff fe ff 02 81 ff ff
00 02 64 98
00 81 ff ff
15 d4 16 80
ff
ff
ff
ff
dlm: lowcomms: addr=ffff810048b83000, base=0, len=4096, iov_len=1200,
iov_base[0]=ffff810048b84000, read=1200

System info:
Gentoo Base System version 1.12.0_pre19
Portage 2.1_rc2-r3 (default-linux/amd64/2005.1, gcc-4.1.1, glibc-2.4-r3,
2.6.16-gentoo-r4 x86_64)
=================================================================
System uname: 2.6.16-gentoo-r4 x86_64 AMD Opteron(tm) Processor 246
distcc 2.18.3 x86_64-pc-linux-gnu (protocols 1 and 2) (default port 3632) [enabled]
dev-lang/python:     2.3.5-r2, 2.4.3-r1
dev-python/pycrypto: 2.0.1-r5
dev-util/ccache:     [Not Present]
dev-util/confcache:  [Not Present]
sys-apps/sandbox:    1.2.18
sys-devel/autoconf:  2.13, 2.59-r7
sys-devel/automake:  1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r2
sys-devel/binutils:  2.16.1-r2
sys-devel/libtool:   1.5.22
virtual/os-headers:  2.6.11-r3

Comment 1 David Teigland 2006-10-17 17:07:39 UTC
Is the cman/dlm traffic running on a separate network from the
nfs traffic?  If not, is it possible to try that?


Comment 2 Nate Straz 2007-12-13 17:40:50 UTC
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.