193851 – corrupt dlm message during heavy nfs usage

Bug 193851 - corrupt dlm message during heavy nfs usage

Summary: corrupt dlm message during heavy nfs usage

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	David Teigland
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-06-02 06:20 UTC by Andrej Filipcic
Modified:	2009-09-03 16:50 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-12-12 21:49:04 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Andrej Filipcic 2006-06-02 06:20:35 UTC

Description of problem:

nfs-exported gfs partiton has severe problems with dlm. After heavy nfs load
from several nfs clients, dlm_recvd starts to consume all available cpu and
eventually locks gfs access from cluster nodes. nfsd consumes the rest of (SMP)
 cpu in a system. The only workaround is to reboot the server.
There is the same problem with cluster-1.02.00, but CVS version does not corrupt
cluster state so the rest of the cluster remains operational after reboot of the
problematic server. The server also joins after reboot without any problem.


Version-Release number of selected component (if applicable):
cluster CVS stable, 2006-05-30

How reproducible:
after heavy nfs load when trying to access lots of small files from several nfs
clients

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
dmesg log
----------------
dlm: midcomms: bad header version 18ce0378
dlm: midcomms: cmd=0, flags=0, length=256, lkid=4290794496, lockspace=4294934785
dlm: midcomms: base=ffff810048b83000, offset=256, len=2640, ret=256,
limit=00001000 newbuf=0
78 03 ce 18 00 00 00 01-00 54 c0 ff 01 81 ff ff
fe ff fe ff 02 81 ff ff-00 02 64 98 00 81 ff ff
15 d4 16 80 ff ff ff ff-01 00 00 00 00 00 00 00
00 54 c0 ff 01 81 ff ff-d0 00 00 00 00 00 00 00
01 00 01 00 05 00 48 00-9c 01 09 19 08 00 00 01
00 54 c0 ff 01 81 ff ff-fe ff fe ff 02 81 ff ff
00 02 64 98 00 81 ff ff-15 d4 16 80 ff ff ff ff
01 00 00 00 00 00 00 00-00 54 c0 ff 01 81 ff ff
d0 00 00 00 00 00 00 00-01 00 01 00 05 00 48 00
22 00 ee 18 08 00 00 01-00 54 c0 ff 01 81 ff ff
fe ff fe ff 02 81 ff ff-00 02 64 98 00 81 ff ff
15 d4 16 80 ff ff ff ff-01 00 00 00 00 00 00 00
00 54 c0 ff 01 81 ff ff-d0 00 00 00 00 00 00 00
01 00 01 00 05 00 48 00-63 02 c9 18 08 00 00 01
00 54 c0 ff 01 81 ff ff-fe ff fe ff 02 81 ff ff
00 02 64 98
00 81 ff ff
15 d4 16 80
ff
ff
ff
ff
dlm: lowcomms: addr=ffff810048b83000, base=0, len=2896, iov_len=4096,
iov_base[0]=ffff810048b83b50, read=2896
dlm: midcomms: bad header version 18ce0378
dlm: midcomms: cmd=0, flags=0, length=256, lkid=4290794496, lockspace=4294934785
dlm: midcomms: base=ffff810048b83000, offset=256, len=3840, ret=256,
limit=00001000 newbuf=0
78 03 ce 18 00 00 00 01-00 54 c0 ff 01 81 ff ff
fe ff fe ff 02 81 ff ff-00 02 64 98 00 81 ff ff
15 d4 16 80 ff ff ff ff-01 00 00 00 00 00 00 00
00 54 c0 ff 01 81 ff ff-d0 00 00 00 00 00 00 00
01 00 01 00 05 00 48 00-9c 01 09 19 08 00 00 01
00 54 c0 ff 01 81 ff ff-fe ff fe ff 02 81 ff ff
00 02 64 98 00 81 ff ff-15 d4 16 80 ff ff ff ff
01 00 00 00 00 00 00 00-00 54 c0 ff 01 81 ff ff
d0 00 00 00 00 00 00 00-01 00 01 00 05 00 48 00
22 00 ee 18 08 00 00 01-00 54 c0 ff 01 81 ff ff
fe ff fe ff 02 81 ff ff-00 02 64 98 00 81 ff ff
15 d4 16 80 ff ff ff ff-01 00 00 00 00 00 00 00
00 54 c0 ff 01 81 ff ff-d0 00 00 00 00 00 00 00
01 00 01 00 05 00 48 00-63 02 c9 18 08 00 00 01
00 54 c0 ff 01 81 ff ff-fe ff fe ff 02 81 ff ff
00 02 64 98
00 81 ff ff
15 d4 16 80
ff
ff
ff
ff
dlm: lowcomms: addr=ffff810048b83000, base=0, len=4096, iov_len=1200,
iov_base[0]=ffff810048b84000, read=1200

System info:
Gentoo Base System version 1.12.0_pre19
Portage 2.1_rc2-r3 (default-linux/amd64/2005.1, gcc-4.1.1, glibc-2.4-r3,
2.6.16-gentoo-r4 x86_64)
=================================================================
System uname: 2.6.16-gentoo-r4 x86_64 AMD Opteron(tm) Processor 246
distcc 2.18.3 x86_64-pc-linux-gnu (protocols 1 and 2) (default port 3632) [enabled]
dev-lang/python:     2.3.5-r2, 2.4.3-r1
dev-python/pycrypto: 2.0.1-r5
dev-util/ccache:     [Not Present]
dev-util/confcache:  [Not Present]
sys-apps/sandbox:    1.2.18
sys-devel/autoconf:  2.13, 2.59-r7
sys-devel/automake:  1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r2
sys-devel/binutils:  2.16.1-r2
sys-devel/libtool:   1.5.22
virtual/os-headers:  2.6.11-r3

Comment 1 David Teigland 2006-10-17 17:07:39 UTC

Is the cman/dlm traffic running on a separate network from the
nfs traffic?  If not, is it possible to try that?

Comment 2 Nate Straz 2007-12-13 17:40:50 UTC

Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.

Note You need to log in before you can comment on or make changes to this bug.