Description of problem: nfs-exported gfs partiton has severe problems with dlm. After heavy nfs load from several nfs clients, dlm_recvd starts to consume all available cpu and eventually locks gfs access from cluster nodes. nfsd consumes the rest of (SMP) cpu in a system. The only workaround is to reboot the server. There is the same problem with cluster-1.02.00, but CVS version does not corrupt cluster state so the rest of the cluster remains operational after reboot of the problematic server. The server also joins after reboot without any problem. Version-Release number of selected component (if applicable): cluster CVS stable, 2006-05-30 How reproducible: after heavy nfs load when trying to access lots of small files from several nfs clients Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: dmesg log ---------------- dlm: midcomms: bad header version 18ce0378 dlm: midcomms: cmd=0, flags=0, length=256, lkid=4290794496, lockspace=4294934785 dlm: midcomms: base=ffff810048b83000, offset=256, len=2640, ret=256, limit=00001000 newbuf=0 78 03 ce 18 00 00 00 01-00 54 c0 ff 01 81 ff ff fe ff fe ff 02 81 ff ff-00 02 64 98 00 81 ff ff 15 d4 16 80 ff ff ff ff-01 00 00 00 00 00 00 00 00 54 c0 ff 01 81 ff ff-d0 00 00 00 00 00 00 00 01 00 01 00 05 00 48 00-9c 01 09 19 08 00 00 01 00 54 c0 ff 01 81 ff ff-fe ff fe ff 02 81 ff ff 00 02 64 98 00 81 ff ff-15 d4 16 80 ff ff ff ff 01 00 00 00 00 00 00 00-00 54 c0 ff 01 81 ff ff d0 00 00 00 00 00 00 00-01 00 01 00 05 00 48 00 22 00 ee 18 08 00 00 01-00 54 c0 ff 01 81 ff ff fe ff fe ff 02 81 ff ff-00 02 64 98 00 81 ff ff 15 d4 16 80 ff ff ff ff-01 00 00 00 00 00 00 00 00 54 c0 ff 01 81 ff ff-d0 00 00 00 00 00 00 00 01 00 01 00 05 00 48 00-63 02 c9 18 08 00 00 01 00 54 c0 ff 01 81 ff ff-fe ff fe ff 02 81 ff ff 00 02 64 98 00 81 ff ff 15 d4 16 80 ff ff ff ff dlm: lowcomms: addr=ffff810048b83000, base=0, len=2896, iov_len=4096, iov_base[0]=ffff810048b83b50, read=2896 dlm: midcomms: bad header version 18ce0378 dlm: midcomms: cmd=0, flags=0, length=256, lkid=4290794496, lockspace=4294934785 dlm: midcomms: base=ffff810048b83000, offset=256, len=3840, ret=256, limit=00001000 newbuf=0 78 03 ce 18 00 00 00 01-00 54 c0 ff 01 81 ff ff fe ff fe ff 02 81 ff ff-00 02 64 98 00 81 ff ff 15 d4 16 80 ff ff ff ff-01 00 00 00 00 00 00 00 00 54 c0 ff 01 81 ff ff-d0 00 00 00 00 00 00 00 01 00 01 00 05 00 48 00-9c 01 09 19 08 00 00 01 00 54 c0 ff 01 81 ff ff-fe ff fe ff 02 81 ff ff 00 02 64 98 00 81 ff ff-15 d4 16 80 ff ff ff ff 01 00 00 00 00 00 00 00-00 54 c0 ff 01 81 ff ff d0 00 00 00 00 00 00 00-01 00 01 00 05 00 48 00 22 00 ee 18 08 00 00 01-00 54 c0 ff 01 81 ff ff fe ff fe ff 02 81 ff ff-00 02 64 98 00 81 ff ff 15 d4 16 80 ff ff ff ff-01 00 00 00 00 00 00 00 00 54 c0 ff 01 81 ff ff-d0 00 00 00 00 00 00 00 01 00 01 00 05 00 48 00-63 02 c9 18 08 00 00 01 00 54 c0 ff 01 81 ff ff-fe ff fe ff 02 81 ff ff 00 02 64 98 00 81 ff ff 15 d4 16 80 ff ff ff ff dlm: lowcomms: addr=ffff810048b83000, base=0, len=4096, iov_len=1200, iov_base[0]=ffff810048b84000, read=1200 System info: Gentoo Base System version 1.12.0_pre19 Portage 2.1_rc2-r3 (default-linux/amd64/2005.1, gcc-4.1.1, glibc-2.4-r3, 2.6.16-gentoo-r4 x86_64) ================================================================= System uname: 2.6.16-gentoo-r4 x86_64 AMD Opteron(tm) Processor 246 distcc 2.18.3 x86_64-pc-linux-gnu (protocols 1 and 2) (default port 3632) [enabled] dev-lang/python: 2.3.5-r2, 2.4.3-r1 dev-python/pycrypto: 2.0.1-r5 dev-util/ccache: [Not Present] dev-util/confcache: [Not Present] sys-apps/sandbox: 1.2.18 sys-devel/autoconf: 2.13, 2.59-r7 sys-devel/automake: 1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r2 sys-devel/binutils: 2.16.1-r2 sys-devel/libtool: 1.5.22 virtual/os-headers: 2.6.11-r3
Is the cman/dlm traffic running on a separate network from the nfs traffic? If not, is it possible to try that?
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.