From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; de-AT; rv:1.7.8) Gecko/20050517 Firefox/1.0.4 (Debian package 1.0.4-2) Description of problem: After forming a 3 node cluster it takes anywhere from 1h to 12h until the whole cluster comes to a halt. then one (random) node produces a kernel log message similar to the following: Nov 21 14:24:29 jimknopf kernel: dlm: midcomms: bad header version 7fffffff Nov 21 14:24:29 jimknopf kernel: dlm: midcomms: cmd=0, flags=0, length=91, lkid=2916746111, lockspace=16777219 Nov 21 14:24:29 jimknopf kernel: dlm: midcomms: base=d59c4000, offset=215, len=0, ret=215, limit=00001000 newbuf=0 Nov 21 14:24:29 jimknopf kernel: ff ff ff 7f 00 00 5b 00-7f 03 da ad 03 00 00 01 Nov 21 14:24:29 jimknopf kernel: 04 01 b2 02 96 00 00 00-01 00 00 00 84 77 51 c0 Nov 21 14:24:29 jimknopf kernel: 80 00 00 00 64 cc 99 cb-00 00 00 00 00 00 00 00 Nov 21 14:24:29 jimknopf kernel: f8 39 ba e2 34 cc 99 cb-05 00 00 00 34 cc 99 cb Nov 21 14:24:29 jimknopf kernel: c1 23 99 f8 34 cc 99 cb-f8 39 ba e2 01 ff ff ff Nov 21 14:24:29 jimknopf kernel: 20 c5 3e c0 Nov 21 14:24:29 jimknopf kernel: 80 00 00 00 Nov 21 14:24:29 jimknopf kernel: 20 Nov 21 14:24:29 jimknopf kernel: 00 Nov 21 14:24:29 jimknopf kernel: 00 Nov 21 14:24:29 jimknopf kernel: dlm: lowcomms: addr=d59c4000, base=0, len=215, iov_len=4096, iov_base[0]=d59c40d7, read=215 there is no attempt to fence the affected node and the reported bad version is always 7fffffff. after rebooting the affected maschine, the other nodes continue to work normally. we have exim (mta) running on all nodes which all share one spool directory on the cluster filesystem (gfs). in an earlier configuration, each node had its own spool subdir and then the problem did not appear. Version-Release number of selected component (if applicable): DLM 1.01.00 (built Nov 20 2005 14:10:36) How reproducible: Always Steps to Reproduce: 1. form a cluster 2. wait... 3. Additional info: all three nodes run a vanilla 2.6.14.2 kernel, device-mapper.1.02.00 and LVM2.2.02.00. [root@jimknopf ~]# cat /proc/cluster/status Protocol version: 5.0.1 Config version: 15 Cluster name: arnold Cluster ID: 6568 Cluster Member: Yes Membership state: Cluster-Member Nodes: 3 Expected_votes: 3 Total_votes: 3 Quorum: 2 Active subsystems: 6 [root@jimknopf ~]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 3 2] DLM Lock Space: "clvmd" 2 3 run - [1 3 2] DLM Lock Space: "clusterfs" 3 4 run - [1 3 2] GFS Mount Group: "clusterfs" 4 5 run - [1 3 2]
When you say "it takes anywhere from 1h to 12h until the whole cluster comes to a halt" do you mean that cman takes this long to form a cluster with all three nodes recognizing each other? What versions of cman and dlm are the other nodes in the cluster running?
the cluster is formed almost instantly, there's no problem at all. but after running some unspecified time (roughly between 1h and 12h) one of the nodes produces said kernel log entry and the cluster is blocked. the versions are on all nodes: Lock_Harness 1.01.00 (built Nov 20 2005 13:57:52) installed CMAN 1.01.00 (built Nov 20 2005 13:57:29) installed DLM 1.01.00 (built Nov 20 2005 13:57:45) installed Lock_DLM (built Nov 20 2005 13:57:56) installed GFS 1.01.00 (built Nov 20 2005 13:58:23) installed
Does the kernel have PREEMPT or hugemem enabled? If so, could you disable those things and try? I'm going to me trying to reproduce this on my own cluster.
the kernels had "preempt the big kernel lock" and highmem (4gb) enabled: CONFIG_PREEMPT_NONE=y # CONFIG_PREEMPT_VOLUNTARY is not set # CONFIG_PREEMPT is not set CONFIG_PREEMPT_BKL=y # CONFIG_NOHIGHMEM is not set CONFIG_HIGHMEM4G=y # CONFIG_HIGHMEM64G is not set CONFIG_HIGHMEM=y i disabled both: CONFIG_PREEMPT_NONE=y # CONFIG_PREEMPT_VOLUNTARY is not set # CONFIG_PREEMPT is not set # CONFIG_PREEMPT_BKL is not set CONFIG_NOHIGHMEM=y # CONFIG_HIGHMEM4G is not set # CONFIG_HIGHMEM64G is not set ...and recompiled the cluster software (device-mapper, lvm, cluster) on all three nodes. after the cluster ran for about 10 mins under some load, one of the nodes produced 33 error messages like the on in my first post. all 33 messages occured in under 1 minute. after rebooting the node, it rejoined the cluster and now i'm waiting for the next crash ;-) if you need any more information, i would be glad to provide it.
Sorry for asking a dumb question, but did you recompile the kernel itself after disabling the options (not just the cluster software)? Also, your output from /proc/cluster/status doesn't look like mine. I have "Node name" and "Node addresses" fields at the end. Could you run cman_tool -V to see if you have the latest userland tools?
i'm pretty sure about recompiling the kernel because i did it from home over ssh, thought we had switched to grub on all nodes (which, of course, wasn't the case), therefore didn't run lilo after installing the new kernel... let's just say two of them didn't boot until i confronted them with my knoppix cd the next day ;-) also, yesterday i read about enabling CONFIG_EXT2_FS_POSIX_ACL on the mailing list. are there any more kernel features i should be aware of? ok, i have to admit i didn't copy the last two lines of /proc/cluster/status. [root@jimknopf log]# cat /proc/cluster/status Protocol version: 5.0.1 Config version: 15 Cluster name: arnold Cluster ID: 6568 Cluster Member: Yes Membership state: Cluster-Member Nodes: 3 Expected_votes: 3 Total_votes: 3 Quorum: 2 Active subsystems: 6 Node name: jimknopf.cluster.uni-frankfurt.de Node addresses: 10.1.1.6 [root@jimknopf log]# cman_tool -V cman_tool 1.01.00 (built Nov 28 2005 16:37:59) Copyright (C) Red Hat, Inc. 2004 All rights reserved.
Created attachment 121602 [details] debugging output from one node (all within 1 sec) after this event the cluster fs was blocking. hope this sheds some light on the issue.
It looks like dlm packets are being corrupted by something. No one has been able to reproduce this.
We just received another instance of this and I think I've found the cause. It's a bug in the lock query code not allocating enough space for the data structure it is populating. -rRHEL4 Checking in queries.c; /cvs/cluster/cluster/dlm-kernel/src/Attic/queries.c,v <-- queries.c new revision: 1.9.2.1; previous revision: 1.9 done -rSTABLE Checking in src/queries.c; /cvs/cluster/cluster/dlm-kernel/src/Attic/queries.c,v <-- queries.c new revision: 1.9.8.1; previous revision: 1.9 done
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0558.html