Bug 173811 - dlm: midcomms: bad header version 7fffffff
Summary: dlm: midcomms: bad header version 7fffffff
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: dlm
Version: 4
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Christine Caulfield
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-11-21 15:04 UTC by Andreas Renner
Modified: 2009-04-16 20:00 UTC (History)
3 users (show)

Fixed In Version: RHBA-2006-0558
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-08-10 21:26:55 UTC
Embargoed:


Attachments (Terms of Use)
debugging output from one node (all within 1 sec) (3.32 KB, text/plain)
2005-11-29 17:53 UTC, Andreas Renner
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2006:0558 0 normal SHIPPED_LIVE dlm-kernel bug fix update 2006-08-10 04:00:00 UTC

Description Andreas Renner 2005-11-21 15:04:35 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; de-AT; rv:1.7.8) Gecko/20050517 Firefox/1.0.4 (Debian package 1.0.4-2)

Description of problem:
After forming a 3 node cluster it takes anywhere from 1h to 12h until the whole cluster comes to a halt. then one (random) node produces a kernel log message similar to the following:

Nov 21 14:24:29 jimknopf kernel: dlm: midcomms: bad header version 7fffffff
Nov 21 14:24:29 jimknopf kernel: dlm: midcomms: cmd=0, flags=0, length=91, lkid=2916746111, lockspace=16777219
Nov 21 14:24:29 jimknopf kernel: dlm: midcomms: base=d59c4000, offset=215, len=0, ret=215, limit=00001000 newbuf=0
Nov 21 14:24:29 jimknopf kernel: ff ff ff 7f 00 00 5b 00-7f 03 da ad 03 00 00 01
Nov 21 14:24:29 jimknopf kernel: 04 01 b2 02 96 00 00 00-01 00 00 00 84 77 51 c0
Nov 21 14:24:29 jimknopf kernel: 80 00 00 00 64 cc 99 cb-00 00 00 00 00 00 00 00
Nov 21 14:24:29 jimknopf kernel: f8 39 ba e2 34 cc 99 cb-05 00 00 00 34 cc 99 cb
Nov 21 14:24:29 jimknopf kernel: c1 23 99 f8 34 cc 99 cb-f8 39 ba e2 01 ff ff ff
Nov 21 14:24:29 jimknopf kernel: 20 c5 3e c0
Nov 21 14:24:29 jimknopf kernel: 80 00 00 00
Nov 21 14:24:29 jimknopf kernel: 20
Nov 21 14:24:29 jimknopf kernel: 00
Nov 21 14:24:29 jimknopf kernel: 00
Nov 21 14:24:29 jimknopf kernel: dlm: lowcomms: addr=d59c4000, base=0, len=215, iov_len=4096, iov_base[0]=d59c40d7, read=215


there is no attempt to fence the affected node and the reported bad version is always 7fffffff. after rebooting the affected maschine, the other nodes continue to work normally.

we have exim (mta) running on all nodes which all share one spool directory on the cluster filesystem (gfs). in an earlier configuration, each node had its own spool subdir and then the problem did not appear.

Version-Release number of selected component (if applicable):
DLM 1.01.00 (built Nov 20 2005 14:10:36)

How reproducible:
Always

Steps to Reproduce:
1. form a cluster
2. wait...
3.
  

Additional info:

all three nodes run a vanilla 2.6.14.2 kernel, device-mapper.1.02.00 and LVM2.2.02.00.

[root@jimknopf ~]# cat /proc/cluster/status
Protocol version: 5.0.1
Config version: 15
Cluster name: arnold
Cluster ID: 6568
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 3
Expected_votes: 3
Total_votes: 3
Quorum: 2
Active subsystems: 6

[root@jimknopf ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 3 2]

DLM Lock Space:  "clvmd"                             2   3 run       -
[1 3 2]

DLM Lock Space:  "clusterfs"                         3   4 run       -
[1 3 2]

GFS Mount Group: "clusterfs"                         4   5 run       -
[1 3 2]

Comment 1 David Teigland 2005-11-21 21:19:42 UTC
When you say "it takes anywhere from 1h to 12h until
the whole cluster comes to a halt" do you mean that
cman takes this long to form a cluster with all three
nodes recognizing each other?

What versions of cman and dlm are the other nodes
in the cluster running?


Comment 2 Andreas Renner 2005-11-22 06:22:37 UTC
the cluster is formed almost instantly, there's no problem at all. but after
running some unspecified time (roughly between 1h and 12h) one of the nodes
produces said kernel log entry and the cluster is blocked.

the versions are on all nodes:

Lock_Harness 1.01.00 (built Nov 20 2005 13:57:52) installed
CMAN 1.01.00 (built Nov 20 2005 13:57:29) installed
DLM 1.01.00 (built Nov 20 2005 13:57:45) installed
Lock_DLM (built Nov 20 2005 13:57:56) installed
GFS 1.01.00 (built Nov 20 2005 13:58:23) installed


Comment 3 David Teigland 2005-11-23 15:28:46 UTC
Does the kernel have PREEMPT or hugemem enabled?
If so, could you disable those things and try?
I'm going to me trying to reproduce this on my
own cluster.

Comment 4 Andreas Renner 2005-11-24 12:14:31 UTC
the kernels had "preempt the big kernel lock" and highmem (4gb) enabled:

CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_BKL=y

# CONFIG_NOHIGHMEM is not set
CONFIG_HIGHMEM4G=y
# CONFIG_HIGHMEM64G is not set
CONFIG_HIGHMEM=y


i disabled both:

CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
# CONFIG_PREEMPT_BKL is not set

CONFIG_NOHIGHMEM=y
# CONFIG_HIGHMEM4G is not set
# CONFIG_HIGHMEM64G is not set

...and recompiled the cluster software (device-mapper, lvm, cluster) on all
three nodes. after the cluster ran for about 10 mins under some load, one of the
nodes produced 33 error messages like the on in my first post. all 33 messages
occured in under 1 minute.
after rebooting the node, it rejoined the cluster and now i'm waiting for the
next crash ;-)

if you need any more information, i would be glad to provide it.

Comment 5 David Teigland 2005-11-28 22:47:44 UTC
Sorry for asking a dumb question, but did you recompile
the kernel itself after disabling the options (not just
the cluster software)?

Also, your output from /proc/cluster/status doesn't
look like mine.  I have "Node name" and "Node addresses"
fields at the end.  Could you run cman_tool -V to see
if you have the latest userland tools?


Comment 6 Andreas Renner 2005-11-29 06:17:19 UTC
i'm pretty sure about recompiling the kernel because i did it from home over
ssh, thought we had switched to grub on all nodes (which, of course, wasn't the
case), therefore didn't run lilo after installing the new kernel... let's just
say two of them didn't boot until i confronted them with my knoppix cd the next
day ;-)

also, yesterday i read about enabling CONFIG_EXT2_FS_POSIX_ACL on the mailing
list. are there any more kernel features i should be aware of?

ok, i have to admit i didn't copy the last two lines of /proc/cluster/status.

[root@jimknopf log]# cat /proc/cluster/status
Protocol version: 5.0.1
Config version: 15
Cluster name: arnold
Cluster ID: 6568
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 3
Expected_votes: 3
Total_votes: 3
Quorum: 2
Active subsystems: 6
Node name: jimknopf.cluster.uni-frankfurt.de
Node addresses: 10.1.1.6

[root@jimknopf log]# cman_tool -V
cman_tool 1.01.00 (built Nov 28 2005 16:37:59)
Copyright (C) Red Hat, Inc.  2004  All rights reserved.


Comment 7 Andreas Renner 2005-11-29 17:53:30 UTC
Created attachment 121602 [details]
debugging output from one node (all within 1 sec)

after this event the cluster fs was blocking. hope this sheds some light on the
issue.

Comment 8 David Teigland 2006-01-04 16:24:24 UTC
It looks like dlm packets are being corrupted by something.
No one has been able to reproduce this.


Comment 9 Christine Caulfield 2006-05-08 08:49:28 UTC
We just received another instance of this and I think I've found the cause. It's
a bug in the lock query code not allocating enough space for the data structure
it is populating.


-rRHEL4
Checking in queries.c;
/cvs/cluster/cluster/dlm-kernel/src/Attic/queries.c,v  <--  queries.c
new revision: 1.9.2.1; previous revision: 1.9
done

-rSTABLE
Checking in src/queries.c;
/cvs/cluster/cluster/dlm-kernel/src/Attic/queries.c,v  <--  queries.c
new revision: 1.9.8.1; previous revision: 1.9
done

Comment 13 Red Hat Bugzilla 2006-08-10 21:26:55 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0558.html



Note You need to log in before you can comment on or make changes to this bug.