173811 – dlm: midcomms: bad header version 7fffffff

Bug 173811 - dlm: midcomms: bad header version 7fffffff

Summary: dlm: midcomms: bad header version 7fffffff

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	dlm
Sub Component:
Version:	4
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Christine Caulfield
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-11-21 15:04 UTC by Andreas Renner
Modified:	2009-04-16 20:00 UTC (History)
CC List:	3 users (show)
Fixed In Version:	RHBA-2006-0558
Clone Of:
Environment:
Last Closed:	2006-08-10 21:26:55 UTC
Embargoed:

Attachments	(Terms of Use)
debugging output from one node (all within 1 sec) (3.32 KB, text/plain) 2005-11-29 17:53 UTC, Andreas Renner	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2006:0558	0	normal	SHIPPED_LIVE	dlm-kernel bug fix update	2006-08-10 04:00:00 UTC

Description Andreas Renner 2005-11-21 15:04:35 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; de-AT; rv:1.7.8) Gecko/20050517 Firefox/1.0.4 (Debian package 1.0.4-2)

Description of problem:
After forming a 3 node cluster it takes anywhere from 1h to 12h until the whole cluster comes to a halt. then one (random) node produces a kernel log message similar to the following:

Nov 21 14:24:29 jimknopf kernel: dlm: midcomms: bad header version 7fffffff
Nov 21 14:24:29 jimknopf kernel: dlm: midcomms: cmd=0, flags=0, length=91, lkid=2916746111, lockspace=16777219
Nov 21 14:24:29 jimknopf kernel: dlm: midcomms: base=d59c4000, offset=215, len=0, ret=215, limit=00001000 newbuf=0
Nov 21 14:24:29 jimknopf kernel: ff ff ff 7f 00 00 5b 00-7f 03 da ad 03 00 00 01
Nov 21 14:24:29 jimknopf kernel: 04 01 b2 02 96 00 00 00-01 00 00 00 84 77 51 c0
Nov 21 14:24:29 jimknopf kernel: 80 00 00 00 64 cc 99 cb-00 00 00 00 00 00 00 00
Nov 21 14:24:29 jimknopf kernel: f8 39 ba e2 34 cc 99 cb-05 00 00 00 34 cc 99 cb
Nov 21 14:24:29 jimknopf kernel: c1 23 99 f8 34 cc 99 cb-f8 39 ba e2 01 ff ff ff
Nov 21 14:24:29 jimknopf kernel: 20 c5 3e c0
Nov 21 14:24:29 jimknopf kernel: 80 00 00 00
Nov 21 14:24:29 jimknopf kernel: 20
Nov 21 14:24:29 jimknopf kernel: 00
Nov 21 14:24:29 jimknopf kernel: 00
Nov 21 14:24:29 jimknopf kernel: dlm: lowcomms: addr=d59c4000, base=0, len=215, iov_len=4096, iov_base[0]=d59c40d7, read=215


there is no attempt to fence the affected node and the reported bad version is always 7fffffff. after rebooting the affected maschine, the other nodes continue to work normally.

we have exim (mta) running on all nodes which all share one spool directory on the cluster filesystem (gfs). in an earlier configuration, each node had its own spool subdir and then the problem did not appear.

Version-Release number of selected component (if applicable):
DLM 1.01.00 (built Nov 20 2005 14:10:36)

How reproducible:
Always

Steps to Reproduce:
1. form a cluster
2. wait...
3.
  

Additional info:

all three nodes run a vanilla 2.6.14.2 kernel, device-mapper.1.02.00 and LVM2.2.02.00.

[root@jimknopf ~]# cat /proc/cluster/status
Protocol version: 5.0.1
Config version: 15
Cluster name: arnold
Cluster ID: 6568
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 3
Expected_votes: 3
Total_votes: 3
Quorum: 2
Active subsystems: 6

[root@jimknopf ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 3 2]

DLM Lock Space:  "clvmd"                             2   3 run       -
[1 3 2]

DLM Lock Space:  "clusterfs"                         3   4 run       -
[1 3 2]

GFS Mount Group: "clusterfs"                         4   5 run       -
[1 3 2]

Comment 1 David Teigland 2005-11-21 21:19:42 UTC

When you say "it takes anywhere from 1h to 12h until
the whole cluster comes to a halt" do you mean that
cman takes this long to form a cluster with all three
nodes recognizing each other?

What versions of cman and dlm are the other nodes
in the cluster running?

Comment 2 Andreas Renner 2005-11-22 06:22:37 UTC

the cluster is formed almost instantly, there's no problem at all. but after
running some unspecified time (roughly between 1h and 12h) one of the nodes
produces said kernel log entry and the cluster is blocked.

the versions are on all nodes:

Lock_Harness 1.01.00 (built Nov 20 2005 13:57:52) installed
CMAN 1.01.00 (built Nov 20 2005 13:57:29) installed
DLM 1.01.00 (built Nov 20 2005 13:57:45) installed
Lock_DLM (built Nov 20 2005 13:57:56) installed
GFS 1.01.00 (built Nov 20 2005 13:58:23) installed

Comment 3 David Teigland 2005-11-23 15:28:46 UTC

Does the kernel have PREEMPT or hugemem enabled?
If so, could you disable those things and try?
I'm going to me trying to reproduce this on my
own cluster.

Comment 4 Andreas Renner 2005-11-24 12:14:31 UTC

the kernels had "preempt the big kernel lock" and highmem (4gb) enabled:

CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_BKL=y

# CONFIG_NOHIGHMEM is not set
CONFIG_HIGHMEM4G=y
# CONFIG_HIGHMEM64G is not set
CONFIG_HIGHMEM=y


i disabled both:

CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
# CONFIG_PREEMPT_BKL is not set

CONFIG_NOHIGHMEM=y
# CONFIG_HIGHMEM4G is not set
# CONFIG_HIGHMEM64G is not set

...and recompiled the cluster software (device-mapper, lvm, cluster) on all
three nodes. after the cluster ran for about 10 mins under some load, one of the
nodes produced 33 error messages like the on in my first post. all 33 messages
occured in under 1 minute.
after rebooting the node, it rejoined the cluster and now i'm waiting for the
next crash ;-)

if you need any more information, i would be glad to provide it.

Comment 5 David Teigland 2005-11-28 22:47:44 UTC

Sorry for asking a dumb question, but did you recompile
the kernel itself after disabling the options (not just
the cluster software)?

Also, your output from /proc/cluster/status doesn't
look like mine.  I have "Node name" and "Node addresses"
fields at the end.  Could you run cman_tool -V to see
if you have the latest userland tools?

Comment 6 Andreas Renner 2005-11-29 06:17:19 UTC

i'm pretty sure about recompiling the kernel because i did it from home over
ssh, thought we had switched to grub on all nodes (which, of course, wasn't the
case), therefore didn't run lilo after installing the new kernel... let's just
say two of them didn't boot until i confronted them with my knoppix cd the next
day ;-)

also, yesterday i read about enabling CONFIG_EXT2_FS_POSIX_ACL on the mailing
list. are there any more kernel features i should be aware of?

ok, i have to admit i didn't copy the last two lines of /proc/cluster/status.

[root@jimknopf log]# cat /proc/cluster/status
Protocol version: 5.0.1
Config version: 15
Cluster name: arnold
Cluster ID: 6568
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 3
Expected_votes: 3
Total_votes: 3
Quorum: 2
Active subsystems: 6
Node name: jimknopf.cluster.uni-frankfurt.de
Node addresses: 10.1.1.6

[root@jimknopf log]# cman_tool -V
cman_tool 1.01.00 (built Nov 28 2005 16:37:59)
Copyright (C) Red Hat, Inc.  2004  All rights reserved.

Comment 7 Andreas Renner 2005-11-29 17:53:30 UTC

Created attachment 121602 [details]
debugging output from one node (all within 1 sec)

after this event the cluster fs was blocking. hope this sheds some light on the
issue.

Comment 8 David Teigland 2006-01-04 16:24:24 UTC

It looks like dlm packets are being corrupted by something.
No one has been able to reproduce this.

Comment 9 Christine Caulfield 2006-05-08 08:49:28 UTC

We just received another instance of this and I think I've found the cause. It's
a bug in the lock query code not allocating enough space for the data structure
it is populating.


-rRHEL4
Checking in queries.c;
/cvs/cluster/cluster/dlm-kernel/src/Attic/queries.c,v  <--  queries.c
new revision: 1.9.2.1; previous revision: 1.9
done

-rSTABLE
Checking in src/queries.c;
/cvs/cluster/cluster/dlm-kernel/src/Attic/queries.c,v  <--  queries.c
new revision: 1.9.8.1; previous revision: 1.9
done

Comment 13 Red Hat Bugzilla 2006-08-10 21:26:55 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0558.html

Note You need to log in before you can comment on or make changes to this bug.