Bug 173811

Summary:

dlm: midcomms: bad header version 7fffffff

Product:

[Retired] Red Hat Cluster Suite

Reporter:

Andreas Renner <arenner>

Component:

dlm

Assignee:

Christine Caulfield <ccaulfie>

Status:

CLOSED ERRATA

QA Contact:

Cluster QE <mspqa-list>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

CC:

ccaulfie, cluster-maint, rkenna

Target Milestone:

---

Keywords:

Reopened

Target Release:

---

Hardware:

i686

OS:

Linux

Whiteboard:

Fixed In Version:

RHBA-2006-0558

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2006-08-10 21:26:55 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
debugging output from one node (all within 1 sec)	none

Description Andreas Renner 2005-11-21 15:04:35 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; de-AT; rv:1.7.8) Gecko/20050517 Firefox/1.0.4 (Debian package 1.0.4-2)

Description of problem:
After forming a 3 node cluster it takes anywhere from 1h to 12h until the whole cluster comes to a halt. then one (random) node produces a kernel log message similar to the following:

Nov 21 14:24:29 jimknopf kernel: dlm: midcomms: bad header version 7fffffff
Nov 21 14:24:29 jimknopf kernel: dlm: midcomms: cmd=0, flags=0, length=91, lkid=2916746111, lockspace=16777219
Nov 21 14:24:29 jimknopf kernel: dlm: midcomms: base=d59c4000, offset=215, len=0, ret=215, limit=00001000 newbuf=0
Nov 21 14:24:29 jimknopf kernel: ff ff ff 7f 00 00 5b 00-7f 03 da ad 03 00 00 01
Nov 21 14:24:29 jimknopf kernel: 04 01 b2 02 96 00 00 00-01 00 00 00 84 77 51 c0
Nov 21 14:24:29 jimknopf kernel: 80 00 00 00 64 cc 99 cb-00 00 00 00 00 00 00 00
Nov 21 14:24:29 jimknopf kernel: f8 39 ba e2 34 cc 99 cb-05 00 00 00 34 cc 99 cb
Nov 21 14:24:29 jimknopf kernel: c1 23 99 f8 34 cc 99 cb-f8 39 ba e2 01 ff ff ff
Nov 21 14:24:29 jimknopf kernel: 20 c5 3e c0
Nov 21 14:24:29 jimknopf kernel: 80 00 00 00
Nov 21 14:24:29 jimknopf kernel: 20
Nov 21 14:24:29 jimknopf kernel: 00
Nov 21 14:24:29 jimknopf kernel: 00
Nov 21 14:24:29 jimknopf kernel: dlm: lowcomms: addr=d59c4000, base=0, len=215, iov_len=4096, iov_base[0]=d59c40d7, read=215


there is no attempt to fence the affected node and the reported bad version is always 7fffffff. after rebooting the affected maschine, the other nodes continue to work normally.

we have exim (mta) running on all nodes which all share one spool directory on the cluster filesystem (gfs). in an earlier configuration, each node had its own spool subdir and then the problem did not appear.

Version-Release number of selected component (if applicable):
DLM 1.01.00 (built Nov 20 2005 14:10:36)

How reproducible:
Always

Steps to Reproduce:
1. form a cluster
2. wait...
3.
  

Additional info:

all three nodes run a vanilla 2.6.14.2 kernel, device-mapper.1.02.00 and LVM2.2.02.00.

[root@jimknopf ~]# cat /proc/cluster/status
Protocol version: 5.0.1
Config version: 15
Cluster name: arnold
Cluster ID: 6568
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 3
Expected_votes: 3
Total_votes: 3
Quorum: 2
Active subsystems: 6

[root@jimknopf ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 3 2]

DLM Lock Space:  "clvmd"                             2   3 run       -
[1 3 2]

DLM Lock Space:  "clusterfs"                         3   4 run       -
[1 3 2]

GFS Mount Group: "clusterfs"                         4   5 run       -
[1 3 2]

Comment 1 David Teigland 2005-11-21 21:19:42 UTC

When you say "it takes anywhere from 1h to 12h until
the whole cluster comes to a halt" do you mean that
cman takes this long to form a cluster with all three
nodes recognizing each other?

What versions of cman and dlm are the other nodes
in the cluster running?

Comment 2 Andreas Renner 2005-11-22 06:22:37 UTC

the cluster is formed almost instantly, there's no problem at all. but after
running some unspecified time (roughly between 1h and 12h) one of the nodes
produces said kernel log entry and the cluster is blocked.

the versions are on all nodes:

Lock_Harness 1.01.00 (built Nov 20 2005 13:57:52) installed
CMAN 1.01.00 (built Nov 20 2005 13:57:29) installed
DLM 1.01.00 (built Nov 20 2005 13:57:45) installed
Lock_DLM (built Nov 20 2005 13:57:56) installed
GFS 1.01.00 (built Nov 20 2005 13:58:23) installed

Comment 3 David Teigland 2005-11-23 15:28:46 UTC

Does the kernel have PREEMPT or hugemem enabled?
If so, could you disable those things and try?
I'm going to me trying to reproduce this on my
own cluster.

Comment 4 Andreas Renner 2005-11-24 12:14:31 UTC

the kernels had "preempt the big kernel lock" and highmem (4gb) enabled:

CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_BKL=y

# CONFIG_NOHIGHMEM is not set
CONFIG_HIGHMEM4G=y
# CONFIG_HIGHMEM64G is not set
CONFIG_HIGHMEM=y


i disabled both:

CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
# CONFIG_PREEMPT_BKL is not set

CONFIG_NOHIGHMEM=y
# CONFIG_HIGHMEM4G is not set
# CONFIG_HIGHMEM64G is not set

...and recompiled the cluster software (device-mapper, lvm, cluster) on all
three nodes. after the cluster ran for about 10 mins under some load, one of the
nodes produced 33 error messages like the on in my first post. all 33 messages
occured in under 1 minute.
after rebooting the node, it rejoined the cluster and now i'm waiting for the
next crash ;-)

if you need any more information, i would be glad to provide it.

Comment 5 David Teigland 2005-11-28 22:47:44 UTC

Sorry for asking a dumb question, but did you recompile
the kernel itself after disabling the options (not just
the cluster software)?

Also, your output from /proc/cluster/status doesn't
look like mine.  I have "Node name" and "Node addresses"
fields at the end.  Could you run cman_tool -V to see
if you have the latest userland tools?

Comment 6 Andreas Renner 2005-11-29 06:17:19 UTC

i'm pretty sure about recompiling the kernel because i did it from home over
ssh, thought we had switched to grub on all nodes (which, of course, wasn't the
case), therefore didn't run lilo after installing the new kernel... let's just
say two of them didn't boot until i confronted them with my knoppix cd the next
day ;-)

also, yesterday i read about enabling CONFIG_EXT2_FS_POSIX_ACL on the mailing
list. are there any more kernel features i should be aware of?

ok, i have to admit i didn't copy the last two lines of /proc/cluster/status.

[root@jimknopf log]# cat /proc/cluster/status
Protocol version: 5.0.1
Config version: 15
Cluster name: arnold
Cluster ID: 6568
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 3
Expected_votes: 3
Total_votes: 3
Quorum: 2
Active subsystems: 6
Node name: jimknopf.cluster.uni-frankfurt.de
Node addresses: 10.1.1.6

[root@jimknopf log]# cman_tool -V
cman_tool 1.01.00 (built Nov 28 2005 16:37:59)
Copyright (C) Red Hat, Inc.  2004  All rights reserved.

Comment 7 Andreas Renner 2005-11-29 17:53:30 UTC

Created attachment 121602 [details]
debugging output from one node (all within 1 sec)

after this event the cluster fs was blocking. hope this sheds some light on the
issue.

Comment 8 David Teigland 2006-01-04 16:24:24 UTC

It looks like dlm packets are being corrupted by something.
No one has been able to reproduce this.

Comment 9 Christine Caulfield 2006-05-08 08:49:28 UTC

We just received another instance of this and I think I've found the cause. It's
a bug in the lock query code not allocating enough space for the data structure
it is populating.


-rRHEL4
Checking in queries.c;
/cvs/cluster/cluster/dlm-kernel/src/Attic/queries.c,v  <--  queries.c
new revision: 1.9.2.1; previous revision: 1.9
done

-rSTABLE
Checking in src/queries.c;
/cvs/cluster/cluster/dlm-kernel/src/Attic/queries.c,v  <--  queries.c
new revision: 1.9.8.1; previous revision: 1.9
done

Comment 13 Red Hat Bugzilla 2006-08-10 21:26:55 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0558.html