Bug 173811
Summary: | dlm: midcomms: bad header version 7fffffff | ||||||
---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Andreas Renner <arenner> | ||||
Component: | dlm | Assignee: | Christine Caulfield <ccaulfie> | ||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 4 | CC: | ccaulfie, cluster-maint, rkenna | ||||
Target Milestone: | --- | Keywords: | Reopened | ||||
Target Release: | --- | ||||||
Hardware: | i686 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | RHBA-2006-0558 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2006-08-10 21:26:55 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Andreas Renner
2005-11-21 15:04:35 UTC
When you say "it takes anywhere from 1h to 12h until the whole cluster comes to a halt" do you mean that cman takes this long to form a cluster with all three nodes recognizing each other? What versions of cman and dlm are the other nodes in the cluster running? the cluster is formed almost instantly, there's no problem at all. but after running some unspecified time (roughly between 1h and 12h) one of the nodes produces said kernel log entry and the cluster is blocked. the versions are on all nodes: Lock_Harness 1.01.00 (built Nov 20 2005 13:57:52) installed CMAN 1.01.00 (built Nov 20 2005 13:57:29) installed DLM 1.01.00 (built Nov 20 2005 13:57:45) installed Lock_DLM (built Nov 20 2005 13:57:56) installed GFS 1.01.00 (built Nov 20 2005 13:58:23) installed Does the kernel have PREEMPT or hugemem enabled? If so, could you disable those things and try? I'm going to me trying to reproduce this on my own cluster. the kernels had "preempt the big kernel lock" and highmem (4gb) enabled: CONFIG_PREEMPT_NONE=y # CONFIG_PREEMPT_VOLUNTARY is not set # CONFIG_PREEMPT is not set CONFIG_PREEMPT_BKL=y # CONFIG_NOHIGHMEM is not set CONFIG_HIGHMEM4G=y # CONFIG_HIGHMEM64G is not set CONFIG_HIGHMEM=y i disabled both: CONFIG_PREEMPT_NONE=y # CONFIG_PREEMPT_VOLUNTARY is not set # CONFIG_PREEMPT is not set # CONFIG_PREEMPT_BKL is not set CONFIG_NOHIGHMEM=y # CONFIG_HIGHMEM4G is not set # CONFIG_HIGHMEM64G is not set ...and recompiled the cluster software (device-mapper, lvm, cluster) on all three nodes. after the cluster ran for about 10 mins under some load, one of the nodes produced 33 error messages like the on in my first post. all 33 messages occured in under 1 minute. after rebooting the node, it rejoined the cluster and now i'm waiting for the next crash ;-) if you need any more information, i would be glad to provide it. Sorry for asking a dumb question, but did you recompile the kernel itself after disabling the options (not just the cluster software)? Also, your output from /proc/cluster/status doesn't look like mine. I have "Node name" and "Node addresses" fields at the end. Could you run cman_tool -V to see if you have the latest userland tools? i'm pretty sure about recompiling the kernel because i did it from home over ssh, thought we had switched to grub on all nodes (which, of course, wasn't the case), therefore didn't run lilo after installing the new kernel... let's just say two of them didn't boot until i confronted them with my knoppix cd the next day ;-) also, yesterday i read about enabling CONFIG_EXT2_FS_POSIX_ACL on the mailing list. are there any more kernel features i should be aware of? ok, i have to admit i didn't copy the last two lines of /proc/cluster/status. [root@jimknopf log]# cat /proc/cluster/status Protocol version: 5.0.1 Config version: 15 Cluster name: arnold Cluster ID: 6568 Cluster Member: Yes Membership state: Cluster-Member Nodes: 3 Expected_votes: 3 Total_votes: 3 Quorum: 2 Active subsystems: 6 Node name: jimknopf.cluster.uni-frankfurt.de Node addresses: 10.1.1.6 [root@jimknopf log]# cman_tool -V cman_tool 1.01.00 (built Nov 28 2005 16:37:59) Copyright (C) Red Hat, Inc. 2004 All rights reserved. Created attachment 121602 [details]
debugging output from one node (all within 1 sec)
after this event the cluster fs was blocking. hope this sheds some light on the
issue.
It looks like dlm packets are being corrupted by something. No one has been able to reproduce this. We just received another instance of this and I think I've found the cause. It's a bug in the lock query code not allocating enough space for the data structure it is populating. -rRHEL4 Checking in queries.c; /cvs/cluster/cluster/dlm-kernel/src/Attic/queries.c,v <-- queries.c new revision: 1.9.2.1; previous revision: 1.9 done -rSTABLE Checking in src/queries.c; /cvs/cluster/cluster/dlm-kernel/src/Attic/queries.c,v <-- queries.c new revision: 1.9.8.1; previous revision: 1.9 done An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0558.html |