210359 – Cluster nodes hang in vgscan at reboot time

Bug 210359 - Cluster nodes hang in vgscan at reboot time

Summary: Cluster nodes hang in vgscan at reboot time

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	David Teigland
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-10-11 17:41 UTC by Robert Peterson
Modified:	2009-09-03 16:51 UTC (History)
CC List:	2 users (show)
Fixed In Version:	5.0.0
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-11-28 21:28:50 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
console output, /proc/net/sctp/assocs and group_tool -v for all nodes (22.46 KB, text/plain) 2006-10-11 17:41 UTC, Robert Peterson	no flags	Details
Time-adjusted, sorted and collated cman_tool dump from all nodes (186.82 KB, text/plain) 2006-10-11 17:46 UTC, Robert Peterson	no flags	Details
View All

Description Robert Peterson 2006-10-11 17:41:03 UTC

Description of problem:
During a "revolver" cluster recovery test on the smoke cluster,
three of the five nodes (camel, merit, and salem) were deliberately
killed by the test, leaving kool and winston to recover when they
rebooted.  Instead of recovering, all three killed nodes hung
at reboot time in vgscan, and it looks to me like some kind
of strange cman kernel communication error.

Version-Release number of selected component (if applicable):
RHEL5 beta 1 plus latest CVS code from 10 Oct 2006.

How reproducible:
Unknown - happened once so far.

Steps to Reproduce:
1. cd /root/sts-test on "smoke"
2. ../gfs/bin/revolver -f var/share/resource_files/smoke.xml -l $PWD -i 0 -L
LITE -I -t 1 
  
Actual results:
vgscan hangs forever.
Console has error messages such as:
dlm: Error sending to node 4 -32
dlm: clvmd: dlm_wait_function aborted
(See attached file.)

Expected results:
No vgscan hang

Additional info:
See attached files for console messages from all nodes
and group_tool dump information.

Comment 1 Robert Peterson 2006-10-11 17:41:03 UTC

Created attachment 138261 [details]
console output, /proc/net/sctp/assocs and group_tool -v for all nodes

Comment 2 Robert Peterson 2006-10-11 17:46:13 UTC

Created attachment 138262 [details]
Time-adjusted, sorted and collated cman_tool dump from all nodes

This is the output from a tool I wrote called grimoire.
Its function is to figure out all nodes in a cluster from cluster.conf,
collect daemon information from each (group_tool -dump), time-adjust
them all and collate them together.  The result is a timeline of what
happened from a groupd daemon point of view.

Comment 3 Christine Caulfield 2006-10-12 09:20:25 UTC

The DLM sctp messages are the clue.
Here's a patch to fix, I'll send this upstream.

diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index 7bcea7c..867f93d 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -548,7 +548,7 @@ static int receive_from_sock(void)
 	}
 	len = iov[0].iov_len + iov[1].iov_len;
 
-	r = ret = kernel_recvmsg(sctp_con.sock, &msg, iov, 1, len,
+	r = ret = kernel_recvmsg(sctp_con.sock, &msg, iov, msg.msg_iovlen, len,
 				 MSG_NOSIGNAL | MSG_DONTWAIT);
 	if (ret <= 0)
 		goto out_close;

Comment 4 Kiersten (Kerri) Anderson 2006-10-12 13:26:53 UTC

DLM kernel module change required to pass the cluster beta2 release criteria. 
Changing the component to dlm-kernel and rhel beta product.  Devel ACK.

Comment 5 RHEL Program Management 2006-10-12 13:33:01 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux release.  Product Management has requested further review
of this request by Red Hat Engineering.  This request is not yet committed for
inclusion in release.

Comment 7 Linda Wang 2006-10-18 14:51:25 UTC

yes, this patch is in RHEL5B2.

Note You need to log in before you can comment on or make changes to this bug.