175805 – DLM closes connections when it shouldn't

Bug 175805 - DLM closes connections when it shouldn't

Summary: DLM closes connections when it shouldn't

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	dlm
Sub Component:
Version:	4
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Christine Caulfield
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-12-15 09:05 UTC by Christine Caulfield
Modified:	2009-04-16 20:00 UTC (History)
CC List:	2 users (show)
Fixed In Version:	RHBA-2006-0237
Clone Of:
Environment:
Last Closed:	2006-03-09 19:55:03 UTC
Embargoed:

Attachments	(Terms of Use)
putative fix for the problem (466 bytes, patch) 2005-12-19 08:29 UTC, Christine Caulfield	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2006:0237	0	normal	SHIPPED_LIVE	dlm-kernel bug fix update	2006-03-09 05:00:00 UTC

Description Christine Caulfield 2005-12-15 09:05:20 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-GB; rv:1.7.12) Gecko/20050922 Fedora/1.0.7-1.1.fc4 Firefox/1.0.7

Description of problem:
Reported by rvchan.com as
"GFS6.1 hangs - after fence_tool join succeeds"

Running a clvmd up/down script on a 3-node cluster causes one to hang after some time.

Ravi can reproduce this - I can't.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Run Ravi's script
2.
3.
  

Actual Results:  One node hangs after several iterations. As it's stuck in recovery, the others get stuck too.

Expected Results:  Recovery completing normally.

Additional info:

This is the latest output I was sent after including some debug printks in lowcomms.c. The first entry is the key one. 


1. On node that hung(gfs1)
==============================SNIP==========================================
=====
Dec 14 18:03:52 gfs1-pvt kernel: PJC: sock_release before connect
==============================SNIP==========================================
=====
Please note that I noticed the hang on this node that occurred
at 18:03:52, around 19:13:40 and then rebooted this node.


2. On node that fenced the hung node (gfs2)
==============================SNIP==========================================
=====
Dec 14 19:14:26 gfs2-pvt kernel: dlm: clvmd: dlm_dir_rebuild_wait failed -1
Dec 14 19:14:30 gfs2-pvt kernel: PJC: closing connection because node 1 left
the cluster
==============================SNIP==========================================
=====


3. On the other node(gfs3):
==============================SNIP==========================================
=====
Dec 14 18:04:05 gfs3-pvt kernel: PJC: closing connection after bad send: ret
= -104
.........
Dec 14 19:14:29 gfs3-pvt kernel: dlm: clvmd: nodes_reconfig failed -1
Dec 14 19:14:30 gfs3-pvt kernel: PJC: closing connection because node 1 left
the cluster
==============================SNIP==========================================
=====

Comment 1 Christine Caulfield 2005-12-19 08:29:50 UTC

Created attachment 122392 [details]
putative fix for the problem

Ravi has been running this patch for 70 hours no with no hangs, but two
reported "incidents" so It looks like it might be the fix.

Comment 2 Christine Caulfield 2005-12-20 16:25:05 UTC

Checked into -rSTABLE & -rRHEL4 (but not U3)

Comment 3 Christine Caulfield 2006-01-24 16:43:28 UTC

Checked in for U3:

Checking in lowcomms.c;
/cvs/cluster/cluster/dlm-kernel/src/lowcomms.c,v  <--  lowcomms.c
new revision: 1.22.2.11.2.1; previous revision: 1.22.2.11
done

Comment 6 Red Hat Bugzilla 2006-03-09 19:55:04 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0237.html

Note You need to log in before you can comment on or make changes to this bug.