183383 – mount deadlock after recovery during regression tests (2)

Bug 183383 - mount deadlock after recovery during regression tests (2)

Summary: mount deadlock after recovery during regression tests (2)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	gulm
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Chris Feist
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-02-28 19:18 UTC by Chris Feist
Modified:	2009-04-16 20:02 UTC (History)
CC List:	3 users (show)
Fixed In Version:	RHBA-2007-0145
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-05-10 21:27:52 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2007:0145	0	normal	SHIPPED_LIVE	gulm bug fix update	2007-05-10 21:27:31 UTC

Description Chris Feist 2006-02-28 19:18:53 UTC

Description of problem:
Cluster still locks up on recovery after several rounds of killing master and
slave gulm servers.

Version-Release number of selected component (if applicable):


How reproducible:
Takes awhile.

Steps to Reproduce:
1.  Kill Master and Slave lots of times.
2.
3.
  
Actual results:
Cluster eventually hangs.

Expected results:
Cluster recovers successfully.

Additional info:

Comment 1 Kiersten (Kerri) Anderson 2006-05-04 15:37:25 UTC

Taking off blocker list, some of the issues have been fixed, but there still
might be problems outstanding.

Comment 2 Nate Straz 2006-05-23 22:05:27 UTC

I hit this today during RHEL4-U3 errata testing.  I was running gulm-1.0.6-0.
2 of 3 server nodes were shot.  It doesn't appear that the server that rejoined
to form quorum expired the locks it had prior to being shot.

Comment 3 Nate Straz 2006-06-14 12:33:55 UTC

I'm still hitting this in RHEL4-U4 testing.

Comment 4 Chris Feist 2006-07-26 16:09:31 UTC

Problem occurs if you kill enough masters for the remaining gulm server to lose
quorum.  It then may not fence all of the killed gulm servers resulting in an
inconsistent lock state.  The problem can be easily fixed by fencing the lock
servers that were killed but not fence previously.  I'm working on a solution.

Comment 5 Dean Jansa 2006-07-26 16:12:54 UTC

I'm still hitting this in RHEL4-U4 testing.  x86 cluster.

Comment 6 Corey Marthaler 2006-08-07 13:49:30 UTC

Hit this over the weekend on x86_64 during the "GULM kill Master and all but one
Slave" revolver senario.

Comment 7 Kiersten (Kerri) Anderson 2006-09-22 16:52:04 UTC

Devel ACK.

Comment 8 Chris Feist 2007-01-29 23:29:55 UTC

Ok, so it appears that gulm was not properly propagating all of the
slaves/clients to the slaves.  This should fix one type of lockup, and hopefully
the lockup that was occurring in this bug.

The fix is built in gulm-1.0.9-2.

Comment 11 Red Hat Bugzilla 2007-05-10 21:27:53 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0145.html

Note You need to log in before you can comment on or make changes to this bug.