162422 – Recovery problem when the gulm master node is fenced

Bug 162422 - Recovery problem when the gulm master node is fenced

Summary: Recovery problem when the gulm master node is fenced

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	gulm
Sub Component:
Version:	3
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	michael conrad tadpol tilstra
QA Contact:	Cluster QE
Docs Contact:
URL:	https://www.redhat.com/archives/linux...
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-07-04 12:28 UTC by Alban Crequy
Modified:	2009-04-16 20:25 UTC (History)
CC List:	1 user (show)
Fixed In Version:	RHBA-2005-723
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-10-10 15:26:07 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2005:723	0	normal	SHIPPED_LIVE	GFS bug fix update	2005-09-30 04:00:00 UTC
Red Hat Product Errata	RHBA-2005:733	0	normal	SHIPPED_LIVE	gulm bug fix update	2005-10-07 04:00:00 UTC

Description Alban Crequy 2005-07-04 12:28:45 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; fr; rv:1.7.6) Gecko/20050322

Description of problem:
When the gulm master node also mount a GFS filesystem, the fencing process does not run properly if the gulm master node has to be fenced.

I may loose data because the recovery process begun too early (before the fence is finished).


Version-Release number of selected component (if applicable):
GFS-6.0.2.20-2

How reproducible:
Always

Steps to Reproduce:
1. Get a 8-nodes cluster.
2. Choose 5 nodes for gulm servers.
3. Mount a GFS filesystem of your 8 nodes.
4. Unplug the network of the current gulm master.
5. Wait until another gulm server become the master.
6. Do NOT run fence_ack_manual and check if the locks of the unplugged node are released.
  

Actual Results:  1. The locks are released immediately when another gulm server become the master.
2. The journal is recovered by another node immediately too.

Expected Results:  The recovery process should wait that the user run fence_ack_manual

Additional info:

More explanations here:
https://www.redhat.com/archives/linux-cluster/2005-July/msg00000.html

I file this bugzilla as requested here:
https://www.redhat.com/archives/linux-cluster/2005-July/msg00006.html

Comment 1 michael conrad tadpol tilstra 2005-07-05 13:40:33 UTC

Have you tried this with only three nodes are gulm servers?  How does it behave
then?

Comment 2 michael conrad tadpol tilstra 2005-07-05 14:01:38 UTC

just checked with three nodes. bug is there too.

Comment 3 michael conrad tadpol tilstra 2005-07-05 15:08:36 UTC

check_for_stale_expires() is tripping on everyone.  it only runs if a jid
mapping is marked 1. (live mappings are marked 2).  Only time jidmapping is
marked 1 is when a node other than owner is replaying the journal.  Why are the
live mappings getting switched to 1 from 2? i duno, but I bet that's the bug
right there.  I look deeper.

Comment 4 michael conrad tadpol tilstra 2005-07-05 18:00:22 UTC

Fixed the issue that was in comment #3 didn't fix the bug.  Digging more.

Comment 5 michael conrad tadpol tilstra 2005-07-05 18:05:17 UTC

Bug only appears when master lock server is also mounting gfs.
So a workaround is to put the lockservers onto dedicated nodes.

Comment 7 michael conrad tadpol tilstra 2005-07-19 15:21:13 UTC

There was a kludge that tried to fix something, but I cannot find or figure what
it was suppose to fix.  That kludge was causing this.  Betting on this being a
bigger problem that whatever it was trying to fix and removing the kludge.


I think what it tried to fix was some weird end case where multiple clients and
lock servers failed in some way.

Comment 9 Red Hat Bugzilla 2005-09-30 14:56:29 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2005-723.html

Comment 10 Red Hat Bugzilla 2005-10-07 16:43:08 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2005-733.html

Comment 11 Red Hat Bugzilla 2005-10-10 15:26:07 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2005-723.html

Note You need to log in before you can comment on or make changes to this bug.