Bug 162422 - Recovery problem when the gulm master node is fenced
Summary: Recovery problem when the gulm master node is fenced
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: gulm
Version: 3
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
Assignee: michael conrad tadpol tilstra
QA Contact: Cluster QE
URL: https://www.redhat.com/archives/linux...
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-07-04 12:28 UTC by Alban Crequy
Modified: 2009-04-16 20:25 UTC (History)
1 user (show)

Fixed In Version: RHBA-2005-723
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-10-10 15:26:07 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2005:723 0 normal SHIPPED_LIVE GFS bug fix update 2005-09-30 04:00:00 UTC
Red Hat Product Errata RHBA-2005:733 0 normal SHIPPED_LIVE gulm bug fix update 2005-10-07 04:00:00 UTC

Description Alban Crequy 2005-07-04 12:28:45 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; fr; rv:1.7.6) Gecko/20050322

Description of problem:
When the gulm master node also mount a GFS filesystem, the fencing process does not run properly if the gulm master node has to be fenced.

I may loose data because the recovery process begun too early (before the fence is finished).


Version-Release number of selected component (if applicable):
GFS-6.0.2.20-2

How reproducible:
Always

Steps to Reproduce:
1. Get a 8-nodes cluster.
2. Choose 5 nodes for gulm servers.
3. Mount a GFS filesystem of your 8 nodes.
4. Unplug the network of the current gulm master.
5. Wait until another gulm server become the master.
6. Do NOT run fence_ack_manual and check if the locks of the unplugged node are released.
  

Actual Results:  1. The locks are released immediately when another gulm server become the master.
2. The journal is recovered by another node immediately too.

Expected Results:  The recovery process should wait that the user run fence_ack_manual

Additional info:

More explanations here:
https://www.redhat.com/archives/linux-cluster/2005-July/msg00000.html

I file this bugzilla as requested here:
https://www.redhat.com/archives/linux-cluster/2005-July/msg00006.html

Comment 1 michael conrad tadpol tilstra 2005-07-05 13:40:33 UTC
Have you tried this with only three nodes are gulm servers?  How does it behave
then?

Comment 2 michael conrad tadpol tilstra 2005-07-05 14:01:38 UTC
just checked with three nodes. bug is there too.

Comment 3 michael conrad tadpol tilstra 2005-07-05 15:08:36 UTC
check_for_stale_expires() is tripping on everyone.  it only runs if a jid
mapping is marked 1. (live mappings are marked 2).  Only time jidmapping is
marked 1 is when a node other than owner is replaying the journal.  Why are the
live mappings getting switched to 1 from 2? i duno, but I bet that's the bug
right there.  I look deeper.

Comment 4 michael conrad tadpol tilstra 2005-07-05 18:00:22 UTC
Fixed the issue that was in comment #3 didn't fix the bug.  Digging more.

Comment 5 michael conrad tadpol tilstra 2005-07-05 18:05:17 UTC
Bug only appears when master lock server is also mounting gfs.
So a workaround is to put the lockservers onto dedicated nodes.


Comment 7 michael conrad tadpol tilstra 2005-07-19 15:21:13 UTC
There was a kludge that tried to fix something, but I cannot find or figure what
it was suppose to fix.  That kludge was causing this.  Betting on this being a
bigger problem that whatever it was trying to fix and removing the kludge.


I think what it tried to fix was some weird end case where multiple clients and
lock servers failed in some way. 

Comment 9 Red Hat Bugzilla 2005-09-30 14:56:29 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2005-723.html


Comment 10 Red Hat Bugzilla 2005-10-07 16:43:08 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2005-733.html


Comment 11 Red Hat Bugzilla 2005-10-10 15:26:07 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2005-723.html



Note You need to log in before you can comment on or make changes to this bug.