138858 – fencing happens to late, causing fs to apparently become corrupt and cause node to panic

Bug 138858 - fencing happens to late, causing fs to apparently become corrupt and cause node to panic

Summary: fencing happens to late, causing fs to apparently become corrupt and cause no...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	gfs
Sub Component:
Version:	4
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Ken Preslan
QA Contact:	GFS Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-11-11 17:23 UTC by Corey Marthaler
Modified:	2010-01-12 03:01 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-01-10 22:22:12 UTC
Embargoed:

Attachments	(Terms of Use)

Description Corey Marthaler 2004-11-11 17:23:20 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4.3)
Gecko/20040803

Description of problem:
5 node cluster (morph-01 - morph-05) all running I/O load to 4
filesystems.

Revolver shoots 3 nodes (morph-03, morph-04, morph-05).

morph-01 and morph-02 detect that the nodes are gone:
morph-01:
Nov 11 11:36:09 morph-01 kernel: CMAN: quorum lost, blocking activity

morph-02:
Nov 11 11:37:02 morph-02 kernel: CMAN: no HELLO from morph-03,
removing from the cluster
Nov 11 11:37:06 morph-02 kernel: CMAN: node morph-05 is not responding
- removing from the cluster
Nov 11 11:37:06 morph-02 kernel: CMAN: node morph-04 is not responding
- removing from the cluster
Nov 11 11:37:08 morph-02 kernel: CMAN: quorum lost, blocking activity

Revolver waits a couple minutes for the nodes to come back up and when 
they do attempts to bring them back into the cluster.

When the cman join is attempted on the shot nodes, morph-01 decides to
fence all three of them:
morph-01:
Nov 11 11:40:32 morph-01 kernel: CMAN: node morph-05 rejoining
Nov 11 11:40:33 morph-01 kernel: CMAN: quorum regained, resuming activity
Nov 11 11:40:33 morph-01 fenced[4015]: fencing node "morph-05"
Nov 11 11:40:33 morph-01 kernel: CMAN: node morph-03 rejoining
Nov 11 11:40:37 morph-01 kernel: CMAN: node morph-04 rejoining
Nov 11 11:40:38 morph-01 fenced[4015]: fence "morph-05" success
Nov 11 11:40:39 morph-01 fenced[4015]: fencing node "morph-04"
Nov 11 11:40:41 morph-01 kernel: CMAN: node morph-05 is not responding
- removing from the cluster
Nov 11 11:40:43 morph-01 fenced[4015]: fence "morph-04" success
Nov 11 11:40:44 morph-01 fenced[4015]: fencing node "morph-03"
Nov 11 11:40:48 morph-01 fenced[4015]: fence "morph-03" success
Nov 11 11:40:53 morph-01 kernel: CMAN: node morph-03 is not responding
- removing from the cluster

morph-02:
Nov 11 11:41:31 morph-02 kernel: CMAN: node morph-05 rejoining
Nov 11 11:41:32 morph-02 kernel: CMAN: quorum regained, resuming activity
Nov 11 11:41:32 morph-02 kernel: CMAN: node morph-03 rejoining
Nov 11 11:41:32 morph-02 fenced[3848]: fencing deferred to 1
Nov 11 11:41:36 morph-02 kernel: CMAN: node morph-04 rejoining
Nov 11 11:41:52 morph-02 kernel: CMAN: node morph-03 is not responding
- removing from the cluster

both morph-01 and morph-02 then start replaying journals and
recovering the filesystems.

morph-02 then panics due to what appears to be a corrupted filesystem:

Nov 11 11:41:59 morph-02 kernel: GFS: fsid=morph-cluster:corey0.1:
jid=3: Replaying journal...
  mh_magic = 0x01161970
  mh_type = 8
  mh_generation = 0
  mh_format = 800
  mh_incarn = 0
  lh_flags = 0x00000000
  lh_pad = 0
  lh_first = 65775200
  lh_sequence = 1003
  lh_tail = 65775008
  lh_last_dump = 65774992
  lh_reserved =
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
start = 65775120
Kernel panic - not syncing: GFS: Assertion failed on line 362 of file
/usr/src/cluster/gfs-kernel/src/gfs/recovery.c
GFS: assertion: "header.lh_first == start"
GFS: time = 1100194920
GFS: fsid=morph-cluster:corey0.1


How reproducible:
Didn't try

Comment 1 Corey Marthaler 2004-11-11 17:24:49 UTC

I'll try to reproduce this and maybe narrow down where the bug really
lives (cman, fenced, or gfs)

Comment 2 Corey Marthaler 2004-11-11 20:38:10 UTC

FWIW, gfs_fsck was able to make this fs mountable again

Comment 3 Corey Marthaler 2004-11-11 21:12:57 UTC

I was able to reproduce the delayed fencing (after a cluster rejoin is
attempted) but not yet the fs corruption.

Comment 4 David Teigland 2004-11-17 03:55:39 UTC

What version of the code is this?  A fix I added on 11/12 appears
to be missing.

Here's what should happen:

- 5 nodes in cluster (A,B,X,Y,Z) and running fenced,dlm,gfs
- kill 3 (X,Y,Z) leaving 2 (A,B)
- cluster loses quorum meaning all services are suspended on A+B
   (this includes, fencing, dlm and gfs services; no fencing should
    occur, no dlm recovery should occur and no gfs journal replay
    should occur, until...)
- X is brought back into the cluster
- Y+Z are left inactive
- 3 of 5 nodes (A+B+X) now satisfies quorum
- on A+B, the services are now unsuspended and they do recovery:

* first, fence domain recovery: any node that was in the cluster
  but is not any longer is fenced.  This means that Y+Z are fenced.
  X, which was also killed but has also just joined the cluster,
  should /not/ be fenced.  Here's where the bug is: the fencing
  daemon was incorrectly fencing node X in addition to correctly
  fencing Y+Z.  I fixed this on 11/12/04 in response to a report
  by Patrick on cluster-list.

  [Note: I should add an additional delay before fencing Y+Z in this
   situation in the hope that they'll have enough time to rejoin
   the cluster and avoid being fenced like X.]

* second, dlm recovery occurs

* third, gfs recovery occurs; A and B are responsible for recovering
  the journals used by X, Y and Z.  X will not be allowed to mount
  until these recoveries are complete.

I don't have an explanation for the gfs assertion. There was another
obscure bug I fixed on 11/15/04 where a machine would be allowed to
mount gfs while it was /joining/ the fencing domain, but was not yet a
full domain member.  This could result in a machine having gfs
mounted, being killed but not getting fenced.  It's a possible
explanation for this although I'm not sure of the details.

Comment 5 Corey Marthaler 2004-12-13 21:31:00 UTC

Dave,

You mentioned a possible additional delay in the comment above so that
nodes Y and Z can avoid being fenced like X. Did that ever go in? I
still see cases where I have 6 nodes, 3 get shot and brought back
within  a few seconds of each other and one or two of them end up fenced.

Comment 6 Corey Marthaler 2004-12-13 22:44:08 UTC

Nevermind, I found it with the post_fail_delay variable.

Comment 7 David Teigland 2004-12-30 09:00:49 UTC

Can we mark this resolved or not a bug now?

Comment 8 Corey Marthaler 2005-01-10 22:22:12 UTC

a post_join_delay and post_fail_delay varible of 15 seconds or more
appears to fix this issue.

Note You need to log in before you can comment on or make changes to this bug.