169087 – still a small window where another node can mount during a gfs_fsck

Bug 169087 - still a small window where another node can mount during a gfs_fsck

Summary: still a small window where another node can mount during a gfs_fsck

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	gfs
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Abhijith Das
QA Contact:	GFS Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-09-22 21:10 UTC by Corey Marthaler
Modified:	2010-01-12 03:07 UTC (History)
CC List:	0 users
Fixed In Version:	RHBA-2006-0233
Clone Of:
Environment:
Last Closed:	2006-03-09 19:43:11 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2006:0233	0	normal	SHIPPED_LIVE	GFS bug fix update	2006-03-09 05:00:00 UTC

Description Corey Marthaler 2005-09-22 21:10:32 UTC

Description of problem:
For the most part, the "mount during an in progress gfs_fsck" issue was fixed by
the checkin for bz 160108. The lock_proto gets set to "fsck_gulm(or dlm)" just
before pass1 starts, well after the gfs_fsck initialization and after clearing
out of the journals. And even when this is "set" on the node doing the fsck, and
a "read" of the super block on that machine verifies that the lock_proto is
indeed changed, there is still a 5 - 15 second window that the other ndoes still
think that the lock_proto is the old one which will allow mounts to occur. This
may be a cache issue on the fsck node.

How reproducible:
everytime

Steps to Reproduce:
1. start gfs_fsck on node
2. verify on that node that the lock_proto in the SB has changed
3. check the lock_proto on another node, it will not have changed yet
4. wait 5 - 15 seconds and then verify on another node that the lock_proto in
the SB has changed

Comment 1 Abhijith Das 2005-12-20 21:27:23 UTC

2 parts to the problem: Fixed both parts and checked in code to RHEL4, STABLE
and HEAD
1 - When gfs_fsck starts, it sets lock_proto to fsck_gulm(or dlm). This happens
after the initialization phase which itself takes a good 10 seconds (on my setup). 
FIX: Split the function fill_super_block() into two, read_super_block() and
fill_super_block(). block_mounters() is called between these two functions so
the ~10 second delay disappears

2- When gfs_fsck modifies the lock_proto to fsck_gulm(or dlm) in
block_mounters(), it doesn't fsync() the changes to the superblock to disk. This
was causing the other nodes to still use the old value of lock_proto (lock_gulm,
lock_dlm) thereby allowing gfs mounts.
FIX: added fsync() so that all nodes see the change instantly.

Comment 4 Red Hat Bugzilla 2006-03-09 19:43:11 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0233.html

Note You need to log in before you can comment on or make changes to this bug.