196047 – split brain on shutdown couses file system corruption.

Bug 196047 - split brain on shutdown couses file system corruption.

Summary: split brain on shutdown couses file system corruption.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	clumanager
Sub Component:
Version:	3
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-06-20 16:48 UTC by Issue Tracker
Modified:	2009-04-16 19:52 UTC (History)
CC List:	2 users (show)
Fixed In Version:	RHBA-2006-0505
Clone Of:
Environment:
Last Closed:	2006-08-10 14:14:14 UTC
Embargoed:

Attachments	(Terms of Use)
Waits for all children before notifying the quorum daemon that we're exiting. (806 bytes, patch) 2006-06-21 17:33 UTC, Lon Hohberger	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2006:0505	0	normal	SHIPPED_LIVE	clumanager bug fix update	2006-08-10 04:00:00 UTC

Comment 6 Lon Hohberger 2006-06-21 17:15:06 UTC

		/*
		 * If we failed to stop the service, we're done.  At this
		 * point, we can't determine the service's status - so
		 * trying to start it on other nodes is right out.
		 */
		return ABORT;

...

ABORT = service failed to stop (can happen because of a simple lock failure, or
it can also indicate a fatal problem with the service or file systems).  Manual
intervention is required.  Relocating in anything *but* a lock failure can
increase the risk for data/filesystem corruption.

I think the correct thing to do here is wait for all the child processes which
are forked off to return prior to letting the main clusvcmgrd exit (and for some
reason, I thought it did this correctly).  If there's a clusvcmgrd process still
running / recovering after cluquorumd has exited, that's definitely a problem.

Comment 7 Lon Hohberger 2006-06-21 17:33:49 UTC

Created attachment 131299 [details]
Waits for all children before notifying the quorum daemon that we're exiting.

This should fix it in a rather robust fashion.

Taking the lock (causing the service to return ABORT) should no longer fail
because clusvcmgrd won't tell cluquorumd it's ready to exit prior for services
completing operations they were performing at the time of shutdown.

Comment 8 Lon Hohberger 2006-06-21 17:40:01 UTC

FWIW, this is what rgmanager on RHEL4 does (except rgmanager has per-service
event queues).

Comment 14 Red Hat Bugzilla 2006-08-10 14:14:16 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0505.html

Note You need to log in before you can comment on or make changes to this bug.