291521 – Cluster mirror can become out-of-sync if nominal I/O overlaps recovery I/O

Bug 291521 - Cluster mirror can become out-of-sync if nominal I/O overlaps recovery I/O

Summary: Cluster mirror can become out-of-sync if nominal I/O overlaps recovery I/O

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	cmirror-kernel
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	---
Assignee:	Jonathan Earl Brassow
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-09-14 19:31 UTC by Jonathan Earl Brassow
Modified:	2010-01-12 02:11 UTC (History)
CC List:	1 user (show)
Fixed In Version:	RHBA-2007-0991
Clone Of:
Environment:
Last Closed:	2007-11-21 21:15:25 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2007:0991	0	normal	SHIPPED_LIVE	cmirror-kernel bug fix update	2007-11-29 14:39:06 UTC

Description Jonathan Earl Brassow 2007-09-14 19:31:04 UTC

If a machine is recovering a region of a mirror, all other machines that are
trying to write to that region will be delayed until it finishes.  This is good.
 However, the state of the region will change after the recovery completes from
not-in-sync to in-sync.  This affects how mirror writes should be carried out.

The problem lies in the fact that the machines being delayed got their
information when the region was not-in-sync, but are allowed to write again when
the mirror is in-sync.  The machines think that they only need to write to the
primary device - thus the mirror becomes out-of-sync.

There are two ways to fix this.
1) Make the mirror write to all mirror disks regardless of sync state
2) Re-try recovery if a collision occurs

#1 is the preferred method, but #2 is less invasive...

So, I'm going to do #2 for 4.6 and #1 for 4.7/5.2

Comment 1 Jonathan Earl Brassow 2007-09-21 20:11:11 UTC

#2 is insufficent due the way the region handling code caches region state.  We
must prevent nodes from getting erronious/stale region state by checking with
'is_remote_recovering' first... a function that had been pulled out because it
was thought it was no longer needed.

assigned -> post

Comment 2 Jonathan Earl Brassow 2007-09-26 03:11:59 UTC

Bad news:
Because a node can cache the state of a region indefinitely (especially for
blocks that are used alot - aka a journaling area of a file system), we must
deny writes to any region of the mirror that is not yet recovered.  This is only
the case with cluster mirroring.  This means poor performance of nominal I/O
during recovery - probably really bad performance.  However, this is absolutely
necessary for mirror reliability.

Good news:
The time I spent coding different fixes for this bug weren't a complete waste. 
I've been able to reuse some of that code to optimize the recovery process. 
Now, rather than going through the mirror from front to back, it skips ahead to
recover regions that have pending writes.  Bottom line: performance will be bad
during recovery, but it will be better than RHEL4.5.

Need for testing:
I've tested mirror consistency during recovery fairly heavily.  However, I
haven't tested this after machine/disk failures.  One particular point of
concern I have is:
- I/O + recovery (or machine failure)  followed by
- non-primary disk failure
This is a concern because the mirror is unable to put the mirror in-sync at this
point and may try to block I/O to non-synced regions.  If the mirror can't
complete I/O, then it can't suspend and reconfigure - meaning, it hangs.  I
should have this case covered, but it will be important to test...  This should
be a standard QA thing, as I often see there tests doing failure of secondary
devices while doing I/O during recovery.

Need another respin of package.

Comment 4 errata-xmlrpc 2007-11-21 21:15:25 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0991.html

Note You need to log in before you can comment on or make changes to this bug.