Bug 291521
Summary: | Cluster mirror can become out-of-sync if nominal I/O overlaps recovery I/O | ||
---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Jonathan Earl Brassow <jbrassow> |
Component: | cmirror-kernel | Assignee: | Jonathan Earl Brassow <jbrassow> |
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> |
Severity: | urgent | Docs Contact: | |
Priority: | high | ||
Version: | 4 | CC: | rkenna |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHBA-2007-0991 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2007-11-21 21:15:25 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Jonathan Earl Brassow
2007-09-14 19:31:04 UTC
#2 is insufficent due the way the region handling code caches region state. We must prevent nodes from getting erronious/stale region state by checking with 'is_remote_recovering' first... a function that had been pulled out because it was thought it was no longer needed. assigned -> post Bad news: Because a node can cache the state of a region indefinitely (especially for blocks that are used alot - aka a journaling area of a file system), we must deny writes to any region of the mirror that is not yet recovered. This is only the case with cluster mirroring. This means poor performance of nominal I/O during recovery - probably really bad performance. However, this is absolutely necessary for mirror reliability. Good news: The time I spent coding different fixes for this bug weren't a complete waste. I've been able to reuse some of that code to optimize the recovery process. Now, rather than going through the mirror from front to back, it skips ahead to recover regions that have pending writes. Bottom line: performance will be bad during recovery, but it will be better than RHEL4.5. Need for testing: I've tested mirror consistency during recovery fairly heavily. However, I haven't tested this after machine/disk failures. One particular point of concern I have is: - I/O + recovery (or machine failure) followed by - non-primary disk failure This is a concern because the mirror is unable to put the mirror in-sync at this point and may try to block I/O to non-synced regions. If the mirror can't complete I/O, then it can't suspend and reconfigure - meaning, it hangs. I should have this case covered, but it will be important to test... This should be a standard QA thing, as I often see there tests doing failure of secondary devices while doing I/O during recovery. Need another respin of package. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0991.html |