Bug 235686
Summary: | Kernel BUG at dm_cmirror_server while recovering region | ||
---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Corey Marthaler <cmarthal> |
Component: | cmirror | Assignee: | Jonathan Earl Brassow <jbrassow> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Cluster QE <mspqa-list> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4 | CC: | agk, cfeist, dwysocha, jbrassow, kanderso, mbroz, prockai |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2008-08-05 21:41:53 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Corey Marthaler
2007-04-09 16:27:16 UTC
no machines killed during this test? no machines were killed as apart of this test. Here is the code causing the machine to panic: if (lr->u.lr_region != lc->recovering_region) { DMERR("Recovering region mismatch: (%Lu/%Lu)", lr->u.lr_region, lc->recovering_region); BUG(); } The question is, how do we have a record state saying that there is a region in recovery _and_ lc->recovering_region saying that there is not a region in recovery. Several fixes have gone in to fix the handling of this bug: 1) During server relocation (which can happen due to machine failure or normal mirror suspension), the server value could get set before the client had a chance to clean-up. This caused the server to become confused and issue a BUG(). 2) perform a flush of the log before suspending. This ensures that regions which are in-sync get correctly flushed to the disk log. Without this, there will always be recovery work to be done when a mirror starts up - even if it was properly in-sync during shutdown. 3) clean-up memory used to record region users when a mirror is shutdown. It was possible for some regions to be left over (causing a memory leak) during certain fault scenarios. 4) properly initialize the state field (ru_rw) in the region user structure when a mark occurs. Without the initialization, it was sometimes possible for the region to be misinterpretted as recovering instead of marked. 5) resolve and unhandled case in server_complete_resync_work 6) reset a variable in cluster_complete_resync_work. Failure to do so was causing a retry to include the wrong value for the completion of the resync work - confusing the server. assigned -> post Important repair for cluster mirror release in 4.5. pm-ack This BUG can still be tripped when attempting the 'non synced secondary leg failure' scenario. Marking back to ASSIGNED, with the QA Whiteboard tag. Here are the comments I put into the code validating the scenario encountered
and obviating the need for the BUG();
< BUG();
---
> /*
> * This is a valid case, when the following happens:
> * 1) a region is recovering and has waiting writes
> * 2) recovery fails and calls complete_resync_work (w/ failure)
> * 2.1) RU is removed from our list
> * 3) waiting writes are released
> * 3.1) writes do not mark, because b/c region state != RH_CLEAN
> * 4) write fails and calls complete_resync_work (w/ failure)
> * 5) boom, we are here.
> *
> * Not a bug to be here
> */
assigned -> post
post -> modified. Fix verified with the latest code. cmirror-kernel-2.6.9-33.2 2.6.9-56.ELsmp Fixed in current release (4.7). |