Bug 431696
Summary: | RHEL5 cmirror tracker: any device failure on a 3-way mirror can leave write path deadlocked | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Corey Marthaler <cmarthal> | ||||||||||
Component: | cmirror | Assignee: | Jonathan Earl Brassow <jbrassow> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | high | ||||||||||||
Version: | 5.2 | CC: | agk, ccaulfie, dwysocha, edamato, heinzm, mbroz | ||||||||||
Target Milestone: | rc | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | All | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2009-01-20 21:25:36 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 430797 | ||||||||||||
Attachments: |
|
Description
Corey Marthaler
2008-02-06 16:02:12 UTC
Created attachment 294120 [details]
messages from taft-01 during the failure
Created attachment 294121 [details]
messages from taft-02 during the failure
Created attachment 294123 [details]
messages from taft-03 during the failure
Created attachment 294124 [details]
messages from taft-04 during the failure
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. This is reproducable. This may be less of a 'primary leg failure' and more of a 'failure to a 3-way mirror' thing. It's possible that this is related to or the same as bz 432064. I need to reproduce this again and verify that the write path is indeed deadlocked and not just really slow. I reproduced this again, and it's definately deadlocked, not just slow. So this is not the same bz as 432064. I'm fairly confident that this issue is due to the fact I'm using 3-way mirrors. I was able to reproduce this with just 2-way mirrors, so although it doesn't happen as often, this isn't just a 3-way issue. Two competing theories: 1) Bug in OpenAIS checkpointing could be causing an inability to reload a mirror with the same UUID, making it impossible to down-convert from 3-way -> 2-way. 2) One region in the log may not be in-sync. The machines (for whatever reason) are not recovering it. A write to that region would be blocked forever under cluster mirroring rules, because writes are not allowed to out-of-sync regions. Looking at the evidence: 1) The mirror never becomes fully synced again 2) There are no messages like: clogd[1814]: saCkptSectionIterationNext failure: 27 clogd[1814]: import_checkpoint: 0 checkpoint sections found It is a good bet that this is _not_ an openAIS issue, but an issue with mirror recovery. It is also a good bet that this was fixed on March 24th, by this checkin: commit 2200d92f9ebc30fca8f4107929fc4707b57bcebd Author: Jonathan Brassow <jbrassow> Date: Mon Mar 24 16:09:52 2008 -0500 clogd: do not process requests after calling cpg_leave However, with the above fix, I can envision seeing messages like: LOG_ERROR("[%s] sync_count(%llu) does not match bitmap count(%llu)", SHORT_UUID(lc->uuid), (unsigned long long)lc->sync_count, reset); LOG_ERROR("[%s] Resetting sync_count = %llu", SHORT_UUID(lc->uuid), reset); If you see those messages, file a new bug (because although the issue is handled, the messages indicate the underlying problem is not gone). This issue is no longer being seen, marking verified in: 2.6.18-104.el5 lvm2-2.02.32-4.el5 BUILT: Fri Apr 4 06:15:19 CDT 2008 lvm2-cluster-2.02.32-4.el5 BUILT: Wed Apr 2 03:56:50 CDT 2008 cmirror-1.1.22-1.el5 BUILT: Thu Jul 24 15:59:03 CDT 2008 kmod-cmirror-0.1.13-2.el5 BUILT: Thu Jul 24 16:00:48 CDT 2008 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-0158.html |