Bug 193728
| Summary: | A write to a cluster mirror volume not in sync will hang and also cause the sync to hang as well | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 4 | Reporter: | Corey Marthaler <cmarthal> | ||||
| Component: | kernel | Assignee: | Jonathan Earl Brassow <jbrassow> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 4.0 | CC: | agk, mbroz, mjenner, rkenna | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | RHSA-2006-0575 | Doc Type: | Bug Fix | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2006-08-10 23:25:42 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 180185, 181411 | ||||||
| Attachments: |
|
||||||
|
Description
Corey Marthaler
2006-05-31 21:21:57 UTC
Looks like I/O will hang to even an in sync mirror: May 31 11:34:26 taft-03 clvmd: Activating VGs: succeeded device-mapper: unable to get server (2) to mark region (120) device-mapper: Reason :: 1 May 31 11:35:52 taft-03 lvm[4538]: mirror_1-coreymirror is now in-sync May 31 11:35:52 taft-03 kernel: device-mapper: unable to get server (2) to mark region (120) May 31 11:35:52 taft-03 kernel: device-mapper: Reason :: 1 May 31 11:35:57 taft-03 kernel: device-mapper: clear_region_count :: 128 May 31 11:35:57 taft-03 kernel: device-mapper: clear_region_count :: 128 May 31 11:35:57 taft-03 kernel: device-mapper: clear_region_count :: 256 May 31 11:35:57 taft-03 kernel: device-mapper: clear_region_count :: 384 May 31 11:35:57 taft-03 kernel: device-mapper: clear_region_count :: 512 May 31 11:35:57 taft-03 kernel: device-mapper: clear_region_count :: 512 May 31 11:35:57 taft-03 kernel: device-mapper: clear_region_count :: 640 May 31 11:35:57 taft-03 kernel: device-mapper: clear_region_count :: 640 May 31 11:35:58 taft-03 kernel: device-mapper: clear_region_count :: 768 I created a GFS filesystem on a 900M cmirror last night and attempted to run simple I/O on it, and this morning it was all hung. Switching this to RHEL4/kernel Created attachment 130973 [details]
Potential Patch (waiting for a little feedback before posting)
The problem was in drivers/md/dm-raid1.c:do_writes, where I was adding writes
that are blocked behind remote recovering regions ('requeue') to a bio list
('writes') that was only valid in that function. The result is that the memory
is leaked and the request lost - stalling all I/O for that process as it
waits for the write to complete.
The fix simply adds the bio to the main write queue via queue_bio - effectively
causing the write to be deferred until the region is recovered.
The reason for the change to do_mirror is that the call to 'wake' in queue_bio
is impotent - given the fact that it is effectively telling itself to wake up.
(That wake up call is disregarded because the thread is already running.) The
while loop continues until do_writes does not requeue bios due to remote
recovery.
You may ask why it is ok to check 'ms->writes.head' without holding the spin
lock. The reason is because do_mirror is the only one that can clear the list,
and any additions to the list which we are concerned about happen in do_writes
(that is, before the check in the same thread).
committed in stream U4 build 39.2. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/ An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0575.html |