Bug 749883
Summary: | mirrored filesystem turned read only after leg and log leg device failure | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Corey Marthaler <cmarthal> | ||||
Component: | lvm2 | Assignee: | Jonathan Earl Brassow <jbrassow> | ||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 5.8 | CC: | agk, dwysocha, emi2fast, heinzm, jbrassow, mcsontos, nperic, prajnoha, prockai, slevine, thornber, zkabelac | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | lvm2-2.02.88-11.el5 | Doc Type: | Bug Fix | ||||
Doc Text: |
A mirror logical volume can itself have a mirrored log device. When a device in an image of the mirror and its log failed at the same time, it was possible for I/O errors to appear on the mirror LV when they should have been handled. That is, the kernel would not absorb the I/O errors from the failed device by relying on the remaining device. The cause was found to be that the mirror was not suspended for repair using the 'noflush' flag. This flag allows the kernel to requeue I/O requests that need to be retried. Because the kernel was not allowed to requeue the requests, it had no choice but to return the I/O as errored. This issue has been corrected and the mirror is now properly suspended with the 'noflush' flag.
|
Story Points: | --- | ||||
Clone Of: | 732124 | Environment: | |||||
Last Closed: | 2013-10-01 00:27:02 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 732124 | ||||||
Bug Blocks: | 807971, 928849 | ||||||
Attachments: |
|
Description
Corey Marthaler
2011-10-28 19:08:30 UTC
This exists in the latest 5.8 rpms as well. Scenario: Kill primary leg and primary log of synced 2 leg redundant log mirror(s) ********* Mirror hash info for this scenario ********* * names: syncd_pri_leg_pri_log_2legs_2logs_1 * sync: 1 * striped: 0 * leg devices: /dev/sdc1 /dev/sdd1 * log devices: /dev/sdb1 /dev/sdf1 * no MDA devices: * failpv(s): /dev/sdc1 /dev/sdb1 * failnode(s): taft-01 * leg fault policy: remove * log fault policy: allocate ****************************************************** Creating mirror(s) on taft-01... taft-01: lvcreate --mirrorlog mirrored -m 1 -n syncd_pri_leg_pri_log_2legs_2logs_1 -L 600M helter_skelter /dev/sdc1:0-1000 /dev/sdd1:0-1000 /dev/sdb1:0-150 /dev/sdf1:0-150 PV=/dev/sdb1 syncd_pri_leg_pri_log_2legs_2logs_1_mlog_mimage_0: 1.1 PV=/dev/sdc1 syncd_pri_leg_pri_log_2legs_2logs_1_mimage_0: 6 PV=/dev/sdb1 syncd_pri_leg_pri_log_2legs_2logs_1_mlog_mimage_0: 1.1 PV=/dev/sdc1 syncd_pri_leg_pri_log_2legs_2logs_1_mimage_0: 6 Waiting until all mirrors become fully syncd... 1/1 mirror(s) are fully synced: ( 100.00% ) Creating ext on top of mirror(s) on taft-01... mke2fs 1.39 (29-May-2006) Mounting mirrored ext filesystems on taft-01... Writing verification files (checkit) to mirror(s) on... ---- taft-01 ---- Verifying files (checkit) on mirror(s) on... ---- taft-01 ---- Disabling device sdc on taft-01 Disabling device sdb on taft-01 <fail name="taft-01_syncd_pri_leg_pri_log_2legs_2logs_1" pid="2936" time="Fri Oct 28 11:43:52 2011" type="cmd" duration="30" ec="1" /> Attempting I/O to cause mirror down conversion(s) on taft-01 dd: opening `/mnt/syncd_pri_leg_pri_log_2legs_2logs_1/ddfile': Read-only file system couldn't write to syncd_pri_leg_pri_log_2legs_2logs_1 [root@taft-01 ~]# lvs -a -o +devices Couldn't find device with uuid rg8epN-uz9b-mYUZ-6SmJ-Wjmo-ikmP-Gb5dp6. Couldn't find device with uuid 34ybGk-EjXg-Ivj5-3a4G-Zdiq-qE0s-XV86Bd. LV Attr LSize Copy% Devices syncd_pri_leg_pri_log_2legs_2logs_1 -wi-ao 600.00M /dev/sdd1(0) [root@taft-01 ~]# touch /mnt/syncd_pri_leg_pri_log_2legs_2logs_1/foo touch: cannot touch `/mnt/syncd_pri_leg_pri_log_2legs_2logs_1/foo': Read-only file system Oct 28 11:43:57 taft-01 kernel: lost page write due to I/O error on dm-7 Oct 28 11:43:57 taft-01 kernel: Buffer I/O error on device dm-7, logical block 26626 Oct 28 11:43:57 taft-01 kernel: sd 1:0:0:2: rejecting I/O to offline device Oct 28 11:43:57 taft-01 kernel: device-mapper: raid1: A read failure occurred on a mirror device. Oct 28 11:43:57 taft-01 kernel: device-mapper: raid1: Trying different device. Oct 28 11:43:57 taft-01 kernel: lost page write due to I/O error on dm-7 Oct 28 11:43:57 taft-01 kernel: Buffer I/O error on device dm-7, logical block 26627 Oct 28 11:43:57 taft-01 kernel: lost page write due to I/O error on dm-7 Oct 28 11:43:57 taft-01 kernel: Buffer I/O error on device dm-7, logical block 26628 Oct 28 11:43:57 taft-01 kernel: lost page write due to I/O error on dm-7 Oct 28 11:43:57 taft-01 kernel: Buffer I/O error on device dm-7, logical block 26629 Oct 28 11:43:57 taft-01 kernel: lost page write due to I/O error on dm-7 Oct 28 11:43:57 taft-01 kernel: Buffer I/O error on device dm-7, logical block 26630 Oct 28 11:43:57 taft-01 kernel: lost page write due to I/O error on dm-7 Oct 28 11:43:57 taft-01 kernel: Buffer I/O error on device dm-7, logical block 26631 Oct 28 11:43:57 taft-01 kernel: lost page write due to I/O error on dm-7 Oct 28 11:43:57 taft-01 kernel: Buffer I/O error on device dm-7, logical block 26632 Oct 28 11:43:57 taft-01 kernel: lost page write due to I/O error on dm-7 Oct 28 11:43:57 taft-01 kernel: Buffer I/O error on device dm-7, logical block 26633 Oct 28 11:43:57 taft-01 kernel: lost page write due to I/O error on dm-7 Oct 28 11:43:57 taft-01 kernel: Aborting journal on device dm-7. Oct 28 11:43:57 taft-01 kernel: device-mapper: raid1: log postsuspend failed Oct 28 11:43:57 taft-01 kernel: ext3_abort called. Oct 28 11:43:57 taft-01 kernel: EXT3-fs error (device dm-7): ext3_journal_start_sb: Detected aborted journal Oct 28 11:43:57 taft-01 kernel: Remounting filesystem read-only Oct 28 11:43:57 taft-01 xinetd[6465]: EXIT: qarsh status=0 pid=15273 duration=30(sec) Oct 28 11:43:57 taft-01 kernel: sd 1:0:0:2: rejecting I/O to offline device Oct 28 11:44:00 taft-01 last message repeated 76 times Oct 28 11:44:13 taft-01 lvm[11842]: Mirror status: 1 of 2 images failed. Oct 28 11:44:13 taft-01 lvm[11842]: Mirror log status: 1 of 2 images failed. Oct 28 11:44:13 taft-01 lvm[11842]: Repair of mirrored LV helter_skelter/syncd_pri_leg_pri_log_2legs_2logs_1 finished successfully. Oct 28 11:44:13 taft-01 lvm[11842]: Log device 253:4 has failed (D). Oct 28 11:44:13 taft-01 lvm[11842]: Device failure in helter_skelter-syncd_pri_leg_pri_log_2legs_2logs_1. Oct 28 11:44:13 taft-01 lvm[11842]: dm_task_run failed, errno = 6, No such device or address Oct 28 11:44:13 taft-01 lvm[11842]: helter_skelter-syncd_pri_leg_pri_log_2legs_2logs_1_mlog disappeared, detaching Oct 28 11:44:13 taft-01 lvm[11842]: No longer monitoring mirror device helter_skelter-syncd_pri_leg_pri_log_2legs_2logs_1_mlog for events. Oct 28 11:44:13 taft-01 lvm[11842]: Couldn't find device with uuid rg8epN-uz9b-mYUZ-6SmJ-Wjmo-ikmP-Gb5dp6. Oct 28 11:44:13 taft-01 lvm[11842]: Couldn't find device with uuid 34ybGk-EjXg-Ivj5-3a4G-Zdiq-qE0s-XV86Bd. Oct 28 11:44:14 taft-01 lvm[11842]: Repair of mirrored LV helter_skelter/syncd_pri_leg_pri_log_2legs_2logs_1 finished successfully. Oct 28 11:44:14 taft-01 lvm[11842]: helter_skelter-syncd_pri_leg_pri_log_2legs_2logs_1 has unmirrored portion. 2.6.18-274.el5 lvm2-2.02.88-2.el5 BUILT: Fri Oct 21 09:48:50 CDT 2011 lvm2-cluster-2.02.88-2.el5 BUILT: Fri Oct 21 09:49:24 CDT 2011 device-mapper-1.02.67-2.el5 BUILT: Mon Oct 17 08:31:56 CDT 2011 device-mapper-event-1.02.67-2.el5 BUILT: Mon Oct 17 08:31:56 CDT 2011 cmirror-1.1.39-10.el5 BUILT: Wed Sep 8 16:32:05 CDT 2010 kmod-cmirror-0.1.22-3.el5 BUILT: Tue Dec 22 13:39:47 CST 2009 Created attachment 551694 [details]
Patch for 769731 fix (which this bug was cloned from
Comment on attachment 551694 [details]
Patch for 769731 fix (which this bug was cloned from
patch put in wrong bug
This bug is the same as (and is a clone of) bug 732124, which was fixed by the upstream commit listed below. This commit can be used to fix this bug also. It applies cleanly (except for WHATS_NEW) to release 2.02.88. commit 54c73b7723713f43413584d59ca0bdd42c1d8241 Author: Jonathan Brassow <jbrassow> Date: Wed Nov 14 14:58:47 2012 -0600 mirror: Mirrored log should be fixed before mirror when double fault occu This patch is intended to fix bug 825323 - FS turns read-only during a dou fault of a mirror leg and mirrored log's leg at the same time. It only affects a 2-way mirror with a mirrored log. 3+-way mirrors and mirrors without a mirrored log are not affected. The problem resulted from the fact that the top level mirror was not using 'noflush' when suspending before its "down-convert". When a mirror image fails, the bios are queue until a suspend is recieved. If it is a 'noflush' suspend, the bios can be safely requeued in the DM core. If 'noflush' is not used, the bios must be pushed through the target and if a device is failed for a mirror, that means issuing an error. When an error is received by a file system, it results in it turning read-only (depending on the FS). Part of the problem was is due to the nature of the stacking involved in using a mirror as a mirror's log. When an image in each fail, the top level mirror stalls because it is waiting for a log flush. The other stalls waiting for corrective action. When the repair command is issued, the entire stacked arrangement is collapsed to a linear LV. The log flush then fails (somewhat uncleanly) and the top-level mirror is suspende without 'noflush' because it is a linear device. This patch allows the log to be repaired first, which in turn allows the top-level mirror's log flush to complete cleanly. The top-level mirror is then secondarily reduced to a linear device - at which time this mirror is suspended properly with 'noflush'. Tested this with multiple iterations of th efollowing scenario without problems: Scenario kill_pri_log_and_pri_leg_2_legs_2_logs: Kill primary leg and primary log of synced 2 leg redundant log mirror(s) ********* Mirror hash info for this scenario ********* * names: syncd_pri_leg_pri_log_2legs_2logs_1 syncd_pri_leg_pri_log_2legs_2logs_2 syncd_pri_leg_pri_log_2legs_2logs_3 * sync: 1 * striped: 0 * leg devices: /dev/sdh1 /dev/sdf1 * log devices: /dev/sdc1 /dev/sdd1 * no MDA devices: * failpv(s): /dev/sdh1 /dev/sdc1 * failnode(s): r5-node02 * leg fault policy: remove * log fault policy: remove ****************************************************** Teseted with: lvm2-2.02.88-11.el5 Tested again with 15 iterations of syncd_pri_leg_pri_log_2legs_2logs without issues (except the known dm_task_run) Marking verified with lvm2-2.02.88-11.el5 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1352.html Why this problem doesn't happen with XFS? |