Description of problem: Helter_skelter failed both the primary leg and log device at the same time and this resulted in a corrupted volume. ================================================================================ Iteration 0.1 started at Thu Apr 30 15:54:10 CDT 2009 ================================================================================ Scenario: Kill primary leg and disk log of synced 2 leg mirror(s) ****** Mirror hash info for this scenario ****** * name: syncd_primary_log_2legs * sync: 1 * num mirrors: 1 * disklog: /dev/sdh1 * failpv(s): /dev/sde1 /dev/sdh1 * leg devices: /dev/sde1 /dev/sdf1 ************************************************ Creating mirror(s) on taft-02-bond... taft-02-bond: lvcreate -m 1 -n syncd_primary_log_2legs_1 -L 600M helter_skelter /dev/sde1:0-1000 /dev/sdf1:0-1000 /dev/sdh1:0-150 Waiting until all mirrors become fully syncd... 0/1 mirror(s) are fully synced: ( 1=2.75% ) 0/1 mirror(s) are fully synced: ( 1=36.75% ) 0/1 mirror(s) are fully synced: ( 1=36.75% ) 0/1 mirror(s) are fully synced: ( 1=41.50% ) 0/1 mirror(s) are fully synced: ( 1=41.50% ) 0/1 mirror(s) are fully synced: ( 1=46.67% ) 0/1 mirror(s) are fully synced: ( 1=52.17% ) 0/1 mirror(s) are fully synced: ( 1=80.75% ) 1/1 mirror(s) are fully synced: ( 1=100.00% ) Creating gfs on top of mirror(s) on taft-01-bond... Mounting mirrored gfs filesystems on taft-01-bond... Mounting mirrored gfs filesystems on taft-02-bond... Mounting mirrored gfs filesystems on taft-03-bond... Mounting mirrored gfs filesystems on taft-04-bond... Writing verification files (checkit) to mirror(s) on... ---- taft-01-bond ---- checkit starting with: CREATE Num files: 100 Random Seed: 12027 Verify XIOR Stream: /tmp/checkit_syncd_primary_log_2legs_1 Working dir: /mnt/syncd_primary_log_2legs_1/checkit ---- taft-02-bond ---- checkit starting with: CREATE Num files: 100 Random Seed: 11738 Verify XIOR Stream: /tmp/checkit_syncd_primary_log_2legs_1 Working dir: /mnt/syncd_primary_log_2legs_1/checkit ---- taft-03-bond ---- checkit starting with: CREATE Num files: 100 Random Seed: 11625 Verify XIOR Stream: /tmp/checkit_syncd_primary_log_2legs_1 Working dir: /mnt/syncd_primary_log_2legs_1/checkit ---- taft-04-bond ---- checkit starting with: CREATE Num files: 100 Random Seed: 11611 Verify XIOR Stream: /tmp/checkit_syncd_primary_log_2legs_1 Working dir: /mnt/syncd_primary_log_2legs_1/checkit <start name="taft-01-bond_1" pid="6572" time="Thu Apr 30 15:55:49 2009" type="cmd" /> <start name="taft-02-bond_1" pid="6574" time="Thu Apr 30 15:55:49 2009" type="cmd" /> <start name="taft-03-bond_1" pid="6576" time="Thu Apr 30 15:55:49 2009" type="cmd" /> <start name="taft-04-bond_1" pid="6578" time="Thu Apr 30 15:55:49 2009" type="cmd" /> Sleeping 10 seconds to get some outsanding GFS I/O locks before the failure Verifying files (checkit) on mirror(s) on... ---- taft-01-bond ---- checkit starting with: VERIFY Verify XIOR Stream: /tmp/checkit_syncd_primary_log_2legs_1 Working dir: /mnt/syncd_primary_log_2legs_1/checkit ---- taft-02-bond ---- checkit starting with: VERIFY Verify XIOR Stream: /tmp/checkit_syncd_primary_log_2legs_1 Working dir: /mnt/syncd_primary_log_2legs_1/checkit ---- taft-03-bond ---- checkit starting with: VERIFY Verify XIOR Stream: /tmp/checkit_syncd_primary_log_2legs_1 Working dir: /mnt/syncd_primary_log_2legs_1/checkit ---- taft-04-bond ---- checkit starting with: VERIFY Verify XIOR Stream: /tmp/checkit_syncd_primary_log_2legs_1 Working dir: /mnt/syncd_primary_log_2legs_1/checkit Disabling device sde on taft-01-bond Disabling device sdh on taft-01-bond Disabling device sde on taft-02-bond Disabling device sdh on taft-02-bond Disabling device sde on taft-03-bond Disabling device sdh on taft-03-bond Disabling device sde on taft-04-bond Disabling device sdh on taft-04-bond Attempting I/O to cause mirror down conversion(s) on taft-02-bond [HANG] [root@taft-04 ~]# lvs -a -o +devices /dev/sde1: read failed after 0 of 2048 at 0: Input/output error /dev/sdh1: read failed after 0 of 2048 at 0: Input/output error [HANG] [root@taft-01 ~]# ls /mnt/syncd_primary_log_2legs_1 ls: /mnt/syncd_primary_log_2legs_1: Input/output error Apr 30 15:56:39 taft-01 clogd[6672]: [vqQS1eu0] rw_log: write failure: Input/output error Apr 30 15:56:39 taft-01 clogd[6672]: [vqQS1eu0] Error writing to disk log Apr 30 15:56:39 taft-01 lvm[8622]: Log device, 253:2, has failed. Apr 30 15:56:39 taft-01 lvm[8622]: Device failure in helter_skelter-syncd_primary_log_2legs_1 Apr 30 15:56:39 taft-01 lvm[8622]: WARNING: dev_open(/etc/lvm/lvm.conf) called while suspended Apr 30 15:56:39 taft-01 kernel: sd 1:0:0:4: rejecting I/O to offline device Apr 30 15:56:39 taft-01 kernel: device-mapper: raid1: Primary mirror device has failed while mirror is not in-sync Apr 30 15:56:39 taft-01 kernel: device-mapper: raid1: Unable to choose alternative primary device Apr 30 15:56:39 taft-01 kernel: sd 1:0:0:4: rejecting I/O to offline device Apr 30 15:56:39 taft-01 kernel: GFS: fsid=TAFT:gfs1.0: fatal: I/O error Apr 30 15:56:39 taft-01 kernel: GFS: fsid=TAFT:gfs1.0: block = 63440 Apr 30 15:56:39 taft-01 kernel: GFS: fsid=TAFT:gfs1.0: function = gfs_logbh_wait Apr 30 15:56:39 taft-01 kernel: GFS: fsid=TAFT:gfs1.0: file = /builddir/build/BUILD/gfs-kmod-0.1.31/_kmod_build_/src/gfs/dio.c, line = 925 Apr 30 15:56:39 taft-01 kernel: GFS: fsid=TAFT:gfs1.0: time = 1241124999 I'll attach the logs from each of the 4 nodes as well as include a kern dump of the running processes. Version-Release number of selected component (if applicable): 2.6.18-128.el5 lvm2-2.02.40-6.el5 BUILT: Fri Oct 24 07:37:33 CDT 2008 lvm2-cluster-2.02.40-7.el5 BUILT: Wed Nov 26 07:19:19 CST 2008 device-mapper-1.02.28-2.el5 BUILT: Fri Sep 19 02:50:32 CDT 2008
Created attachment 342012 [details] log from taft-01
Created attachment 342013 [details] log from taft-02
Created attachment 342014 [details] log from taft-03
Created attachment 342015 [details] log from taft-04
This is reproducible. Attaching kern dumps...
Created attachment 342103 [details] dump from taft-01
Created attachment 342105 [details] dump from taft-02
Created attachment 342106 [details] dump from taft-03
Created attachment 342108 [details] dump from taft-04
I don't see any corruption. What I do see is that GFS got an I/O error from the mirror and withdrew. The data should still be fine, but we need to find the reason for the I/O error.
All operations to the mirror get stuck at the point of failure. As far as I can tell, there is no way to get at the data without a reboot. After rebooting my entire cluster, the mirror was unable to be activated which is expected. I then attempted to down convert, which was probably a mistake, because it's going to attempt to down convert to the failed primary leg, at which point the data gets lost. What would have been the proper way to fix this issue and kept the data? # After a reboot: syncd_primary_log_2legs_1 helter_skelter mwi-a- 600.00M syncd_primary_log_2legs_1_mlog 0.00 syncd_p rimary_log_2legs_1_mimage_0(0),syncd_primary_log_2legs_1_mimage_1(0) [syncd_primary_log_2legs_1_mimage_0] helter_skelter Iwi-ao 600.00M [syncd_primary_log_2legs_1_mimage_1] helter_skelter Iwi-ao 600.00M /dev/sd d1(0) [syncd_primary_log_2legs_1_mlog] helter_skelter lwi-ao 4.00M [root@taft-01 ~]# lvconvert -m 0 helter_skelter/syncd_primary_log_2legs_1 Logical volume syncd_primary_log_2legs_1 converted. [root@taft-01 ~]# lvs -a -o +devices LV VG Attr LSize Origin Snap% Move Log Copy% Convert Devices LogVol00 VolGroup00 -wi-ao 58.38G /dev/sda2(0) LogVol01 VolGroup00 -wi-ao 9.75G /dev/sda2(1868) syncd_primary_log_2legs_1 helter_skelter vwi-a- 600.00M [root@taft-01 ~]# mount /dev/helter_skelter/syncd_primary_log_2legs_1 /mnt/taft mount: you must specify the filesystem type printk: 25 messages suppressed. Buffer I/O error on device dm-5, logical block 153584 Buffer I/O error on device dm-5, logical block 153584 Buffer I/O error on device dm-5, logical block 0 Buffer I/O error on device dm-5, logical block 1 Buffer I/O error on device dm-5, logical block 2 Buffer I/O error on device dm-5, logical block 3 Buffer I/O error on device dm-5, logical block 4 Buffer I/O error on device dm-5, logical block 5 Buffer I/O error on device dm-5, logical block 6 Buffer I/O error on device dm-5, logical block 7 hfs: unable to find HFS+ superblock
The second time I attempted this today, the VG didn't even show up after the reboot. [root@taft-04 ~]# service cman start Starting cluster: Loading modules... done Mounting configfs... done Starting ccsd... done Starting cman... done Starting daemons... done Starting fencing... done [ OK ] [root@taft-04 ~]# service cmirror start Loading clustered mirror log module: [ OK ] Starting clustered mirror log server: [ OK ] [root@taft-04 ~]# service clvmd start Starting clvmd: [ OK ] Activating VGs: /dev/sdd1: Checksum error /dev/sde1: Checksum error /dev/sdh1: Checksum error 2 logical volume(s) in volume group "VolGroup00" now active [ OK ] [root@taft-04 ~]# lvs -a -o +devices /dev/sdd1: Checksum error /dev/sde1: Checksum error /dev/sdh1: Checksum error LV VG Attr LSize Origin Snap% Move Log Copy% Convert Devices LogVol00 VolGroup00 -wi-ao 58.38G /dev/sda2(0) LogVol01 VolGroup00 -wi-ao 9.75G /dev/sda2(1868)
Comment #12 sounds like a completely different bug to me. The volume group metadata seems to be corrupted in that instance. The cluster mirror components know nothing about LVM metadata and cannot even address those portions of the disk. A valid bug for sure - especially if you can reproduce it, but I don't think it is related to the original post. The problem being seen in the original post is: - failure of primary (or any other device) write causes the mirror to be "out-of-sync" - the primary cannot be switched due to the out-of-sync nature of the mirror - a read results in an I/O error because the mirror can't be sure the back-up leg is up-to-date Since the logic failure is in drivers/md/dm-raid1.c, this should be reproducible on single machine mirrors also. Can the bug be reproduced by just failing the primary leg? Can the bug be reproduced in single machine mode? I think there may be a workaround, but I would have to try it on a setup with a reproduction of the problem. Changing the logic to fix the problem in drivers/md/dm-raid1.c has the potential to introduce other regressions. I will be checking to see if there is a minimally intrusive way to resolve the issue - so for now, I'm adding the devel-ack.
Jon and I tested and concluded that this doesn't happen on single machine mirrors. We're working on device failure testing in a cluster mode to get a better handle on this bug.
I had thought the bug could be recreated by: 1) create cluster mirror and wait for sync 2) kill primary device and log 3) write to the mirror on a machine that has not yet reported "in-sync" This method for reproducing the bug has not worked. We are currently running tests with a modified kernel that we are hoping will provide us enough information to create a flow chart of how the bug is happening. Waiting to reproduce....
We are pretty sure we have discovered the cause. From my pending upstream patch: <patch> This patch fixes a bug which was triggering a case where the primary leg could not be changed (even when the mirror was in-sync) on failure. The case involves the failure of the primary device along with the transient failure of the log device. The problem is that bios can be put on the 'failures' list - due to log failure - before 'fail_mirror' is called due to the primary device failure. Normally, this is fine, but if the log device failure is transient, a subsequent iteration of the work thread, 'do_mirror', will reset 'log_failure'. The 'do_failures' function then resets the 'in_sync' variable when processing bios on the failures list. The 'in_sync' variable is what is used to determine if the primary device can be switched in the event of a failure. Since this has been reset, the primary device is incorrectly assumed to be not switchable. The case has been seen in the cluster mirror context, where one machine realizes the log device is dead before the other machines. As the responsibilities of the server migrate from one node to another (because the mirror is being reconfigured due to the failure), the new server may think for a moment that the log device is fine - thus resetting the 'log_failure' variable. In any case, it is inappropiate for us to reset the 'log_failure' variable. The above bug simply illustrates that it can actually hurt us. Signed-off-by: Jonathan Brassow <jbrassow> Index: linux-2.6/drivers/md/dm-raid1.c =================================================================== --- linux-2.6.orig/drivers/md/dm-raid1.c +++ linux-2.6/drivers/md/dm-raid1.c @@ -647,7 +647,13 @@ static void do_writes(struct mirror_set */ dm_rh_inc_pending(ms->rh, &sync); dm_rh_inc_pending(ms->rh, &nosync); - ms->log_failure = dm_rh_flush(ms->rh) ? 1 : 0; + + /* + * If the flush fails on a previous call and succeeds here, + * we must not reset the log_failure variable. We need + * userspace interaction to do that. + */ + ms->log_failure = dm_rh_flush(ms->rh) ? 1 : ms->log_failure; /* * Dispatch io. </patch> It seems to me that the bug has pretty low priority, given what it takes to reproduce it. OTOH, the patch changes one line and already has some testing mileage on it.
file affected is: drivers/md/dm-raid1.c
in kernel-2.6.18-175.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html