Hide Forgot
+++ This bug was initially created as a clone of Bug #1307111 +++ Description of problem: I'm guessing this is related to, if not the same as bug 1304045. Non sync'ed mirrors experiencing device failures are not being properly repaired. This test case works fine in both 6.7 and 7.2. Could it be that failed PVs are being recognized as still existing, but having 0 sectors available? I see this in all the failures: Feb 12 10:48:03 host-116 lvm[1795]: Device /dev/sdd1 has size of 0 sectors which is smaller than corresponding PV size of 31455207 sectors. Was device resized? Feb 12 10:48:03 host-116 lvm[1795]: One or more devices used as PVs in VG helter_skelter have changed sizes. Feb 12 10:48:03 host-116 lvm[1795]: /dev/sdd1: open failed: No such device or address Feb 12 10:48:03 host-116 lvm[1795]: Device /dev/sdd1 has size of 0 sectors which is smaller than corresponding PV size of 31455207 sectors. Was device resized? Feb 12 10:48:03 host-116 lvm[1795]: One or more devices used as PVs in VG helter_skelter have changed sizes. Feb 12 10:48:03 host-116 lvm[1795]: nonsyncd_secondary_3legs_1 is consistent. Nothing to repair. cenario kill_secondary_non_synced_3_legs: Kill secondary leg of non synced 3 leg mirror(s) ********* Mirror hash info for this scenario ********* * names: nonsyncd_secondary_3legs_1 * sync: 0 * striped: 0 * leg devices: /dev/sdh1 /dev/sdd1 /dev/sda1 * log devices: /dev/sdc1 * no MDA devices: * failpv(s): /dev/sdd1 * failnode(s): host-116.virt.lab.msp.redhat.com * lvmetad: 1 * leg fault policy: remove * log fault policy: remove ****************************************************** Creating mirror(s) on host-116.virt.lab.msp.redhat.com... host-116.virt.lab.msp.redhat.com: lvcreate --type mirror -m 2 -n nonsyncd_secondary_3legs_1 -L 3G helter_skelter /dev/sdh1:0-2400 /dev/sdd1:0-2400 /dev/sda1:0-2400 /dev/sdc1:0-150 Current mirror/raid device structure(s): LV Attr LSize Cpy%Sync Devices nonsyncd_secondary_3legs_1 mwi-a-m--- 3.00g 0.65 nonsyncd_secondary_3legs_1_mimage_0(0),nonsyncd_secondary_3legs_1_mimage_1(0),nonsyncd_secondary_3legs_1_mimage_2(0) [nonsyncd_secondary_3legs_1_mimage_0] Iwi-aom--- 3.00g /dev/sdh1(0) [nonsyncd_secondary_3legs_1_mimage_1] Iwi-aom--- 3.00g /dev/sdd1(0) [nonsyncd_secondary_3legs_1_mimage_2] Iwi-aom--- 3.00g /dev/sda1(0) [nonsyncd_secondary_3legs_1_mlog] lwi-aom--- 4.00m /dev/sdc1(0) Creating ext on top of mirror(s) on host-116.virt.lab.msp.redhat.com... mke2fs 1.41.12 (17-May-2010) Mounting mirrored ext filesystems on host-116.virt.lab.msp.redhat.com... PV=/dev/sdd1 nonsyncd_secondary_3legs_1_mimage_1: 6 PV=/dev/sdd1 nonsyncd_secondary_3legs_1_mimage_1: 6 Writing verification files (checkit) to mirror(s) on... ---- host-116.virt.lab.msp.redhat.com ---- <start name="host-116.virt.lab.msp.redhat.com_nonsyncd_secondary_3legs_1" pid="6908" time="Fri Feb 12 10:46:11 2016" type="cmd" /> Verifying files (checkit) on mirror(s) on... ---- host-116.virt.lab.msp.redhat.com ---- Current sync percent just before failure ( 11.26% ) Disabling device sdd on host-116.virt.lab.msp.redhat.comrescan device... <fail name="host-116.virt.lab.msp.redhat.com_nonsyncd_secondary_3legs_1" pid="6908" time="Fri Feb 12 10:46:21 2016" type="cmd" duration="10" ec="1" /> ALL STOP! /dev/sdd1: read failed after 0 of 512 at 16104947712: Input/output error /dev/sdd1: read failed after 0 of 512 at 16105054208: Input/output error /dev/sdd1: read failed after 0 of 512 at 0: Input/output error /dev/sdd1: read failed after 0 of 512 at 4096: Input/output error /dev/sdd1: read failed after 0 of 2048 at 0: Input/output error Getting recovery check start time from /var/log/messages: Feb 12 10:46 Attempting I/O to cause mirror down conversion(s) on host-116.virt.lab.msp.redhat.com dd if=/dev/zero of=/mnt/nonsyncd_secondary_3legs_1/ddfile count=10 bs=4M dd: opening `/mnt/nonsyncd_secondary_3legs_1/ddfile': Read-only file system [root@host-116 ~]# lvs -a -o +devices /dev/sdd1: open failed: No such device or address Device /dev/sdd1 has size of 0 sectors which is smaller than corresponding PV size of 31455207 sectors. Was device resized? One or more devices used as PVs in VG helter_skelter have changed sizes. LV VG Attr LSize Log Cpy%Sync Devices nonsyncd_secondary_3legs_1 helter_skelter mwi-aom--- 3.00g [nonsyncd_secondary_3legs_1_mlog] 100.00 nonsyncd_secondary_3legs_1_mimage_0(0),nonsyncd_secondary_3legs_1_mimage_2(0) [nonsyncd_secondary_3legs_1_mimage_0] helter_skelter iwi-aom--- 3.00g /dev/sdh1(0) [nonsyncd_secondary_3legs_1_mimage_2] helter_skelter iwi-aom--- 3.00g /dev/sda1(0) [nonsyncd_secondary_3legs_1_mlog] helter_skelter lwi-aom--- 4.00m /dev/sdc1(0) Feb 12 10:46:20 host-116 qarshd[4122]: Running cmdline: echo offline > /sys/block/sdd/device/state Feb 12 10:46:20 host-116 kernel: sd 5:0:0:1: rejecting I/O to offline device Feb 12 10:46:20 host-116 kernel: device-mapper: raid1: Write error during recovery (error = 0x1) Feb 12 10:46:20 host-116 lvm[1795]: Secondary mirror device 253:4 sync failed. Feb 12 10:46:20 host-116 kernel: device-mapper: raid1: Write error during recovery (error = 0x1) Feb 12 10:46:20 host-116 kernel: device-mapper: raid1: Write error during recovery (error = 0x1) Feb 12 10:46:20 host-116 kernel: device-mapper: raid1: Write error during recovery (error = 0x1) Feb 12 10:46:20 host-116 qarshd[4124]: Running cmdline: pvscan --cache /dev/sdd1 Feb 12 10:46:20 host-116 kernel: sd 5:0:0:1: rejecting I/O to offline device Feb 12 10:46:20 host-116 kernel: device-mapper: raid1: Write error during recovery (error = 0x1) Feb 12 10:46:20 host-116 kernel: sd 5:0:0:1: rejecting I/O to offline device Feb 12 10:46:20 host-116 kernel: sd 5:0:0:1: rejecting I/O to offline device Feb 12 10:46:20 host-116 kernel: sd 5:0:0:1: rejecting I/O to offline device Feb 12 10:46:21 host-116 lvm[1795]: Secondary mirror device 253:4 sync failed. Feb 12 10:46:21 host-116 lvm[1795]: Secondary mirror device 253:4 has failed (D). Feb 12 10:46:21 host-116 lvm[1795]: Device failure in helter_skelter-nonsyncd_secondary_3legs_1. Feb 12 10:46:21 host-116 lvm[1795]: /dev/sdd1: read failed after 0 of 2048 at 0: Input/output error Feb 12 10:46:21 host-116 lvm[1795]: No PV label found on /dev/sdd1. Feb 12 10:46:21 host-116 lvm[1795]: WARNING: Device for PV eRz6z2-UT3p-g8dA-SrzG-41UQ-NhbR-m0SrVw not found or rejected by a filter. Feb 12 10:46:21 host-116 lvm[1795]: /dev/sdd1: read failed after 0 of 512 at 4096: Input/output error Feb 12 10:46:21 host-116 lvm[1795]: WARNING: Failed to write an MDA of VG helter_skelter. Feb 12 10:46:21 host-116 kernel: sd 5:0:0:1: rejecting I/O to offline device Feb 12 10:46:21 host-116 kernel: lost page write due to I/O error on dm-6 Feb 12 10:46:21 host-116 kernel: Buffer I/O error on device dm-6, logical block 6152 Feb 12 10:46:21 host-116 kernel: lost page write due to I/O error on dm-6 Feb 12 10:46:21 host-116 kernel: sd 5:0:0:1: rejecting I/O to offline device Feb 12 10:46:21 host-116 kernel: EXT3-fs: ext3_journal_dirty_data: aborting transaction: IO failure in ext3_journal_dirty_data Feb 12 10:46:21 host-116 kernel: EXT3-fs (dm-6): error in ext3_ordered_write_end: IO failure Feb 12 10:46:21 host-116 kernel: sd 5:0:0:1: rejecting I/O to offline device Feb 12 10:46:21 host-116 kernel: sd 5:0:0:1: rejecting I/O to offline device Feb 12 10:46:21 host-116 kernel: EXT3-fs (dm-6): I/O error while writing superblock Feb 12 10:46:21 host-116 kernel: JBD: Detected IO errors while flushing file data on dm-6 Feb 12 10:46:21 host-116 kernel: Aborting journal on device dm-6. Feb 12 10:46:21 host-116 kernel: JBD: I/O error detected when updating journal superblock for dm-6. Feb 12 10:46:21 host-116 kernel: sd 5:0:0:1: rejecting I/O to offline device Feb 12 10:46:21 host-116 lvm[1795]: WARNING: Failed to write an MDA of VG helter_skelter. Feb 12 10:46:21 host-116 kernel: sd 5:0:0:1: rejecting I/O to offline device Feb 12 10:46:21 host-116 kernel: sd 5:0:0:1: rejecting I/O to offline device Feb 12 10:46:21 host-116 multipathd: dm-4: remove map (uevent) Feb 12 10:46:21 host-116 multipathd: dm-4: devmap not registered, can't remove Feb 12 10:46:21 host-116 kernel: sd 5:0:0:1: rejecting I/O to offline device Feb 12 10:46:21 host-116 lvm[1795]: WARNING: Failed to write an MDA of VG helter_skelter. Feb 12 10:46:21 host-116 lvm[1795]: Mirror status: 1 of 3 images failed. Feb 12 10:46:27 host-116 qarshd[4187]: Running cmdline: dd if=/dev/zero of=/mnt/nonsyncd_secondary_3legs_1/ddfile count=10 bs=4M Feb 12 10:46:27 host-116 kernel: EXT3-fs (dm-6): error: ext3_journal_start_sb: Detected aborted journal Feb 12 10:46:27 host-116 kernel: EXT3-fs (dm-6): error: remounting filesystem read-only Feb 12 10:46:27 host-116 xinetd[2017]: EXIT: qarsh status=0 pid=4187 duration=0(sec) Feb 12 10:48:03 host-116 lvm[1795]: Secondary mirror device 253:4 sync failed. Feb 12 10:48:03 host-116 lvm[1795]: Secondary mirror device 253:4 has failed (D). Feb 12 10:48:03 host-116 lvm[1795]: Device failure in helter_skelter-nonsyncd_secondary_3legs_1. Feb 12 10:48:03 host-116 lvm[1795]: WARNING: Device for PV eRz6z2-UT3p-g8dA-SrzG-41UQ-NhbR-m0SrVw not found or rejected by a filter. Feb 12 10:48:03 host-116 lvm[1795]: WARNING: Device for PV eRz6z2-UT3p-g8dA-SrzG-41UQ-NhbR-m0SrVw not found or rejected by a filter. Feb 12 10:48:03 host-116 lvm[1795]: Missing device /dev/sdd1 reappeared, updating metadata for VG helter_skelter to version 6. Feb 12 10:48:03 host-116 lvm[1795]: /dev/sdd1: open failed: No such device or address Feb 12 10:48:03 host-116 lvm[1795]: WARNING: Failed to write an MDA of VG helter_skelter. Feb 12 10:48:03 host-116 lvm[1795]: /dev/sdd1: open failed: No such device or address Feb 12 10:48:03 host-116 lvm[1795]: /dev/sdd1: open failed: No such device or address Feb 12 10:48:03 host-116 lvm[1795]: Device /dev/sdd1 has size of 0 sectors which is smaller than corresponding PV size of 31455207 sectors. Was device resized? Feb 12 10:48:03 host-116 lvm[1795]: One or more devices used as PVs in VG helter_skelter have changed sizes. Feb 12 10:48:03 host-116 lvm[1795]: /dev/sdd1: open failed: No such device or address Feb 12 10:48:03 host-116 lvm[1795]: Device /dev/sdd1 has size of 0 sectors which is smaller than corresponding PV size of 31455207 sectors. Was device resized? Feb 12 10:48:03 host-116 lvm[1795]: One or more devices used as PVs in VG helter_skelter have changed sizes. Feb 12 10:48:03 host-116 lvm[1795]: nonsyncd_secondary_3legs_1 is consistent. Nothing to repair. Version-Release number of selected component (if applicable): 2.6.32-615.el6.x86_64 lvm2-2.02.141-2.el6 BUILT: Wed Feb 10 07:49:03 CST 2016 lvm2-libs-2.02.141-2.el6 BUILT: Wed Feb 10 07:49:03 CST 2016 lvm2-cluster-2.02.141-2.el6 BUILT: Wed Feb 10 07:49:03 CST 2016 udev-147-2.66.el6 BUILT: Mon Jan 18 02:42:20 CST 2016 device-mapper-1.02.115-2.el6 BUILT: Wed Feb 10 07:49:03 CST 2016 device-mapper-libs-1.02.115-2.el6 BUILT: Wed Feb 10 07:49:03 CST 2016 device-mapper-event-1.02.115-2.el6 BUILT: Wed Feb 10 07:49:03 CST 2016 device-mapper-event-libs-1.02.115-2.el6 BUILT: Wed Feb 10 07:49:03 CST 2016 device-mapper-persistent-data-0.6.1-1.el6 BUILT: Wed Feb 10 05:09:45 CST 2016 cmirror-2.02.141-2.el6 BUILT: Wed Feb 10 07:49:03 CST 2016 How reproducible: Everytime --- Additional comment from Heinz Mauelshagen on 2016-02-24 12:02:11 EST --- Using test script: lvcreate -y -L512 -m1 -nm tst mkfs -t ext4 /dev/tst/m fsck -fn /dev/tst/m lvs tst/m echo offline > /sys/block/sda/device/state lvs tst/m sleep 1 fsck -fn /dev/tst/m and "allocate" image policy WFM. Output of script showing replacement during partial sync: Logical volume "m" created. mke2fs 1.41.12 (17-May-2010) Discarding device blocks: done Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=0 blocks, Stripe width=0 blocks 32768 inodes, 131072 blocks 6553 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=134217728 4 block groups 32768 blocks per group, 32768 fragments per group 8192 inodes per group Superblock backups stored on blocks: 32768, 98304 Writing inode tables: done Creating journal (4096 blocks): done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 31 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. fsck from util-linux-ng 2.17.2 e2fsck 1.41.12 (17-May-2010) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/mapper/tst-m: 11/32768 files (0.0% non-contiguous), 6257/131072 blocks LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert m tst mwi-a-m--- 512.00m [m_mlog] 24.22 /dev/sda: open failed: No such device or address Couldn't find device with uuid GGOtu2-pwqj-lXs6-ShKU-juTe-iU2K-RTpPdq. LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert m tst mwi-a-m--- 512.00m [m_mlog] 3.12 fsck from util-linux-ng 2.17.2 e2fsck 1.41.12 (17-May-2010) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/mapper/tst-m: 11/32768 files (0.0% non-contiguous), 6257/131072 blocks [root@rhel-6 ~]# lvs ^C [root@rhel-6 ~]# lvs -a -o+seg_pe_ranges tst Couldn't find device with uuid GGOtu2-pwqj-lXs6-ShKU-juTe-iU2K-RTpPdq. LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert PE Ranges m tst mwi-a-m--- 512.00m [m_mlog] 100.00 m_mimage_0:0-127 m_mimage_1:0-127 [m_mimage_0] tst iwi-aom--- 512.00m /dev/sdb:0-127 [m_mimage_1] tst iwi-aom--- 512.00m /dev/sdc:0-127 [m_mlog] tst lwi-aom--- 4.00m /dev/sdh:0-0 --- Additional comment from Corey Marthaler on 2016-02-24 17:39:35 EST --- It appears that this is more difficult to reproduce with the latest kernel/userspace, but it still exists for me after a few iterations with both remove and allocate policies. 2.6.32-621.el6.x86_64 lvm2-2.02.143-1.el6 BUILT: Wed Feb 24 07:59:50 CST 2016 lvm2-libs-2.02.143-1.el6 BUILT: Wed Feb 24 07:59:50 CST 2016 lvm2-cluster-2.02.143-1.el6 BUILT: Wed Feb 24 07:59:50 CST 2016 udev-147-2.71.el6 BUILT: Wed Feb 10 07:07:17 CST 2016 device-mapper-1.02.117-1.el6 BUILT: Wed Feb 24 07:59:50 CST 2016 device-mapper-libs-1.02.117-1.el6 BUILT: Wed Feb 24 07:59:50 CST 2016 device-mapper-event-1.02.117-1.el6 BUILT: Wed Feb 24 07:59:50 CST 2016 device-mapper-event-libs-1.02.117-1.el6 BUILT: Wed Feb 24 07:59:50 CST 2016 device-mapper-persistent-data-0.6.2-0.1.rc5.el6 BUILT: Wed Feb 24 07:07:09 CST 2016 cmirror-2.02.143-1.el6 BUILT: Wed Feb 24 07:59:50 CST 2016 ================================================================================ Iteration 0.2 started at Wed Feb 24 15:48:57 CST 2016 ================================================================================ Scenario kill_secondary_non_synced_4_legs: Kill secondary leg of non synced 4 leg mirror(s) ********* Mirror hash info for this scenario ********* * names: nonsyncd_secondary_4legs_1 * sync: 0 * striped: 0 * leg devices: /dev/sdd1 /dev/sdh1 /dev/sde1 /dev/sdc1 * log devices: /dev/sdb1 * no MDA devices: * failpv(s): /dev/sdh1 * failnode(s): host-137.virt.lab.msp.redhat.com * lvmetad: 0 * leg fault policy: remove * log fault policy: remove ****************************************************** Creating mirror(s) on host-137.virt.lab.msp.redhat.com... host-137.virt.lab.msp.redhat.com: lvcreate --type mirror -m 3 -n nonsyncd_secondary_4legs_1 -L 3G helter_skelter /dev/sdd1:0-2400 /dev/sdh1:0-2400 /dev/sde1:0-2400 /dev/sdc1:0-2400 /dev/sdb1:0-150 Current mirror/raid device structure(s): LV Attr LSize Cpy%Sync Devices nonsyncd_secondary_4legs_1 mwi-a-m--- 3.00g 1.56 nonsyncd_secondary_4legs_1_mimage_0(0),nonsyncd_secondary_4legs_1_mimage_1(0),nonsyncd_secondary_4legs_1_mimage_2(0),nonsyncd_secondary_4legs_1_mimage_3(0) [nonsyncd_secondary_4legs_1_mimage_0] Iwi-aom--- 3.00g /dev/sdd1(0) [nonsyncd_secondary_4legs_1_mimage_1] Iwi-aom--- 3.00g /dev/sdh1(0) [nonsyncd_secondary_4legs_1_mimage_2] Iwi-aom--- 3.00g /dev/sde1(0) [nonsyncd_secondary_4legs_1_mimage_3] Iwi-aom--- 3.00g /dev/sdc1(0) [nonsyncd_secondary_4legs_1_mlog] lwi-aom--- 4.00m /dev/sdb1(0) Creating ext on top of mirror(s) on host-137.virt.lab.msp.redhat.com... mke2fs 1.41.12 (17-May-2010) Mounting mirrored ext filesystems on host-137.virt.lab.msp.redhat.com... PV=/dev/sdh1 nonsyncd_secondary_4legs_1_mimage_1: 6 PV=/dev/sdh1 nonsyncd_secondary_4legs_1_mimage_1: 6 Writing verification files (checkit) to mirror(s) on... ---- host-137.virt.lab.msp.redhat.com ---- Verifying files (checkit) on mirror(s) on... ---- host-137.virt.lab.msp.redhat.com ---- Current sync percent just before failure ( 36.21% ) Disabling device sdh on host-137.virt.lab.msp.redhat.com <fail name="host-137.virt.lab.msp.redhat.com_nonsyncd_secondary_4legs_1" pid="30771" time="Wed Feb 24 15:49:23 2016" type="cmd" duration="14" ec="1" /> ALL STOP! Getting recovery check start time from /var/log/messages: Feb 24 15:49 Attempting I/O to cause mirror down conversion(s) on host-137.virt.lab.msp.redhat.com dd if=/dev/zero of=/mnt/nonsyncd_secondary_4legs_1/ddfile count=10 bs=4M dd: opening `/mnt/nonsyncd_secondary_4legs_1/ddfile': Read-only file system couldn't write to nonsyncd_secondary_4legs_1 Feb 24 15:49:23 host-137 kernel: Buffer I/O error on device dm-7, logical block 2184 Feb 24 15:49:23 host-137 kernel: lost page write due to I/O error on dm-7 Feb 24 15:49:23 host-137 kernel: Buffer I/O error on device dm-7, logical block 2185 Feb 24 15:49:23 host-137 kernel: lost page write due to I/O error on dm-7 Feb 24 15:49:23 host-137 kernel: Buffer I/O error on device dm-7, logical block 2186 Feb 24 15:49:23 host-137 kernel: lost page write due to I/O error on dm-7 Feb 24 15:49:23 host-137 kernel: EXT3-fs: ext3_journal_dirty_data: aborting transaction: IO failure in ext3_journal_dirty_data Feb 24 15:49:23 host-137 kernel: EXT3-fs (dm-7): error in ext3_ordered_write_end: IO failure Feb 24 15:49:23 host-137 kernel: sd 8:0:0:1: rejecting I/O to offline device Feb 24 15:49:23 host-137 kernel: EXT3-fs (dm-7): I/O error while writing superblock Feb 24 15:49:23 host-137 kernel: JBD: Detected IO errors while flushing file data on dm-7 Feb 24 15:49:23 host-137 kernel: Aborting journal on device dm-7. Feb 24 15:49:23 host-137 multipathd: dm-8: remove map (uevent) Feb 24 15:49:23 host-137 multipathd: dm-8: devmap not registered, can't remove Feb 24 15:49:23 host-137 multipathd: dm-4: remove map (uevent) Feb 24 15:49:23 host-137 multipathd: dm-4: devmap not registered, can't remove Feb 24 15:49:23 host-137 lvm[4038]: Mirror status: 1 of 4 images failed. Feb 24 15:49:25 host-137 qarshd[4599]: Talking to peer 10.15.80.224:39244 (IPv4) Feb 24 15:49:25 host-137 qarshd[4599]: Running cmdline: pvs -a Feb 24 15:49:26 host-137 qarshd[4601]: Talking to peer 10.15.80.224:39250 (IPv4) Feb 24 15:49:26 host-137 qarshd[4601]: Running cmdline: tail -n1 /var/log/messages | cut -c 1-12 Feb 24 15:49:26 host-137 qarshd[4605]: Talking to peer 10.15.80.224:39252 (IPv4) Feb 24 15:49:26 host-137 qarshd[4605]: Running cmdline: dd if=/dev/zero of=/mnt/nonsyncd_secondary_4legs_1/ddfile count=10 bs=4M Feb 24 15:49:26 host-137 kernel: EXT3-fs (dm-7): error: ext3_journal_start_sb: Detected aborted journal Feb 24 15:49:26 host-137 kernel: EXT3-fs (dm-7): error: remounting filesystem read-only Feb 24 15:50:34 host-137 lvm[4038]: Secondary mirror device 253:4 sync failed. Feb 24 15:50:34 host-137 lvm[4038]: Secondary mirror device 253:4 has failed (D). Feb 24 15:50:34 host-137 lvm[4038]: Device failure in helter_skelter-nonsyncd_secondary_4legs_1. Feb 24 15:50:34 host-137 lvm[4038]: Couldn't find device with uuid Gy2BMu-0Pz2-G2tA-EUr7-9XqX-PZiU-4pIkLt. Feb 24 15:50:34 host-137 lvm[4038]: Couldn't find device with uuid Gy2BMu-0Pz2-G2tA-EUr7-9XqX-PZiU-4pIkLt. Feb 24 15:50:34 host-137 lvm[4038]: nonsyncd_secondary_4legs_1 is consistent. Nothing to repair. ## Another reproduction: ================================================================================ Iteration 0.1 started at Wed Feb 24 15:45:08 CST 2016 ================================================================================ Scenario kill_secondary_non_synced_core_log_2_legs: Kill secondary leg of non synced core log 2 leg mirror(s) ********* Mirror hash info for this scenario ********* * names: nonsyncd_secondary_core_2legs_1 * sync: 0 * striped: 0 * leg devices: /dev/sdh1 /dev/sda1 * log devices: * no MDA devices: * failpv(s): /dev/sda1 * failnode(s): host-119.virt.lab.msp.redhat.com * lvmetad: 0 * leg fault policy: allocate * log fault policy: allocate ****************************************************** Creating mirror(s) on host-119.virt.lab.msp.redhat.com... host-119.virt.lab.msp.redhat.com: lvcreate --type mirror --mirrorlog core -m 1 -n nonsyncd_secondary_core_2legs_1 -L 3G helter_skelter /dev/sdh1:0-2400 /dev/sda1:0-2400 Current mirror/raid device structure(s): LV Attr LSize Cpy%Sync Devices nonsyncd_secondary_core_2legs_1 mwi-a-m--- 3.00g 0.52 nonsyncd_secondary_core_2legs_1_mimage_0(0),nonsyncd_secondary_core_2legs_1_mimage_1(0) [nonsyncd_secondary_core_2legs_1_mimage_0] Iwi-aom--- 3.00g /dev/sdh1(0) [nonsyncd_secondary_core_2legs_1_mimage_1] Iwi-aom--- 3.00g /dev/sda1(0) Creating ext on top of mirror(s) on host-119.virt.lab.msp.redhat.com... mke2fs 1.41.12 (17-May-2010) Mounting mirrored ext filesystems on host-119.virt.lab.msp.redhat.com... PV=/dev/sda1 nonsyncd_secondary_core_2legs_1_mimage_1: 5.1 PV=/dev/sda1 nonsyncd_secondary_core_2legs_1_mimage_1: 5.1 Writing verification files (checkit) to mirror(s) on... ---- host-119.virt.lab.msp.redhat.com ---- <start name="host-119.virt.lab.msp.redhat.com_nonsyncd_secondary_core_2legs_1" pid="29121" time="Wed Feb 24 15:45:22 2016" type="cmd" /> Verifying files (checkit) on mirror(s) on... ---- host-119.virt.lab.msp.redhat.com ---- Current sync percent just before failure ( 14.94% ) Disabling device sda on host-119.virt.lab.msp.redhat.com <fail name="host-119.virt.lab.msp.redhat.com_nonsyncd_secondary_core_2legs_1" pid="29121" time="Wed Feb 24 15:45:32 2016" type="cmd" duration="10" ec="1" /> ALL STOP! Getting recovery check start time from /var/log/messages: Feb 24 15:45 Attempting I/O to cause mirror down conversion(s) on host-119.virt.lab.msp.redhat.com dd if=/dev/zero of=/mnt/nonsyncd_secondary_core_2legs_1/ddfile count=10 bs=4M dd: opening `/mnt/nonsyncd_secondary_core_2legs_1/ddfile': Read-only file system couldn't write to nonsyncd_secondary_core_2legs_1 FI_engine: recover() method failed --- Additional comment from Corey Marthaler on 2016-02-24 18:23:35 EST --- Saw a similar error with a log device failure. ================================================================================ Iteration 0.8 started at Wed Feb 24 17:02:47 CST 2016 ================================================================================ Scenario kill_log_synced_2_legs: Kill disk log of synced 2 leg mirror(s) ********* Mirror hash info for this scenario ********* * names: syncd_log_2legs_1 * sync: 1 * striped: 0 * leg devices: /dev/sde1 /dev/sdf1 * log devices: /dev/sdg1 * no MDA devices: * failpv(s): /dev/sdg1 * failnode(s): host-115.virt.lab.msp.redhat.com * lvmetad: 0 * leg fault policy: allocate * log fault policy: allocate ****************************************************** Creating mirror(s) on host-115.virt.lab.msp.redhat.com... host-115.virt.lab.msp.redhat.com: lvcreate --type mirror -m 1 -n syncd_log_2legs_1 -L 500M helter_skelter /dev/sde1:0-2400 /dev/sdf1:0-2400 /dev/sdg1:0-150 Current mirror/raid device structure(s): LV Attr LSize Cpy%Sync Devices syncd_log_2legs_1 mwi-a-m--- 500.00m 14.40 syncd_log_2legs_1_mimage_0(0),syncd_log_2legs_1_mimage_1(0) [syncd_log_2legs_1_mimage_0] Iwi-aom--- 500.00m /dev/sde1(0) [syncd_log_2legs_1_mimage_1] Iwi-aom--- 500.00m /dev/sdf1(0) [syncd_log_2legs_1_mlog] lwi-aom--- 4.00m /dev/sdg1(0) Waiting until all mirror|raid volumes become fully syncd... 1/1 mirror(s) are fully synced: ( 100.00% ) Creating ext on top of mirror(s) on host-115.virt.lab.msp.redhat.com... mke2fs 1.41.12 (17-May-2010) Mounting mirrored ext filesystems on host-115.virt.lab.msp.redhat.com... PV=/dev/sdg1 syncd_log_2legs_1_mlog: 1.2 PV=/dev/sdg1 syncd_log_2legs_1_mlog: 1.2 Writing verification files (checkit) to mirror(s) on... ---- host-115.virt.lab.msp.redhat.com ---- <start name="host-115.virt.lab.msp.redhat.com_syncd_log_2legs_1" pid="17217" time="Wed Feb 24 17:03:29 2016" type="cmd" /> Sleeping 15 seconds to get some outsanding I/O locks before the failure Verifying files (checkit) on mirror(s) on... ---- host-115.virt.lab.msp.redhat.com ---- Disabling device sdg on host-115.virt.lab.msp.redhat.com Getting recovery check start time from /var/log/messages: Feb 24 17:03 Attempting I/O to cause mirror down conversion(s) on host-115.virt.lab.msp.redhat.com dd if=/dev/zero of=/mnt/syncd_log_2legs_1/ddfile count=10 bs=4M dd: opening `/mnt/syncd_log_2legs_1/ddfile': Invalid argument couldn't write to syncd_log_2legs_1 EXT3-fs error (device dm-5): ext3_xattr_block_get: inode 12: bad block 319745 EXT3-fs error (device dm-5): ext3_xattr_block_get: inode 12: bad block 319745 EXT3-fs error (device dm-5): ext3_xattr_block_get: inode 12: bad block 319745 --- Additional comment from Corey Marthaler on 2016-02-25 12:39:26 EST --- Changing the summary to reflect that this affects all types of mirrors, synced or non synced. All the machines I had running this test on over night using synced mirrors eventually ended up failing with this issue, even if it took many iterations to get there. Attaching test output and log from one of the test runs. It appears that maybe the filesystems are being shut down before lvm has time to properly repair the mirror since in most cases the mirrors do appear to be fixed. 2.6.32-616.el6.x86_64 lvm2-2.02.143-1.el6 BUILT: Wed Feb 24 07:59:50 CST 2016 lvm2-libs-2.02.143-1.el6 BUILT: Wed Feb 24 07:59:50 CST 2016 lvm2-cluster-2.02.143-1.el6 BUILT: Wed Feb 24 07:59:50 CST 2016 udev-147-2.71.el6 BUILT: Wed Feb 10 07:07:17 CST 2016 device-mapper-1.02.117-1.el6 BUILT: Wed Feb 24 07:59:50 CST 2016 device-mapper-libs-1.02.117-1.el6 BUILT: Wed Feb 24 07:59:50 CST 2016 device-mapper-event-1.02.117-1.el6 BUILT: Wed Feb 24 07:59:50 CST 2016 device-mapper-event-libs-1.02.117-1.el6 BUILT: Wed Feb 24 07:59:50 CST 2016 device-mapper-persistent-data-0.6.2-0.1.rc5.el6 BUILT: Wed Feb 24 07:07:09 CST 2016 cmirror-2.02.143-1.el6 BUILT: Wed Feb 24 07:59:50 CST 2016 --- Additional comment from Heinz Mauelshagen on 2016-03-04 11:03:24 EST --- Getting plenty of "rejecting I/O to offline device" messages here: ... sd 2:0:0:1: rejecting I/O to offline device sd 2:0:0:1: rejecting I/O to offline device sd 2:0:0:1: rejecting I/O to offline device sd 2:0:0:1: rejecting I/O to offline device sd 2:0:0:1: rejecting I/O to offline device JBD2: Detected IO errors while flushing file data on dm-6-8 Aborting journal on device dm-6-8. journal commit I/O error EXT4-fs error (device dm-6): ext4_journal_start_sb: Detected aborted journal EXT4-fs (dm-6): Remounting filesystem read-only EXT4-fs error (device dm-6): ext4_put_super: Couldn't clean up the journal The journal abort, followed by the ro mount should have never happened. "mirror" target behaves severely broken! --- Additional comment from Heinz Mauelshagen on 2016-03-04 11:06:32 EST --- Corey, even w/o "lvconvert --repair...", io errors get promoted to JBD2. That seconds a severe flaw in the "mirror" target. --- Additional comment from Heinz Mauelshagen on 2016-03-07 08:18:45 EST --- This bz is not reproducible neither in upstream nor in RHEL7. Analyzing the diffs in dm-raid1.c/dm-log.c/dm-region-hash.c does not show any respective changes? --- Additional comment from Alasdair Kergon on 2016-03-10 02:42:20 EST --- Not necessarily related to the problem, but this sequence of messages is at least wrong: Feb 12 10:48:03 host-116 lvm[1795]: /dev/sdd1: open failed: No such device or address Feb 12 10:48:03 host-116 lvm[1795]: Device /dev/sdd1 has size of 0 sectors which is smaller than corresponding PV size of 31455207 sectors. Was device resized? --- Additional comment from Peter Rajnoha on 2016-03-10 03:01:37 EST --- (In reply to Corey Marthaler from comment #0) > Description of problem: > I'm guessing this is related to, if not the same as bug 1304045. > > Non sync'ed mirrors experiencing device failures are not being properly > repaired. This test case works fine in both 6.7 and 7.2. Could it be that > failed PVs are being recognized as still existing, but having 0 sectors > available? > > I see this in all the failures: > Feb 12 10:48:03 host-116 lvm[1795]: Device /dev/sdd1 has size of 0 sectors > which is smaller than corresponding PV size of 31455207 sectors. Was device > resized? > Feb 12 10:48:03 host-116 lvm[1795]: One or more devices used as PVs in VG > helter_skelter have changed sizes. > Feb 12 10:48:03 host-116 lvm[1795]: /dev/sdd1: open failed: No such device > or address > Feb 12 10:48:03 host-116 lvm[1795]: Device /dev/sdd1 has size of 0 sectors > which is smaller than corresponding PV size of 31455207 sectors. Was device > resized? > Feb 12 10:48:03 host-116 lvm[1795]: One or more devices used as PVs in VG > helter_skelter have changed sizes. Note: These warning messages about size difference is new addition in RHEL6 - it's the new code that checks actual PV sizes against sizes from metadata. But it's only a warning - it doesn't change the usual behaviour of the code otherwise (so there's no condition based on this check which would change the flow of the code). We're checking whether PV is marked as missing with a flag before doing size checks - we should probably also add an additional check to avoid this warning if device open fails and hence we can't get dev size. But otherwise, ignore these warning messages about sizes for now - they shouldn't be causing the bug as reported. --- Additional comment from Heinz Mauelshagen on 2016-03-21 12:47:37 EDT --- I found a long-standing flaw in dm-raid.c, hold_bio() erroring the bio erronously in case of !DM_NOFLUSH_SUSPENDING. Fix works, testing for side effects --- Additional comment from Heinz Mauelshagen on 2016-03-23 15:35:38 EDT --- Test passed ~200 iterations ok -> fix valid. --- Additional comment from Corey Marthaler on 2016-03-23 16:53:35 EDT --- We tested this patched kernel and it seems to fix the problem.
Bug cloned, because the same issue as in RHEL6 kernel applies to RHEL7 kernel.
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions
This bug was not fixed via kernel patch. We provided lvm2 user space regression fix via these two upstream commits: https://www.redhat.com/archives/lvm-devel/2016-April/msg00011.html https://www.redhat.com/archives/lvm-devel/2016-April/msg00012.html this 3rd. commit also fixes possible problem but likely unneeded as it might be impossible to hit with current stacking logic: https://www.redhat.com/archives/lvm-devel/2016-April/msg00013.html This fix makes again automatic mirror recovery working. Without out filesystem may shutdown.
(In reply to Zdenek Kabelac from comment #6) > This bug was not fixed via kernel patch. > > We provided lvm2 user space regression fix via these two upstream commits: > > https://www.redhat.com/archives/lvm-devel/2016-April/msg00011.html > https://www.redhat.com/archives/lvm-devel/2016-April/msg00012.html > > this 3rd. commit also fixes possible problem but likely unneeded as it might > be impossible to hit with current stacking logic: > > https://www.redhat.com/archives/lvm-devel/2016-April/msg00013.html > > > This fix makes again automatic mirror recovery working. > Without out filesystem may shutdown. We already have lvm2-2.02.152-1.el7 built which includes these patches which are from lvm2 v2.02.150. So I'm adding this to errata.
Verified. Running mirror failure test hundred times passed with both "remove" and "allocate" fault policy. 3.10.0-445.el7.x86_64 lvm2-2.02.156-1.el7 BUILT: Mon Jun 13 10:05:51 CEST 2016 lvm2-libs-2.02.156-1.el7 BUILT: Mon Jun 13 10:05:51 CEST 2016 lvm2-cluster-2.02.156-1.el7 BUILT: Mon Jun 13 10:05:51 CEST 2016 device-mapper-1.02.126-1.el7 BUILT: Mon Jun 13 10:05:51 CEST 2016 device-mapper-libs-1.02.126-1.el7 BUILT: Mon Jun 13 10:05:51 CEST 2016 device-mapper-event-1.02.126-1.el7 BUILT: Mon Jun 13 10:05:51 CEST 2016 device-mapper-event-libs-1.02.126-1.el7 BUILT: Mon Jun 13 10:05:51 CEST 2016 device-mapper-persistent-data-0.6.2-0.1.rc8.el7 BUILT: Wed May 4 09:56:34 CEST 2016 cmirror-2.02.156-1.el7 BUILT: Mon Jun 13 10:05:51 CEST 2016
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-1445.html