Bug 1304045

Summary: type mirror no longer able to survive device failure
Product: Red Hat Enterprise Linux 6 Reporter: Corey Marthaler <cmarthal>
Component: lvm2Assignee: Heinz Mauelshagen <heinzm>
lvm2 sub component: Mirroring and RAID (RHEL6) QA Contact: cluster-qe <cluster-qe>
Status: CLOSED WORKSFORME Docs Contact:
Severity: high    
Priority: unspecified CC: agk, heinzm, jbrassow, msnitzer, prajnoha, prockai, zkabelac
Version: 6.8Keywords: Regression, TestBlocker
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-02-29 16:02:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
log from host-115
none
log from host-116 none

Description Corey Marthaler 2016-02-02 18:03:46 UTC
Description of problem:
This is the most basic cmirror device failure case and it causes everything to go south.


./helter_skelter -l /usr/tests/sts-rhel6.8 -r /usr/tests/sts-rhel6.8 -R ../../../resource-STSRHTS3987.xml -i 1 -e kill_primary_synced_2_legs                                                                                                       
8 disk(s) to be used:                                                                                                                                                                                                                              
        host-123=/dev/sdc /dev/sdh /dev/sdg /dev/sdf /dev/sda /dev/sde /dev/sdd /dev/sdb                                                                                                                                                                      
        host-124=/dev/sdb /dev/sda /dev/sdh /dev/sdd /dev/sdg /dev/sde /dev/sdf /dev/sdc                                                                                                                                                                      
        host-125=/dev/sda /dev/sde /dev/sdh /dev/sdf /dev/sdc /dev/sdg /dev/sdb /dev/sdd                                                                                                                                                                           

on host-123...
dicing /dev/sdc into 1... 
dicing /dev/sdh into 1... 
dicing /dev/sdg into 1... 
dicing /dev/sdf into 1... 
dicing /dev/sda into 1... 
dicing /dev/sde into 1... 
dicing /dev/sdd into 1... 
dicing /dev/sdb into 1... 
re-reading disks on host-125...
re-reading disks on host-123...
re-reading disks on host-124...
Zeroing out the new partitions.../dev/sdc1.../dev/sdh1.../dev/sdg1.../dev/sdf1.../dev/sda1.../dev/sde1.../dev/sdd1.../dev/sdb1...
Reload the device cache on all nodes...

creating lvm devices...
Create 7 PV(s) for helter_skelter on host-123
Create VG helter_skelter on host-123
================================================================================
Iteration 0.1 started at Tue Feb  2 11:48:34 CST 2016
================================================================================
Scenario kill_primary_synced_2_legs: Kill primary leg of synced 2 leg mirror(s)

********* Mirror hash info for this scenario *********
* names:              syncd_primary_2legs_1
* sync:               1
* striped:            0
* leg devices:        /dev/sdh1 /dev/sdf1
* log devices:        /dev/sdb1
* no MDA devices:     
* failpv(s):          /dev/sdh1
* failnode(s):        host-123 host-124 host-125
* lvmetad:            0
* leg fault policy:   remove
* log fault policy:   allocate
******************************************************

Creating mirror(s) on host-123...
host-123: lvcreate --type mirror -m 1 -n syncd_primary_2legs_1 -L 500M helter_skelter /dev/sdh1:0-2400 /dev/sdf1:0-2400 /dev/sdb1:0-150

Current mirror/raid device structure(s):
  LV                               Attr       LSize   Cpy%Sync Devices                                                            
   syncd_primary_2legs_1            mwi-a-m--- 500.00m 5.60     syncd_primary_2legs_1_mimage_0(0),syncd_primary_2legs_1_mimage_1(0)
   [syncd_primary_2legs_1_mimage_0] Iwi-aom--- 500.00m          /dev/sdh1(0)                                                       
   [syncd_primary_2legs_1_mimage_1] Iwi-aom--- 500.00m          /dev/sdf1(0)                                                       
   [syncd_primary_2legs_1_mlog]     lwi-aom---   4.00m          /dev/sdb1(0)                                                       

Waiting until all mirror|raid volumes become fully syncd...
   0/1 mirror(s) are fully synced: ( 74.80% )
   1/1 mirror(s) are fully synced: ( 100.00% )

Creating gfs2 on top of mirror(s) on host-123...
Mounting mirrored gfs2 filesystems on host-123...
Mounting mirrored gfs2 filesystems on host-124...
Mounting mirrored gfs2 filesystems on host-125...

PV=/dev/sdh1
        syncd_primary_2legs_1_mimage_0: 6
PV=/dev/sdh1

        syncd_primary_2legs_1_mimage_0: 6
Writing verification files (checkit) to mirror(s) on...
        ---- host-123 ----
        ---- host-124 ----
        ---- host-125 ----

Sleeping 15 seconds to get some outsanding I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
        ---- host-123 ----
        ---- host-124 ----
        ---- host-125 ----

Disabling device sdh on host-123
Disabling device sda on host-124
Disabling device sde on host-125

Getting recovery check start time from /var/log/messages: Feb  2 11:50
Attempting I/O to cause mirror down conversion(s) on host-123
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.272817 s, 154 MB/s
Verifying current sanity of lvm after the failure

Current mirror/raid device structure(s):
  Couldn't find device with uuid Vwn1jQ-ndo2-YhuJ-VaqX-bllI-a25v-oR2hbt.
  LV                    Attr       LSize   Cpy%Sync Devices        
   syncd_primary_2legs_1 -wi-ao---- 500.00m          /dev/sdf1(0)   


Verifying FAILED device /dev/sdh1 is *NOT* in the volume(s)
  Couldn't find device with uuid Vwn1jQ-ndo2-YhuJ-VaqX-bllI-a25v-oR2hbt.
olog: 4
Verifying LOG device(s) /dev/sdb1 are *NOT* in the mirror/linear(s)
  Couldn't find device with uuid Vwn1jQ-ndo2-YhuJ-VaqX-bllI-a25v-oR2hbt.
Verifying LEG device /dev/sdf1 *IS* in the volume(s)
  Couldn't find device with uuid Vwn1jQ-ndo2-YhuJ-VaqX-bllI-a25v-oR2hbt.
verify the dm devices associated with /dev/sdh1 have been removed as expected
Checking REMOVAL of syncd_primary_2legs_1_mimage_0 on:  host-123 host-124 host-125

Verifying files (checkit) on mirror(s) on...
        ---- host-123 ----
        ---- host-124 ----
Can not open ufyybfhhhrcmxqbbiuexuqrwvl: Input/output error
<fail name="host-124_syncd_primary_2legs_1" pid="24757" time="Tue Feb  2 11:50:23 2016" type="cmd" duration="55" ec="1" />
ALL STOP!
checkit write verify failed

Feb  2 11:49:23 host-124 qarshd[3931]: Running cmdline: mount /dev/helter_skelter/syncd_primary_2legs_1 /mnt/syncd_primary_2legs_1
Feb  2 11:49:23 host-124 kernel: Slow work thread pool: Starting up
Feb  2 11:49:23 host-124 kernel: Slow work thread pool: Ready
Feb  2 11:49:23 host-124 kernel: GFS2 (built Jan 28 2016 16:50:00) installed
Feb  2 11:49:23 host-124 kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "STSRHTS3987:gfs1"
Feb  2 11:49:23 host-124 kernel: GFS2: fsid=STSRHTS3987:gfs1.1: Joined cluster. Now mounting FS...
Feb  2 11:49:23 host-124 kernel: GFS2: fsid=STSRHTS3987:gfs1.1: jid=1, already locked for use
Feb  2 11:49:23 host-124 kernel: GFS2: fsid=STSRHTS3987:gfs1.1: jid=1: Looking at journal...
Feb  2 11:49:23 host-124 kernel: GFS2: fsid=STSRHTS3987:gfs1.1: jid=1: Done
Feb  2 11:49:26 host-124 qarshd[3952]: Running cmdline: mkdir -p /mnt/syncd_primary_2legs_1/checkit
Feb  2 11:49:27 host-124 qarshd[3954]: Running cmdline: /usr/tests/sts-rhel6.8/bin/checkit -w /mnt/syncd_primary_2legs_1/checkit -f /tmp/checkit_syncd_primary_2legs_1 -n 100
Feb  2 11:49:28 host-124 qarshd[3956]: Running cmdline: /usr/tests/sts-rhel6.8/bin/xiogen -f buffered -m random -s write,writev -t 100 -T 1000000  -F 1000000:/mnt/syncd_primary_2legs_1/host-124_load | /usr/tests/sts-rhel6.8/bin/xdoio -v
Feb  2 11:49:51 host-124 qarshd[3962]: Running cmdline: /usr/tests/sts-rhel6.8/bin/checkit -w /mnt/syncd_primary_2legs_1/checkit -f /tmp/checkit_syncd_primary_2legs_1 -v
Feb  2 11:49:58 host-124 qarshd[3964]: Running cmdline: echo offline > /sys/block/sda/device/state
Feb  2 11:49:59 host-124 kernel: sd 2:0:0:1: rejecting I/O to offline device
Feb  2 11:49:59 host-124 lvm[2846]: Primary mirror device 253:3 has failed (D).
Feb  2 11:49:59 host-124 lvm[2846]: Device failure in helter_skelter-syncd_primary_2legs_1.
Feb  2 11:49:59 host-124 lvm[2846]: /dev/sda1: read failed after 0 of 2048 at 0: Input/output error
Feb  2 11:49:59 host-124 lvm[2846]: /dev/sda1: read failed after 0 of 512 at 16104947712: Input/output error
Feb  2 11:49:59 host-124 lvm[2846]: /dev/sda1: read failed after 0 of 512 at 16105054208: Input/output error
Feb  2 11:49:59 host-124 lvm[2846]: /dev/sda1: read failed after 0 of 512 at 0: Input/output error
Feb  2 11:49:59 host-124 lvm[2846]: /dev/sda1: read failed after 0 of 512 at 4096: Input/output error
Feb  2 11:49:59 host-124 lvm[2846]: Couldn't find device with uuid Vwn1jQ-ndo2-YhuJ-VaqX-bllI-a25v-oR2hbt.
Feb  2 11:50:00 host-124 kernel: Buffer I/O error on device dm-5, logical block 25436
Feb  2 11:50:00 host-124 kernel: lost page write due to I/O error on dm-5
Feb  2 11:50:00 host-124 kernel: Buffer I/O error on device dm-5, logical block 25437
Feb  2 11:50:00 host-124 kernel: lost page write due to I/O error on dm-5
Feb  2 11:50:00 host-124 kernel: Buffer I/O error on device dm-5, logical block 25523
Feb  2 11:50:01 host-124 lvm[2846]: Mirror status: 1 of 2 images failed.
Feb  2 11:50:12 host-124 qarshd[4017]: Running cmdline: dmsetup ls | grep syncd_primary_2legs_1_mimage_0
Feb  2 11:50:21 host-124 qarshd[4021]: Running cmdline: /usr/tests/sts-rhel6.8/bin/checkit -w /mnt/syncd_primary_2legs_1/checkit -f /tmp/checkit_syncd_primary_2legs_1 -v
Feb  2 11:50:23 host-124 kernel: GFS2: fsid=STSRHTS3987:gfs1.1: fatal: I/O error
Feb  2 11:50:23 host-124 kernel: GFS2: fsid=STSRHTS3987:gfs1.1:   block = 32251
Feb  2 11:50:23 host-124 kernel: GFS2: fsid=STSRHTS3987:gfs1.1:   function = gfs2_ail1_empty_one, file = fs/gfs2/log.c, line = 199
Feb  2 11:50:23 host-124 kernel: GFS2: fsid=STSRHTS3987:gfs1.1: about to withdraw this file system
Feb  2 11:50:24 host-124 kernel: GFS2: fsid=STSRHTS3987:gfs1.1: telling LM to unmount
Feb  2 11:50:24 host-124 kernel: GFS2: fsid=STSRHTS3987:gfs1.1: withdrawn
Feb  2 11:50:24 host-124 kernel: Pid: 3948, comm: gfs2_logd Not tainted 2.6.32-610.el6.x86_64 #1
Feb  2 11:50:24 host-124 kernel: Call Trace:
Feb  2 11:50:24 host-124 kernel: [<ffffffffa04b0fa8>] ? gfs2_lm_withdraw+0x128/0x160 [gfs2]
Feb  2 11:50:24 host-124 kernel: [<ffffffff81065990>] ? __dequeue_entity+0x30/0x50
Feb  2 11:50:24 host-124 kernel: [<ffffffffa04b1020>] ? gfs2_io_error_bh_i+0x40/0x50 [gfs2]
Feb  2 11:50:24 host-124 kernel: [<ffffffff8108f260>] ? process_timeout+0x0/0x10
Feb  2 11:50:24 host-124 kernel: [<ffffffffa0497a2c>] ? gfs2_ail1_empty+0x11c/0x1b0 [gfs2]
Feb  2 11:50:24 host-124 kernel: [<ffffffffa0498f2c>] ? gfs2_logd+0x7c/0x140 [gfs2]
Feb  2 11:50:24 host-124 kernel: [<ffffffffa0498eb0>] ? gfs2_logd+0x0/0x140 [gfs2]
Feb  2 11:50:24 host-124 kernel: [<ffffffff810a662e>] ? kthread+0x9e/0xc0
Feb  2 11:50:24 host-124 kernel: [<ffffffffa045eff0>] ? process_send_sockets+0x0/0x260 [dlm]
Feb  2 11:50:24 host-124 kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
Feb  2 11:50:24 host-124 kernel: [<ffffffff810a6590>] ? kthread+0x0/0xc0
Feb  2 11:50:24 host-124 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
Feb  2 11:52:35 host-124 kernel: INFO: task gfs2_logd:3948 blocked for more than 120 seconds.
Feb  2 11:52:35 host-124 kernel:      Not tainted 2.6.32-610.el6.x86_64 #1
Feb  2 11:52:35 host-124 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb  2 11:52:35 host-124 kernel: gfs2_logd     D 0000000000000000     0  3948      2 0x00000080
Feb  2 11:52:35 host-124 kernel: ffff88003c777c10 0000000000000046 ffff880000000000 0000abd9b1dfc40f
Feb  2 11:52:35 host-124 kernel: ffff88003c777bd0 ffff88003c78d2f0 000000000003c33e ffffffffa94f14fd
Feb  2 11:52:35 host-124 kernel: 00000000207aa707 ffffffff81aa8340 ffff880037af5ad8 ffff88003c777fd8
Feb  2 11:52:35 host-124 kernel: Call Trace:
Feb  2 11:52:35 host-124 kernel: [<ffffffff8112e2a0>] ? sync_page+0x0/0x50
Feb  2 11:52:35 host-124 kernel: [<ffffffff8112e2a0>] ? sync_page+0x0/0x50
Feb  2 11:52:35 host-124 kernel: [<ffffffff81545853>] io_schedule+0x73/0xc0
Feb  2 11:52:35 host-124 kernel: [<ffffffff8112e2dd>] sync_page+0x3d/0x50
Feb  2 11:52:35 host-124 kernel: [<ffffffff8154633f>] __wait_on_bit+0x5f/0x90
Feb  2 11:52:35 host-124 kernel: [<ffffffff8112e513>] wait_on_page_bit+0x73/0x80
Feb  2 11:52:35 host-124 kernel: [<ffffffff810a6b40>] ? wake_bit_function+0x0/0x50
Feb  2 11:52:35 host-124 kernel: [<ffffffff81144635>] ? pagevec_lookup_tag+0x25/0x40
Feb  2 11:52:35 host-124 kernel: [<ffffffff8112e93b>] wait_on_page_writeback_range+0xfb/0x190
Feb  2 11:52:35 host-124 kernel: [<ffffffff8112e9ff>] filemap_fdatawait+0x2f/0x40
Feb  2 11:52:35 host-124 kernel: [<ffffffffa0498b09>] gfs2_log_flush+0x259/0x600 [gfs2]
Feb  2 11:52:35 host-124 kernel: [<ffffffffa0498f89>] gfs2_logd+0xd9/0x140 [gfs2]
Feb  2 11:52:35 host-124 kernel: [<ffffffffa0498eb0>] ? gfs2_logd+0x0/0x140 [gfs2]
Feb  2 11:52:35 host-124 kernel: [<ffffffff810a662e>] kthread+0x9e/0xc0
Feb  2 11:52:35 host-124 kernel: [<ffffffffa045eff0>] ? process_send_sockets+0x0/0x260 [dlm]
Feb  2 11:52:35 host-124 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
Feb  2 11:52:35 host-124 kernel: [<ffffffff810a6590>] ? kthread+0x0/0xc0
Feb  2 11:52:35 host-124 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20



[root@host-124 ~]# lvs
[...]
stat("/dev/helter_skelter/syncd_primary_2legs_1", {st_mode=S_IFBLK|0660, st_rdev=makedev(253, 5), ...}) = 0
open("/dev/helter_skelter/syncd_primary_2legs_1", O_RDONLY|O_DIRECT|O_NOATIME) = 5
fstat(5, {st_mode=S_IFBLK|0660, st_rdev=makedev(253, 5), ...}) = 0
ioctl(5, BLKBSZGET, 4096)               = 0
ioctl(5, BLKPBSZGET, 512)               = 0
lseek(5, 0, SEEK_SET)                   = 0
read(5, 


Version-Release number of selected component (if applicable):
2.6.32-610.el6.x86_64

lvm2-2.02.140-3.el6    BUILT: Thu Jan 21 05:40:10 CST 2016
lvm2-libs-2.02.140-3.el6    BUILT: Thu Jan 21 05:40:10 CST 2016
lvm2-cluster-2.02.140-3.el6    BUILT: Thu Jan 21 05:40:10 CST 2016
udev-147-2.69.el6    BUILT: Thu Jan 28 08:41:45 CST 2016
device-mapper-1.02.114-3.el6    BUILT: Thu Jan 21 05:40:10 CST 2016
device-mapper-libs-1.02.114-3.el6    BUILT: Thu Jan 21 05:40:10 CST 2016
device-mapper-event-1.02.114-3.el6    BUILT: Thu Jan 21 05:40:10 CST 2016
device-mapper-event-libs-1.02.114-3.el6    BUILT: Thu Jan 21 05:40:10 CST 2016
device-mapper-persistent-data-0.6.0-2.el6    BUILT: Thu Jan 21 02:40:25 CST 2016
cmirror-2.02.140-3.el6    BUILT: Thu Jan 21 05:40:10 CST 2016


How reproducible:
Everytime

Comment 3 Corey Marthaler 2016-02-11 23:16:07 UTC
This isn't just a cluster mirror issue. It appears many single machine mirror failure cases are no longer passing in 6.8. 

I ran this same scenario on both 6.7 and 7.3 and both passed 10/10 iterations. However on 6.8 it fails pretty quickly.

2.6.32-615.el6.x86_64
lvm2-2.02.141-2.el6    BUILT: Wed Feb 10 07:49:03 CST 2016
lvm2-libs-2.02.141-2.el6    BUILT: Wed Feb 10 07:49:03 CST 2016
lvm2-cluster-2.02.141-2.el6    BUILT: Wed Feb 10 07:49:03 CST 2016
udev-147-2.66.el6    BUILT: Mon Jan 18 02:42:20 CST 2016
device-mapper-1.02.115-2.el6    BUILT: Wed Feb 10 07:49:03 CST 2016
device-mapper-libs-1.02.115-2.el6    BUILT: Wed Feb 10 07:49:03 CST 2016
device-mapper-event-1.02.115-2.el6    BUILT: Wed Feb 10 07:49:03 CST 2016
device-mapper-event-libs-1.02.115-2.el6    BUILT: Wed Feb 10 07:49:03 CST 2016
device-mapper-persistent-data-0.6.1-1.el6    BUILT: Wed Feb 10 05:09:45 CST 2016
cmirror-2.02.141-2.el6    BUILT: Wed Feb 10 07:49:03 CST 2016


Iteration 4.1 started at Thu Feb 11 15:08:55 CST 2016
================================================================================
Scenario kill_primary_synced_2_legs: Kill primary leg of synced 2 leg mirror(s)

********* Mirror hash info for this scenario *********
* names:              syncd_primary_2legs_1 syncd_primary_2legs_2
* sync:               1
* striped:            0
* leg devices:        /dev/sdg1 /dev/sdb1
* log devices:        /dev/sde1
* no MDA devices:     
* failpv(s):          /dev/sdg1
* failnode(s):        host-115.virt.lab.msp.redhat.com
* lvmetad:            1
* leg fault policy:   allocate
* log fault policy:   allocate
******************************************************

Creating mirror(s) on host-115.virt.lab.msp.redhat.com...
host-115.virt.lab.msp.redhat.com: lvcreate --type mirror -m 1 -n syncd_primary_2legs_1 -L 500M helter_skelter /dev/sdg1:0-2400 /dev/sdb1:0-2400 /dev/sde1:0-150
host-115.virt.lab.msp.redhat.com: lvcreate --type mirror -m 1 -n syncd_primary_2legs_2 -L 500M helter_skelter /dev/sdg1:0-2400 /dev/sdb1:0-2400 /dev/sde1:0-150

Current mirror/raid device structure(s):
  LV                               Attr       LSize   Cpy%Sync Devices
  syncd_primary_2legs_1            mwi-a-m--- 500.00m 15.20    syncd_primary_2legs_1_mimage_0(0),syncd_primary_2legs_1_mimage_1(0)
  [syncd_primary_2legs_1_mimage_0] Iwi-aom--- 500.00m          /dev/sdg1(0)
  [syncd_primary_2legs_1_mimage_1] Iwi-aom--- 500.00m          /dev/sdb1(0)
  [syncd_primary_2legs_1_mlog]     lwi-aom---   4.00m          /dev/sde1(0)
  syncd_primary_2legs_2            mwi-a-m--- 500.00m 0.80     syncd_primary_2legs_2_mimage_0(0),syncd_primary_2legs_2_mimage_1(0)
  [syncd_primary_2legs_2_mimage_0] Iwi-aom--- 500.00m          /dev/sdg1(125)
  [syncd_primary_2legs_2_mimage_1] Iwi-aom--- 500.00m          /dev/sdb1(125)
  [syncd_primary_2legs_2_mlog]     lwi-aom---   4.00m          /dev/sde1(1)

Waiting until all mirror|raid volumes become fully syncd...
   1/2 mirror(s) are fully synced: ( 100.00% 94.70% )
   2/2 mirror(s) are fully synced: ( 100.00% 100.00% )

Creating ext on top of mirror(s) on host-115.virt.lab.msp.redhat.com...
mke2fs 1.41.12 (17-May-2010)
mke2fs 1.41.12 (17-May-2010)
Mounting mirrored ext filesystems on host-115.virt.lab.msp.redhat.com...

PV=/dev/sdg1
        syncd_primary_2legs_1_mimage_0: 5.1
        syncd_primary_2legs_2_mimage_0: 5.1
PV=/dev/sdg1
        syncd_primary_2legs_1_mimage_0: 5.1
        syncd_primary_2legs_2_mimage_0: 5.1

Writing verification files (checkit) to mirror(s) on...
        ---- host-115.virt.lab.msp.redhat.com ----

Sleeping 15 seconds to get some outsanding I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
        ---- host-115.virt.lab.msp.redhat.com ----

Disabling device sdg on host-115.virt.lab.msp.redhat.comrescan device...
  /dev/sdg1: read failed after 0 of 512 at 16104947712: Input/output error
  /dev/sdg1: read failed after 0 of 2048 at 0: Input/output error

Getting recovery check start time from /var/log/messages: Feb 11 15:10
Attempting I/O to cause mirror down conversion(s) on host-115.virt.lab.msp.redhat.com
dd if=/dev/zero of=/mnt/syncd_primary_2legs_1/ddfile count=10 bs=4M
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.225524 s, 186 MB/s
dd if=/dev/zero of=/mnt/syncd_primary_2legs_2/ddfile count=10 bs=4M
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.337796 s, 124 MB/s

Verifying current sanity of lvm after the failure

Current mirror/raid device structure(s):
  /dev/sdg1: open failed: No such device or address
  Device /dev/sdg1 has size of 0 sectors which is smaller than corresponding PV size of 31455207 sectors. Was device resized?
  One or more devices used as PVs in VG helter_skelter have changed sizes.
  LV                               Attr       LSize   Cpy%Sync Devices                                                            
   syncd_primary_2legs_1            mwi-aom--- 500.00m 100.00   syncd_primary_2legs_1_mimage_0(0),syncd_primary_2legs_1_mimage_1(0)
   [syncd_primary_2legs_1_mimage_0] iwi-aom--- 500.00m          /dev/sdb1(0)                                                       
   [syncd_primary_2legs_1_mimage_1] iwi-aom--- 500.00m          /dev/sdc1(0)                                                       
   [syncd_primary_2legs_1_mlog]     lwi-aom---   4.00m          /dev/sdf1(0)                                                       
   syncd_primary_2legs_2            mwi-aom--- 500.00m 100.00   syncd_primary_2legs_2_mimage_0(0),syncd_primary_2legs_2_mimage_1(0)
   [syncd_primary_2legs_2_mimage_0] iwi-aom--- 500.00m          /dev/sdb1(125)                                                     
   [syncd_primary_2legs_2_mimage_1] iwi-aom--- 500.00m          /dev/sdc1(125)                                                     


Verifying FAILED device /dev/sdg1 is *NOT* in the volume(s)
  /dev/sdg1: open failed: No such device or address
  Device /dev/sdg1 has size of 0 sectors which is smaller than corresponding PV size of 31455207 sectors. Was device resized?
  One or more devices used as PVs in VG helter_skelter have changed sizes.
olog: 1
Verifying LOG device(s) /dev/sde1 *ARE* in the mirror(s)
  /dev/sdg1: open failed: No such device or address
  Device /dev/sdg1 has size of 0 sectors which is smaller than corresponding PV size of 31455207 sectors. Was device resized?
  One or more devices used as PVs in VG helter_skelter have changed sizes.
log device /dev/sde1 should still be present on host-115.virt.lab.msp.redhat.com

seaching for *any* log after this failure...
  /dev/sdg1: open failed: No such device or address
  Device /dev/sdg1 has size of 0 sectors which is smaller than corresponding PV size of 31455207 sectors. Was device resized?
  One or more devices used as PVs in VG helter_skelter have changed sizes.
  /dev/sdg1: open failed: No such device or address
  Device /dev/sdg1 has size of 0 sectors which is smaller than corresponding PV size of 31455207 sectors. Was device resized?
  One or more devices used as PVs in VG helter_skelter have changed sizes.
NO LOG WAS FOUND for syncd_primary_2legs_2 on host-115.virt.lab.msp.redhat.com


[root@host-115 ~]# lvs -a -o +devices
  /dev/sdg1: open failed: No such device or address
  Device /dev/sdg1 has size of 0 sectors which is smaller than corresponding PV size of 31455207 sectors. Was device resized?
  One or more devices used as PVs in VG helter_skelter have changed sizes.
  LV                               VG             Attr       LSize   Log                          Cpy%Sync Devices
  syncd_primary_2legs_1            helter_skelter mwi-aom--- 500.00m [syncd_primary_2legs_1_mlog] 100.00   syncd_primary_2legs_1_mimage_0(0),syncd_primary_2legs_1_mimage_1(0)
  [syncd_primary_2legs_1_mimage_0] helter_skelter iwi-aom--- 500.00m                                       /dev/sdb1(0)
  [syncd_primary_2legs_1_mimage_1] helter_skelter iwi-aom--- 500.00m                                       /dev/sdc1(0)
  [syncd_primary_2legs_1_mlog]     helter_skelter lwi-aom---   4.00m                                       /dev/sdf1(0)
  syncd_primary_2legs_2            helter_skelter mwi-aom--- 500.00m                              100.00   syncd_primary_2legs_2_mimage_0(0),syncd_primary_2legs_2_mimage_1(0)
  [syncd_primary_2legs_2_mimage_0] helter_skelter iwi-aom--- 500.00m                                       /dev/sdb1(125)
  [syncd_primary_2legs_2_mimage_1] helter_skelter iwi-aom--- 500.00m                                       /dev/sdc1(125)

Comment 4 Corey Marthaler 2016-02-11 23:22:21 UTC
Created attachment 1123307 [details]
log from host-115

Comment 5 Corey Marthaler 2016-02-11 23:25:34 UTC
Another 6.8 failure, same test case:

================================================================================
Iteration 1.1 started at Thu Feb 11 16:46:54 CST 2016
================================================================================
Scenario kill_primary_synced_2_legs: Kill primary leg of synced 2 leg mirror(s)

********* Mirror hash info for this scenario *********
* names:              syncd_primary_2legs_1 syncd_primary_2legs_2
* sync:               1
* striped:            0
* leg devices:        /dev/sde1 /dev/sdb1
* log devices:        /dev/sdc1
* no MDA devices:     
* failpv(s):          /dev/sde1
* failnode(s):        host-116.virt.lab.msp.redhat.com
* lvmetad:            1
* leg fault policy:   allocate
* log fault policy:   allocate
******************************************************

Creating mirror(s) on host-116.virt.lab.msp.redhat.com...
host-116.virt.lab.msp.redhat.com: lvcreate --type mirror -m 1 -n syncd_primary_2legs_1 -L 500M helter_skelter /dev/sde1:0-2400 /dev/sdb1:0-2400 /dev/sdc1:0-150
host-116.virt.lab.msp.redhat.com: lvcreate --type mirror -m 1 -n syncd_primary_2legs_2 -L 500M helter_skelter /dev/sde1:0-2400 /dev/sdb1:0-2400 /dev/sdc1:0-150

Current mirror/raid device structure(s):
  LV                               Attr       LSize   Cpy%Sync Devices
  syncd_primary_2legs_1            mwi-a-m--- 500.00m 52.80    syncd_primary_2legs_1_mimage_0(0),syncd_primary_2legs_1_mimage_1(0)
  [syncd_primary_2legs_1_mimage_0] Iwi-aom--- 500.00m          /dev/sde1(0)
  [syncd_primary_2legs_1_mimage_1] Iwi-aom--- 500.00m          /dev/sdb1(0)
  [syncd_primary_2legs_1_mlog]     lwi-aom---   4.00m          /dev/sdc1(0)
  syncd_primary_2legs_2            mwi-a-m--- 500.00m 5.60     syncd_primary_2legs_2_mimage_0(0),syncd_primary_2legs_2_mimage_1(0)
  [syncd_primary_2legs_2_mimage_0] Iwi-aom--- 500.00m          /dev/sde1(125)
  [syncd_primary_2legs_2_mimage_1] Iwi-aom--- 500.00m          /dev/sdb1(125)
  [syncd_primary_2legs_2_mlog]     lwi-aom---   4.00m          /dev/sdc1(1)

Waiting until all mirror|raid volumes become fully syncd...
   2/2 mirror(s) are fully synced: ( 100.00% 100.00% )

Creating ext on top of mirror(s) on host-116.virt.lab.msp.redhat.com...
mke2fs 1.41.12 (17-May-2010)
mke2fs 1.41.12 (17-May-2010)
Mounting mirrored ext filesystems on host-116.virt.lab.msp.redhat.com...

PV=/dev/sde1
        syncd_primary_2legs_1_mimage_0: 5.1
        syncd_primary_2legs_2_mimage_0: 5.1
PV=/dev/sde1
        syncd_primary_2legs_1_mimage_0: 5.1
        syncd_primary_2legs_2_mimage_0: 5.1
Writing verification files (checkit) to mirror(s) on...
        ---- host-116.virt.lab.msp.redhat.com ----

Sleeping 15 seconds to get some outsanding I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
        ---- host-116.virt.lab.msp.redhat.com ----

Disabling device sde on host-116.virt.lab.msp.redhat.comrescan device...
<fail name="host-116.virt.lab.msp.redhat.com_syncd_primary_2legs_2" pid="25029" time="Thu Feb 11 16:48:05 2016" type="cmd" duration="30" ec="1" />
ALL STOP!
  /dev/sde1: read failed after 0 of 512 at 16104947712: Input/output error
  /dev/sde1: read failed after 0 of 512 at 4096: Input/output error
  /dev/sde1: read failed after 0 of 2048 at 0: Input/output error

<stop name="host-116.virt.lab.msp.redhat.com_syncd_primary_2legs_1" pid="25028" time="Thu Feb 11 16:48:08 2016" type="cmd" duration="33" signal="2" />
Getting recovery check start time from /var/log/messages: Feb 11 16:48
Attempting I/O to cause mirror down conversion(s) on host-116.virt.lab.msp.redhat.com
dd if=/dev/zero of=/mnt/syncd_primary_2legs_1/ddfile count=10 bs=4M
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.127945 s, 328 MB/s

dd if=/dev/zero of=/mnt/syncd_primary_2legs_2/ddfile count=10 bs=4M
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.166799 s, 251 MB/s

Verifying current sanity of lvm after the failure

Current mirror/raid device structure(s):
  LV                               Attr       LSize   Cpy%Sync Devices
  syncd_primary_2legs_1            mwi-aom--- 500.00m 0.00     syncd_primary_2legs_1_mimage_0(0),syncd_primary_2legs_1_mimage_1(0)
  [syncd_primary_2legs_1_mimage_0] Iwi-aom--- 500.00m          /dev/sdb1(0)
  [syncd_primary_2legs_1_mimage_1] Iwi-aom--- 500.00m          /dev/sde1(0)
  [syncd_primary_2legs_1_mlog]     lwi-aom---   4.00m          /dev/sda1(1)
  syncd_primary_2legs_2            mwi-aom--- 500.00m 100.00   syncd_primary_2legs_2_mimage_0(0),syncd_primary_2legs_2_mimage_1(0)
  [syncd_primary_2legs_2_mimage_0] iwi-aom--- 500.00m          /dev/sdb1(125)
  [syncd_primary_2legs_2_mimage_1] iwi-aom--- 500.00m          /dev/sdd1(0)
  [syncd_primary_2legs_2_mlog]     lwi-aom---   4.00m          /dev/sda1(0)

Verifying FAILED device /dev/sde1 is *NOT* in the volume(s)
failed device /dev/sde1 should no longer be in volume on host-116.virt.lab.msp.redhat.com

Comment 6 Corey Marthaler 2016-02-11 23:27:58 UTC
Created attachment 1123308 [details]
log from host-116

Comment 7 Corey Marthaler 2016-02-12 17:50:09 UTC
Bug 1307111 is likely related to this one.

Comment 18 Corey Marthaler 2016-02-29 16:02:04 UTC
Closing this bug as it hasn't been seen with the latest rpms. The issue that mirror failure test scenarios lead to is bug 1307111.