RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1304045 - type mirror no longer able to survive device failure
Summary: type mirror no longer able to survive device failure
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: lvm2
Version: 6.8
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: rc
: ---
Assignee: Heinz Mauelshagen
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-02-02 18:03 UTC by Corey Marthaler
Modified: 2016-02-29 16:02 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-02-29 16:02:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
log from host-115 (38.31 KB, text/plain)
2016-02-11 23:22 UTC, Corey Marthaler
no flags Details
log from host-116 (166.79 KB, text/plain)
2016-02-11 23:27 UTC, Corey Marthaler
no flags Details

Description Corey Marthaler 2016-02-02 18:03:46 UTC
Description of problem:
This is the most basic cmirror device failure case and it causes everything to go south.


./helter_skelter -l /usr/tests/sts-rhel6.8 -r /usr/tests/sts-rhel6.8 -R ../../../resource-STSRHTS3987.xml -i 1 -e kill_primary_synced_2_legs                                                                                                       
8 disk(s) to be used:                                                                                                                                                                                                                              
        host-123=/dev/sdc /dev/sdh /dev/sdg /dev/sdf /dev/sda /dev/sde /dev/sdd /dev/sdb                                                                                                                                                                      
        host-124=/dev/sdb /dev/sda /dev/sdh /dev/sdd /dev/sdg /dev/sde /dev/sdf /dev/sdc                                                                                                                                                                      
        host-125=/dev/sda /dev/sde /dev/sdh /dev/sdf /dev/sdc /dev/sdg /dev/sdb /dev/sdd                                                                                                                                                                           

on host-123...
dicing /dev/sdc into 1... 
dicing /dev/sdh into 1... 
dicing /dev/sdg into 1... 
dicing /dev/sdf into 1... 
dicing /dev/sda into 1... 
dicing /dev/sde into 1... 
dicing /dev/sdd into 1... 
dicing /dev/sdb into 1... 
re-reading disks on host-125...
re-reading disks on host-123...
re-reading disks on host-124...
Zeroing out the new partitions.../dev/sdc1.../dev/sdh1.../dev/sdg1.../dev/sdf1.../dev/sda1.../dev/sde1.../dev/sdd1.../dev/sdb1...
Reload the device cache on all nodes...

creating lvm devices...
Create 7 PV(s) for helter_skelter on host-123
Create VG helter_skelter on host-123
================================================================================
Iteration 0.1 started at Tue Feb  2 11:48:34 CST 2016
================================================================================
Scenario kill_primary_synced_2_legs: Kill primary leg of synced 2 leg mirror(s)

********* Mirror hash info for this scenario *********
* names:              syncd_primary_2legs_1
* sync:               1
* striped:            0
* leg devices:        /dev/sdh1 /dev/sdf1
* log devices:        /dev/sdb1
* no MDA devices:     
* failpv(s):          /dev/sdh1
* failnode(s):        host-123 host-124 host-125
* lvmetad:            0
* leg fault policy:   remove
* log fault policy:   allocate
******************************************************

Creating mirror(s) on host-123...
host-123: lvcreate --type mirror -m 1 -n syncd_primary_2legs_1 -L 500M helter_skelter /dev/sdh1:0-2400 /dev/sdf1:0-2400 /dev/sdb1:0-150

Current mirror/raid device structure(s):
  LV                               Attr       LSize   Cpy%Sync Devices                                                            
   syncd_primary_2legs_1            mwi-a-m--- 500.00m 5.60     syncd_primary_2legs_1_mimage_0(0),syncd_primary_2legs_1_mimage_1(0)
   [syncd_primary_2legs_1_mimage_0] Iwi-aom--- 500.00m          /dev/sdh1(0)                                                       
   [syncd_primary_2legs_1_mimage_1] Iwi-aom--- 500.00m          /dev/sdf1(0)                                                       
   [syncd_primary_2legs_1_mlog]     lwi-aom---   4.00m          /dev/sdb1(0)                                                       

Waiting until all mirror|raid volumes become fully syncd...
   0/1 mirror(s) are fully synced: ( 74.80% )
   1/1 mirror(s) are fully synced: ( 100.00% )

Creating gfs2 on top of mirror(s) on host-123...
Mounting mirrored gfs2 filesystems on host-123...
Mounting mirrored gfs2 filesystems on host-124...
Mounting mirrored gfs2 filesystems on host-125...

PV=/dev/sdh1
        syncd_primary_2legs_1_mimage_0: 6
PV=/dev/sdh1

        syncd_primary_2legs_1_mimage_0: 6
Writing verification files (checkit) to mirror(s) on...
        ---- host-123 ----
        ---- host-124 ----
        ---- host-125 ----

Sleeping 15 seconds to get some outsanding I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
        ---- host-123 ----
        ---- host-124 ----
        ---- host-125 ----

Disabling device sdh on host-123
Disabling device sda on host-124
Disabling device sde on host-125

Getting recovery check start time from /var/log/messages: Feb  2 11:50
Attempting I/O to cause mirror down conversion(s) on host-123
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.272817 s, 154 MB/s
Verifying current sanity of lvm after the failure

Current mirror/raid device structure(s):
  Couldn't find device with uuid Vwn1jQ-ndo2-YhuJ-VaqX-bllI-a25v-oR2hbt.
  LV                    Attr       LSize   Cpy%Sync Devices        
   syncd_primary_2legs_1 -wi-ao---- 500.00m          /dev/sdf1(0)   


Verifying FAILED device /dev/sdh1 is *NOT* in the volume(s)
  Couldn't find device with uuid Vwn1jQ-ndo2-YhuJ-VaqX-bllI-a25v-oR2hbt.
olog: 4
Verifying LOG device(s) /dev/sdb1 are *NOT* in the mirror/linear(s)
  Couldn't find device with uuid Vwn1jQ-ndo2-YhuJ-VaqX-bllI-a25v-oR2hbt.
Verifying LEG device /dev/sdf1 *IS* in the volume(s)
  Couldn't find device with uuid Vwn1jQ-ndo2-YhuJ-VaqX-bllI-a25v-oR2hbt.
verify the dm devices associated with /dev/sdh1 have been removed as expected
Checking REMOVAL of syncd_primary_2legs_1_mimage_0 on:  host-123 host-124 host-125

Verifying files (checkit) on mirror(s) on...
        ---- host-123 ----
        ---- host-124 ----
Can not open ufyybfhhhrcmxqbbiuexuqrwvl: Input/output error
<fail name="host-124_syncd_primary_2legs_1" pid="24757" time="Tue Feb  2 11:50:23 2016" type="cmd" duration="55" ec="1" />
ALL STOP!
checkit write verify failed

Feb  2 11:49:23 host-124 qarshd[3931]: Running cmdline: mount /dev/helter_skelter/syncd_primary_2legs_1 /mnt/syncd_primary_2legs_1
Feb  2 11:49:23 host-124 kernel: Slow work thread pool: Starting up
Feb  2 11:49:23 host-124 kernel: Slow work thread pool: Ready
Feb  2 11:49:23 host-124 kernel: GFS2 (built Jan 28 2016 16:50:00) installed
Feb  2 11:49:23 host-124 kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "STSRHTS3987:gfs1"
Feb  2 11:49:23 host-124 kernel: GFS2: fsid=STSRHTS3987:gfs1.1: Joined cluster. Now mounting FS...
Feb  2 11:49:23 host-124 kernel: GFS2: fsid=STSRHTS3987:gfs1.1: jid=1, already locked for use
Feb  2 11:49:23 host-124 kernel: GFS2: fsid=STSRHTS3987:gfs1.1: jid=1: Looking at journal...
Feb  2 11:49:23 host-124 kernel: GFS2: fsid=STSRHTS3987:gfs1.1: jid=1: Done
Feb  2 11:49:26 host-124 qarshd[3952]: Running cmdline: mkdir -p /mnt/syncd_primary_2legs_1/checkit
Feb  2 11:49:27 host-124 qarshd[3954]: Running cmdline: /usr/tests/sts-rhel6.8/bin/checkit -w /mnt/syncd_primary_2legs_1/checkit -f /tmp/checkit_syncd_primary_2legs_1 -n 100
Feb  2 11:49:28 host-124 qarshd[3956]: Running cmdline: /usr/tests/sts-rhel6.8/bin/xiogen -f buffered -m random -s write,writev -t 100 -T 1000000  -F 1000000:/mnt/syncd_primary_2legs_1/host-124_load | /usr/tests/sts-rhel6.8/bin/xdoio -v
Feb  2 11:49:51 host-124 qarshd[3962]: Running cmdline: /usr/tests/sts-rhel6.8/bin/checkit -w /mnt/syncd_primary_2legs_1/checkit -f /tmp/checkit_syncd_primary_2legs_1 -v
Feb  2 11:49:58 host-124 qarshd[3964]: Running cmdline: echo offline > /sys/block/sda/device/state
Feb  2 11:49:59 host-124 kernel: sd 2:0:0:1: rejecting I/O to offline device
Feb  2 11:49:59 host-124 lvm[2846]: Primary mirror device 253:3 has failed (D).
Feb  2 11:49:59 host-124 lvm[2846]: Device failure in helter_skelter-syncd_primary_2legs_1.
Feb  2 11:49:59 host-124 lvm[2846]: /dev/sda1: read failed after 0 of 2048 at 0: Input/output error
Feb  2 11:49:59 host-124 lvm[2846]: /dev/sda1: read failed after 0 of 512 at 16104947712: Input/output error
Feb  2 11:49:59 host-124 lvm[2846]: /dev/sda1: read failed after 0 of 512 at 16105054208: Input/output error
Feb  2 11:49:59 host-124 lvm[2846]: /dev/sda1: read failed after 0 of 512 at 0: Input/output error
Feb  2 11:49:59 host-124 lvm[2846]: /dev/sda1: read failed after 0 of 512 at 4096: Input/output error
Feb  2 11:49:59 host-124 lvm[2846]: Couldn't find device with uuid Vwn1jQ-ndo2-YhuJ-VaqX-bllI-a25v-oR2hbt.
Feb  2 11:50:00 host-124 kernel: Buffer I/O error on device dm-5, logical block 25436
Feb  2 11:50:00 host-124 kernel: lost page write due to I/O error on dm-5
Feb  2 11:50:00 host-124 kernel: Buffer I/O error on device dm-5, logical block 25437
Feb  2 11:50:00 host-124 kernel: lost page write due to I/O error on dm-5
Feb  2 11:50:00 host-124 kernel: Buffer I/O error on device dm-5, logical block 25523
Feb  2 11:50:01 host-124 lvm[2846]: Mirror status: 1 of 2 images failed.
Feb  2 11:50:12 host-124 qarshd[4017]: Running cmdline: dmsetup ls | grep syncd_primary_2legs_1_mimage_0
Feb  2 11:50:21 host-124 qarshd[4021]: Running cmdline: /usr/tests/sts-rhel6.8/bin/checkit -w /mnt/syncd_primary_2legs_1/checkit -f /tmp/checkit_syncd_primary_2legs_1 -v
Feb  2 11:50:23 host-124 kernel: GFS2: fsid=STSRHTS3987:gfs1.1: fatal: I/O error
Feb  2 11:50:23 host-124 kernel: GFS2: fsid=STSRHTS3987:gfs1.1:   block = 32251
Feb  2 11:50:23 host-124 kernel: GFS2: fsid=STSRHTS3987:gfs1.1:   function = gfs2_ail1_empty_one, file = fs/gfs2/log.c, line = 199
Feb  2 11:50:23 host-124 kernel: GFS2: fsid=STSRHTS3987:gfs1.1: about to withdraw this file system
Feb  2 11:50:24 host-124 kernel: GFS2: fsid=STSRHTS3987:gfs1.1: telling LM to unmount
Feb  2 11:50:24 host-124 kernel: GFS2: fsid=STSRHTS3987:gfs1.1: withdrawn
Feb  2 11:50:24 host-124 kernel: Pid: 3948, comm: gfs2_logd Not tainted 2.6.32-610.el6.x86_64 #1
Feb  2 11:50:24 host-124 kernel: Call Trace:
Feb  2 11:50:24 host-124 kernel: [<ffffffffa04b0fa8>] ? gfs2_lm_withdraw+0x128/0x160 [gfs2]
Feb  2 11:50:24 host-124 kernel: [<ffffffff81065990>] ? __dequeue_entity+0x30/0x50
Feb  2 11:50:24 host-124 kernel: [<ffffffffa04b1020>] ? gfs2_io_error_bh_i+0x40/0x50 [gfs2]
Feb  2 11:50:24 host-124 kernel: [<ffffffff8108f260>] ? process_timeout+0x0/0x10
Feb  2 11:50:24 host-124 kernel: [<ffffffffa0497a2c>] ? gfs2_ail1_empty+0x11c/0x1b0 [gfs2]
Feb  2 11:50:24 host-124 kernel: [<ffffffffa0498f2c>] ? gfs2_logd+0x7c/0x140 [gfs2]
Feb  2 11:50:24 host-124 kernel: [<ffffffffa0498eb0>] ? gfs2_logd+0x0/0x140 [gfs2]
Feb  2 11:50:24 host-124 kernel: [<ffffffff810a662e>] ? kthread+0x9e/0xc0
Feb  2 11:50:24 host-124 kernel: [<ffffffffa045eff0>] ? process_send_sockets+0x0/0x260 [dlm]
Feb  2 11:50:24 host-124 kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
Feb  2 11:50:24 host-124 kernel: [<ffffffff810a6590>] ? kthread+0x0/0xc0
Feb  2 11:50:24 host-124 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
Feb  2 11:52:35 host-124 kernel: INFO: task gfs2_logd:3948 blocked for more than 120 seconds.
Feb  2 11:52:35 host-124 kernel:      Not tainted 2.6.32-610.el6.x86_64 #1
Feb  2 11:52:35 host-124 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb  2 11:52:35 host-124 kernel: gfs2_logd     D 0000000000000000     0  3948      2 0x00000080
Feb  2 11:52:35 host-124 kernel: ffff88003c777c10 0000000000000046 ffff880000000000 0000abd9b1dfc40f
Feb  2 11:52:35 host-124 kernel: ffff88003c777bd0 ffff88003c78d2f0 000000000003c33e ffffffffa94f14fd
Feb  2 11:52:35 host-124 kernel: 00000000207aa707 ffffffff81aa8340 ffff880037af5ad8 ffff88003c777fd8
Feb  2 11:52:35 host-124 kernel: Call Trace:
Feb  2 11:52:35 host-124 kernel: [<ffffffff8112e2a0>] ? sync_page+0x0/0x50
Feb  2 11:52:35 host-124 kernel: [<ffffffff8112e2a0>] ? sync_page+0x0/0x50
Feb  2 11:52:35 host-124 kernel: [<ffffffff81545853>] io_schedule+0x73/0xc0
Feb  2 11:52:35 host-124 kernel: [<ffffffff8112e2dd>] sync_page+0x3d/0x50
Feb  2 11:52:35 host-124 kernel: [<ffffffff8154633f>] __wait_on_bit+0x5f/0x90
Feb  2 11:52:35 host-124 kernel: [<ffffffff8112e513>] wait_on_page_bit+0x73/0x80
Feb  2 11:52:35 host-124 kernel: [<ffffffff810a6b40>] ? wake_bit_function+0x0/0x50
Feb  2 11:52:35 host-124 kernel: [<ffffffff81144635>] ? pagevec_lookup_tag+0x25/0x40
Feb  2 11:52:35 host-124 kernel: [<ffffffff8112e93b>] wait_on_page_writeback_range+0xfb/0x190
Feb  2 11:52:35 host-124 kernel: [<ffffffff8112e9ff>] filemap_fdatawait+0x2f/0x40
Feb  2 11:52:35 host-124 kernel: [<ffffffffa0498b09>] gfs2_log_flush+0x259/0x600 [gfs2]
Feb  2 11:52:35 host-124 kernel: [<ffffffffa0498f89>] gfs2_logd+0xd9/0x140 [gfs2]
Feb  2 11:52:35 host-124 kernel: [<ffffffffa0498eb0>] ? gfs2_logd+0x0/0x140 [gfs2]
Feb  2 11:52:35 host-124 kernel: [<ffffffff810a662e>] kthread+0x9e/0xc0
Feb  2 11:52:35 host-124 kernel: [<ffffffffa045eff0>] ? process_send_sockets+0x0/0x260 [dlm]
Feb  2 11:52:35 host-124 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
Feb  2 11:52:35 host-124 kernel: [<ffffffff810a6590>] ? kthread+0x0/0xc0
Feb  2 11:52:35 host-124 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20



[root@host-124 ~]# lvs
[...]
stat("/dev/helter_skelter/syncd_primary_2legs_1", {st_mode=S_IFBLK|0660, st_rdev=makedev(253, 5), ...}) = 0
open("/dev/helter_skelter/syncd_primary_2legs_1", O_RDONLY|O_DIRECT|O_NOATIME) = 5
fstat(5, {st_mode=S_IFBLK|0660, st_rdev=makedev(253, 5), ...}) = 0
ioctl(5, BLKBSZGET, 4096)               = 0
ioctl(5, BLKPBSZGET, 512)               = 0
lseek(5, 0, SEEK_SET)                   = 0
read(5, 


Version-Release number of selected component (if applicable):
2.6.32-610.el6.x86_64

lvm2-2.02.140-3.el6    BUILT: Thu Jan 21 05:40:10 CST 2016
lvm2-libs-2.02.140-3.el6    BUILT: Thu Jan 21 05:40:10 CST 2016
lvm2-cluster-2.02.140-3.el6    BUILT: Thu Jan 21 05:40:10 CST 2016
udev-147-2.69.el6    BUILT: Thu Jan 28 08:41:45 CST 2016
device-mapper-1.02.114-3.el6    BUILT: Thu Jan 21 05:40:10 CST 2016
device-mapper-libs-1.02.114-3.el6    BUILT: Thu Jan 21 05:40:10 CST 2016
device-mapper-event-1.02.114-3.el6    BUILT: Thu Jan 21 05:40:10 CST 2016
device-mapper-event-libs-1.02.114-3.el6    BUILT: Thu Jan 21 05:40:10 CST 2016
device-mapper-persistent-data-0.6.0-2.el6    BUILT: Thu Jan 21 02:40:25 CST 2016
cmirror-2.02.140-3.el6    BUILT: Thu Jan 21 05:40:10 CST 2016


How reproducible:
Everytime

Comment 3 Corey Marthaler 2016-02-11 23:16:07 UTC
This isn't just a cluster mirror issue. It appears many single machine mirror failure cases are no longer passing in 6.8. 

I ran this same scenario on both 6.7 and 7.3 and both passed 10/10 iterations. However on 6.8 it fails pretty quickly.

2.6.32-615.el6.x86_64
lvm2-2.02.141-2.el6    BUILT: Wed Feb 10 07:49:03 CST 2016
lvm2-libs-2.02.141-2.el6    BUILT: Wed Feb 10 07:49:03 CST 2016
lvm2-cluster-2.02.141-2.el6    BUILT: Wed Feb 10 07:49:03 CST 2016
udev-147-2.66.el6    BUILT: Mon Jan 18 02:42:20 CST 2016
device-mapper-1.02.115-2.el6    BUILT: Wed Feb 10 07:49:03 CST 2016
device-mapper-libs-1.02.115-2.el6    BUILT: Wed Feb 10 07:49:03 CST 2016
device-mapper-event-1.02.115-2.el6    BUILT: Wed Feb 10 07:49:03 CST 2016
device-mapper-event-libs-1.02.115-2.el6    BUILT: Wed Feb 10 07:49:03 CST 2016
device-mapper-persistent-data-0.6.1-1.el6    BUILT: Wed Feb 10 05:09:45 CST 2016
cmirror-2.02.141-2.el6    BUILT: Wed Feb 10 07:49:03 CST 2016


Iteration 4.1 started at Thu Feb 11 15:08:55 CST 2016
================================================================================
Scenario kill_primary_synced_2_legs: Kill primary leg of synced 2 leg mirror(s)

********* Mirror hash info for this scenario *********
* names:              syncd_primary_2legs_1 syncd_primary_2legs_2
* sync:               1
* striped:            0
* leg devices:        /dev/sdg1 /dev/sdb1
* log devices:        /dev/sde1
* no MDA devices:     
* failpv(s):          /dev/sdg1
* failnode(s):        host-115.virt.lab.msp.redhat.com
* lvmetad:            1
* leg fault policy:   allocate
* log fault policy:   allocate
******************************************************

Creating mirror(s) on host-115.virt.lab.msp.redhat.com...
host-115.virt.lab.msp.redhat.com: lvcreate --type mirror -m 1 -n syncd_primary_2legs_1 -L 500M helter_skelter /dev/sdg1:0-2400 /dev/sdb1:0-2400 /dev/sde1:0-150
host-115.virt.lab.msp.redhat.com: lvcreate --type mirror -m 1 -n syncd_primary_2legs_2 -L 500M helter_skelter /dev/sdg1:0-2400 /dev/sdb1:0-2400 /dev/sde1:0-150

Current mirror/raid device structure(s):
  LV                               Attr       LSize   Cpy%Sync Devices
  syncd_primary_2legs_1            mwi-a-m--- 500.00m 15.20    syncd_primary_2legs_1_mimage_0(0),syncd_primary_2legs_1_mimage_1(0)
  [syncd_primary_2legs_1_mimage_0] Iwi-aom--- 500.00m          /dev/sdg1(0)
  [syncd_primary_2legs_1_mimage_1] Iwi-aom--- 500.00m          /dev/sdb1(0)
  [syncd_primary_2legs_1_mlog]     lwi-aom---   4.00m          /dev/sde1(0)
  syncd_primary_2legs_2            mwi-a-m--- 500.00m 0.80     syncd_primary_2legs_2_mimage_0(0),syncd_primary_2legs_2_mimage_1(0)
  [syncd_primary_2legs_2_mimage_0] Iwi-aom--- 500.00m          /dev/sdg1(125)
  [syncd_primary_2legs_2_mimage_1] Iwi-aom--- 500.00m          /dev/sdb1(125)
  [syncd_primary_2legs_2_mlog]     lwi-aom---   4.00m          /dev/sde1(1)

Waiting until all mirror|raid volumes become fully syncd...
   1/2 mirror(s) are fully synced: ( 100.00% 94.70% )
   2/2 mirror(s) are fully synced: ( 100.00% 100.00% )

Creating ext on top of mirror(s) on host-115.virt.lab.msp.redhat.com...
mke2fs 1.41.12 (17-May-2010)
mke2fs 1.41.12 (17-May-2010)
Mounting mirrored ext filesystems on host-115.virt.lab.msp.redhat.com...

PV=/dev/sdg1
        syncd_primary_2legs_1_mimage_0: 5.1
        syncd_primary_2legs_2_mimage_0: 5.1
PV=/dev/sdg1
        syncd_primary_2legs_1_mimage_0: 5.1
        syncd_primary_2legs_2_mimage_0: 5.1

Writing verification files (checkit) to mirror(s) on...
        ---- host-115.virt.lab.msp.redhat.com ----

Sleeping 15 seconds to get some outsanding I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
        ---- host-115.virt.lab.msp.redhat.com ----

Disabling device sdg on host-115.virt.lab.msp.redhat.comrescan device...
  /dev/sdg1: read failed after 0 of 512 at 16104947712: Input/output error
  /dev/sdg1: read failed after 0 of 2048 at 0: Input/output error

Getting recovery check start time from /var/log/messages: Feb 11 15:10
Attempting I/O to cause mirror down conversion(s) on host-115.virt.lab.msp.redhat.com
dd if=/dev/zero of=/mnt/syncd_primary_2legs_1/ddfile count=10 bs=4M
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.225524 s, 186 MB/s
dd if=/dev/zero of=/mnt/syncd_primary_2legs_2/ddfile count=10 bs=4M
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.337796 s, 124 MB/s

Verifying current sanity of lvm after the failure

Current mirror/raid device structure(s):
  /dev/sdg1: open failed: No such device or address
  Device /dev/sdg1 has size of 0 sectors which is smaller than corresponding PV size of 31455207 sectors. Was device resized?
  One or more devices used as PVs in VG helter_skelter have changed sizes.
  LV                               Attr       LSize   Cpy%Sync Devices                                                            
   syncd_primary_2legs_1            mwi-aom--- 500.00m 100.00   syncd_primary_2legs_1_mimage_0(0),syncd_primary_2legs_1_mimage_1(0)
   [syncd_primary_2legs_1_mimage_0] iwi-aom--- 500.00m          /dev/sdb1(0)                                                       
   [syncd_primary_2legs_1_mimage_1] iwi-aom--- 500.00m          /dev/sdc1(0)                                                       
   [syncd_primary_2legs_1_mlog]     lwi-aom---   4.00m          /dev/sdf1(0)                                                       
   syncd_primary_2legs_2            mwi-aom--- 500.00m 100.00   syncd_primary_2legs_2_mimage_0(0),syncd_primary_2legs_2_mimage_1(0)
   [syncd_primary_2legs_2_mimage_0] iwi-aom--- 500.00m          /dev/sdb1(125)                                                     
   [syncd_primary_2legs_2_mimage_1] iwi-aom--- 500.00m          /dev/sdc1(125)                                                     


Verifying FAILED device /dev/sdg1 is *NOT* in the volume(s)
  /dev/sdg1: open failed: No such device or address
  Device /dev/sdg1 has size of 0 sectors which is smaller than corresponding PV size of 31455207 sectors. Was device resized?
  One or more devices used as PVs in VG helter_skelter have changed sizes.
olog: 1
Verifying LOG device(s) /dev/sde1 *ARE* in the mirror(s)
  /dev/sdg1: open failed: No such device or address
  Device /dev/sdg1 has size of 0 sectors which is smaller than corresponding PV size of 31455207 sectors. Was device resized?
  One or more devices used as PVs in VG helter_skelter have changed sizes.
log device /dev/sde1 should still be present on host-115.virt.lab.msp.redhat.com

seaching for *any* log after this failure...
  /dev/sdg1: open failed: No such device or address
  Device /dev/sdg1 has size of 0 sectors which is smaller than corresponding PV size of 31455207 sectors. Was device resized?
  One or more devices used as PVs in VG helter_skelter have changed sizes.
  /dev/sdg1: open failed: No such device or address
  Device /dev/sdg1 has size of 0 sectors which is smaller than corresponding PV size of 31455207 sectors. Was device resized?
  One or more devices used as PVs in VG helter_skelter have changed sizes.
NO LOG WAS FOUND for syncd_primary_2legs_2 on host-115.virt.lab.msp.redhat.com


[root@host-115 ~]# lvs -a -o +devices
  /dev/sdg1: open failed: No such device or address
  Device /dev/sdg1 has size of 0 sectors which is smaller than corresponding PV size of 31455207 sectors. Was device resized?
  One or more devices used as PVs in VG helter_skelter have changed sizes.
  LV                               VG             Attr       LSize   Log                          Cpy%Sync Devices
  syncd_primary_2legs_1            helter_skelter mwi-aom--- 500.00m [syncd_primary_2legs_1_mlog] 100.00   syncd_primary_2legs_1_mimage_0(0),syncd_primary_2legs_1_mimage_1(0)
  [syncd_primary_2legs_1_mimage_0] helter_skelter iwi-aom--- 500.00m                                       /dev/sdb1(0)
  [syncd_primary_2legs_1_mimage_1] helter_skelter iwi-aom--- 500.00m                                       /dev/sdc1(0)
  [syncd_primary_2legs_1_mlog]     helter_skelter lwi-aom---   4.00m                                       /dev/sdf1(0)
  syncd_primary_2legs_2            helter_skelter mwi-aom--- 500.00m                              100.00   syncd_primary_2legs_2_mimage_0(0),syncd_primary_2legs_2_mimage_1(0)
  [syncd_primary_2legs_2_mimage_0] helter_skelter iwi-aom--- 500.00m                                       /dev/sdb1(125)
  [syncd_primary_2legs_2_mimage_1] helter_skelter iwi-aom--- 500.00m                                       /dev/sdc1(125)

Comment 4 Corey Marthaler 2016-02-11 23:22:21 UTC
Created attachment 1123307 [details]
log from host-115

Comment 5 Corey Marthaler 2016-02-11 23:25:34 UTC
Another 6.8 failure, same test case:

================================================================================
Iteration 1.1 started at Thu Feb 11 16:46:54 CST 2016
================================================================================
Scenario kill_primary_synced_2_legs: Kill primary leg of synced 2 leg mirror(s)

********* Mirror hash info for this scenario *********
* names:              syncd_primary_2legs_1 syncd_primary_2legs_2
* sync:               1
* striped:            0
* leg devices:        /dev/sde1 /dev/sdb1
* log devices:        /dev/sdc1
* no MDA devices:     
* failpv(s):          /dev/sde1
* failnode(s):        host-116.virt.lab.msp.redhat.com
* lvmetad:            1
* leg fault policy:   allocate
* log fault policy:   allocate
******************************************************

Creating mirror(s) on host-116.virt.lab.msp.redhat.com...
host-116.virt.lab.msp.redhat.com: lvcreate --type mirror -m 1 -n syncd_primary_2legs_1 -L 500M helter_skelter /dev/sde1:0-2400 /dev/sdb1:0-2400 /dev/sdc1:0-150
host-116.virt.lab.msp.redhat.com: lvcreate --type mirror -m 1 -n syncd_primary_2legs_2 -L 500M helter_skelter /dev/sde1:0-2400 /dev/sdb1:0-2400 /dev/sdc1:0-150

Current mirror/raid device structure(s):
  LV                               Attr       LSize   Cpy%Sync Devices
  syncd_primary_2legs_1            mwi-a-m--- 500.00m 52.80    syncd_primary_2legs_1_mimage_0(0),syncd_primary_2legs_1_mimage_1(0)
  [syncd_primary_2legs_1_mimage_0] Iwi-aom--- 500.00m          /dev/sde1(0)
  [syncd_primary_2legs_1_mimage_1] Iwi-aom--- 500.00m          /dev/sdb1(0)
  [syncd_primary_2legs_1_mlog]     lwi-aom---   4.00m          /dev/sdc1(0)
  syncd_primary_2legs_2            mwi-a-m--- 500.00m 5.60     syncd_primary_2legs_2_mimage_0(0),syncd_primary_2legs_2_mimage_1(0)
  [syncd_primary_2legs_2_mimage_0] Iwi-aom--- 500.00m          /dev/sde1(125)
  [syncd_primary_2legs_2_mimage_1] Iwi-aom--- 500.00m          /dev/sdb1(125)
  [syncd_primary_2legs_2_mlog]     lwi-aom---   4.00m          /dev/sdc1(1)

Waiting until all mirror|raid volumes become fully syncd...
   2/2 mirror(s) are fully synced: ( 100.00% 100.00% )

Creating ext on top of mirror(s) on host-116.virt.lab.msp.redhat.com...
mke2fs 1.41.12 (17-May-2010)
mke2fs 1.41.12 (17-May-2010)
Mounting mirrored ext filesystems on host-116.virt.lab.msp.redhat.com...

PV=/dev/sde1
        syncd_primary_2legs_1_mimage_0: 5.1
        syncd_primary_2legs_2_mimage_0: 5.1
PV=/dev/sde1
        syncd_primary_2legs_1_mimage_0: 5.1
        syncd_primary_2legs_2_mimage_0: 5.1
Writing verification files (checkit) to mirror(s) on...
        ---- host-116.virt.lab.msp.redhat.com ----

Sleeping 15 seconds to get some outsanding I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
        ---- host-116.virt.lab.msp.redhat.com ----

Disabling device sde on host-116.virt.lab.msp.redhat.comrescan device...
<fail name="host-116.virt.lab.msp.redhat.com_syncd_primary_2legs_2" pid="25029" time="Thu Feb 11 16:48:05 2016" type="cmd" duration="30" ec="1" />
ALL STOP!
  /dev/sde1: read failed after 0 of 512 at 16104947712: Input/output error
  /dev/sde1: read failed after 0 of 512 at 4096: Input/output error
  /dev/sde1: read failed after 0 of 2048 at 0: Input/output error

<stop name="host-116.virt.lab.msp.redhat.com_syncd_primary_2legs_1" pid="25028" time="Thu Feb 11 16:48:08 2016" type="cmd" duration="33" signal="2" />
Getting recovery check start time from /var/log/messages: Feb 11 16:48
Attempting I/O to cause mirror down conversion(s) on host-116.virt.lab.msp.redhat.com
dd if=/dev/zero of=/mnt/syncd_primary_2legs_1/ddfile count=10 bs=4M
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.127945 s, 328 MB/s

dd if=/dev/zero of=/mnt/syncd_primary_2legs_2/ddfile count=10 bs=4M
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.166799 s, 251 MB/s

Verifying current sanity of lvm after the failure

Current mirror/raid device structure(s):
  LV                               Attr       LSize   Cpy%Sync Devices
  syncd_primary_2legs_1            mwi-aom--- 500.00m 0.00     syncd_primary_2legs_1_mimage_0(0),syncd_primary_2legs_1_mimage_1(0)
  [syncd_primary_2legs_1_mimage_0] Iwi-aom--- 500.00m          /dev/sdb1(0)
  [syncd_primary_2legs_1_mimage_1] Iwi-aom--- 500.00m          /dev/sde1(0)
  [syncd_primary_2legs_1_mlog]     lwi-aom---   4.00m          /dev/sda1(1)
  syncd_primary_2legs_2            mwi-aom--- 500.00m 100.00   syncd_primary_2legs_2_mimage_0(0),syncd_primary_2legs_2_mimage_1(0)
  [syncd_primary_2legs_2_mimage_0] iwi-aom--- 500.00m          /dev/sdb1(125)
  [syncd_primary_2legs_2_mimage_1] iwi-aom--- 500.00m          /dev/sdd1(0)
  [syncd_primary_2legs_2_mlog]     lwi-aom---   4.00m          /dev/sda1(0)

Verifying FAILED device /dev/sde1 is *NOT* in the volume(s)
failed device /dev/sde1 should no longer be in volume on host-116.virt.lab.msp.redhat.com

Comment 6 Corey Marthaler 2016-02-11 23:27:58 UTC
Created attachment 1123308 [details]
log from host-116

Comment 7 Corey Marthaler 2016-02-12 17:50:09 UTC
Bug 1307111 is likely related to this one.

Comment 18 Corey Marthaler 2016-02-29 16:02:04 UTC
Closing this bug as it hasn't been seen with the latest rpms. The issue that mirror failure test scenarios lead to is bug 1307111.


Note You need to log in before you can comment on or make changes to this bug.