Bug 1380448

Summary: cmirror log device failure scenarios lead to deadlocks
Product: Red Hat Enterprise Linux 7 Reporter: Corey Marthaler <cmarthal>
Component: lvm2Assignee: Heinz Mauelshagen <heinzm>
lvm2 sub component: Clustered Mirror / cmirrord QA Contact: cluster-qe <cluster-qe>
Status: CLOSED WONTFIX Docs Contact:
Severity: medium    
Priority: unspecified CC: agk, heinzm, jbrassow, msnitzer, prajnoha, zkabelac
Version: 7.3   
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1411900 (view as bug list) Environment:
Last Closed: 2020-02-28 21:14:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1411900    

Description Corey Marthaler 2016-09-29 15:51:26 UTC
Description of problem:
./helter_skelter -f -e kill_secondary_and_log_2_devices
[...]
================================================================================
Iteration 0.3 started at Wed Sep 28 18:38:26 CDT 2016
================================================================================
Scenario kill_secondary_and_log_2_devices: Kill secondary and log of synced 2 device alloc anywhere mirror(s)
********* Mirror hash info for this scenario *********
* names:              syncd_secondary_log_2devices_1
* sync:               1
* striped:            0
* leg devices:        /dev/mapper/mpathg1 /dev/mapper/mpathe1
* log devices:        /dev/mapper/mpathe1
* no MDA devices:     
* failpv(s):          /dev/mapper/mpathe1
* failnode(s):        mckinley-01 mckinley-02 mckinley-03 mckinley-04
* lvmetad:            0
* leg fault policy:   remove
* log fault policy:   remove
******************************************************

Creating mirror(s) on mckinley-04...
mckinley-04: lvcreate -ay --type mirror --alloc anywhere -m 1 -n syncd_secondary_log_2devices_1 -L 500M helter_skelter /dev/mapper/mpathg1:0-2400 /dev/mapper/mpathe1:0-2400
WARNING: gfs2 signature detected on /dev/helter_skelter/syncd_secondary_log_2devices_1 at offset 65536. Wipe it? [y/n]: [n]
  Aborted wiping of gfs2.
  1 existing signature left on the device.

Current mirror/raid device structure(s):
  LV                                        Attr       LSize   Cpy%Sync Devices                                                                              
  syncd_secondary_log_2devices_1            mwa-a-m--- 500.00m 1.60     syncd_secondary_log_2devices_1_mimage_0(0),syncd_secondary_log_2devices_1_mimage_1(0)
  [syncd_secondary_log_2devices_1_mimage_0] Iwa-aom--- 500.00m          /dev/mapper/mpathg1(0)
  [syncd_secondary_log_2devices_1_mimage_1] Iwa-aom--- 500.00m          /dev/mapper/mpathe1(0)
  [syncd_secondary_log_2devices_1_mlog]     lwa-aom---   4.00m          /dev/mapper/mpathe1(125)


Waiting until all mirror|raid volumes become fully syncd...
   0/1 mirror(s) are fully synced: ( 26.90% )
   1/1 mirror(s) are fully synced: ( 100.00% )

Creating gfs2 on top of mirror(s) on mckinley-01...
mkfs.gfs2 -o align=0 -J 32M -j 4 -p lock_dlm -t MCKINLEY:gfs1 /dev/helter_skelter/syncd_secondary_log_2devices_1 -O
Mounting mirrored gfs2 filesystems on mckinley-01...
Mounting mirrored gfs2 filesystems on mckinley-02...
Mounting mirrored gfs2 filesystems on mckinley-03...
Mounting mirrored gfs2 filesystems on mckinley-04...

PV=/dev/mapper/mpathe1
        syncd_secondary_log_2devices_1_mimage_1: 6
        syncd_secondary_log_2devices_1_mlog: 2
PV=/dev/mapper/mpathe1
        syncd_secondary_log_2devices_1_mimage_1: 6
        syncd_secondary_log_2devices_1_mlog: 2

Writing verification files (checkit) to mirror(s) on...
        ---- mckinley-01 ----
        ---- mckinley-02 ----
        ---- mckinley-03 ----
        ---- mckinley-04 ----

Sleeping 15 seconds to get some outsanding I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
        ---- mckinley-01 ----
        ---- mckinley-02 ----
        ---- mckinley-03 ----
        ---- mckinley-04 ----

Failing path sdf on mckinley-01
Failing path sdn on mckinley-01
Failing path sdv on mckinley-01
Failing path sdad on mckinley-01

Failing path sdg on mckinley-02
Failing path sdo on mckinley-02
Failing path sdw on mckinley-02
Failing path sdae on mckinley-02

Failing path sdg on mckinley-03
Failing path sdo on mckinley-03
Failing path sdw on mckinley-03
Failing path sdae on mckinley-03

Failing path sdg on mckinley-04
Failing path sdo on mckinley-04
Failing path sdw on mckinley-04
Failing path sdae on mckinley-04

Getting recovery check start time from /var/log/messages: Sep 28 18:41
Attempting I/O to cause mirror down conversion(s) on mckinley-01
dd if=/dev/zero of=/mnt/syncd_secondary_log_2devices_1/ddfile count=10 bs=4M

[root@mckinley-01 ~]# lvs -a -o +devices
  /dev/mapper/mpathe: read failed after 0 of 4096 at 0: Input/output error
  /dev/mapper/mpathe: read failed after 0 of 4096 at 268435390464: Input/output error
  /dev/mapper/mpathe: read failed after 0 of 4096 at 268435447808: Input/output error
  /dev/mapper/mpathe: read failed after 0 of 4096 at 4096: Input/output error
  /dev/mapper/mpathe1: read failed after 0 of 2048 at 0: Input/output error
  /dev/mapper/mpathe1: read failed after 0 of 2048 at 268431851520: Input/output error
  /dev/mapper/mpathe1: read failed after 0 of 2048 at 268431970304: Input/output error
  /dev/mapper/mpathe1: read failed after 0 of 2048 at 4096: Input/output error
  Couldn't find device with uuid hkhWA7-05OK-M5DV-xJeo-DIBc-LvLw-HKSDv7.
  WARNING: Couldn't find all devices for LV helter_skelter/syncd_secondary_log_2devices_1_mimage_1 while checking used and assumed devices.
  WARNING: Couldn't find all devices for LV helter_skelter/syncd_secondary_log_2devices_1_mlog while checking used and assumed devices.
  LV                                        VG              Attr       LSize   Log                                   Cpy%Sync Devices
  syncd_secondary_log_2devices_1            helter_skelter  mwa-aom-p- 500.00m [syncd_secondary_log_2devices_1_mlog] 100.00   syncd_secondary_log_2devices_1_mimage_0(0),syncd_secondary_log_2devices_1_mimage_1(0)
  [syncd_secondary_log_2devices_1_mimage_0] helter_skelter  iwa-aom--- 500.00m                                                /dev/mapper/mpathg1(0)
  [syncd_secondary_log_2devices_1_mimage_1] helter_skelter  iwa-aom-p- 500.00m                                                [unknown](0)
  [syncd_secondary_log_2devices_1_mlog]     helter_skelter  lwa-aom-p-   4.00m                                                [unknown](125)


Sep 28 18:40:58 mckinley-01 qarshd[49969]: Running cmdline: multipathd fail path sdv
Sep 28 18:40:58 mckinley-01 multipathd: mpathe: fail path 65:80 (operator)
Sep 28 18:40:58 mckinley-01 kernel: device-mapper: multipath: Failing path 65:80.
Sep 28 18:40:59 mckinley-01 systemd: Started qarsh Per-Connection Server (10.15.80.224:53666).
Sep 28 18:40:59 mckinley-01 systemd: Starting qarsh Per-Connection Server (10.15.80.224:53666)...
Sep 28 18:40:59 mckinley-01 qarshd[49974]: Talking to peer ::ffff:10.15.80.224:53666 (IPv6)
Sep 28 18:40:59 mckinley-01 qarshd[49974]: Running cmdline: multipathd fail path sdad
Sep 28 18:40:59 mckinley-01 multipathd: mpathe: fail path 65:208 (operator)
Sep 28 18:40:59 mckinley-01 kernel: device-mapper: multipath: Failing path 65:208.
Sep 28 18:40:59 mckinley-01 kernel: blk_update_request: 14 callbacks suppressed
Sep 28 18:40:59 mckinley-01 kernel: blk_update_request: I/O error, dev dm-11, sector 1089599
Sep 28 18:40:59 mckinley-01 cmirrord[11251]: [LbMWHQ6e] rw_log:  write failure: Input/output error
Sep 28 18:40:59 mckinley-01 cmirrord[11251]: [LbMWHQ6e] Error writing to disk log
Sep 28 18:40:59 mckinley-01 multipathd: sdn: mark as failed
Sep 28 18:40:59 mckinley-01 multipathd: mpathe: remaining active paths: 2
Sep 28 18:40:59 mckinley-01 multipathd: sdad: mark as failed
Sep 28 18:40:59 mckinley-01 multipathd: mpathe: remaining active paths: 1
Sep 28 18:40:59 mckinley-01 multipathd: sdv: mark as failed
Sep 28 18:40:59 mckinley-01 multipathd: mpathe: remaining active paths: 0
Sep 28 18:41:00 mckinley-01 cmirrord[11251]: [LbMWHQ6e] rw_log:  write failure: Input/output error
Sep 28 18:41:00 mckinley-01 kernel: blk_update_request: I/O error, dev dm-11, sector 1089599
Sep 28 18:41:00 mckinley-01 cmirrord[11251]: [LbMWHQ6e] Error writing to disk log
Sep 28 18:41:01 mckinley-01 cmirrord[11251]: [LbMWHQ6e] rw_log:  write failure: Input/output error
Sep 28 18:41:01 mckinley-01 cmirrord[11251]: [LbMWHQ6e] Error writing to disk log
Sep 28 18:41:01 mckinley-01 kernel: blk_update_request: I/O error, dev dm-11, sector 1089599
Sep 28 18:41:02 mckinley-01 multipathd: sdf: unusable path
Sep 28 18:41:02 mckinley-01 cmirrord[11251]: [LbMWHQ6e] rw_log:  write failure: Input/output error
Sep 28 18:41:02 mckinley-01 cmirrord[11251]: [LbMWHQ6e] Error writing to disk log
Sep 28 18:41:02 mckinley-01 kernel: blk_update_request: I/O error, dev dm-11, sector 1089599
Sep 28 18:41:03 mckinley-01 multipathd: sdn: unusable path
Sep 28 18:41:03 mckinley-01 multipathd: sdv: unusable path
Sep 28 18:41:03 mckinley-01 multipathd: sdad: unusable path
Sep 28 18:41:04 mckinley-01 cmirrord[11251]: [LbMWHQ6e] rw_log:  write failure: Input/output error
Sep 28 18:41:04 mckinley-01 cmirrord[11251]: [LbMWHQ6e] Error writing to disk log
Sep 28 18:41:04 mckinley-01 kernel: blk_update_request: I/O error, dev dm-11, sector 1089599
Sep 28 18:41:05 mckinley-01 cmirrord[11251]: [LbMWHQ6e] rw_log:  write failure: Input/output error
Sep 28 18:41:05 mckinley-01 kernel: blk_update_request: I/O error, dev dm-11, sector 1089599
Sep 28 18:41:05 mckinley-01 cmirrord[11251]: [LbMWHQ6e] Error writing to disk log
Sep 28 18:41:06 mckinley-01 cmirrord[11251]: [LbMWHQ6e] rw_log:  write failure: Input/output error
Sep 28 18:41:06 mckinley-01 cmirrord[11251]: [LbMWHQ6e] Error writing to disk log
Sep 28 18:41:06 mckinley-01 kernel: blk_update_request: I/O error, dev dm-11, sector 1089599
Sep 28 18:41:07 mckinley-01 multipathd: sdf: unusable path
Sep 28 18:41:07 mckinley-01 kernel: blk_update_request: I/O error, dev dm-11, sector 0
Sep 28 18:41:07 mckinley-01 kernel: blk_update_request: I/O error, dev dm-11, sector 524287872
Sep 28 18:41:07 mckinley-01 kernel: blk_update_request: I/O error, dev dm-11, sector 524287984
Sep 28 18:41:07 mckinley-01 kernel: blk_update_request: I/O error, dev dm-11, sector 0
Sep 28 18:41:07 mckinley-01 kernel: blk_update_request: I/O error, dev dm-11, sector 8
Sep 28 18:41:07 mckinley-01 kernel: blk_update_request: I/O error, dev dm-11, sector 0
Sep 28 18:41:07 mckinley-01 kernel: blk_update_request: I/O error, dev dm-11, sector 63
Sep 28 18:41:07 mckinley-01 kernel: blk_update_request: I/O error, dev dm-11, sector 524281023
Sep 28 18:41:08 mckinley-01 kernel: sd 7:0:0:4: Asymmetric access state changed
Sep 28 18:41:08 mckinley-01 kernel: sd 7:0:2:4: Asymmetric access state changed
Sep 28 18:41:08 mckinley-01 kernel: sd 7:0:1:4: Asymmetric access state changed
Sep 28 18:41:08 mckinley-01 kernel: sd 7:0:3:4: Asymmetric access state changed
Sep 28 18:41:08 mckinley-01 systemd: Started qarsh Per-Connection Server (10.15.80.224:53796).
Sep 28 18:41:08 mckinley-01 systemd: Starting qarsh Per-Connection Server (10.15.80.224:53796)...
Sep 28 18:41:08 mckinley-01 qarshd[50012]: Talking to peer ::ffff:10.15.80.224:53796 (IPv6)
Sep 28 18:41:08 mckinley-01 multipathd: sdn: unusable path
Sep 28 18:41:08 mckinley-01 multipathd: sdv: unusable path
Sep 28 18:41:08 mckinley-01 multipathd: sdad: unusable path
Sep 28 18:41:08 mckinley-01 qarshd[50012]: Running cmdline: dd if=/dev/zero of=/mnt/syncd_secondary_log_2devices_1/ddfile count=10 bs=4M
Sep 28 18:41:08 mckinley-01 cmirrord[11251]: [LbMWHQ6e] rw_log:  write failure: Input/output error
Sep 28 18:41:08 mckinley-01 cmirrord[11251]: [LbMWHQ6e] Error writing to disk log
[...]
Sep 28 18:43:29 mckinley-01 kernel: INFO: task kworker/u194:1:49875 blocked for more than 120 seconds.
Sep 28 18:43:29 mckinley-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 28 18:43:29 mckinley-01 kernel: kworker/u194:1  D ffff881ff665b9b8     0 49875      2 0x00000080
Sep 28 18:43:29 mckinley-01 kernel: Workqueue: writeback bdi_writeback_workfn (flush-253:22)
Sep 28 18:43:30 mckinley-01 kernel: ffff881ff665b850 0000000000000046 ffff881ffe6fde20 ffff881ff665bfd8
Sep 28 18:43:30 mckinley-01 kernel: ffff881ff665bfd8 ffff881ff665bfd8 ffff881ffe6fde20 ffff881ffead6c40
Sep 28 18:43:30 mckinley-01 kernel: 0000000000000000 7fffffffffffffff ffffffff81688ef0 ffff881ff665b9b8
Sep 28 18:43:30 mckinley-01 kernel: Call Trace:
Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff81688ef0>] ? bit_wait+0x50/0x50
Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff8168ae79>] schedule+0x29/0x70
Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff816888c9>] schedule_timeout+0x239/0x2d0
Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff81688ef0>] ? bit_wait+0x50/0x50
Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff8168a41e>] io_schedule_timeout+0xae/0x130
Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff8168a4b8>] io_schedule+0x18/0x20
Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff81688f01>] bit_wait_io+0x11/0x50
Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff81688c1f>] __wait_on_bit_lock+0x5f/0xc0
Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff8117fd14>] __lock_page+0x74/0x90
Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff810b1590>] ? wake_bit_function+0x40/0x40
Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff8118be80>] write_cache_pages+0x420/0x4d0
Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff8123ea20>] ? clean_buffers+0x60/0x60
Sep 28 18:43:30 mckinley-01 kernel: [<ffffffffa066ac80>] ? gfs2_writepages+0x20/0x20 [gfs2]
Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff8123e959>] mpage_writepages+0x69/0xd0
Sep 28 18:43:30 mckinley-01 kernel: [<ffffffffa066ac80>] ? gfs2_writepages+0x20/0x20 [gfs2]
Sep 28 18:43:30 mckinley-01 kernel: [<ffffffffa066ac75>] gfs2_writepages+0x15/0x20 [gfs2]
Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff8118d02e>] do_writepages+0x1e/0x40
Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff812283e0>] __writeback_single_inode+0x40/0x210
Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff812290ce>] writeback_sb_inodes+0x25e/0x420
Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff8122932f>] __writeback_inodes_wb+0x9f/0xd0
Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff81229b73>] wb_writeback+0x263/0x2f0
Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff8118c6d0>] ? bdi_dirty_limit+0x40/0xe0
Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff8122ba6c>] bdi_writeback_workfn+0x1cc/0x460
Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff810a7e0b>] process_one_work+0x17b/0x470
Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff810a8c46>] worker_thread+0x126/0x410
Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff810a8b20>] ? rescuer_thread+0x460/0x460
Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff810b03ff>] kthread+0xcf/0xe0
Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff810b0330>] ? kthread_create_on_node+0x140/0x140
Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff81695d18>] ret_from_fork+0x58/0x90
Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff810b0330>] ? kthread_create_on_node+0x140/0x140






Version-Release number of selected component (if applicable):
3.10.0-505.el7.x86_64

lvm2-2.02.166-1.el7    BUILT: Wed Sep 28 02:26:52 CDT 2016
lvm2-libs-2.02.166-1.el7    BUILT: Wed Sep 28 02:26:52 CDT 2016
lvm2-cluster-2.02.166-1.el7    BUILT: Wed Sep 28 02:26:52 CDT 2016
device-mapper-1.02.135-1.el7    BUILT: Wed Sep 28 02:26:52 CDT 2016
device-mapper-libs-1.02.135-1.el7    BUILT: Wed Sep 28 02:26:52 CDT 2016
device-mapper-event-1.02.135-1.el7    BUILT: Wed Sep 28 02:26:52 CDT 2016
device-mapper-event-libs-1.02.135-1.el7    BUILT: Wed Sep 28 02:26:52 CDT 2016
device-mapper-persistent-data-0.6.3-1.el7    BUILT: Fri Jul 22 05:29:13 CDT 2016
cmirror-2.02.166-1.el7    BUILT: Wed Sep 28 02:26:52 CDT 2016

Comment 2 Corey Marthaler 2016-09-29 20:54:23 UTC
This is reproducible. Another similar scenario:  kill_secondary_and_log_synced_4_legs

Comment 5 Corey Marthaler 2017-07-06 21:55:08 UTC
Even the basic log only failure scenario can cause this.

================================================================================
Iteration 0.1 started at Thu Jul  6 16:37:21 CDT 2017
================================================================================
 ACTUAL LEG ORDER: /dev/sda1 /dev/sdb1 /dev/sdc1
 Scenario kill_random_devices: Kill random devices (log and legs)
 ********* Mirror info for this scenario *********
 * mirrors:            mirror_1
 * leg devices:        /dev/sda1 /dev/sdb1 /dev/sdc1
 * log devices:        /dev/sdd1
 * failpv(s):          /dev/sdd1
 * failnode(s):        host-117 host-121 host-122
 * lvmetad:            0
 * leg fault policy:   allocate
 * log fault policy:   allocate
 *************************************************
 
 Current mirror/raid device structure(s):
   LV                  Attr       LSize   Cpy%Sync Devices
    mirror_1            mwi-aom---   2.00g 100.00   mirror_1_mimage_0(0),mirror_1_mimage_1(0),mirror_1_mimage_2(0)
    [mirror_1_mimage_0] iwi-aom---   2.00g          /dev/sda1(0)
    [mirror_1_mimage_1] iwi-aom---   2.00g          /dev/sdb1(0)
    [mirror_1_mimage_2] iwi-aom---   2.00g          /dev/sdc1(0)
    [mirror_1_mlog]     lwi-aom---   4.00m          /dev/sdd1(0)
 
 PV=/dev/sdd1
        mirror_1_mlog: 1.2
 PV=/dev/sdd1
        mirror_1_mlog: 1.2
 
 Writing verification files (checkit) to mirror(s) on...
        ---- host-117 ----
        ---- host-121 ----
        ---- host-122 ----
 
 <start name="host-117_mirror_1"  pid="434" time="Thu Jul  6 16:37:30 2017 -0500" type="cmd" />
 <start name="host-121_mirror_1"  pid="435" time="Thu Jul  6 16:37:30 2017 -0500" type="cmd" />
 <start name="host-122_mirror_1"  pid="437" time="Thu Jul  6 16:37:30 2017 -0500" type="cmd" />

 Sleeping 15 seconds to get some outsanding I/O locks before the failure 
 Verifying files (checkit) on mirror(s) on...
        ---- host-117 ----
        ---- host-121 ----
        ---- host-122 ----

 
 Disabling device sdd on host-117rescan device...
 Disabling device sdd on host-121rescan device...
 Disabling device sdd on host-122rescan device...

 [DEADLOCK]



3.10.0-689.el7.x86_64
lvm2-2.02.171-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
lvm2-libs-2.02.171-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
lvm2-cluster-2.02.171-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
device-mapper-1.02.140-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
device-mapper-libs-1.02.140-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
device-mapper-event-1.02.140-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
device-mapper-event-libs-1.02.140-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
device-mapper-persistent-data-0.7.0-0.1.rc6.el7    BUILT: Mon Mar 27 10:15:46 CDT 2017
cmirror-2.02.171-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017