| Summary: | cmirror log device failure scenarios lead to deadlocks | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Corey Marthaler <cmarthal> | |
| Component: | lvm2 | Assignee: | Heinz Mauelshagen <heinzm> | |
| lvm2 sub component: | Clustered Mirror / cmirrord | QA Contact: | cluster-qe <cluster-qe> | |
| Status: | CLOSED WONTFIX | Docs Contact: | ||
| Severity: | medium | |||
| Priority: | unspecified | CC: | agk, heinzm, jbrassow, msnitzer, prajnoha, zkabelac | |
| Version: | 7.3 | |||
| Target Milestone: | rc | |||
| Target Release: | --- | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1411900 (view as bug list) | Environment: | ||
| Last Closed: | 2020-02-28 21:14:45 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Bug Depends On: | ||||
| Bug Blocks: | 1411900 | |||
This is reproducible. Another similar scenario: kill_secondary_and_log_synced_4_legs Even the basic log only failure scenario can cause this.
================================================================================
Iteration 0.1 started at Thu Jul 6 16:37:21 CDT 2017
================================================================================
ACTUAL LEG ORDER: /dev/sda1 /dev/sdb1 /dev/sdc1
Scenario kill_random_devices: Kill random devices (log and legs)
********* Mirror info for this scenario *********
* mirrors: mirror_1
* leg devices: /dev/sda1 /dev/sdb1 /dev/sdc1
* log devices: /dev/sdd1
* failpv(s): /dev/sdd1
* failnode(s): host-117 host-121 host-122
* lvmetad: 0
* leg fault policy: allocate
* log fault policy: allocate
*************************************************
Current mirror/raid device structure(s):
LV Attr LSize Cpy%Sync Devices
mirror_1 mwi-aom--- 2.00g 100.00 mirror_1_mimage_0(0),mirror_1_mimage_1(0),mirror_1_mimage_2(0)
[mirror_1_mimage_0] iwi-aom--- 2.00g /dev/sda1(0)
[mirror_1_mimage_1] iwi-aom--- 2.00g /dev/sdb1(0)
[mirror_1_mimage_2] iwi-aom--- 2.00g /dev/sdc1(0)
[mirror_1_mlog] lwi-aom--- 4.00m /dev/sdd1(0)
PV=/dev/sdd1
mirror_1_mlog: 1.2
PV=/dev/sdd1
mirror_1_mlog: 1.2
Writing verification files (checkit) to mirror(s) on...
---- host-117 ----
---- host-121 ----
---- host-122 ----
<start name="host-117_mirror_1" pid="434" time="Thu Jul 6 16:37:30 2017 -0500" type="cmd" />
<start name="host-121_mirror_1" pid="435" time="Thu Jul 6 16:37:30 2017 -0500" type="cmd" />
<start name="host-122_mirror_1" pid="437" time="Thu Jul 6 16:37:30 2017 -0500" type="cmd" />
Sleeping 15 seconds to get some outsanding I/O locks before the failure
Verifying files (checkit) on mirror(s) on...
---- host-117 ----
---- host-121 ----
---- host-122 ----
Disabling device sdd on host-117rescan device...
Disabling device sdd on host-121rescan device...
Disabling device sdd on host-122rescan device...
[DEADLOCK]
3.10.0-689.el7.x86_64
lvm2-2.02.171-8.el7 BUILT: Wed Jun 28 13:28:58 CDT 2017
lvm2-libs-2.02.171-8.el7 BUILT: Wed Jun 28 13:28:58 CDT 2017
lvm2-cluster-2.02.171-8.el7 BUILT: Wed Jun 28 13:28:58 CDT 2017
device-mapper-1.02.140-8.el7 BUILT: Wed Jun 28 13:28:58 CDT 2017
device-mapper-libs-1.02.140-8.el7 BUILT: Wed Jun 28 13:28:58 CDT 2017
device-mapper-event-1.02.140-8.el7 BUILT: Wed Jun 28 13:28:58 CDT 2017
device-mapper-event-libs-1.02.140-8.el7 BUILT: Wed Jun 28 13:28:58 CDT 2017
device-mapper-persistent-data-0.7.0-0.1.rc6.el7 BUILT: Mon Mar 27 10:15:46 CDT 2017
cmirror-2.02.171-8.el7 BUILT: Wed Jun 28 13:28:58 CDT 2017
|
Description of problem: ./helter_skelter -f -e kill_secondary_and_log_2_devices [...] ================================================================================ Iteration 0.3 started at Wed Sep 28 18:38:26 CDT 2016 ================================================================================ Scenario kill_secondary_and_log_2_devices: Kill secondary and log of synced 2 device alloc anywhere mirror(s) ********* Mirror hash info for this scenario ********* * names: syncd_secondary_log_2devices_1 * sync: 1 * striped: 0 * leg devices: /dev/mapper/mpathg1 /dev/mapper/mpathe1 * log devices: /dev/mapper/mpathe1 * no MDA devices: * failpv(s): /dev/mapper/mpathe1 * failnode(s): mckinley-01 mckinley-02 mckinley-03 mckinley-04 * lvmetad: 0 * leg fault policy: remove * log fault policy: remove ****************************************************** Creating mirror(s) on mckinley-04... mckinley-04: lvcreate -ay --type mirror --alloc anywhere -m 1 -n syncd_secondary_log_2devices_1 -L 500M helter_skelter /dev/mapper/mpathg1:0-2400 /dev/mapper/mpathe1:0-2400 WARNING: gfs2 signature detected on /dev/helter_skelter/syncd_secondary_log_2devices_1 at offset 65536. Wipe it? [y/n]: [n] Aborted wiping of gfs2. 1 existing signature left on the device. Current mirror/raid device structure(s): LV Attr LSize Cpy%Sync Devices syncd_secondary_log_2devices_1 mwa-a-m--- 500.00m 1.60 syncd_secondary_log_2devices_1_mimage_0(0),syncd_secondary_log_2devices_1_mimage_1(0) [syncd_secondary_log_2devices_1_mimage_0] Iwa-aom--- 500.00m /dev/mapper/mpathg1(0) [syncd_secondary_log_2devices_1_mimage_1] Iwa-aom--- 500.00m /dev/mapper/mpathe1(0) [syncd_secondary_log_2devices_1_mlog] lwa-aom--- 4.00m /dev/mapper/mpathe1(125) Waiting until all mirror|raid volumes become fully syncd... 0/1 mirror(s) are fully synced: ( 26.90% ) 1/1 mirror(s) are fully synced: ( 100.00% ) Creating gfs2 on top of mirror(s) on mckinley-01... mkfs.gfs2 -o align=0 -J 32M -j 4 -p lock_dlm -t MCKINLEY:gfs1 /dev/helter_skelter/syncd_secondary_log_2devices_1 -O Mounting mirrored gfs2 filesystems on mckinley-01... Mounting mirrored gfs2 filesystems on mckinley-02... Mounting mirrored gfs2 filesystems on mckinley-03... Mounting mirrored gfs2 filesystems on mckinley-04... PV=/dev/mapper/mpathe1 syncd_secondary_log_2devices_1_mimage_1: 6 syncd_secondary_log_2devices_1_mlog: 2 PV=/dev/mapper/mpathe1 syncd_secondary_log_2devices_1_mimage_1: 6 syncd_secondary_log_2devices_1_mlog: 2 Writing verification files (checkit) to mirror(s) on... ---- mckinley-01 ---- ---- mckinley-02 ---- ---- mckinley-03 ---- ---- mckinley-04 ---- Sleeping 15 seconds to get some outsanding I/O locks before the failure Verifying files (checkit) on mirror(s) on... ---- mckinley-01 ---- ---- mckinley-02 ---- ---- mckinley-03 ---- ---- mckinley-04 ---- Failing path sdf on mckinley-01 Failing path sdn on mckinley-01 Failing path sdv on mckinley-01 Failing path sdad on mckinley-01 Failing path sdg on mckinley-02 Failing path sdo on mckinley-02 Failing path sdw on mckinley-02 Failing path sdae on mckinley-02 Failing path sdg on mckinley-03 Failing path sdo on mckinley-03 Failing path sdw on mckinley-03 Failing path sdae on mckinley-03 Failing path sdg on mckinley-04 Failing path sdo on mckinley-04 Failing path sdw on mckinley-04 Failing path sdae on mckinley-04 Getting recovery check start time from /var/log/messages: Sep 28 18:41 Attempting I/O to cause mirror down conversion(s) on mckinley-01 dd if=/dev/zero of=/mnt/syncd_secondary_log_2devices_1/ddfile count=10 bs=4M [root@mckinley-01 ~]# lvs -a -o +devices /dev/mapper/mpathe: read failed after 0 of 4096 at 0: Input/output error /dev/mapper/mpathe: read failed after 0 of 4096 at 268435390464: Input/output error /dev/mapper/mpathe: read failed after 0 of 4096 at 268435447808: Input/output error /dev/mapper/mpathe: read failed after 0 of 4096 at 4096: Input/output error /dev/mapper/mpathe1: read failed after 0 of 2048 at 0: Input/output error /dev/mapper/mpathe1: read failed after 0 of 2048 at 268431851520: Input/output error /dev/mapper/mpathe1: read failed after 0 of 2048 at 268431970304: Input/output error /dev/mapper/mpathe1: read failed after 0 of 2048 at 4096: Input/output error Couldn't find device with uuid hkhWA7-05OK-M5DV-xJeo-DIBc-LvLw-HKSDv7. WARNING: Couldn't find all devices for LV helter_skelter/syncd_secondary_log_2devices_1_mimage_1 while checking used and assumed devices. WARNING: Couldn't find all devices for LV helter_skelter/syncd_secondary_log_2devices_1_mlog while checking used and assumed devices. LV VG Attr LSize Log Cpy%Sync Devices syncd_secondary_log_2devices_1 helter_skelter mwa-aom-p- 500.00m [syncd_secondary_log_2devices_1_mlog] 100.00 syncd_secondary_log_2devices_1_mimage_0(0),syncd_secondary_log_2devices_1_mimage_1(0) [syncd_secondary_log_2devices_1_mimage_0] helter_skelter iwa-aom--- 500.00m /dev/mapper/mpathg1(0) [syncd_secondary_log_2devices_1_mimage_1] helter_skelter iwa-aom-p- 500.00m [unknown](0) [syncd_secondary_log_2devices_1_mlog] helter_skelter lwa-aom-p- 4.00m [unknown](125) Sep 28 18:40:58 mckinley-01 qarshd[49969]: Running cmdline: multipathd fail path sdv Sep 28 18:40:58 mckinley-01 multipathd: mpathe: fail path 65:80 (operator) Sep 28 18:40:58 mckinley-01 kernel: device-mapper: multipath: Failing path 65:80. Sep 28 18:40:59 mckinley-01 systemd: Started qarsh Per-Connection Server (10.15.80.224:53666). Sep 28 18:40:59 mckinley-01 systemd: Starting qarsh Per-Connection Server (10.15.80.224:53666)... Sep 28 18:40:59 mckinley-01 qarshd[49974]: Talking to peer ::ffff:10.15.80.224:53666 (IPv6) Sep 28 18:40:59 mckinley-01 qarshd[49974]: Running cmdline: multipathd fail path sdad Sep 28 18:40:59 mckinley-01 multipathd: mpathe: fail path 65:208 (operator) Sep 28 18:40:59 mckinley-01 kernel: device-mapper: multipath: Failing path 65:208. Sep 28 18:40:59 mckinley-01 kernel: blk_update_request: 14 callbacks suppressed Sep 28 18:40:59 mckinley-01 kernel: blk_update_request: I/O error, dev dm-11, sector 1089599 Sep 28 18:40:59 mckinley-01 cmirrord[11251]: [LbMWHQ6e] rw_log: write failure: Input/output error Sep 28 18:40:59 mckinley-01 cmirrord[11251]: [LbMWHQ6e] Error writing to disk log Sep 28 18:40:59 mckinley-01 multipathd: sdn: mark as failed Sep 28 18:40:59 mckinley-01 multipathd: mpathe: remaining active paths: 2 Sep 28 18:40:59 mckinley-01 multipathd: sdad: mark as failed Sep 28 18:40:59 mckinley-01 multipathd: mpathe: remaining active paths: 1 Sep 28 18:40:59 mckinley-01 multipathd: sdv: mark as failed Sep 28 18:40:59 mckinley-01 multipathd: mpathe: remaining active paths: 0 Sep 28 18:41:00 mckinley-01 cmirrord[11251]: [LbMWHQ6e] rw_log: write failure: Input/output error Sep 28 18:41:00 mckinley-01 kernel: blk_update_request: I/O error, dev dm-11, sector 1089599 Sep 28 18:41:00 mckinley-01 cmirrord[11251]: [LbMWHQ6e] Error writing to disk log Sep 28 18:41:01 mckinley-01 cmirrord[11251]: [LbMWHQ6e] rw_log: write failure: Input/output error Sep 28 18:41:01 mckinley-01 cmirrord[11251]: [LbMWHQ6e] Error writing to disk log Sep 28 18:41:01 mckinley-01 kernel: blk_update_request: I/O error, dev dm-11, sector 1089599 Sep 28 18:41:02 mckinley-01 multipathd: sdf: unusable path Sep 28 18:41:02 mckinley-01 cmirrord[11251]: [LbMWHQ6e] rw_log: write failure: Input/output error Sep 28 18:41:02 mckinley-01 cmirrord[11251]: [LbMWHQ6e] Error writing to disk log Sep 28 18:41:02 mckinley-01 kernel: blk_update_request: I/O error, dev dm-11, sector 1089599 Sep 28 18:41:03 mckinley-01 multipathd: sdn: unusable path Sep 28 18:41:03 mckinley-01 multipathd: sdv: unusable path Sep 28 18:41:03 mckinley-01 multipathd: sdad: unusable path Sep 28 18:41:04 mckinley-01 cmirrord[11251]: [LbMWHQ6e] rw_log: write failure: Input/output error Sep 28 18:41:04 mckinley-01 cmirrord[11251]: [LbMWHQ6e] Error writing to disk log Sep 28 18:41:04 mckinley-01 kernel: blk_update_request: I/O error, dev dm-11, sector 1089599 Sep 28 18:41:05 mckinley-01 cmirrord[11251]: [LbMWHQ6e] rw_log: write failure: Input/output error Sep 28 18:41:05 mckinley-01 kernel: blk_update_request: I/O error, dev dm-11, sector 1089599 Sep 28 18:41:05 mckinley-01 cmirrord[11251]: [LbMWHQ6e] Error writing to disk log Sep 28 18:41:06 mckinley-01 cmirrord[11251]: [LbMWHQ6e] rw_log: write failure: Input/output error Sep 28 18:41:06 mckinley-01 cmirrord[11251]: [LbMWHQ6e] Error writing to disk log Sep 28 18:41:06 mckinley-01 kernel: blk_update_request: I/O error, dev dm-11, sector 1089599 Sep 28 18:41:07 mckinley-01 multipathd: sdf: unusable path Sep 28 18:41:07 mckinley-01 kernel: blk_update_request: I/O error, dev dm-11, sector 0 Sep 28 18:41:07 mckinley-01 kernel: blk_update_request: I/O error, dev dm-11, sector 524287872 Sep 28 18:41:07 mckinley-01 kernel: blk_update_request: I/O error, dev dm-11, sector 524287984 Sep 28 18:41:07 mckinley-01 kernel: blk_update_request: I/O error, dev dm-11, sector 0 Sep 28 18:41:07 mckinley-01 kernel: blk_update_request: I/O error, dev dm-11, sector 8 Sep 28 18:41:07 mckinley-01 kernel: blk_update_request: I/O error, dev dm-11, sector 0 Sep 28 18:41:07 mckinley-01 kernel: blk_update_request: I/O error, dev dm-11, sector 63 Sep 28 18:41:07 mckinley-01 kernel: blk_update_request: I/O error, dev dm-11, sector 524281023 Sep 28 18:41:08 mckinley-01 kernel: sd 7:0:0:4: Asymmetric access state changed Sep 28 18:41:08 mckinley-01 kernel: sd 7:0:2:4: Asymmetric access state changed Sep 28 18:41:08 mckinley-01 kernel: sd 7:0:1:4: Asymmetric access state changed Sep 28 18:41:08 mckinley-01 kernel: sd 7:0:3:4: Asymmetric access state changed Sep 28 18:41:08 mckinley-01 systemd: Started qarsh Per-Connection Server (10.15.80.224:53796). Sep 28 18:41:08 mckinley-01 systemd: Starting qarsh Per-Connection Server (10.15.80.224:53796)... Sep 28 18:41:08 mckinley-01 qarshd[50012]: Talking to peer ::ffff:10.15.80.224:53796 (IPv6) Sep 28 18:41:08 mckinley-01 multipathd: sdn: unusable path Sep 28 18:41:08 mckinley-01 multipathd: sdv: unusable path Sep 28 18:41:08 mckinley-01 multipathd: sdad: unusable path Sep 28 18:41:08 mckinley-01 qarshd[50012]: Running cmdline: dd if=/dev/zero of=/mnt/syncd_secondary_log_2devices_1/ddfile count=10 bs=4M Sep 28 18:41:08 mckinley-01 cmirrord[11251]: [LbMWHQ6e] rw_log: write failure: Input/output error Sep 28 18:41:08 mckinley-01 cmirrord[11251]: [LbMWHQ6e] Error writing to disk log [...] Sep 28 18:43:29 mckinley-01 kernel: INFO: task kworker/u194:1:49875 blocked for more than 120 seconds. Sep 28 18:43:29 mckinley-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 28 18:43:29 mckinley-01 kernel: kworker/u194:1 D ffff881ff665b9b8 0 49875 2 0x00000080 Sep 28 18:43:29 mckinley-01 kernel: Workqueue: writeback bdi_writeback_workfn (flush-253:22) Sep 28 18:43:30 mckinley-01 kernel: ffff881ff665b850 0000000000000046 ffff881ffe6fde20 ffff881ff665bfd8 Sep 28 18:43:30 mckinley-01 kernel: ffff881ff665bfd8 ffff881ff665bfd8 ffff881ffe6fde20 ffff881ffead6c40 Sep 28 18:43:30 mckinley-01 kernel: 0000000000000000 7fffffffffffffff ffffffff81688ef0 ffff881ff665b9b8 Sep 28 18:43:30 mckinley-01 kernel: Call Trace: Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff81688ef0>] ? bit_wait+0x50/0x50 Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff8168ae79>] schedule+0x29/0x70 Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff816888c9>] schedule_timeout+0x239/0x2d0 Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff81688ef0>] ? bit_wait+0x50/0x50 Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff8168a41e>] io_schedule_timeout+0xae/0x130 Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff8168a4b8>] io_schedule+0x18/0x20 Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff81688f01>] bit_wait_io+0x11/0x50 Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff81688c1f>] __wait_on_bit_lock+0x5f/0xc0 Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff8117fd14>] __lock_page+0x74/0x90 Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff810b1590>] ? wake_bit_function+0x40/0x40 Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff8118be80>] write_cache_pages+0x420/0x4d0 Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff8123ea20>] ? clean_buffers+0x60/0x60 Sep 28 18:43:30 mckinley-01 kernel: [<ffffffffa066ac80>] ? gfs2_writepages+0x20/0x20 [gfs2] Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff8123e959>] mpage_writepages+0x69/0xd0 Sep 28 18:43:30 mckinley-01 kernel: [<ffffffffa066ac80>] ? gfs2_writepages+0x20/0x20 [gfs2] Sep 28 18:43:30 mckinley-01 kernel: [<ffffffffa066ac75>] gfs2_writepages+0x15/0x20 [gfs2] Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff8118d02e>] do_writepages+0x1e/0x40 Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff812283e0>] __writeback_single_inode+0x40/0x210 Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff812290ce>] writeback_sb_inodes+0x25e/0x420 Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff8122932f>] __writeback_inodes_wb+0x9f/0xd0 Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff81229b73>] wb_writeback+0x263/0x2f0 Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff8118c6d0>] ? bdi_dirty_limit+0x40/0xe0 Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff8122ba6c>] bdi_writeback_workfn+0x1cc/0x460 Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff810a7e0b>] process_one_work+0x17b/0x470 Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff810a8c46>] worker_thread+0x126/0x410 Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff810a8b20>] ? rescuer_thread+0x460/0x460 Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff810b03ff>] kthread+0xcf/0xe0 Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff810b0330>] ? kthread_create_on_node+0x140/0x140 Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff81695d18>] ret_from_fork+0x58/0x90 Sep 28 18:43:30 mckinley-01 kernel: [<ffffffff810b0330>] ? kthread_create_on_node+0x140/0x140 Version-Release number of selected component (if applicable): 3.10.0-505.el7.x86_64 lvm2-2.02.166-1.el7 BUILT: Wed Sep 28 02:26:52 CDT 2016 lvm2-libs-2.02.166-1.el7 BUILT: Wed Sep 28 02:26:52 CDT 2016 lvm2-cluster-2.02.166-1.el7 BUILT: Wed Sep 28 02:26:52 CDT 2016 device-mapper-1.02.135-1.el7 BUILT: Wed Sep 28 02:26:52 CDT 2016 device-mapper-libs-1.02.135-1.el7 BUILT: Wed Sep 28 02:26:52 CDT 2016 device-mapper-event-1.02.135-1.el7 BUILT: Wed Sep 28 02:26:52 CDT 2016 device-mapper-event-libs-1.02.135-1.el7 BUILT: Wed Sep 28 02:26:52 CDT 2016 device-mapper-persistent-data-0.6.3-1.el7 BUILT: Fri Jul 22 05:29:13 CDT 2016 cmirror-2.02.166-1.el7 BUILT: Wed Sep 28 02:26:52 CDT 2016