Bug 1439934
Summary: | RAID TAKEOVER: potential for deadlock when doing takeover | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Corey Marthaler <cmarthal> |
Component: | lvm2 | Assignee: | Heinz Mauelshagen <heinzm> |
lvm2 sub component: | Mirroring and RAID | QA Contact: | cluster-qe <cluster-qe> |
Status: | CLOSED DUPLICATE | Docs Contact: | |
Severity: | high | ||
Priority: | unspecified | CC: | agk, heinzm, jbrassow, msnitzer, prajnoha, prockai, zkabelac |
Version: | 7.4 | ||
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-04-24 14:35:07 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Corey Marthaler
2017-04-06 22:15:31 UTC
Reproduced this with raid6 -> raid6_la_6 as well. ================================================================================ Iteration 0.1 started at Mon Apr 10 11:31:39 CDT 2017 ================================================================================ Scenario raid6: Convert Striped raid6 volume ********* Take over hash info for this scenario ********* * from type: raid6 * to type: raid6_la_6 * from legs: 3 * to legs: 3 * contiguous: 0 * snapshot: 0 ****************************************************** Creating original volume on host-077... host-077: lvcreate --type raid6 -i 3 -n takeover -L 500M centipede2 Waiting until all mirror|raid volumes become fully syncd... 1/1 mirror(s) are fully synced: ( 100.00% ) Sleeping 15 sec Current volume device structure: LV Attr LSize Cpy%Sync Devices takeover rwi-a-r--- 504.00m 100.00 takeover_rimage_0(0),takeover_rimage_1(0),takeover_rimage_2(0),takeover_rimage_3(0),takeover_rimage_4(0) [takeover_rimage_0] iwi-aor--- 168.00m /dev/sdf1(1) [takeover_rimage_1] iwi-aor--- 168.00m /dev/sdb1(1) [takeover_rimage_2] iwi-aor--- 168.00m /dev/sdc1(1) [takeover_rimage_3] iwi-aor--- 168.00m /dev/sdd1(1) [takeover_rimage_4] iwi-aor--- 168.00m /dev/sda1(1) [takeover_rmeta_0] ewi-aor--- 4.00m /dev/sdf1(0) [takeover_rmeta_1] ewi-aor--- 4.00m /dev/sdb1(0) [takeover_rmeta_2] ewi-aor--- 4.00m /dev/sdc1(0) [takeover_rmeta_3] ewi-aor--- 4.00m /dev/sdd1(0) [takeover_rmeta_4] ewi-aor--- 4.00m /dev/sda1(0) Vg=centipede2 names=takeover Creating xfs on top of mirror(s) on host-077... host-077 mkfs.xfs -f /dev/centipede2/takeover Mounting mirrored xfs filesystems on host-077... Writing verification files (checkit) to mirror(s) on... ---- host-077 ---- <start name="host-077_takeover" pid="30295" time="Mon Apr 10 11:32:16 2017 -0500" type="cmd" /> execvp() failed. Sleeping 15 seconds to get some outsanding I/O locks before the failure Verifying files (checkit) on mirror(s) on... ---- host-077 ---- Name GrpID RgID ObjType ArID ArStart ArSize RMrg/s WMrg/s R/s W/s RSz/s WSz/s AvgRqSz QSize Util% AWait RdAWait WrAWait host-077_load 0 0 group 0 128.00k 980.00k 0.00 0.00 0.00 17.00 0 980.00k 57.50k 0.62 4.90 36.76 0.00 36.76 Name GrpID RgID ObjType RgStart RgSize #Areas ArSize ProgID host-077_load 0 0 group 128.00k 980.00k 1 980.00k dmstats lvconvert --yes --type raid6_la_6 centipede2/takeover Apr 10 11:32:39 host-077 qarshd[30609]: Running cmdline: lvconvert --yes --type raid6_la_6 centipede2/takeover Apr 10 11:32:39 host-077 kernel: md/raid:mdX: device dm-3 operational as raid disk 0 Apr 10 11:32:39 host-077 kernel: md/raid:mdX: device dm-5 operational as raid disk 1 Apr 10 11:32:39 host-077 kernel: md/raid:mdX: device dm-7 operational as raid disk 2 Apr 10 11:32:39 host-077 kernel: md/raid:mdX: device dm-9 operational as raid disk 3 Apr 10 11:32:39 host-077 kernel: md/raid:mdX: device dm-11 operational as raid disk 4 Apr 10 11:32:39 host-077 kernel: md/raid:mdX: raid level 6 active with 5 out of 5 devices, algorithm 8 Apr 10 11:32:39 host-077 dmeventd[30499]: No longer monitoring RAID device centipede2-takeover for events. Apr 10 11:32:39 host-077 kernel: md: reshape of RAID array mdX Apr 10 11:32:39 host-077 lvm[30499]: Monitoring RAID device centipede2-takeover for events. Apr 10 11:32:40 host-077 kernel: md/raid:mdX: device dm-3 operational as raid disk 0 Apr 10 11:32:40 host-077 kernel: md/raid:mdX: device dm-5 operational as raid disk 1 Apr 10 11:32:40 host-077 kernel: md/raid:mdX: device dm-7 operational as raid disk 2 Apr 10 11:32:40 host-077 kernel: md/raid:mdX: device dm-9 operational as raid disk 3 Apr 10 11:32:40 host-077 kernel: md/raid:mdX: device dm-11 operational as raid disk 4 Apr 10 11:32:40 host-077 kernel: md/raid:mdX: raid level 6 active with 5 out of 5 devices, algorithm 16 Apr 10 11:32:40 host-077 dmeventd[30499]: No longer monitoring RAID device centipede2-takeover for events. Apr 10 11:32:40 host-077 kernel: md: mdX: reshape interrupted. Apr 10 11:33:40 host-077 systemd-udevd: worker [30613] /devices/virtual/block/dm-12 is taking a long time Apr 10 11:35:40 host-077 systemd-udevd: worker [30613] /devices/virtual/block/dm-12 timeout; kill it Apr 10 11:35:40 host-077 systemd-udevd: seq 114807 '/devices/virtual/block/dm-12' killed Apr 10 11:35:50 host-077 kernel: INFO: task lvconvert:30610 blocked for more than 120 seconds. Apr 10 11:35:50 host-077 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 10 11:35:50 host-077 kernel: lvconvert D ffff88003ce00fb0 0 30610 30609 0x00000080 Apr 10 11:35:50 host-077 kernel: ffff880029f83ac0 0000000000000082 ffff880029f83fd8 ffff880029f83fd8 Apr 10 11:35:50 host-077 kernel: ffff880029f83fd8 0000000000016cc0 ffff88001a1d6dd0 ffff880029f83c08 Apr 10 11:35:50 host-077 kernel: 7fffffffffffffff ffff880029f83c00 ffff88003ce00fb0 0000000000000001 Apr 10 11:35:50 host-077 kernel: Call Trace: Apr 10 11:35:50 host-077 kernel: [<ffffffffb8c960f9>] schedule+0x29/0x70 Apr 10 11:35:50 host-077 kernel: [<ffffffffb8c93d69>] schedule_timeout+0x239/0x2c0 Apr 10 11:35:50 host-077 kernel: [<ffffffffb869806a>] ? mod_timer+0x14a/0x220 Apr 10 11:35:50 host-077 kernel: [<ffffffffb8c964ad>] wait_for_completion+0xfd/0x140 Apr 10 11:35:50 host-077 kernel: [<ffffffffb86c2280>] ? wake_up_state+0x20/0x20 Apr 10 11:35:50 host-077 kernel: [<ffffffffb86b454e>] __synchronize_srcu+0xfe/0x150 Apr 10 11:35:50 host-077 kernel: [<ffffffffb86b4070>] ? call_srcu+0x70/0x70 Apr 10 11:35:50 host-077 kernel: [<ffffffffb8b06929>] ? md_stop_writes+0x29/0x30 Apr 10 11:35:50 host-077 kernel: [<ffffffffb86b45bd>] synchronize_srcu+0x1d/0x20 Apr 10 11:35:50 host-077 kernel: [<ffffffffc01f77d0>] __dm_suspend+0x60/0x210 [dm_mod] Apr 10 11:35:50 host-077 kernel: [<ffffffffc01f9ea0>] dm_suspend+0xc0/0xd0 [dm_mod] Apr 10 11:35:50 host-077 kernel: [<ffffffffc01ff414>] dev_suspend+0x194/0x250 [dm_mod] Apr 10 11:35:50 host-077 kernel: [<ffffffffc01ff280>] ? table_load+0x390/0x390 [dm_mod] Apr 10 11:35:50 host-077 kernel: [<ffffffffc01ffc45>] ctl_ioctl+0x1e5/0x500 [dm_mod] Apr 10 11:35:50 host-077 kernel: [<ffffffffc01fff73>] dm_ctl_ioctl+0x13/0x20 [dm_mod] Apr 10 11:35:50 host-077 kernel: [<ffffffffb881264d>] do_vfs_ioctl+0x33d/0x540 Apr 10 11:35:50 host-077 kernel: [<ffffffffb88b072f>] ? file_has_perm+0x9f/0xb0 Apr 10 11:35:50 host-077 kernel: [<ffffffffb88128f1>] SyS_ioctl+0xa1/0xc0 Apr 10 11:35:50 host-077 kernel: [<ffffffffb8ca1bc9>] system_call_fastpath+0x16/0x1b Apr 10 11:35:50 host-077 kernel: INFO: task systemd-udevd:30613 blocked for more than 120 seconds. Apr 10 11:35:50 host-077 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 10 11:35:50 host-077 kernel: systemd-udevd D ffff88003ad2bec0 0 30613 484 0x00000084 Apr 10 11:35:50 host-077 kernel: ffff880013317798 0000000000000082 ffff880013317fd8 ffff880013317fd8 Apr 10 11:35:50 host-077 kernel: ffff880013317fd8 0000000000016cc0 ffff880025166dd0 ffff880017104c00 Apr 10 11:35:50 host-077 kernel: 0000000000000000 0000000000000002 ffff88000e47d218 00000000000fbf80 Apr 10 11:35:50 host-077 kernel: Call Trace: Apr 10 11:35:50 host-077 kernel: [<ffffffffb8c960f9>] schedule+0x29/0x70 Apr 10 11:35:50 host-077 kernel: [<ffffffffc05d463c>] raid5_make_request+0x5ec/0xcd0 [raid456] Apr 10 11:35:50 host-077 kernel: [<ffffffffb86afa00>] ? wake_up_atomic_t+0x30/0x30 Apr 10 11:35:50 host-077 kernel: [<ffffffffc05f154a>] raid_map+0x2a/0x40 [dm_raid] Apr 10 11:35:50 host-077 kernel: [<ffffffffc01f80a0>] __map_bio+0x90/0x190 [dm_mod] Apr 10 11:35:50 host-077 kernel: [<ffffffffc01f6680>] ? queue_io+0x80/0x80 [dm_mod] Apr 10 11:35:50 host-077 kernel: [<ffffffffc01f839f>] __clone_and_map_data_bio+0x16f/0x280 [dm_mod] Apr 10 11:35:50 host-077 kernel: [<ffffffffc01f8781>] __split_and_process_bio+0x2d1/0x520 [dm_mod] Apr 10 11:35:50 host-077 kernel: [<ffffffffb88e0000>] ? sha256_transform+0x15d0/0x1c40 Apr 10 11:35:50 host-077 kernel: [<ffffffffc01f8cdc>] dm_make_request+0x11c/0x190 [dm_mod] Apr 10 11:35:50 host-077 kernel: [<ffffffffb88f0d16>] generic_make_request+0x106/0x1e0 Apr 10 11:35:50 host-077 kernel: [<ffffffffb88f0e60>] submit_bio+0x70/0x150 Apr 10 11:35:50 host-077 kernel: [<ffffffffb8840954>] mpage_readpages+0x124/0x160 Apr 10 11:35:50 host-077 kernel: [<ffffffffb8839c30>] ? I_BDEV+0x10/0x10 Apr 10 11:35:50 host-077 kernel: [<ffffffffb8839c30>] ? I_BDEV+0x10/0x10 Apr 10 11:35:50 host-077 kernel: [<ffffffffb883a60d>] blkdev_readpages+0x1d/0x20 Apr 10 11:35:50 host-077 kernel: [<ffffffffb878d15c>] __do_page_cache_readahead+0x1cc/0x250 Apr 10 11:35:50 host-077 kernel: [<ffffffffb878d6d9>] force_page_cache_readahead+0x99/0xe0 Apr 10 11:35:50 host-077 kernel: [<ffffffffb878d7b7>] page_cache_sync_readahead+0x97/0xb0 Apr 10 11:35:50 host-077 kernel: [<ffffffffb878160b>] generic_file_aio_read+0x29b/0x790 Apr 10 11:35:50 host-077 kernel: [<ffffffffb883aa4c>] blkdev_aio_read+0x4c/0x70 Apr 10 11:35:50 host-077 kernel: [<ffffffffb87fd97d>] do_sync_read+0x8d/0xd0 Apr 10 11:35:50 host-077 kernel: [<ffffffffb87fe36c>] vfs_read+0x9c/0x170 Apr 10 11:35:50 host-077 kernel: [<ffffffffb87ff22f>] SyS_read+0x7f/0xe0 Apr 10 11:35:50 host-077 kernel: [<ffffffffb8ca1bc9>] system_call_fastpath+0x16/0x1b 3.10.0-635.el7.x86_64 lvm2-2.02.169-3.el7 BUILT: Wed Mar 29 09:17:46 CDT 2017 lvm2-libs-2.02.169-3.el7 BUILT: Wed Mar 29 09:17:46 CDT 2017 lvm2-cluster-2.02.169-3.el7 BUILT: Wed Mar 29 09:17:46 CDT 2017 device-mapper-1.02.138-3.el7 BUILT: Wed Mar 29 09:17:46 CDT 2017 device-mapper-libs-1.02.138-3.el7 BUILT: Wed Mar 29 09:17:46 CDT 2017 device-mapper-event-1.02.138-3.el7 BUILT: Wed Mar 29 09:17:46 CDT 2017 device-mapper-event-libs-1.02.138-3.el7 BUILT: Wed Mar 29 09:17:46 CDT 2017 device-mapper-persistent-data-0.7.0-0.1.rc6.el7 BUILT: Mon Mar 27 10:15:46 CDT 2017 cmirror-2.02.169-3.el7 BUILT: Wed Mar 29 09:17:46 CDT 2017 sanlock-3.4.0-1.el7 BUILT: Fri Jun 10 11:41:03 CDT 2016 sanlock-lib-3.4.0-1.el7 BUILT: Fri Jun 10 11:41:03 CDT 2016 lvm2-lockd-2.02.169-3.el7 BUILT: Wed Mar 29 09:17:46 CDT 2017 Can't reproduce yet: # lvs -oname,size,segtype,attr,copypercent,stripes,datastripes,datacopies,regionsize nvm/r LV LSize Type Attr Cpy%Sync #Str #DStr #Cpy Region r 1.01g raid5_ra rwi-aor--- 100.00 4 3 2 128.00k # for t in raid6 raid6_n_6 raid6_ls_6 raid5 raid5_n striped raid6;do lvs -oname,size,segtype,attr,copypercent,stripes,datastripes,datacopies,regionsize nvm/r;rm -fr /mnt/r/*;rsync -lr /usr/lib /mnt/r;sleep 1;lvconvert -y --ty $t nvm/r;wait;done LV LSize Type Attr Cpy%Sync #Str #DStr #Cpy Region r 1.01g raid5_ra rwi-aor--- 100.00 4 3 2 128.00k Using default stripesize 64.00 KiB. Replaced LV type raid6 with possible type raid6_ra_6. Logical volume nvm/r successfully converted. LV LSize Type Attr Cpy%Sync #Str #DStr #Cpy Region r 1.01g raid6_ra_6 rwi-aor--- 100.00 5 3 3 128.00k Using default stripesize 64.00 KiB. Converting raid6_ra_6 LV nvm/r to raid6_n_6. Logical volume nvm/r successfully converted. LV LSize Type Attr Cpy%Sync #Str #DStr #Cpy Region r 1.01g raid6_n_6 rwi-aor-s- 47.13 5 3 3 128.00k Using default stripesize 64.00 KiB. Converting raid6_n_6 LV nvm/r to raid6_ls_6. Logical volume nvm/r successfully converted. LV LSize Type Attr Cpy%Sync #Str #DStr #Cpy Region r 1.01g raid6_ls_6 rwi-aor-s- 37.93 5 3 3 128.00k Using default stripesize 64.00 KiB. Replaced LV type raid5 with possible type raid5_ls. Logical volume nvm/r successfully converted. LV LSize Type Attr Cpy%Sync #Str #DStr #Cpy Region r 1.01g raid5_ls rwi-aor--- 100.00 4 3 2 128.00k Using default stripesize 64.00 KiB. Converting raid5_ls LV nvm/r to raid5_n. Logical volume nvm/r successfully converted. LV LSize Type Attr Cpy%Sync #Str #DStr #Cpy Region r 1.01g raid5_n rwi-aor-s- 47.13 4 3 2 128.00k Logical volume nvm/r successfully converted. LV LSize Type Attr Cpy%Sync #Str #DStr #Cpy Region r 1.01g striped -wi-ao---- 3 3 1 0 Using default stripesize 64.00 KiB. Replaced LV type raid6 with possible type raid6_n_6. Logical volume nvm/r successfully converted. # lvs -oname,size,segtype,attr,copypercent,stripes,datastripes,datacopies,regionsize nvm/r LV LSize Type Attr Cpy%Sync #Str #DStr #Cpy Region r 1.01g raid6_n_6 rwi-aor--- 100.00 5 3 3 512.00k # df -h /mnt/r Filesystem Size Used Avail Use% Mounted on /dev/mapper/nvm-r 984M 849M 68M 93% /mnt/r *** This bug has been marked as a duplicate of bug 1442137 *** |