Bug 1439934
| Summary: | RAID TAKEOVER: potential for deadlock when doing takeover | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Corey Marthaler <cmarthal> |
| Component: | lvm2 | Assignee: | Heinz Mauelshagen <heinzm> |
| lvm2 sub component: | Mirroring and RAID | QA Contact: | cluster-qe <cluster-qe> |
| Status: | CLOSED DUPLICATE | Docs Contact: | |
| Severity: | high | ||
| Priority: | unspecified | CC: | agk, heinzm, jbrassow, msnitzer, prajnoha, prockai, zkabelac |
| Version: | 7.4 | ||
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-04-24 14:35:07 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Corey Marthaler
2017-04-06 22:15:31 UTC
Reproduced this with raid6 -> raid6_la_6 as well.
================================================================================
Iteration 0.1 started at Mon Apr 10 11:31:39 CDT 2017
================================================================================
Scenario raid6: Convert Striped raid6 volume
********* Take over hash info for this scenario *********
* from type: raid6
* to type: raid6_la_6
* from legs: 3
* to legs: 3
* contiguous: 0
* snapshot: 0
******************************************************
Creating original volume on host-077...
host-077: lvcreate --type raid6 -i 3 -n takeover -L 500M centipede2
Waiting until all mirror|raid volumes become fully syncd...
1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec
Current volume device structure:
LV Attr LSize Cpy%Sync Devices
takeover rwi-a-r--- 504.00m 100.00 takeover_rimage_0(0),takeover_rimage_1(0),takeover_rimage_2(0),takeover_rimage_3(0),takeover_rimage_4(0)
[takeover_rimage_0] iwi-aor--- 168.00m /dev/sdf1(1)
[takeover_rimage_1] iwi-aor--- 168.00m /dev/sdb1(1)
[takeover_rimage_2] iwi-aor--- 168.00m /dev/sdc1(1)
[takeover_rimage_3] iwi-aor--- 168.00m /dev/sdd1(1)
[takeover_rimage_4] iwi-aor--- 168.00m /dev/sda1(1)
[takeover_rmeta_0] ewi-aor--- 4.00m /dev/sdf1(0)
[takeover_rmeta_1] ewi-aor--- 4.00m /dev/sdb1(0)
[takeover_rmeta_2] ewi-aor--- 4.00m /dev/sdc1(0)
[takeover_rmeta_3] ewi-aor--- 4.00m /dev/sdd1(0)
[takeover_rmeta_4] ewi-aor--- 4.00m /dev/sda1(0)
Vg=centipede2
names=takeover
Creating xfs on top of mirror(s) on host-077...
host-077 mkfs.xfs -f /dev/centipede2/takeover
Mounting mirrored xfs filesystems on host-077...
Writing verification files (checkit) to mirror(s) on...
---- host-077 ----
<start name="host-077_takeover" pid="30295" time="Mon Apr 10 11:32:16 2017 -0500" type="cmd" />
execvp() failed.
Sleeping 15 seconds to get some outsanding I/O locks before the failure
Verifying files (checkit) on mirror(s) on...
---- host-077 ----
Name GrpID RgID ObjType ArID ArStart ArSize RMrg/s WMrg/s R/s W/s RSz/s WSz/s AvgRqSz QSize Util% AWait RdAWait WrAWait
host-077_load 0 0 group 0 128.00k 980.00k 0.00 0.00 0.00 17.00 0 980.00k 57.50k 0.62 4.90 36.76 0.00 36.76
Name GrpID RgID ObjType RgStart RgSize #Areas ArSize ProgID
host-077_load 0 0 group 128.00k 980.00k 1 980.00k dmstats
lvconvert --yes --type raid6_la_6 centipede2/takeover
Apr 10 11:32:39 host-077 qarshd[30609]: Running cmdline: lvconvert --yes --type raid6_la_6 centipede2/takeover
Apr 10 11:32:39 host-077 kernel: md/raid:mdX: device dm-3 operational as raid disk 0
Apr 10 11:32:39 host-077 kernel: md/raid:mdX: device dm-5 operational as raid disk 1
Apr 10 11:32:39 host-077 kernel: md/raid:mdX: device dm-7 operational as raid disk 2
Apr 10 11:32:39 host-077 kernel: md/raid:mdX: device dm-9 operational as raid disk 3
Apr 10 11:32:39 host-077 kernel: md/raid:mdX: device dm-11 operational as raid disk 4
Apr 10 11:32:39 host-077 kernel: md/raid:mdX: raid level 6 active with 5 out of 5 devices, algorithm 8
Apr 10 11:32:39 host-077 dmeventd[30499]: No longer monitoring RAID device centipede2-takeover for events.
Apr 10 11:32:39 host-077 kernel: md: reshape of RAID array mdX
Apr 10 11:32:39 host-077 lvm[30499]: Monitoring RAID device centipede2-takeover for events.
Apr 10 11:32:40 host-077 kernel: md/raid:mdX: device dm-3 operational as raid disk 0
Apr 10 11:32:40 host-077 kernel: md/raid:mdX: device dm-5 operational as raid disk 1
Apr 10 11:32:40 host-077 kernel: md/raid:mdX: device dm-7 operational as raid disk 2
Apr 10 11:32:40 host-077 kernel: md/raid:mdX: device dm-9 operational as raid disk 3
Apr 10 11:32:40 host-077 kernel: md/raid:mdX: device dm-11 operational as raid disk 4
Apr 10 11:32:40 host-077 kernel: md/raid:mdX: raid level 6 active with 5 out of 5 devices, algorithm 16
Apr 10 11:32:40 host-077 dmeventd[30499]: No longer monitoring RAID device centipede2-takeover for events.
Apr 10 11:32:40 host-077 kernel: md: mdX: reshape interrupted.
Apr 10 11:33:40 host-077 systemd-udevd: worker [30613] /devices/virtual/block/dm-12 is taking a long time
Apr 10 11:35:40 host-077 systemd-udevd: worker [30613] /devices/virtual/block/dm-12 timeout; kill it
Apr 10 11:35:40 host-077 systemd-udevd: seq 114807 '/devices/virtual/block/dm-12' killed
Apr 10 11:35:50 host-077 kernel: INFO: task lvconvert:30610 blocked for more than 120 seconds.
Apr 10 11:35:50 host-077 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 10 11:35:50 host-077 kernel: lvconvert D ffff88003ce00fb0 0 30610 30609 0x00000080
Apr 10 11:35:50 host-077 kernel: ffff880029f83ac0 0000000000000082 ffff880029f83fd8 ffff880029f83fd8
Apr 10 11:35:50 host-077 kernel: ffff880029f83fd8 0000000000016cc0 ffff88001a1d6dd0 ffff880029f83c08
Apr 10 11:35:50 host-077 kernel: 7fffffffffffffff ffff880029f83c00 ffff88003ce00fb0 0000000000000001
Apr 10 11:35:50 host-077 kernel: Call Trace:
Apr 10 11:35:50 host-077 kernel: [<ffffffffb8c960f9>] schedule+0x29/0x70
Apr 10 11:35:50 host-077 kernel: [<ffffffffb8c93d69>] schedule_timeout+0x239/0x2c0
Apr 10 11:35:50 host-077 kernel: [<ffffffffb869806a>] ? mod_timer+0x14a/0x220
Apr 10 11:35:50 host-077 kernel: [<ffffffffb8c964ad>] wait_for_completion+0xfd/0x140
Apr 10 11:35:50 host-077 kernel: [<ffffffffb86c2280>] ? wake_up_state+0x20/0x20
Apr 10 11:35:50 host-077 kernel: [<ffffffffb86b454e>] __synchronize_srcu+0xfe/0x150
Apr 10 11:35:50 host-077 kernel: [<ffffffffb86b4070>] ? call_srcu+0x70/0x70
Apr 10 11:35:50 host-077 kernel: [<ffffffffb8b06929>] ? md_stop_writes+0x29/0x30
Apr 10 11:35:50 host-077 kernel: [<ffffffffb86b45bd>] synchronize_srcu+0x1d/0x20
Apr 10 11:35:50 host-077 kernel: [<ffffffffc01f77d0>] __dm_suspend+0x60/0x210 [dm_mod]
Apr 10 11:35:50 host-077 kernel: [<ffffffffc01f9ea0>] dm_suspend+0xc0/0xd0 [dm_mod]
Apr 10 11:35:50 host-077 kernel: [<ffffffffc01ff414>] dev_suspend+0x194/0x250 [dm_mod]
Apr 10 11:35:50 host-077 kernel: [<ffffffffc01ff280>] ? table_load+0x390/0x390 [dm_mod]
Apr 10 11:35:50 host-077 kernel: [<ffffffffc01ffc45>] ctl_ioctl+0x1e5/0x500 [dm_mod]
Apr 10 11:35:50 host-077 kernel: [<ffffffffc01fff73>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
Apr 10 11:35:50 host-077 kernel: [<ffffffffb881264d>] do_vfs_ioctl+0x33d/0x540
Apr 10 11:35:50 host-077 kernel: [<ffffffffb88b072f>] ? file_has_perm+0x9f/0xb0
Apr 10 11:35:50 host-077 kernel: [<ffffffffb88128f1>] SyS_ioctl+0xa1/0xc0
Apr 10 11:35:50 host-077 kernel: [<ffffffffb8ca1bc9>] system_call_fastpath+0x16/0x1b
Apr 10 11:35:50 host-077 kernel: INFO: task systemd-udevd:30613 blocked for more than 120 seconds.
Apr 10 11:35:50 host-077 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 10 11:35:50 host-077 kernel: systemd-udevd D ffff88003ad2bec0 0 30613 484 0x00000084
Apr 10 11:35:50 host-077 kernel: ffff880013317798 0000000000000082 ffff880013317fd8 ffff880013317fd8
Apr 10 11:35:50 host-077 kernel: ffff880013317fd8 0000000000016cc0 ffff880025166dd0 ffff880017104c00
Apr 10 11:35:50 host-077 kernel: 0000000000000000 0000000000000002 ffff88000e47d218 00000000000fbf80
Apr 10 11:35:50 host-077 kernel: Call Trace:
Apr 10 11:35:50 host-077 kernel: [<ffffffffb8c960f9>] schedule+0x29/0x70
Apr 10 11:35:50 host-077 kernel: [<ffffffffc05d463c>] raid5_make_request+0x5ec/0xcd0 [raid456]
Apr 10 11:35:50 host-077 kernel: [<ffffffffb86afa00>] ? wake_up_atomic_t+0x30/0x30
Apr 10 11:35:50 host-077 kernel: [<ffffffffc05f154a>] raid_map+0x2a/0x40 [dm_raid]
Apr 10 11:35:50 host-077 kernel: [<ffffffffc01f80a0>] __map_bio+0x90/0x190 [dm_mod]
Apr 10 11:35:50 host-077 kernel: [<ffffffffc01f6680>] ? queue_io+0x80/0x80 [dm_mod]
Apr 10 11:35:50 host-077 kernel: [<ffffffffc01f839f>] __clone_and_map_data_bio+0x16f/0x280 [dm_mod]
Apr 10 11:35:50 host-077 kernel: [<ffffffffc01f8781>] __split_and_process_bio+0x2d1/0x520 [dm_mod]
Apr 10 11:35:50 host-077 kernel: [<ffffffffb88e0000>] ? sha256_transform+0x15d0/0x1c40
Apr 10 11:35:50 host-077 kernel: [<ffffffffc01f8cdc>] dm_make_request+0x11c/0x190 [dm_mod]
Apr 10 11:35:50 host-077 kernel: [<ffffffffb88f0d16>] generic_make_request+0x106/0x1e0
Apr 10 11:35:50 host-077 kernel: [<ffffffffb88f0e60>] submit_bio+0x70/0x150
Apr 10 11:35:50 host-077 kernel: [<ffffffffb8840954>] mpage_readpages+0x124/0x160
Apr 10 11:35:50 host-077 kernel: [<ffffffffb8839c30>] ? I_BDEV+0x10/0x10
Apr 10 11:35:50 host-077 kernel: [<ffffffffb8839c30>] ? I_BDEV+0x10/0x10
Apr 10 11:35:50 host-077 kernel: [<ffffffffb883a60d>] blkdev_readpages+0x1d/0x20
Apr 10 11:35:50 host-077 kernel: [<ffffffffb878d15c>] __do_page_cache_readahead+0x1cc/0x250
Apr 10 11:35:50 host-077 kernel: [<ffffffffb878d6d9>] force_page_cache_readahead+0x99/0xe0
Apr 10 11:35:50 host-077 kernel: [<ffffffffb878d7b7>] page_cache_sync_readahead+0x97/0xb0
Apr 10 11:35:50 host-077 kernel: [<ffffffffb878160b>] generic_file_aio_read+0x29b/0x790
Apr 10 11:35:50 host-077 kernel: [<ffffffffb883aa4c>] blkdev_aio_read+0x4c/0x70
Apr 10 11:35:50 host-077 kernel: [<ffffffffb87fd97d>] do_sync_read+0x8d/0xd0
Apr 10 11:35:50 host-077 kernel: [<ffffffffb87fe36c>] vfs_read+0x9c/0x170
Apr 10 11:35:50 host-077 kernel: [<ffffffffb87ff22f>] SyS_read+0x7f/0xe0
Apr 10 11:35:50 host-077 kernel: [<ffffffffb8ca1bc9>] system_call_fastpath+0x16/0x1b
3.10.0-635.el7.x86_64
lvm2-2.02.169-3.el7 BUILT: Wed Mar 29 09:17:46 CDT 2017
lvm2-libs-2.02.169-3.el7 BUILT: Wed Mar 29 09:17:46 CDT 2017
lvm2-cluster-2.02.169-3.el7 BUILT: Wed Mar 29 09:17:46 CDT 2017
device-mapper-1.02.138-3.el7 BUILT: Wed Mar 29 09:17:46 CDT 2017
device-mapper-libs-1.02.138-3.el7 BUILT: Wed Mar 29 09:17:46 CDT 2017
device-mapper-event-1.02.138-3.el7 BUILT: Wed Mar 29 09:17:46 CDT 2017
device-mapper-event-libs-1.02.138-3.el7 BUILT: Wed Mar 29 09:17:46 CDT 2017
device-mapper-persistent-data-0.7.0-0.1.rc6.el7 BUILT: Mon Mar 27 10:15:46 CDT 2017
cmirror-2.02.169-3.el7 BUILT: Wed Mar 29 09:17:46 CDT 2017
sanlock-3.4.0-1.el7 BUILT: Fri Jun 10 11:41:03 CDT 2016
sanlock-lib-3.4.0-1.el7 BUILT: Fri Jun 10 11:41:03 CDT 2016
lvm2-lockd-2.02.169-3.el7 BUILT: Wed Mar 29 09:17:46 CDT 2017
Can't reproduce yet: # lvs -oname,size,segtype,attr,copypercent,stripes,datastripes,datacopies,regionsize nvm/r LV LSize Type Attr Cpy%Sync #Str #DStr #Cpy Region r 1.01g raid5_ra rwi-aor--- 100.00 4 3 2 128.00k # for t in raid6 raid6_n_6 raid6_ls_6 raid5 raid5_n striped raid6;do lvs -oname,size,segtype,attr,copypercent,stripes,datastripes,datacopies,regionsize nvm/r;rm -fr /mnt/r/*;rsync -lr /usr/lib /mnt/r;sleep 1;lvconvert -y --ty $t nvm/r;wait;done LV LSize Type Attr Cpy%Sync #Str #DStr #Cpy Region r 1.01g raid5_ra rwi-aor--- 100.00 4 3 2 128.00k Using default stripesize 64.00 KiB. Replaced LV type raid6 with possible type raid6_ra_6. Logical volume nvm/r successfully converted. LV LSize Type Attr Cpy%Sync #Str #DStr #Cpy Region r 1.01g raid6_ra_6 rwi-aor--- 100.00 5 3 3 128.00k Using default stripesize 64.00 KiB. Converting raid6_ra_6 LV nvm/r to raid6_n_6. Logical volume nvm/r successfully converted. LV LSize Type Attr Cpy%Sync #Str #DStr #Cpy Region r 1.01g raid6_n_6 rwi-aor-s- 47.13 5 3 3 128.00k Using default stripesize 64.00 KiB. Converting raid6_n_6 LV nvm/r to raid6_ls_6. Logical volume nvm/r successfully converted. LV LSize Type Attr Cpy%Sync #Str #DStr #Cpy Region r 1.01g raid6_ls_6 rwi-aor-s- 37.93 5 3 3 128.00k Using default stripesize 64.00 KiB. Replaced LV type raid5 with possible type raid5_ls. Logical volume nvm/r successfully converted. LV LSize Type Attr Cpy%Sync #Str #DStr #Cpy Region r 1.01g raid5_ls rwi-aor--- 100.00 4 3 2 128.00k Using default stripesize 64.00 KiB. Converting raid5_ls LV nvm/r to raid5_n. Logical volume nvm/r successfully converted. LV LSize Type Attr Cpy%Sync #Str #DStr #Cpy Region r 1.01g raid5_n rwi-aor-s- 47.13 4 3 2 128.00k Logical volume nvm/r successfully converted. LV LSize Type Attr Cpy%Sync #Str #DStr #Cpy Region r 1.01g striped -wi-ao---- 3 3 1 0 Using default stripesize 64.00 KiB. Replaced LV type raid6 with possible type raid6_n_6. Logical volume nvm/r successfully converted. # lvs -oname,size,segtype,attr,copypercent,stripes,datastripes,datacopies,regionsize nvm/r LV LSize Type Attr Cpy%Sync #Str #DStr #Cpy Region r 1.01g raid6_n_6 rwi-aor--- 100.00 5 3 3 512.00k # df -h /mnt/r Filesystem Size Used Avail Use% Mounted on /dev/mapper/nvm-r 984M 849M 68M 93% /mnt/r *** This bug has been marked as a duplicate of bug 1442137 *** |