Bug 1439399
Summary: | RAID TAKEOVER: takeover on raid volumes containing snapshots doesn't work | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Corey Marthaler <cmarthal> | ||||
Component: | lvm2 | Assignee: | Heinz Mauelshagen <heinzm> | ||||
lvm2 sub component: | Mirroring and RAID | QA Contact: | cluster-qe <cluster-qe> | ||||
Status: | CLOSED ERRATA | Docs Contact: | |||||
Severity: | high | ||||||
Priority: | unspecified | CC: | agk, heinzm, jbrassow, msnitzer, prajnoha, prockai, rbednar, zkabelac | ||||
Version: | 7.4 | Keywords: | Reopened | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | lvm2-2.02.175-1.el7 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-04-10 15:20:44 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1782045 | ||||||
Bug Blocks: | 1469559 | ||||||
Attachments: |
|
Description
Corey Marthaler
2017-04-05 22:48:38 UTC
This appears to be the case w/ all raid types. Scenario raid6_nr: Convert Striped raid6_nr volume ********* Take over hash info for this scenario ********* * from type: raid6_nr * to type: raid6_la_6 * snapshot: 1 ****************************************************** Creating original volume on host-121... host-121: lvcreate --type raid6_nr -i 3 -n takeover -L 500M centipede2 Waiting until all mirror|raid volumes become fully syncd... 1/1 mirror(s) are fully synced: ( 100.00% ) Sleeping 15 sec Current volume device structure: LV Attr LSize Cpy%Sync Devices takeover rwi-a-r--- 504.00m 100.00 takeover_rimage_0(0),takeover_rimage_1(0),takeover_rimage_2(0),takeover_rimage_3(0),takeover_rimage_4(0) [takeover_rimage_0] iwi-aor--- 168.00m /dev/sdg1(1) [takeover_rimage_1] iwi-aor--- 168.00m /dev/sde1(1) [takeover_rimage_2] iwi-aor--- 168.00m /dev/sda1(1) [takeover_rimage_3] iwi-aor--- 168.00m /dev/sdd1(1) [takeover_rimage_4] iwi-aor--- 168.00m /dev/sdc1(1) [takeover_rmeta_0] ewi-aor--- 4.00m /dev/sdg1(0) [takeover_rmeta_1] ewi-aor--- 4.00m /dev/sde1(0) [takeover_rmeta_2] ewi-aor--- 4.00m /dev/sda1(0) [takeover_rmeta_3] ewi-aor--- 4.00m /dev/sdd1(0) [takeover_rmeta_4] ewi-aor--- 4.00m /dev/sdc1(0) Creating ext on top of mirror(s) on host-121... mke2fs 1.42.9 (28-Dec-2013) Mounting mirrored ext filesystems on host-121... Writing verification files (checkit) to mirror(s) on... ---- host-121 ---- Sleeping 15 seconds to get some outsanding I/O locks before the failure Creating a snapshot volume of raid to be changed lvcreate --type snapshot -L 100M -n snap -s centipede2/takeover Verifying files (checkit) on mirror(s) on... ---- host-121 ---- lvconvert --yes --type raid6_la_6 centipede2/takeover Internal error: Writing metadata in critical section. Apr 12 15:24:30 host-121 qarshd[31678]: Running cmdline: lvconvert --yes --type raid6_la_6 centipede2/takeover Apr 12 15:24:31 host-121 kernel: md/raid:mdX: device dm-3 operational as raid disk 0 Apr 12 15:24:31 host-121 kernel: md/raid:mdX: device dm-5 operational as raid disk 1 Apr 12 15:24:31 host-121 kernel: md/raid:mdX: device dm-7 operational as raid disk 2 Apr 12 15:24:31 host-121 kernel: md/raid:mdX: device dm-9 operational as raid disk 3 Apr 12 15:24:31 host-121 kernel: md/raid:mdX: device dm-11 operational as raid disk 4 Apr 12 15:24:31 host-121 kernel: md/raid:mdX: raid level 6 active with 5 out of 5 devices, algorithm 9 Apr 12 15:24:31 host-121 lvm[9616]: No longer monitoring RAID device centipede2-takeover-real for events. Apr 12 15:24:31 host-121 dmeventd[9616]: No longer monitoring snapshot centipede2-snap. Apr 12 15:26:31 host-121 kernel: INFO: task jbd2/dm-12-8:31512 blocked for more than 120 seconds. Apr 12 15:26:31 host-121 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 12 15:26:31 host-121 kernel: jbd2/dm-12-8 D ffff88003b003ec0 0 31512 2 0x00000080 Apr 12 15:26:31 host-121 kernel: ffff88000496ba60 0000000000000046 ffff88000496bfd8 ffff88000496bfd8 Apr 12 15:26:31 host-121 kernel: ffff88000496bfd8 0000000000016cc0 ffff880020845e20 ffff88003fc16cc0 Apr 12 15:26:31 host-121 kernel: 0000000000000000 7fffffffffffffff ffff88003ff5a260 ffffffffbe494380 Apr 12 15:26:31 host-121 kernel: Call Trace: Apr 12 15:26:31 host-121 kernel: [<ffffffffbe494380>] ? bit_wait+0x50/0x50 Apr 12 15:26:31 host-121 kernel: [<ffffffffbe4960f9>] schedule+0x29/0x70 Apr 12 15:26:31 host-121 kernel: [<ffffffffbe493d69>] schedule_timeout+0x239/0x2c0 Apr 12 15:26:31 host-121 kernel: [<ffffffffc02a1799>] ? __split_and_process_bio+0x2e9/0x520 [dm_mod] Apr 12 15:26:31 host-121 kernel: [<ffffffffbde60ede>] ? kvm_clock_get_cycles+0x1e/0x20 Apr 12 15:26:31 host-121 kernel: [<ffffffffbdee6c3c>] ? ktime_get_ts64+0x4c/0xf0 Apr 12 15:26:31 host-121 kernel: [<ffffffffbe494380>] ? bit_wait+0x50/0x50 Apr 12 15:26:31 host-121 kernel: [<ffffffffbe4958dd>] io_schedule_timeout+0xad/0x130 Apr 12 15:26:31 host-121 kernel: [<ffffffffbe495978>] io_schedule+0x18/0x1a Apr 12 15:26:31 host-121 kernel: [<ffffffffbe494391>] bit_wait_io+0x11/0x50 Apr 12 15:26:31 host-121 kernel: [<ffffffffbe493eb5>] __wait_on_bit+0x65/0x90 Apr 12 15:26:31 host-121 kernel: [<ffffffffbe494380>] ? bit_wait+0x50/0x50 Apr 12 15:26:31 host-121 kernel: [<ffffffffbe493f61>] out_of_line_wait_on_bit+0x81/0xb0 Apr 12 15:26:31 host-121 kernel: [<ffffffffbdeafac0>] ? wake_bit_function+0x40/0x40 Apr 12 15:26:31 host-121 kernel: [<ffffffffbe03394a>] __wait_on_buffer+0x2a/0x30 Apr 12 15:26:31 host-121 kernel: [<ffffffffc06b3110>] jbd2_write_superblock+0xa0/0x180 [jbd2] Apr 12 15:26:31 host-121 kernel: [<ffffffffc06b3229>] jbd2_journal_update_sb_log_tail+0x39/0xa0 [jbd2] Apr 12 15:26:31 host-121 kernel: [<ffffffffc06ac7f4>] jbd2_journal_commit_transaction+0x17a4/0x1990 [jbd2] Apr 12 15:26:31 host-121 kernel: [<ffffffffbdec803e>] ? account_entity_dequeue+0xae/0xd0 Apr 12 15:26:31 host-121 kernel: [<ffffffffbdecba5c>] ? dequeue_entity+0x11c/0x5d0 Apr 12 15:26:31 host-121 kernel: [<ffffffffbde60ebe>] ? kvm_clock_read+0x1e/0x20 Apr 12 15:26:31 host-121 kernel: [<ffffffffbde29557>] ? __switch_to+0xd7/0x4c0 Apr 12 15:26:31 host-121 kernel: [<ffffffffbde96edb>] ? lock_timer_base.isra.34+0x2b/0x50 Apr 12 15:26:31 host-121 kernel: [<ffffffffbde9738e>] ? try_to_del_timer_sync+0x5e/0x90 Apr 12 15:26:31 host-121 kernel: [<ffffffffc06b1a89>] kjournald2+0xc9/0x260 [jbd2] Apr 12 15:26:31 host-121 kernel: [<ffffffffbdeafa00>] ? wake_up_atomic_t+0x30/0x30 Apr 12 15:26:31 host-121 kernel: [<ffffffffc06b19c0>] ? commit_timeout+0x10/0x10 [jbd2] Apr 12 15:26:31 host-121 kernel: [<ffffffffbdeae9bf>] kthread+0xcf/0xe0 Apr 12 15:26:31 host-121 kernel: [<ffffffffbde8bf0b>] ? do_exit+0x6bb/0xa40 Apr 12 15:26:31 host-121 kernel: [<ffffffffbdeae8f0>] ? insert_kthread_work+0x40/0x40 Apr 12 15:26:31 host-121 kernel: [<ffffffffbe4a1b18>] ret_from_fork+0x58/0x90 Apr 12 15:26:31 host-121 kernel: [<ffffffffbdeae8f0>] ? insert_kthread_work+0x40/0x40 Apr 12 15:26:31 host-121 kernel: INFO: task xdoio:31533 blocked for more than 120 seconds. Apr 12 15:26:31 host-121 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 12 15:26:31 host-121 kernel: xdoio D ffff88003b005e20 0 31533 31532 0x00000080 Apr 12 15:26:31 host-121 kernel: ffff8800186efdb0 0000000000000082 ffff8800186effd8 ffff8800186effd8 Apr 12 15:26:31 host-121 kernel: ffff8800186effd8 0000000000016cc0 ffff880020c41f60 ffff88002379f000 Apr 12 15:26:31 host-121 kernel: 0000000000000001 0000000000000001 0000000000000000 ffff88002379f308 Apr 12 15:26:31 host-121 kernel: Call Trace: Apr 12 15:26:31 host-121 kernel: [<ffffffffbe4960f9>] schedule+0x29/0x70 Apr 12 15:26:31 host-121 kernel: [<ffffffffbe0014ae>] __sb_start_write+0xde/0x110 Apr 12 15:26:31 host-121 kernel: [<ffffffffbdeafa00>] ? wake_up_atomic_t+0x30/0x30 Apr 12 15:26:31 host-121 kernel: [<ffffffffbdfff81e>] do_readv_writev+0x20e/0x260 Apr 12 15:26:31 host-121 kernel: [<ffffffffc06d4e10>] ? ext4_dax_fault+0x150/0x150 [ext4] Apr 12 15:26:31 host-121 kernel: [<ffffffffbdffd9c0>] ? do_sync_read+0xd0/0xd0 Apr 12 15:26:31 host-121 kernel: [<ffffffffbde60ede>] ? kvm_clock_get_cycles+0x1e/0x20 Apr 12 15:26:31 host-121 kernel: [<ffffffffbdee816a>] ? __getnstimeofday64+0x3a/0xd0 Apr 12 15:26:31 host-121 kernel: [<ffffffffbdfff905>] vfs_writev+0x35/0x60 Apr 12 15:26:31 host-121 kernel: [<ffffffffbdfffabf>] SyS_writev+0x7f/0x110 Apr 12 15:26:31 host-121 kernel: [<ffffffffbe4a1bc9>] system_call_fastpath+0x16/0x1b Apr 12 15:26:31 host-121 kernel: INFO: task lvconvert:31679 blocked for more than 120 seconds. Apr 12 15:26:31 host-121 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 12 15:26:31 host-121 kernel: lvconvert D ffff880021862f10 0 31679 31678 0x00000080 Apr 12 15:26:31 host-121 kernel: ffff88000107f8b0 0000000000000086 ffff88000107ffd8 ffff88000107ffd8 Apr 12 15:26:31 host-121 kernel: ffff88000107ffd8 0000000000016cc0 ffff880020840000 ffff88003fc16cc0 Apr 12 15:26:31 host-121 kernel: 0000000000000000 7fffffffffffffff ffff88003ff5d7e8 ffffffffbe494380 Apr 12 15:26:31 host-121 kernel: Call Trace: Apr 12 15:26:31 host-121 kernel: [<ffffffffbe494380>] ? bit_wait+0x50/0x50 Apr 12 15:26:31 host-121 kernel: [<ffffffffbe4960f9>] schedule+0x29/0x70 Apr 12 15:26:31 host-121 kernel: [<ffffffffbe493d69>] schedule_timeout+0x239/0x2c0 Apr 12 15:26:31 host-121 kernel: [<ffffffffbdec803e>] ? account_entity_dequeue+0xae/0xd0 Apr 12 15:26:31 host-121 kernel: [<ffffffffbdecba5c>] ? dequeue_entity+0x11c/0x5d0 Apr 12 15:26:31 host-121 kernel: [<ffffffffbde60ede>] ? kvm_clock_get_cycles+0x1e/0x20 Apr 12 15:26:31 host-121 kernel: [<ffffffffbe494380>] ? bit_wait+0x50/0x50 Apr 12 15:26:31 host-121 kernel: [<ffffffffbe4958dd>] io_schedule_timeout+0xad/0x130 Apr 12 15:26:31 host-121 kernel: [<ffffffffbe495978>] io_schedule+0x18/0x1a Apr 12 15:26:31 host-121 kernel: [<ffffffffbe494391>] bit_wait_io+0x11/0x50 Apr 12 15:26:31 host-121 kernel: [<ffffffffbe493eb5>] __wait_on_bit+0x65/0x90 Apr 12 15:26:31 host-121 kernel: [<ffffffffbdf7f231>] wait_on_page_bit+0x81/0xa0 Apr 12 15:26:31 host-121 kernel: [<ffffffffbdeafac0>] ? wake_bit_function+0x40/0x40 Apr 12 15:26:31 host-121 kernel: [<ffffffffbdf7f361>] __filemap_fdatawait_range+0x111/0x190 Apr 12 15:26:31 host-121 kernel: [<ffffffffbdf82157>] filemap_fdatawait_keep_errors+0x27/0x30 Apr 12 15:26:31 host-121 kernel: [<ffffffffbe02af9d>] sync_inodes_sb+0x16d/0x1f0 Apr 12 15:26:31 host-121 kernel: [<ffffffffbe030833>] sync_filesystem+0x63/0xb0 Apr 12 15:26:31 host-121 kernel: [<ffffffffbe0017bf>] freeze_super+0x8f/0x130 Apr 12 15:26:31 host-121 kernel: [<ffffffffbe03b705>] freeze_bdev+0x75/0xd0 Apr 12 15:26:31 host-121 kernel: [<ffffffffc02a0868>] __dm_suspend+0xf8/0x210 [dm_mod] Apr 12 15:26:31 host-121 kernel: [<ffffffffc02a2ea0>] dm_suspend+0xc0/0xd0 [dm_mod] Apr 12 15:26:31 host-121 kernel: [<ffffffffc02a8414>] dev_suspend+0x194/0x250 [dm_mod] Apr 12 15:26:31 host-121 kernel: [<ffffffffc02a8280>] ? table_load+0x390/0x390 [dm_mod] Apr 12 15:26:31 host-121 kernel: [<ffffffffc02a8c45>] ctl_ioctl+0x1e5/0x500 [dm_mod] Apr 12 15:26:31 host-121 kernel: [<ffffffffc02a8f73>] dm_ctl_ioctl+0x13/0x20 [dm_mod] Apr 12 15:26:31 host-121 kernel: [<ffffffffbe01264d>] do_vfs_ioctl+0x33d/0x540 Apr 12 15:26:31 host-121 kernel: [<ffffffffbe0b072f>] ? file_has_perm+0x9f/0xb0 Apr 12 15:26:31 host-121 kernel: [<ffffffffbe0009ee>] ? ____fput+0xe/0x10 Apr 12 15:26:31 host-121 kernel: [<ffffffffbe0128f1>] SyS_ioctl+0xa1/0xc0 Apr 12 15:26:31 host-121 kernel: [<ffffffffbe4a1bc9>] system_call_fastpath+0x16/0x1b Created attachment 1273641 [details]
verbose lvconvert w/ snapshot attempt
This was attempted w/o running I/O so it wouldn't deadlock.
Disallowing reshape/takeover while LV is under a snapshot until future release Output for disallowing For completeness related to comment #6: [root@vm254 ~]# lvs -aoname,size,segtype,stripes,datastripes,syncpercent,reshapelen,origin,devices nvm LV LSize Type #Str #DStr Cpy%Sync RSize Origin Devices r 128.00m raid1 2 2 100.00 r_rimage_0(0),r_rimage_1(0) [r_rimage_0] 128.00m linear 1 1 /dev/sda(1) [r_rimage_1] 128.00m linear 1 1 /dev/sdq(1) [r_rmeta_0] 4.00m linear 1 1 /dev/sda(0) [r_rmeta_1] 4.00m linear 1 1 /dev/sdq(0) s 12.00m linear 1 1 r /dev/sda(33) [root@vm254 ~]# lvconvert --ty raid5 -y nvm/r Using default stripesize 64.00 KiB. Can't convert snapshot origin nvm/r. CANTFIX reasoning: - though commit f1b78665ef181ccd630209243b74df0627322a35 fixes the 2-legged raid1 -> raid5 conversion, this does not provide any advantage over just keepiung the raid1 layout unless additionally reshaping to more stripes - reshaping to more (or less; not in this BZs context) stripes involves a RaidLV size change after adding (or before removing) stripes - active classic snapshots require the size of an origin LV to be constant and hence need the origin LV to be inactive when resizing via e.g. lvresize or "lvconvert --stripes ..." - on the other hand, inactive RaidLVs can't be resized/converted because kernel state is not available but mandatory to decide if the RaidLV is fully synchronized/reshaped -> we can't allow active RaidLVs to be reshaped when classic snapshots are on top of them (done in commit f342e803ba3c32890a2b08736fa94bdd541d5e9c as of comment #6) (In reply to Heinz Mauelshagen from comment #11) > Output for disallowing For completeness related to comment #6: > > [root@vm254 ~]# lvs > -aoname,size,segtype,stripes,datastripes,syncpercent,reshapelen,origin, > devices nvm > LV LSize Type #Str #DStr Cpy%Sync RSize Origin Devices > > r 128.00m raid1 2 2 100.00 > r_rimage_0(0),r_rimage_1(0) > [r_rimage_0] 128.00m linear 1 1 /dev/sda(1) > > [r_rimage_1] 128.00m linear 1 1 /dev/sdq(1) > > [r_rmeta_0] 4.00m linear 1 1 /dev/sda(0) > > [r_rmeta_1] 4.00m linear 1 1 /dev/sdq(0) > > s 12.00m linear 1 1 r /dev/sda(33) > > [root@vm254 ~]# lvconvert --ty raid5 -y nvm/r > Using default stripesize 64.00 KiB. > Can't convert snapshot origin nvm/r. Can we get a clean-up of that error message? Something like: "Unable to convert nvm/r while under snapshot(s)." or "Snapshots must be removed in order to convert nvm/r." Otherwise, the user will simply ask, "why the hell not? what's wrong?". (In reply to Jonathan Earl Brassow from comment #13) > (In reply to Heinz Mauelshagen from comment #11) > > Output for disallowing For completeness related to comment #6: > > > > [root@vm254 ~]# lvs > > -aoname,size,segtype,stripes,datastripes,syncpercent,reshapelen,origin, > > devices nvm > > LV LSize Type #Str #DStr Cpy%Sync RSize Origin Devices > > > > r 128.00m raid1 2 2 100.00 > > r_rimage_0(0),r_rimage_1(0) > > [r_rimage_0] 128.00m linear 1 1 /dev/sda(1) > > > > [r_rimage_1] 128.00m linear 1 1 /dev/sdq(1) > > > > [r_rmeta_0] 4.00m linear 1 1 /dev/sda(0) > > > > [r_rmeta_1] 4.00m linear 1 1 /dev/sdq(0) > > > > s 12.00m linear 1 1 r /dev/sda(33) > > > > [root@vm254 ~]# lvconvert --ty raid5 -y nvm/r > > Using default stripesize 64.00 KiB. > > Can't convert snapshot origin nvm/r. > > Can we get a clean-up of that error message? Something like: > "Unable to convert nvm/r while under snapshot(s)." > or > "Snapshots must be removed in order to convert nvm/r." > > Otherwise, the user will simply ask, "why the hell not? what's wrong?". Done, commit a95f656d0df0fb81d68fa27bfee2350953677174 enhances the rejection message. Fix verified in the latest rpms. 3.10.0-772.el7.x86_64 lvm2-2.02.176-3.el7 BUILT: Fri Nov 10 07:12:10 CST 2017 lvm2-libs-2.02.176-3.el7 BUILT: Fri Nov 10 07:12:10 CST 2017 lvm2-cluster-2.02.176-3.el7 BUILT: Fri Nov 10 07:12:10 CST 2017 lvm2-lockd-2.02.176-3.el7 BUILT: Fri Nov 10 07:12:10 CST 2017 lvm2-python-boom-0.8-3.el7 BUILT: Fri Nov 10 07:16:45 CST 2017 cmirror-2.02.176-3.el7 BUILT: Fri Nov 10 07:12:10 CST 2017 device-mapper-1.02.145-3.el7 BUILT: Fri Nov 10 07:12:10 CST 2017 device-mapper-libs-1.02.145-3.el7 BUILT: Fri Nov 10 07:12:10 CST 2017 device-mapper-event-1.02.145-3.el7 BUILT: Fri Nov 10 07:12:10 CST 2017 device-mapper-event-libs-1.02.145-3.el7 BUILT: Fri Nov 10 07:12:10 CST 2017 device-mapper-persistent-data-0.7.3-2.el7 BUILT: Tue Oct 10 04:00:07 CDT 2017 [root@host-116 ~]# lvs -o +segtype LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Type snap centipede2 swi-a-s--- 100.00m takeover 1.22 linear takeover centipede2 owi-a-r--- 4.06g 100.00 raid6_rs_6 [root@host-116 ~]# lvconvert --yes -R 16384.00k --type raid5_rs centipede2/takeover Using default stripesize 64.00 KiB. Can't convert RAID LV centipede2/takeover while under snapshot. [root@host-116 ~]# lvconvert --yes --stripes 2 centipede2/takeover Using default stripesize 64.00 KiB. Can't convert RAID LV centipede2/takeover while under snapshot. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:0853 |