Bug 1602173

Summary:

Kernel bugcheck on LV conversion from RAID6 to RAID5

Product:

[Community] LVM and device-mapper

Reporter:

Douglas Paul <doug-rh>

Component:

lvm2

Assignee:

LVM and device-mapper development team <lvm-team>

lvm2 sub component:

Changing Logical Volumes

QA Contact:

cluster-qe <cluster-qe>

Status:

CLOSED CURRENTRELEASE

Docs Contact:

Severity:

unspecified

Priority:

unspecified

CC:

agk, heinzm, jbrassow, msnitzer, prajnoha, zkabelac

Version:

2.02.179

Flags:

rule-engine: lvm-technical-solution?
rule-engine: lvm-test-coverage?

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-07-20 14:42:31 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
pruned vgcfgbackup before failing command	none
pruned vgcfgbackup 'after' failing command	none

Description Douglas Paul 2018-07-18 00:04:38 UTC

Description of problem:

While trying to convert from RAID6 to RAID5, the first stage reshape seems to work correctly, but the second stage exits with a segmentation fault caused with a kernel bugcheck.

Version-Release number of selected component (if applicable):
I reproduced this with kernel 4.14.48 and 4.14.52, and userspace tools 2.02.173 and 2.02.279.

How reproducible:
Easily ... for me it fails every time

Steps to Reproduce:
  lvcreate --type raid6 ....
  (wait for resync)
  lvconvert --type raid5 ....
  (wait for resync)
  lvconvert --type raid5 ....
  (segfault)

Actual results:

depot ~ # lvcreate --type raid6 --stripes 4 -l 256 --name reshape_test Depot /dev/sdd1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1 
  Using default stripesize 64.00 KiB.
  Logical volume "reshape_test" created.
depot ~ # lvconvert --type raid5 Depot/reshape_test
  Using default stripesize 64.00 KiB.
  Replaced LV type raid5 (same as raid5_ls) with possible type raid6_ls_6.
  Repeat this command to convert to raid5 after an interim conversion has finished.
  Converting raid6 (same as raid6_zr) LV Depot/reshape_test to raid6_ls_6.
Are you sure you want to convert raid6 LV Depot/reshape_test? [y/n]: y
  Logical volume Depot/reshape_test successfully converted.
depot ~ # lvconvert --type raid5 Depot/reshape_test
  Using default stripesize 64.00 KiB.
  Replaced LV type raid5 (same as raid5_ls) with possible type raid5_ls.
  Repeat this command to convert to raid5 after an interim conversion has finished.
Are you sure you want to convert raid6_ls_6 LV Depot/reshape_test to raid5_ls type? [y/n]: y
Segmentation fault

I also get the same results converting to raid5_n through raid6_n_6.

The VG is left in a state where it cannot be activated without restoring from a previous backup. The LV is still accessible, but the RAID is in a frozen state.

Expected results:

Successful reshaping / RAID conversion

Additional info:

Kernel bugcheck:

[174203.839334] ------------[ cut here ]------------
[174203.842572] kernel BUG at drivers/md/raid5.c:7251!
***> BUG_ON(mddev->level != mddev->new_level);
[174203.845836] invalid opcode: 0000 [#2] SMP PTI
[174203.849137] Modules linked in: target_core_pscsi target_core_file iscsi_target_mod target_core_iblock target_core_mod macvtap autofs4 nfsd auth_rpcgss oid_registry nfs_acl iptable_mangle iptable_filter ip_tables x_tables ipmi_ssif vhost_net vhost tap tun bridge stp llc intel_powerclamp coretemp kvm_intel kvm irqbypass crc32c_intel ghash_clmulni_intel pcbc aesni_intel crypto_simd cryptd glue_helper i2c_i801 mei_me mei e1000e ipmi_si ipmi_devintf ipmi_msghandler efivarfs virtio_pci virtio_balloon virtio_ring virtio xts aes_x86_64 ecb sha512_generic sha256_generic sha1_generic iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi fuse xfs nfs lockd grace sunrpc jfs reiserfs btrfs zstd_decompress zstd_compress xxhash lzo_compress zlib_deflate usb_storage
[174203.883644] CPU: 0 PID: 12628 Comm: lvconvert Tainted: G      D         4.14.52-gentoo #1
[174203.888531] Hardware name: Supermicro Super Server/X10SDV-7TP4F, BIOS 1.0 04/07/2016
[174203.893613] task: ffff96af7df1e040 task.stack: ffffa20701b74000
[174203.898798] RIP: 0010:raid5_run+0x28b/0x865
[174203.904059] RSP: 0018:ffffa20701b77ac8 EFLAGS: 00010202
[174203.909449] RAX: 0000000000000006 RBX: ffff96adf4107058 RCX: ffff96adf4107070
[174203.915032] RDX: 0000000000000000 RSI: ffffffffffffffff RDI: ffff96b49f2152b8
[174203.920562] RBP: ffffffffbee93fc0 R08: 0000000000000000 R09: 0000000000000000
[174203.926085] R10: 00000000ffffffe4 R11: 0000000000aaaaaa R12: 0000000000000000
[174203.931688] R13: ffff96adf4107070 R14: ffffffffbec42dee R15: 0000000000000000
[174203.937368] FS:  00007fadb34d15c0(0000) GS:ffff96b49f200000(0000) knlGS:0000000000000000
[174203.943239] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[174203.949197] CR2: 00005592978d89b0 CR3: 0000000032416003 CR4: 00000000003626f0
[174203.955341] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[174203.961431] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[174203.967380] Call Trace:
[174203.973280]  ? bioset_create+0x1d2/0x20c
[174203.979133]  md_run+0x59a/0x96a
[174203.984911]  ? super_validate.part.20+0x3f0/0x635
[174203.990832]  ? sync_page_io+0x104/0x112
[174203.996809]  raid_ctr+0x1c80/0x1fe5
[174204.002830]  ? dm_table_add_target+0x1d8/0x275
[174204.008966]  dm_table_add_target+0x1d8/0x275
[174204.015185]  table_load+0x22d/0x290
[174204.021315]  ? list_version_get_info+0xab/0xab
[174204.027413]  ctl_ioctl+0x2de/0x351
[174204.033420]  dm_ctl_ioctl+0x5/0x8
[174204.039374]  vfs_ioctl+0x16/0x23
[174204.045276]  do_vfs_ioctl+0x4a6/0x517
[174204.051050]  ? SyS_newstat+0x35/0x40
[174204.056620]  SyS_ioctl+0x39/0x55
[174204.061991]  do_syscall_64+0x56/0xe1
[174204.067188]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[174204.072294] RIP: 0033:0x7fadb20e32c7
[174204.077212] RSP: 002b:00007fff04ae5708 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[174204.082174] RAX: ffffffffffffffda RBX: 0000564d891e8bb0 RCX: 00007fadb20e32c7
[174204.087045] RDX: 0000564d891e8bb0 RSI: 00000000c138fd09 RDI: 0000000000000011
[174204.091824] RBP: 0000000000004000 R08: 0000000000000000 R09: 00007fff04ae5570
[174204.096512] R10: 0000000000000003 R11: 0000000000000202 R12: 0000564d891e8c60
[174204.101080] R13: 00007fadb28457d3 R14: 0000564d89286280 R15: 0000564d891e8be0
[174204.105597] Code: 5a 48 8b 43 48 48 c7 c7 ad 41 c4 be 48 8d 70 0c 48 85 c0 48 c7 c0 ee 2d c4 be 48 0f 44 f0 e9 b3 fe ff ff 3b 83 34 01 00 00 74 02 <0f> 0b 8b 83 38 01 00 00 39 83 d4 00 00 00 74 02 0f 0b 8b 83 3c 
[174204.119706] RIP: raid5_run+0x28b/0x865 RSP: ffffa20701b77ac8
[174204.124640] ---[ end trace 21af1433740b4388 ]---

Comment 1 Douglas Paul 2018-07-18 00:07:24 UTC

Additional dmesg log from LV creation until the bugcheck :

(lvcreate)
[174109.655087] device-mapper: raid: Superblocks created for new raid set
[174109.661767] md/raid:mdX: not clean -- starting background reconstruction
[174109.666772] md/raid:mdX: device dm-124 operational as raid disk 0
[174109.671694] md/raid:mdX: device dm-179 operational as raid disk 1
[174109.676459] md/raid:mdX: device dm-208 operational as raid disk 2
[174109.681185] md/raid:mdX: device dm-210 operational as raid disk 3
[174109.685906] md/raid:mdX: device dm-212 operational as raid disk 4
[174109.690438] md/raid:mdX: device dm-214 operational as raid disk 5
[174109.696012] md/raid:mdX: raid level 6 active with 6 out of 6 devices, algorithm 8
[174109.799204] mdX: bitmap file is out of date, doing full recovery
[174109.901833] md: resync of RAID array mdX
[174120.746961] md: mdX: resync done.

(lvconvert #1)
[174163.621595] md/raid:mdX: device dm-124 operational as raid disk 0
[174163.626293] md/raid:mdX: device dm-179 operational as raid disk 1
[174163.630954] md/raid:mdX: device dm-208 operational as raid disk 2
[174163.635530] md/raid:mdX: device dm-210 operational as raid disk 3
[174163.639992] md/raid:mdX: device dm-212 operational as raid disk 4
[174163.644374] md/raid:mdX: device dm-214 operational as raid disk 5
[174163.649437] md/raid:mdX: raid level 6 active with 6 out of 6 devices, algorithm 8
[174164.601362] md/raid:mdX: device dm-124 operational as raid disk 0
[174164.605900] md/raid:mdX: device dm-179 operational as raid disk 1
[174164.610393] md/raid:mdX: device dm-208 operational as raid disk 2
[174164.614821] md/raid:mdX: device dm-210 operational as raid disk 3
[174164.619149] md/raid:mdX: device dm-212 operational as raid disk 4
[174164.623360] md/raid:mdX: device dm-214 operational as raid disk 5
[174164.628161] md/raid:mdX: raid level 6 active with 6 out of 6 devices, algorithm 18
[174164.939292] md: reshape of RAID array mdX
[174179.978504] md: mdX: reshape done.

(lvconvert #2)
[174203.005817] md/raid:mdX: not clean -- starting background reconstruction
[174203.010048] md/raid:mdX: device dm-124 operational as raid disk 0
[174203.014219] md/raid:mdX: device dm-179 operational as raid disk 1
[174203.018203] md/raid:mdX: device dm-208 operational as raid disk 2
[174203.021943] md/raid:mdX: device dm-210 operational as raid disk 3
[174203.025475] md/raid:mdX: device dm-212 operational as raid disk 4
[174203.028877] md/raid:mdX: device dm-214 operational as raid disk 5
[174203.032852] md/raid:mdX: raid level 6 active with 6 out of 6 devices, algorithm 18
[174203.836002] md/raid:mdX: not clean -- starting background reconstruction

Comment 2 Heinz Mauelshagen 2018-07-18 21:59:34 UTC

Tested ok on Fedora 27
(kernel 4.17.6-100.fc27.x86_64 / LVM version: 2.02.175(2) (2017-10-06).

Which distribution is this?
Please share 'uname -r' and 'lvm version', thanks.

Comment 3 Douglas Paul 2018-07-18 22:19:59 UTC

As in the description, this is with LVM tools 2.02.173 and 2.02.179. The uname -r was in the bugcheck report in dmesg, 4.14.52-gentoo.

The distribution is Gentoo. Version 2.02.173 was from their package, I updated to 2.02.179 myself to test it.

It seems to be adding an additional extent during the second run of lvconvert, and I don't understand why (at least this is what I see from the LVM VG config backups).

Is the 4.14 series of kernels supported? I can try building a 4.17 series kernel to see if that works better.

Comment 4 Douglas Paul 2018-07-18 22:26:49 UTC

I just realized that 'lvm version' is a command :

  LVM version:     2.02.179(2) (2018-06-18)
  Library version: 1.02.148 (2018-06-18)
  Driver version:  4.37.0
  Configuration:   ./configure --prefix=/usr --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --mandir=/usr/share/man --infodir=/usr/share/info --datadir=/usr/share --sysconfdir=/etc --localstatedir=/var/lib --disable-dependency-tracking --docdir=/usr/share/doc/lvm2-2.02.179 --htmldir=/usr/share/doc/lvm2-2.02.179/html --enable-readline --disable-selinux --enable-pkgconfig --with-confdir=/etc --exec-prefix= --sbindir=/sbin --with-staticdir=/sbin --libdir=/lib64 --with-usrlibdir=/usr/lib64 --with-default-dm-run-dir=/run --with-default-run-dir=/run/lvm --with-default-locking-dir=/run/lock/lvm --with-default-pid-dir=/run --enable-udev_rules --enable-udev_sync --with-udevdir=/lib/udev/rules.d --disable-lvmlockd-sanlock --disable-udev-systemd-background-jobs --with-systemdsystemunitdir=/lib/systemd/system --enable-dmeventd --enable-cmdlib --enable-applib --enable-fsadm --enable-lvmetad --with-mirrors=internal --with-snapshots=internal --with-thin=internal --with-cache=internal --with-thin-check=/sbin/thin_check --with-cache-check=/sbin/cache_check --with-thin-dump=/sbin/thin_dump --with-cache-dump=/sbin/cache_dump --with-thin-repair=/sbin/thin_repair --with-cache-repair=/sbin/cache_repair --with-thin-restore=/sbin/thin_restore --with-cache-restore=/sbin/cache_restore --with-lvm1=none --with-clvmd=none --with-cluster=none CLDFLAGS=-Wl,-O1 -Wl,--as-needed

Comment 5 Heinz Mauelshagen 2018-07-19 12:25:07 UTC

(In reply to Douglas Paul from comment #3)
> As in the description, this is with LVM tools 2.02.173 and 2.02.179. The
> uname -r was in the bugcheck report in dmesg, 4.14.52-gentoo.
> 
> The distribution is Gentoo. Version 2.02.173 was from their package, I
> updated to 2.02.179 myself to test it.
> 
> It seems to be adding an additional extent during the second run of
> lvconvert, and I don't understand why (at least this is what I see from the
> LVM VG config backups).

That's expected behaviour: it is adding out-of-place reshape space to be used
during the reshape process from raid6(_zr) to raid6_ls_6 in order to avoid writing over data in place.

> 
> Is the 4.14 series of kernels supported? I can try building a 4.17 series
> kernel to see if that works better.

Yes, those are ok.

Comment 6 Heinz Mauelshagen 2018-07-19 12:26:36 UTC

Try a newer kernel though in order to test if that makes
a difference in your test case.

Comment 7 Douglas Paul 2018-07-19 13:02:06 UTC

My confusion is that the first phase of lvconvert seems to be fine, and I do end up with a raid6_ls_6.

I notice I forgot to attach the vgcfgbackup files done before and after the second command. I misread them at first, but what seems to happen is that the final extent in the segment (I guess originally added for the reshape) is moved into its own segment. Could that trigger the kernel into thinking the array is unclean? It seems to me that it should be equivalent ...

In any case, I don't know why it does this split. I would have expected it just to remove the segment containing the Q syndrome (the P should be the same as the parity for RAID5)? (this is why I suspected a problem in the user-space tools)

The bugcheck I am hitting seems to be due to the array being unclean at the point of takeover, I guess.

I seems each time I call lvconvert, it first reassembles the RAID as it currently is, then starts the reshape process, so I see two RAID assembly operations in the dmesg: first with the current format and second with the target format. (see comment #1)

The second time I run lvconvert (after the reshape, for the takeover) it was complaining (in dmesg) that the array is unclean at that point. Then, at the point of the second assembly operation, I again get the message about it being unclean, then it hits the bugcheck and freezes the array (the repair in progress never completes).

I will try again with 4.17.8, and if it still fails, I guess I will build a debug version of the lvmtools and understand why it is doing this segment splitting, or try to find out why the kernel thinks the array is not clean.

Comment 8 Douglas Paul 2018-07-19 13:03:07 UTC

Created attachment 1460029 [details]
pruned vgcfgbackup before failing command

Comment 9 Douglas Paul 2018-07-19 13:05:13 UTC

Created attachment 1460030 [details]
pruned vgcfgbackup 'after' failing command

Comment 10 Heinz Mauelshagen 2018-07-19 17:13:40 UTC

(In reply to Douglas Paul from comment #7)
> My confusion is that the first phase of lvconvert seems to be fine, and I do
> end up with a raid6_ls_6.
> 
> I notice I forgot to attach the vgcfgbackup files done before and after the
> second command. I misread them at first, but what seems to happen is that
> the final extent in the segment (I guess originally added for the reshape)
> is moved into its own segment. Could that trigger the kernel into thinking
> the array is unclean? It seems to me that it should be equivalent ...
> 
> In any case, I don't know why it does this split. I would have expected it
> just to remove the segment containing the Q syndrome (the P should be the
> same as the parity for RAID5)? (this is why I suspected a problem in the
> user-space tools)
> 

The 2 segments of each rimage LV result from allocating/moving the out-of-place reshape space at/to the the proper offset (it's either at the beginning or at the end depending on the need of a forward reshape with adding stripes or a backward reshape with removing stripes).

> The bugcheck I am hitting seems to be due to the array being unclean at the
> point of takeover, I guess.

With that you mean the respective "md: resync of..." which result from unconditionally requesting synchronization at each md array activation which can be a noop when the array already finished it before.

> 
> I seems each time I call lvconvert, it first reassembles the RAID as it
> currently is, then starts the reshape process, so I see two RAID assembly
> operations in the dmesg: first with the current format and second with the
> target format. (see comment #1)

Device-mapper has an active and an inactive mapping table slot per mapped device (i.e. a raid device in this case).  A mapping table defines mapped device address segments with start sector and length in sectors, mapping target (e.g. "raid", "striped", ...) and target parameters in ASCII format.  The mapping table in the active slot does process mapped device I/O.  When the lvconvert command performs a conversion, the new mapping defining e.g. a reshape (via the "raid" target parameters it passed in) is loaded into the inactive slot allocating all necessary resources (memory, threads, ...).  This avoids resource constraints like failing allocations to cause blocked I/O.  When that succeeds, lvm2 switches the inactive table with the active one, quiescing mapped device I/O before doing so and unquiescing afterwards.

The 2 consecutive MD array assembly messages result from lvm2 performing this load cycle twice in order to keep the lvm2 userspace metadata in sync with the kernel.  At first it passes the reshape configuration in (e.g. for a raid6_zr -> raid6_ls_6 conversion) which is stored in the raid superblocks on the rmeta SubLVs (see 'lvs -o+devices $vg' for those) and secondly it removes that information and reloads.

This is a simplification to show the principle.

> 
> The second time I run lvconvert (after the reshape, for the takeover) it was
> complaining (in dmesg) that the array is unclean at that point. Then, at the
> point of the second assembly operation, I again get the message about it
> being unclean, then it hits the bugcheck and freezes the array (the repair
> in progress never completes).

The BUG_ON triggered shows that the reshape has finished but the old and new raid levels aren't the same which shouldn't be the case on takeover.

Please report if a newer kernel still fails for you, thanks.

> 
> I will try again with 4.17.8, and if it still fails, I guess I will build a
> debug version of the lvmtools and understand why it is doing this segment
> splitting, or try to find out why the kernel thinks the array is not clean.

As mentioned, any reload involves requesting synchronization causing the respective kernel message which may be a noop when it was already finished before.  Let's see a newer kernel...

Comment 11 Douglas Paul 2018-07-19 18:58:09 UTC

The suspicious (to me) kernel messages I was mentioning that I got only in the second lvconvert were these ones:

[174203.005817] md/raid:mdX: not clean -- starting background reconstruction

Running 4.17.8 the results are definitely different.

On the first lvconvert, I get this:

[  514.269796] md/raid:mdX: device dm-226 operational as raid disk 0
...
[  514.269801] md/raid:mdX: device dm-234 operational as raid disk 4
[  514.269802] md/raid:mdX: device dm-236 operational as raid disk 5
[  514.270462] md/raid:mdX: raid level 6 active with 6 out of 6 devices, algorithm 8
[  515.331847] md: reshape of RAID array mdX
[  517.767100] md/raid:mdX: device dm-226 operational as raid disk 0
...
[  517.767105] md/raid:mdX: device dm-234 operational as raid disk 4
[  517.767106] md/raid:mdX: device dm-236 operational as raid disk 5
[  517.767718] md/raid:mdX: raid level 6 active with 6 out of 6 devices, algorithm 18
[  517.832551] md: mdX: reshape interrupted.
[  518.756190] md: reshape of RAID array mdX
[  532.952524] md: mdX: reshape done.

Looks fine.

In the second lvconvert, I get a new warning during the command execution:

# lvconvert --type raid5 Depot/reshape_test
  Using default stripesize 64.00 KiB.
  Replaced LV type raid5 (same as raid5_ls) with possible type raid5_ls.
  Repeat this command to convert to raid5 after an interim conversion has finished.
Are you sure you want to convert raid6_ls_6 LV Depot/reshape_test to raid5_ls type? [y/n]: y
  WARNING: Sync status for Depot/reshape_test is inconsistent. <==== NEW WARNING
  Logical volume Depot/reshape_test successfully converted.

And in dmesg:
[  542.454610] md/raid:mdX: not clean -- starting background reconstruction
[  542.454648] md/raid:mdX: device dm-226 operational as raid disk 0
...
[  542.454652] md/raid:mdX: device dm-234 operational as raid disk 4
[  542.454653] md/raid:mdX: device dm-236 operational as raid disk 5
[  542.455270] md/raid:mdX: raid level 6 active with 6 out of 6 devices, algorithm 18
[  543.371373] md: resync of RAID array mdX
[  543.371387] md: mdX: resync done.
[  543.877618] md/raid:mdX: device dm-226 operational as raid disk 0
...
[  543.877623] md/raid:mdX: device dm-234 operational as raid disk 4
[  543.878445] md/raid:mdX: raid level 5 active with 5 out of 5 devices, algorithm 2
[  544.924005] md/raid:mdX: not clean -- starting background reconstruction
[  544.924060] md/raid:mdX: device dm-226 operational as raid disk 0
...
[  544.924069] md/raid:mdX: device dm-234 operational as raid disk 4
[  544.924862] md/raid:mdX: raid level 5 active with 5 out of 5 devices, algorithm 2
[  545.780715] md: resync of RAID array mdX
[  545.780732] md: mdX: resync done.

Looks fine to me. So I guess there is something missing in the 4.14 series of kernels. Maybe some fix needs to be backported or have LVM reshaping disabled on those kernels?

For an extra check, I did a dd from /dev/urandom after the LV was created, and check the data after the full conversion to RAID5, and the data matched fine.

And looking at a new vgcfgbackup, the segments look sane, with the reshape data cleaned up (they have no extra flag on the LV type)

Comment 12 Heinz Mauelshagen 2018-07-20 14:42:31 UTC

Yes, there's activation related patches.
Please rely on distro supported kernels or use the newer kernel you provided.