Created attachment 996381 [details] Full log Description of problem: mdadm --stop /dev/md127 hangs forever. There are no kernel messages. The complete sequence of md commands is shown in the attachment. This is a regression. The same test worked a couple of weeks ago. Version-Release number of selected component (if applicable): mdadm-3.3.2-1.fc22.x86_64 How reproducible: 100% Steps to Reproduce: See attachment.
Richard, Your bug report is incomplete. Please provide: 1) full sequence of mdadm commands executed 2) /proc/mdstat output 3) SCSI driver used to provide the drives used in the MD array 4) Is there a LVM volume sitting on top of the RAID? There is an OOPS in the logs, thats almost certainly why it's hanging, so this isn't an mdadm bug, but a kernel bug. Jes
I wonder if it's better to supply the actual disk image? (1) The sequence of commands is: # we have a guest with 4 disks and 4 partitions per disk mdadm --create --run r1t1 --level raid1 --raid-devices 2 /dev/sda1 /dev/sdb1 mdadm --create --run r1t2 --level raid1 --raid-devices 2 --chunk 64 /dev/sdc1 /dev/sdd1 mdadm --create --run r5t1 --level 5 --raid-devices 4 --spare-devices 1 /dev/sda2 /dev/sdb2 /dev/sdc2 /dev/sdd2 missing mdadm --create --run r5t2 --level 5 --raid-devices 3 missing /dev/sda3 /dev/sdb3 mdadm --create --run r5t3 --level 5 --raid-devices 2 --spare-devices 2 /dev/sdc3 missing missing /dev/sdd3 mdadm -D --export /dev/md127 # we then create filesystems on all 5 of the above arrays # we then mount them and write some stuff # at this point we reboot the guest mdadm --stop /dev/md127 (2) /proc/mdstat output is: Personalities : [raid6] [raid5] [raid4] md127 : inactive sdc3[0] sdd3[3] 2048 blocks super 1.2 unused devices: <none> (3) virtio-scsi (4) No LVM is involved here.
Created attachment 997143 [details] rhbz1197305.tar What I've done is to take the 4 disk images, xz-compress them (they are almost completely sparse), and then tar them into a single tarball. To reproduce this (on the very latest Rawhide): tar xf rhbz1197305.tar unxz mdadm-?.img.xz virt-rescue --ro -a mdadm-1.img -a mdadm-2.img -a mdadm-3.img -a mdadm-4.img At the virt-rescue prompt, type: mdadm --stop /dev/md127 & The mdadm command will hang forever, and at the same time you can investigate what's going on in the guest.
The raid5_free oops is 100% reproducible under virt-rescue. It happens before mdadm is run, during boot, when we run the mdadm -As --auto=yes --run command: [ 2.486615] md: md127 stopped. [ 2.496163] md: bind<sdd3> [ 2.498713] md: bind<sdc3> [ 2.536029] raid6: sse2x1 6734 MB/s [ 2.554018] raid6: sse2x2 9046 MB/s [ 2.572021] raid6: sse2x4 11132 MB/s [ 2.573465] raid6: using algorithm sse2x4 (11132 MB/s) [ 2.575432] raid6: using ssse3x2 recovery algorithm [ 2.579368] async_tx: api initialized (async) [ 2.583378] xor: automatically using best checksumming function: [ 2.595015] avx : 23052.000 MB/sec [ 2.613338] md: raid6 personality registered for level 6 [ 2.615164] md: raid5 personality registered for level 5 [ 2.616940] md: raid4 personality registered for level 4 [ 2.619531] md/raid:md127: not clean -- starting background reconstruction [ 2.621873] md/raid:md127: device sdc3 operational as raid disk 0 [ 2.625679] md/raid:md127: allocated 0kB [ 2.627058] md/raid:md127: cannot start dirty degraded array. [ 2.630797] md/raid:md127: failed to run raid set. [ 2.632339] md: pers->run() failed ... [ 2.633617] BUG: unable to handle kernel NULL pointer dereference at 00000000000005f8 [ 2.634535] IP: [<ffffffffa0251d55>] free_conf+0x15/0x130 [raid456] [ 2.634535] PGD 1b9a4067 PUD 1bd34067 PMD 0 [ 2.634535] Oops: 0000 [#1] SMP [ 2.634535] Modules linked in: raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq iosf_mbi kvm_intel kvm snd_pcsp snd_pcm snd_timer ghash_clmulni_intel snd soundcore ata_generic i2c_piix4 serio_raw pata_acpi libcrc32c crc8 crc_itu_t crc_ccitt virtio_pci virtio_mmio virtio_balloon virtio_scsi sym53c8xx scsi_transport_spi megaraid_sas megaraid_mbox megaraid_mm megaraid ideapad_laptop rfkill sparse_keymap virtio_net virtio_console virtio_rng virtio_blk virtio_ring virtio crc32 crct10dif_pclmul crc32c_intel crc32_pclmul [ 2.634535] CPU: 0 PID: 140 Comm: mdadm Not tainted 4.0.0-0.rc1.git1.1.fc23.x86_64 #1 [ 2.634535] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.0-20150221_201410- 04/01/2014 [ 2.634535] task: ffff88001bbf0000 ti: ffff88001a104000 task.ti: ffff88001a104000 [ 2.634535] RIP: 0010:[<ffffffffa0251d55>] [<ffffffffa0251d55>] free_conf+0x15/0x130 [raid456] [ 2.634535] RSP: 0018:ffff88001a107c48 EFLAGS: 00010296 [ 2.634535] RAX: 0000000000000000 RBX: ffff88001df84800 RCX: 0000000000000000 [ 2.634535] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 2.634535] RBP: ffff88001a107c68 R08: 0000000000000001 R09: 0000000000000000 [ 2.634535] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 2.634535] R13: ffff88001df84800 R14: ffff88001df84818 R15: ffffffffa0262300 [ 2.634535] FS: 00007f39168fc700(0000) GS:ffff88001ec00000(0000) knlGS:0000000000000000 [ 2.634535] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2.634535] CR2: 00000000000005f8 CR3: 000000001a2b6000 CR4: 00000000000407f0 [ 2.634535] Stack: [ 2.634535] ffff88001a107c58 ffff88001df84800 ffff88001df84818 ffff88001df84800 [ 2.634535] ffff88001a107c88 ffffffffa0251e89 ffff88001df84818 00000000fffffffb [ 2.634535] ffff88001a107d38 ffffffff816ac0a1 0000000000000246 ffff88001df84a70 [ 2.634535] Call Trace: [ 2.634535] [<ffffffffa0251e89>] raid5_free+0x19/0x30 [raid456] [ 2.634535] [<ffffffff816ac0a1>] md_run+0x901/0xa30 [ 2.634535] [<ffffffff816ad6ca>] ? md_ioctl+0x50a/0x1d20 [ 2.634535] [<ffffffff816ad6ca>] ? md_ioctl+0x50a/0x1d20 [ 2.634535] [<ffffffff816ac1e4>] do_md_run+0x14/0xa0 [ 2.634535] [<ffffffff816ae24e>] md_ioctl+0x108e/0x1d20 [ 2.634535] [<ffffffff810e7d55>] ? sched_clock_local+0x25/0x90 [ 2.634535] [<ffffffff8129ce6f>] ? mntput_no_expire+0x6f/0x360 [ 2.634535] [<ffffffff810e7f98>] ? sched_clock_cpu+0x98/0xd0 [ 2.634535] [<ffffffff814013fe>] blkdev_ioctl+0x1ce/0x850 [ 2.634535] [<ffffffff8129ce87>] ? mntput_no_expire+0x87/0x360 [ 2.634535] [<ffffffff8129ce05>] ? mntput_no_expire+0x5/0x360 [ 2.634535] [<ffffffff812ba993>] block_ioctl+0x43/0x50 [ 2.634535] [<ffffffff8128cf08>] do_vfs_ioctl+0x2e8/0x530 [ 2.634535] [<ffffffff811271e5>] ? rcu_read_lock_held+0x65/0x70 [ 2.634535] [<ffffffff81299bde>] ? __fget_light+0xbe/0xe0 [ 2.634535] [<ffffffff8128d1d1>] SyS_ioctl+0x81/0xa0 [ 2.634535] [<ffffffff81881b29>] system_call_fastpath+0x12/0x17 [ 2.634535] Code: 1f 80 00 00 00 00 48 c7 c0 ea ff ff ff eb bb e8 52 94 e5 e0 66 90 66 66 66 66 90 55 48 89 e5 41 55 41 54 53 49 89 fc 48 83 ec 08 <48> 8b bf f8 05 00 00 48 85 ff 74 11 48 8b 7f 18 e8 66 e5 ff e0 [ 2.634535] RIP [<ffffffffa0251d55>] free_conf+0x15/0x130 [raid456] [ 2.634535] RSP <ffff88001a107c48> [ 2.634535] CR2: 00000000000005f8 [ 2.776821] ---[ end trace 06a77318965f6cfb ]--- /init: line 93: 140 Killed mdadm -As --auto=yes --run
Richard, Can you reproduce this using real storage as opposed to virtio-scsi? I have no way of reproducing your case: [jes@ultrasam ~]$ type -a virt-rescue bash: type: virt-rescue: not found I don't run virt at all! Jes
(In reply to Richard W.M. Jones from comment #2) > I wonder if it's better to supply the actual disk image? > > (1) The sequence of commands is: > > # we have a guest with 4 disks and 4 partitions per disk > > mdadm --create --run r1t1 --level raid1 --raid-devices 2 /dev/sda1 /dev/sdb1 OK. > mdadm --create --run r1t2 --level raid1 --raid-devices 2 --chunk 64 > /dev/sdc1 /dev/sdd1 --chunk has no effect on raid1, this device is actually a duplicate of your first device > mdadm --create --run r5t1 --level 5 --raid-devices 4 --spare-devices 1 > /dev/sda2 /dev/sdb2 /dev/sdc2 /dev/sdd2 missing a missing spare device might as well not be specified. This is functionally no different than a 4 disk raid5 array with no spare. > mdadm --create --run r5t2 --level 5 --raid-devices 3 missing /dev/sda3 > /dev/sdb3 ok, a degraded raid5...although creating a degraded raid5 array is actually problematic, but I'll get to that later > mdadm --create --run r5t3 --level 5 --raid-devices 2 --spare-devices 2 > /dev/sdc3 missing missing /dev/sdd3 I'm actually surprised (very much so) that mdadm will even let you create this. A degraded, 2 disk raid5 array with a hot spare already attached. The problem is that when you create a raid5 array, the first thing we need to do is to bring it into sync. Until that happens, rebuilds of any sort are non-deterministic. We can't bring it into sync because you created it with too few data disks. But you also gave it a spare, so theoretically it should be able to rebuild itself. But it's never been initialized, so rebuilding onto the spare is a bad idea. It might help if you pass --assume-clean on this one, but in general a 2 disk raid5 is a bad idea and mdadm should block it, and creating a degraded raid5 array with active spares will likely interfere with how mdadm likes to do the initial sync of the raid5 array (namely, it adds all of the disks except one, marks that one as a spare, then triggers a rebuild onto the spare...but that's all done internally by mdadm/md and is done because it's faster/more efficient than a resync operation, so this particular setup confounds how mdadm normally wants to initialize raid5 arrays). > mdadm -D --export /dev/md127 > > # we then create filesystems on all 5 of the above arrays > # we then mount them and write some stuff > > # at this point we reboot the guest > > mdadm --stop /dev/md127 > > (2) /proc/mdstat output is: > > Personalities : [raid6] [raid5] [raid4] > md127 : inactive sdc3[0] sdd3[3] > 2048 blocks super 1.2 > > unused devices: <none> > > (3) virtio-scsi > > (4) No LVM is involved here. Well, md127 *is* stopped. Being in inactive state is the same. If you did mdadm -r /dev/md127 /dev/sdc3 /dev/sdd3, the device would go away and the system could be brought down. I'm guessing this device never even made it to an active state, which is why a normal mdadm -S operation won't work on it.
It's not really good to have a disk image that causes the kernel to segfault on boot, no matter how crazy we are to try creating such an image ...
(In reply to Richard W.M. Jones from comment #7) > It's not really good to have a disk image that causes the kernel > to segfault on boot, no matter how crazy we are to try creating such > an image ... An image executing bad commands can crash the kernel any time, but yes mdadm shouldn't. Without answers to previous questions and filing a proper bug report, this isn't going to get debugged though. In particular the question whether you can reproduce this without using virtio-scsi? Jes
Created attachment 997179 [details] Log using virtio-blk.txt I really don't have 4 spare SCSI disks to test this with, and I only care that it fails under virt. I hacked up virt-rescue to use virtio-blk drives instead, and it fails with the exact same OOPS. See the attachment.
As Doug already pointed out, you are creating your arrays in a broken matter. It is up to you to show us that this happens on a type of disk that anyone actually cares about, to use your own words. In addition running RAID inside a virt guest makes little sense in the first, place. I am not going to install some random obscure virt tools just to be able to reproduce this. Jes
(a) We don't control what disk images people will upload to (eg) OpenStack. This is a *security issue* if a malformed disk image causes a crash / hang in the kernel. (b) RAID inside virt does make sense in the context of virt-p2v, and is a supported case for RHEL. (c) I have shown (comment 9) this is not an issue specific to virtio-scsi. If you don't want to look at the bug, ignore it or remove yourself from the CC.
Well first of all this is a Fedora bug, not a RHEL one! Second I maintain the RAID stack and I asked you to provide proper data, but you ask people to run obscure virt tools that I don't have installed anywhere and I have no use for. Third, running RAID inside a guest is silly, and inefficient, it should be run at the host level. Sure, people can do silly things, and they will. They can also run 'cat /dev/random > /dev/mem' in the script running as root. Fourth, Doug pointed out you were assembling and forcing arrays to run in a broken way. Last, I'm the RAID maintainer and I cannot just unsubscribe myself from RAID bugs, but as the RAID maintainer I ask for you to test this on proper storage so I have some data to work from! If you do not find that worth your while, I will have to close this as INSUFFICIENT_DATA! Jes
Taking - I'll fix this myself when I have the time to look at it.
git bisect points to one of these 4 commits. Other errors meant I could not narrow it further. # There are only 'skip'ped commits left to test. # The first bad commit could be any of: # afa0f557cb15176570a18fb2a093e348a793afd4 # 5aa61f427e4979be733e4847b9199ff9cc48a47e # db721d32b74b51a5ac9ec9fab1d85cba90dbdbd3 # 36d091f4759d194c99f0705d412afe208622b45a # We cannot bisect more! https://github.com/torvalds/linux/commit/5aa61f427e4979be733e4847b9199ff9cc48a47e https://github.com/torvalds/linux/commit/afa0f557cb15176570a18fb2a093e348a793afd4 https://github.com/torvalds/linux/commit/db721d32b74b51a5ac9ec9fab1d85cba90dbdbd3 https://github.com/torvalds/linux/commit/36d091f4759d194c99f0705d412afe208622b45a
Upstream bug filed, with potential fix: https://bugzilla.kernel.org/show_bug.cgi?id=94381#c2
Fixed upstream (commit 0c35bd4723e4a39ba2da4c13a22cb97986ee10c8).