Bug 1197305 - NULL pointer deref in raid5_free, causing mdadm --stop to hang
Summary: NULL pointer deref in raid5_free, causing mdadm --stop to hang
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Richard W.M. Jones
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: TRACKER-bugs-affecting-libguestfs
TreeView+ depends on / blocked
 
Reported: 2015-02-28 07:49 UTC by Richard W.M. Jones
Modified: 2015-03-25 13:23 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-03-25 13:23:20 UTC
Type: Bug


Attachments (Terms of Use)
Full log (221.93 KB, text/plain)
2015-02-28 07:49 UTC, Richard W.M. Jones
no flags Details
rhbz1197305.tar (70.00 KB, application/x-tar)
2015-03-02 16:14 UTC, Richard W.M. Jones
no flags Details
Log using virtio-blk.txt (27.28 KB, text/plain)
2015-03-02 17:53 UTC, Richard W.M. Jones
no flags Details


Links
System ID Priority Status Summary Last Updated
Linux Kernel 94381 None None None Never

Description Richard W.M. Jones 2015-02-28 07:49:15 UTC
Created attachment 996381 [details]
Full log

Description of problem:

mdadm --stop /dev/md127 hangs forever.  There are no kernel
messages.

The complete sequence of md commands is shown in the
attachment.

This is a regression.  The same test worked a couple of weeks ago.

Version-Release number of selected component (if applicable):

mdadm-3.3.2-1.fc22.x86_64

How reproducible:

100%

Steps to Reproduce:

See attachment.

Comment 1 Jes Sorensen 2015-03-02 14:57:11 UTC
Richard,

Your bug report is incomplete. Please provide:

1) full sequence of mdadm commands executed
2) /proc/mdstat output
3) SCSI driver used to provide the drives used in the MD array
4) Is there a LVM volume sitting on top of the RAID?

There is an OOPS in the logs, thats almost certainly why it's hanging, so this
isn't an mdadm bug, but a kernel bug.

Jes

Comment 2 Richard W.M. Jones 2015-03-02 16:10:22 UTC
I wonder if it's better to supply the actual disk image?

(1) The sequence of commands is:

# we have a guest with 4 disks and 4 partitions per disk

mdadm --create --run r1t1 --level raid1 --raid-devices 2 /dev/sda1 /dev/sdb1
mdadm --create --run r1t2 --level raid1 --raid-devices 2 --chunk 64 /dev/sdc1 /dev/sdd1
mdadm --create --run r5t1 --level 5 --raid-devices 4 --spare-devices 1 /dev/sda2 /dev/sdb2 /dev/sdc2 /dev/sdd2 missing
mdadm --create --run r5t2 --level 5 --raid-devices 3 missing /dev/sda3 /dev/sdb3
mdadm --create --run r5t3 --level 5 --raid-devices 2 --spare-devices 2 /dev/sdc3 missing missing /dev/sdd3
mdadm -D --export /dev/md127

# we then create filesystems on all 5 of the above arrays
# we then mount them and write some stuff

# at this point we reboot the guest

mdadm --stop /dev/md127

(2) /proc/mdstat output is:

Personalities : [raid6] [raid5] [raid4] 
md127 : inactive sdc3[0] sdd3[3]
      2048 blocks super 1.2
       
unused devices: <none>

(3) virtio-scsi

(4) No LVM is involved here.

Comment 3 Richard W.M. Jones 2015-03-02 16:14:22 UTC
Created attachment 997143 [details]
rhbz1197305.tar

What I've done is to take the 4 disk images, xz-compress
them (they are almost completely sparse), and then tar
them into a single tarball.

To reproduce this (on the very latest Rawhide):

tar xf rhbz1197305.tar
unxz mdadm-?.img.xz
virt-rescue --ro -a mdadm-1.img -a mdadm-2.img -a mdadm-3.img -a mdadm-4.img 

At the virt-rescue prompt, type:

mdadm --stop /dev/md127 &

The mdadm command will hang forever, and at the same time you can
investigate what's going on in the guest.

Comment 4 Richard W.M. Jones 2015-03-02 16:16:37 UTC
The raid5_free oops is 100% reproducible under virt-rescue.  It happens
before mdadm is run, during boot, when we run the mdadm -As --auto=yes --run
command:

[    2.486615] md: md127 stopped.
[    2.496163] md: bind<sdd3>
[    2.498713] md: bind<sdc3>
[    2.536029] raid6: sse2x1    6734 MB/s
[    2.554018] raid6: sse2x2    9046 MB/s
[    2.572021] raid6: sse2x4   11132 MB/s
[    2.573465] raid6: using algorithm sse2x4 (11132 MB/s)
[    2.575432] raid6: using ssse3x2 recovery algorithm
[    2.579368] async_tx: api initialized (async)
[    2.583378] xor: automatically using best checksumming function:
[    2.595015]    avx       : 23052.000 MB/sec
[    2.613338] md: raid6 personality registered for level 6
[    2.615164] md: raid5 personality registered for level 5
[    2.616940] md: raid4 personality registered for level 4
[    2.619531] md/raid:md127: not clean -- starting background reconstruction
[    2.621873] md/raid:md127: device sdc3 operational as raid disk 0
[    2.625679] md/raid:md127: allocated 0kB
[    2.627058] md/raid:md127: cannot start dirty degraded array.
[    2.630797] md/raid:md127: failed to run raid set.
[    2.632339] md: pers->run() failed ...
[    2.633617] BUG: unable to handle kernel NULL pointer dereference at 00000000000005f8
[    2.634535] IP: [<ffffffffa0251d55>] free_conf+0x15/0x130 [raid456]
[    2.634535] PGD 1b9a4067 PUD 1bd34067 PMD 0 
[    2.634535] Oops: 0000 [#1] SMP 
[    2.634535] Modules linked in: raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq iosf_mbi kvm_intel kvm snd_pcsp snd_pcm snd_timer ghash_clmulni_intel snd soundcore ata_generic i2c_piix4 serio_raw pata_acpi libcrc32c crc8 crc_itu_t crc_ccitt virtio_pci virtio_mmio virtio_balloon virtio_scsi sym53c8xx scsi_transport_spi megaraid_sas megaraid_mbox megaraid_mm megaraid ideapad_laptop rfkill sparse_keymap virtio_net virtio_console virtio_rng virtio_blk virtio_ring virtio crc32 crct10dif_pclmul crc32c_intel crc32_pclmul
[    2.634535] CPU: 0 PID: 140 Comm: mdadm Not tainted 4.0.0-0.rc1.git1.1.fc23.x86_64 #1
[    2.634535] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.0-20150221_201410- 04/01/2014
[    2.634535] task: ffff88001bbf0000 ti: ffff88001a104000 task.ti: ffff88001a104000
[    2.634535] RIP: 0010:[<ffffffffa0251d55>]  [<ffffffffa0251d55>] free_conf+0x15/0x130 [raid456]
[    2.634535] RSP: 0018:ffff88001a107c48  EFLAGS: 00010296
[    2.634535] RAX: 0000000000000000 RBX: ffff88001df84800 RCX: 0000000000000000
[    2.634535] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[    2.634535] RBP: ffff88001a107c68 R08: 0000000000000001 R09: 0000000000000000
[    2.634535] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[    2.634535] R13: ffff88001df84800 R14: ffff88001df84818 R15: ffffffffa0262300
[    2.634535] FS:  00007f39168fc700(0000) GS:ffff88001ec00000(0000) knlGS:0000000000000000
[    2.634535] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.634535] CR2: 00000000000005f8 CR3: 000000001a2b6000 CR4: 00000000000407f0
[    2.634535] Stack:
[    2.634535]  ffff88001a107c58 ffff88001df84800 ffff88001df84818 ffff88001df84800
[    2.634535]  ffff88001a107c88 ffffffffa0251e89 ffff88001df84818 00000000fffffffb
[    2.634535]  ffff88001a107d38 ffffffff816ac0a1 0000000000000246 ffff88001df84a70
[    2.634535] Call Trace:
[    2.634535]  [<ffffffffa0251e89>] raid5_free+0x19/0x30 [raid456]
[    2.634535]  [<ffffffff816ac0a1>] md_run+0x901/0xa30
[    2.634535]  [<ffffffff816ad6ca>] ? md_ioctl+0x50a/0x1d20
[    2.634535]  [<ffffffff816ad6ca>] ? md_ioctl+0x50a/0x1d20
[    2.634535]  [<ffffffff816ac1e4>] do_md_run+0x14/0xa0
[    2.634535]  [<ffffffff816ae24e>] md_ioctl+0x108e/0x1d20
[    2.634535]  [<ffffffff810e7d55>] ? sched_clock_local+0x25/0x90
[    2.634535]  [<ffffffff8129ce6f>] ? mntput_no_expire+0x6f/0x360
[    2.634535]  [<ffffffff810e7f98>] ? sched_clock_cpu+0x98/0xd0
[    2.634535]  [<ffffffff814013fe>] blkdev_ioctl+0x1ce/0x850
[    2.634535]  [<ffffffff8129ce87>] ? mntput_no_expire+0x87/0x360
[    2.634535]  [<ffffffff8129ce05>] ? mntput_no_expire+0x5/0x360
[    2.634535]  [<ffffffff812ba993>] block_ioctl+0x43/0x50
[    2.634535]  [<ffffffff8128cf08>] do_vfs_ioctl+0x2e8/0x530
[    2.634535]  [<ffffffff811271e5>] ? rcu_read_lock_held+0x65/0x70
[    2.634535]  [<ffffffff81299bde>] ? __fget_light+0xbe/0xe0
[    2.634535]  [<ffffffff8128d1d1>] SyS_ioctl+0x81/0xa0
[    2.634535]  [<ffffffff81881b29>] system_call_fastpath+0x12/0x17
[    2.634535] Code: 1f 80 00 00 00 00 48 c7 c0 ea ff ff ff eb bb e8 52 94 e5 e0 66 90 66 66 66 66 90 55 48 89 e5 41 55 41 54 53 49 89 fc 48 83 ec 08 <48> 8b bf f8 05 00 00 48 85 ff 74 11 48 8b 7f 18 e8 66 e5 ff e0 
[    2.634535] RIP  [<ffffffffa0251d55>] free_conf+0x15/0x130 [raid456]
[    2.634535]  RSP <ffff88001a107c48>
[    2.634535] CR2: 00000000000005f8
[    2.776821] ---[ end trace 06a77318965f6cfb ]---
/init: line 93:   140 Killed                  mdadm -As --auto=yes --run

Comment 5 Jes Sorensen 2015-03-02 16:20:38 UTC
Richard,

Can you reproduce this using real storage as opposed to virtio-scsi?

I have no way of reproducing your case:
[jes@ultrasam ~]$ type -a virt-rescue
bash: type: virt-rescue: not found

I don't run virt at all!

Jes

Comment 6 Doug Ledford 2015-03-02 16:39:09 UTC
(In reply to Richard W.M. Jones from comment #2)
> I wonder if it's better to supply the actual disk image?
> 
> (1) The sequence of commands is:
> 
> # we have a guest with 4 disks and 4 partitions per disk
> 
> mdadm --create --run r1t1 --level raid1 --raid-devices 2 /dev/sda1 /dev/sdb1

OK.

> mdadm --create --run r1t2 --level raid1 --raid-devices 2 --chunk 64
> /dev/sdc1 /dev/sdd1

--chunk has no effect on raid1, this device is actually a duplicate of your first device

> mdadm --create --run r5t1 --level 5 --raid-devices 4 --spare-devices 1
> /dev/sda2 /dev/sdb2 /dev/sdc2 /dev/sdd2 missing

a missing spare device might as well not be specified.  This is functionally no different than a 4 disk raid5 array with no spare.

> mdadm --create --run r5t2 --level 5 --raid-devices 3 missing /dev/sda3
> /dev/sdb3

ok, a degraded raid5...although creating a degraded raid5 array is actually problematic, but I'll get to that later

> mdadm --create --run r5t3 --level 5 --raid-devices 2 --spare-devices 2
> /dev/sdc3 missing missing /dev/sdd3

I'm actually surprised (very much so) that mdadm will even let you create this.  A degraded, 2 disk raid5 array with a hot spare already attached.  The problem is that when you create a raid5 array, the first thing we need to do is to bring it into sync.  Until that happens, rebuilds of any sort are non-deterministic.  We can't bring it into sync because you created it with too few data disks.  But you also gave it a spare, so theoretically it should be able to rebuild itself.  But it's never been initialized, so rebuilding onto the spare is a bad idea.  It might help if you pass --assume-clean on this one, but in general a 2 disk raid5 is a bad idea and mdadm should block it, and creating a degraded raid5 array with active spares will likely interfere with how mdadm likes to do the initial sync of the raid5 array (namely, it adds all of the disks except one, marks that one as a spare, then triggers a rebuild onto the spare...but that's all done internally by mdadm/md and is done because it's faster/more efficient than a resync operation, so  this particular setup confounds how mdadm normally wants to initialize raid5 arrays).

> mdadm -D --export /dev/md127
> 
> # we then create filesystems on all 5 of the above arrays
> # we then mount them and write some stuff
> 
> # at this point we reboot the guest
> 
> mdadm --stop /dev/md127
> 
> (2) /proc/mdstat output is:
> 
> Personalities : [raid6] [raid5] [raid4] 
> md127 : inactive sdc3[0] sdd3[3]
>       2048 blocks super 1.2
>        
> unused devices: <none>
> 
> (3) virtio-scsi
> 
> (4) No LVM is involved here.

Well, md127 *is* stopped.  Being in inactive state is the same.  If you did mdadm -r /dev/md127 /dev/sdc3 /dev/sdd3, the device would go away and the system could be brought down.  I'm guessing this device never even made it to an active state, which is why a normal mdadm -S operation won't work on it.

Comment 7 Richard W.M. Jones 2015-03-02 16:44:27 UTC
It's not really good to have a disk image that causes the kernel
to segfault on boot, no matter how crazy we are to try creating such
an image ...

Comment 8 Jes Sorensen 2015-03-02 16:56:54 UTC
(In reply to Richard W.M. Jones from comment #7)
> It's not really good to have a disk image that causes the kernel
> to segfault on boot, no matter how crazy we are to try creating such
> an image ...

An image executing bad commands can crash the kernel any time, but yes
mdadm shouldn't.

Without answers to previous questions and filing a proper bug report, this
isn't going to get debugged though. In particular the question whether you
can reproduce this without using virtio-scsi?

Jes

Comment 9 Richard W.M. Jones 2015-03-02 17:53:39 UTC
Created attachment 997179 [details]
Log using virtio-blk.txt

I really don't have 4 spare SCSI disks to test this with, and I
only care that it fails under virt.

I hacked up virt-rescue to use virtio-blk drives instead, and it fails
with the exact same OOPS.  See the attachment.

Comment 10 Jes Sorensen 2015-03-05 14:24:47 UTC
As Doug already pointed out, you are creating your arrays in a broken matter.
It is up to you to show us that this happens on a type of disk that anyone
actually cares about, to use your own words.

In addition running RAID inside a virt guest makes little sense in the first,
place.

I am not going to install some random obscure virt tools just to be able to
reproduce this.

Jes

Comment 11 Richard W.M. Jones 2015-03-05 14:38:17 UTC
(a) We don't control what disk images people will upload to (eg)
OpenStack.  This is a *security issue* if a malformed disk image
causes a crash / hang in the kernel.

(b) RAID inside virt does make sense in the context of virt-p2v,
and is a supported case for RHEL.

(c) I have shown (comment 9) this is not an issue specific to virtio-scsi.

If you don't want to look at the bug, ignore it or remove yourself from
the CC.

Comment 12 Jes Sorensen 2015-03-05 14:46:35 UTC
Well first of all this is a Fedora bug, not a RHEL one!

Second I maintain the RAID stack and I asked you to provide proper data, but
you ask people to run obscure virt tools that I don't have installed anywhere
and I have no use for.

Third, running RAID inside a guest is silly, and inefficient, it should be run
at the host level. Sure, people can do silly things, and they will. They can
also run 'cat /dev/random > /dev/mem' in the script running as root.

Fourth, Doug pointed out you were assembling and forcing arrays to run in a
broken way.

Last, I'm the RAID maintainer and I cannot just unsubscribe myself from RAID
bugs, but as the RAID maintainer I ask for you to test this on proper storage
so I have some data to work from! If you do not find that worth your while,
I will have to close this as INSUFFICIENT_DATA!

Jes

Comment 13 Richard W.M. Jones 2015-03-05 14:51:48 UTC
Taking - I'll fix this myself when I have the time to look at it.

Comment 14 Richard W.M. Jones 2015-03-05 22:24:50 UTC
git bisect points to one of these 4 commits.  Other errors meant
I could not narrow it further.

# There are only 'skip'ped commits left to test.                                 
# The first bad commit could be any of:                                          
# afa0f557cb15176570a18fb2a093e348a793afd4                                       
# 5aa61f427e4979be733e4847b9199ff9cc48a47e                                       
# db721d32b74b51a5ac9ec9fab1d85cba90dbdbd3                                       
# 36d091f4759d194c99f0705d412afe208622b45a                                       
# We cannot bisect more!                                                         

https://github.com/torvalds/linux/commit/5aa61f427e4979be733e4847b9199ff9cc48a47e
https://github.com/torvalds/linux/commit/afa0f557cb15176570a18fb2a093e348a793afd4
https://github.com/torvalds/linux/commit/db721d32b74b51a5ac9ec9fab1d85cba90dbdbd3
https://github.com/torvalds/linux/commit/36d091f4759d194c99f0705d412afe208622b45a

Comment 15 Richard W.M. Jones 2015-03-06 10:21:07 UTC
Upstream bug filed, with potential fix:
https://bugzilla.kernel.org/show_bug.cgi?id=94381#c2

Comment 16 Richard W.M. Jones 2015-03-25 13:23:20 UTC
Fixed upstream (commit 0c35bd4723e4a39ba2da4c13a22cb97986ee10c8).


Note You need to log in before you can comment on or make changes to this bug.