Bug 962079
Summary: | Kernel panics when creating RAID 5 array using mdadm --create | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Richard W.M. Jones <rjones> | ||||||
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||||
Status: | CLOSED RAWHIDE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
Severity: | unspecified | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | rawhide | CC: | fedora-kernel-raid, gansalmon, itamar, jonathan, kchamart, kernel-maint, madhu.chinakonda, pbonzini, tom | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 978834 (view as bug list) | Environment: | |||||||
Last Closed: | 2013-06-18 14:33:03 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 978834 | ||||||||
Attachments: |
|
Description
Richard W.M. Jones
2013-05-11 14:37:02 UTC
CC-ing Paolo since the stack trace indicates that virtio-scsi might be involved in this. Could you please try this against real physical disks and see if it does the same thing? I don't want to disappoint, so basically no. I don't have a physical machine that runs Rawhide. Any test by necessity would be running in a VM, hence using virtio at some point. No, this is not virtio-scsi (doesn't seem like, at least), and yes, you should test using another driver. megasas can work or, with QEMU 1.5, pvscsi too. megasas is supported by libvirt (it calls it lsisas1078). (In reply to comment #4) > No, this is not virtio-scsi (doesn't seem like, at least), and yes, you > should test using another driver. megasas can work or, with QEMU 1.5, > pvscsi too. megasas is supported by libvirt (it calls it lsisas1078). I'll see if I can work out the libvirt voodoo for this tomorrow. Created attachment 747775 [details]
Complete messages (using lsisas1078 driver)
I tried lsisas1078 as suggested by Paolo Bonzini, and to the
untrained eye it appears to be failing in exactly the same way.
I attached the complete libguestfs debug + kernel messages in case
they are helpful.
By the way the actual error is: /* * Filesystem requests must transfer data. */ BUG_ON(!req->nr_phys_segments); This is on the way out to QEMU, so it seems likely to be a kernel problem rather than a virt problem. Can you try with an external USB disk and USB passthrough? That would be the smoking gun that it is a kernel problem. :) This seems to have "fixed itself" with the new kernel (3.10.0-0.rc2.git0.3.fc20). I'm just running a build to confirm this. Dammit unfortunately it's not fixed. It has, however, changed so that instead of just hanging the kernel, it now panics allowing us to catch the error and report it. The stack trace is the same as before (http://kojipkgs.fedoraproject.org//work/tasks/8778/5408778/build.log) Created attachment 752953 [details]
BUG of existing RAID5
Attached is a very similar BUG from an existing level 5 array with physical disks. The bug happened when trying to read a file from it.
This is from vanilla 3.10-rc2 on an ArchLinux host, for whatever it's worth.. could underline the "it's a kernel bug" assumption.
The box also has a level 1 array - no issues with that one.
Tom, Your bug report does not seem to resemble the previous bug report at all, there are traces of plugging in your output and also drive errors coming from the ATA driver: 2013-05-24T17:59:42+02:00 sirius kernel: ata7: lost interrupt (Status 0x58) 2013-05-24T17:59:42+02:00 sirius kernel: ata7: drained 65536 bytes to clear DRQ 2013-05-24T17:59:42+02:00 sirius kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen 2013-05-24T17:59:42+02:00 sirius kernel: sr 6:0:0:0: CDB: 2013-05-24T17:59:42+02:00 sirius kernel: cdb[0]=0x0: 00 00 00 00 00 00 2013-05-24T17:59:42+02:00 sirius kernel: ata7.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0 2013-05-24T17:59:42+02:00 sirius kernel: res 40/00:02:00:08:00/00:00:00:00:00/a0 Emask 0x4 (timeout) 2013-05-24T17:59:42+02:00 sirius kernel: ata7.00: status: { DRDY } 2013-05-24T17:59:42+02:00 sirius kernel: ata7: soft resetting link 2013-05-24T17:59:42+02:00 sirius kernel: ata7.00: configured for UDMA/33 2013-05-24T17:59:42+02:00 sirius kernel: ata7: EH complete What driver are you running? Jes Jes, the "very similar" part is just the first trace (cut here.. end trace), indeed. The ATA errors could be caused by me unplugging a USB connected mobile while the box was slowly locking up, so please ignore those. IIRC they didn't happen when I tried vanilla -rc1. Anyways, the driver is PATA_ATIIXP. The mdraid error was almost the same on -rc1, beside the fact that -rc1 went to death immediately, while with -rc2 it took 2-3 minutes until the system got completely unresponsive. The disks are good, and back on 2.9.1 mdraid is working fine. If this is fully unrelated, I'm sorry. This bug is what I found crawling google and lkml about the bug headline "drivers/scsi/scsi_lib.c:1196" Tom, The ATA errors almost certainly came from the harddrives, not the USB mobile. A USB disk would not normally show up with the name 'ata<X>'. A USB disk will normally show up as 'sd<Y>'. It is certainly possible that the problem is due to interrupts not getting through to the ATA driver, or that the newer kernel introduced a bug in said driver. Jes This thread appeared a couple of weeks ago on LKML. There is a suggested patch in the third comment. http://comments.gmane.org/gmane.linux.kernel/1492771 Here is a similar-but-a-bit-different patch claiming to fix the same bug: http://permalink.gmane.org/gmane.linux.raid/42949 Neither of those is upstream in Linus's tree. I tested http://thread.gmane.org/gmane.linux.kernel/1492771/focus=42953 on the basis that it was the simpler of the two patches. Here is my scratch-build: http://koji.fedoraproject.org/koji/taskinfo?taskID=5437990 This fixes the problem. My test which previously failed now works. I asked on LKML for this patch to be included. Fix is now in Linus' tree: commit 4997b72ee62930cb841d185398ea547d979789f4 Author: Kent Overstreet <koverstreet> Date: Thu May 30 08:44:39 2013 +0200 raid5: Initialize bi_vcnt The patch that converted raid5 to use bio_reset() forgot to initialize bi_vcnt. Signed-off-by: Kent Overstreet <koverstreet> Cc: NeilBrown <neilb> Cc: linux-raid.org Tested-by: Ilia Mirkin <imirkin.edu> Signed-off-by: Jens Axboe <axboe> Should show up in the next resync I presume. fixed in rc6 builds and newer. |