Bug 962079

Summary:

Kernel panics when creating RAID 5 array using mdadm --create

Product:

[Fedora] Fedora

Reporter:

Richard W.M. Jones <rjones>

Component:

kernel

Assignee:

Kernel Maintainer List <kernel-maint>

Status:

CLOSED RAWHIDE

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

rawhide

CC:

fedora-kernel-raid, gansalmon, itamar, jonathan, kchamart, kernel-maint, madhu.chinakonda, pbonzini, tom

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

978834 (view as bug list)

Environment:

Last Closed:

2013-06-18 14:33:03 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

978834

Attachments:

Description	Flags
Complete messages (using lsisas1078 driver)	none
BUG of existing RAID5	none

Description Richard W.M. Jones 2013-05-11 14:37:02 UTC

Description of problem:

The kernel reliably panics when creating a RAID 5 array
using the command:

mdadm --create --run r5t1 --level 5 --raid-devices 4 --spare-devices 1 /dev/sda2 /dev/sdb2 /dev/sdc2 /dev/sdd2 missing

Version-Release number of selected component (if applicable):

kernel 0:3.10.0-0.rc0.git24.1.fc20
mdadm-3.2.6-19.fc20.x86_64

(Version of other components can be found here:
http://koji.fedoraproject.org/koji/getfile?taskID=5360743&name=root.log )

How reproducible:

100%

Steps to Reproduce:
1. Run the test tests/md/test-mdadm.sh in the libguestfs test suite.

Additional information:

mdadm --create --run r5t1 --level 5 --raid-devices 4 --spare-devices 1 /dev/sda2 /dev/sdb2 /dev/sdc2 /dev/sdd2 missing
mdadm: Defaulting to version 1.2 metadata
[  185.041140] md: bind<sda2>
[  185.062137] md: bind<sdb2>
[  185.086133] md: bind<sdc2>
[  185.095132] md: bind<sdd2>
[  185.947002] async_tx: api initialized (async)
[  186.013992] xor: measuring software checksum speed
[  186.024991]    prefetch64-sse:   960.000 MB/sec
[  186.035989]    generic_sse:   964.000 MB/sec
[  186.035989] xor: using function: generic_sse (964.000 MB/sec)
[  186.207963] raid6: sse2x1     195 MB/s
[  186.225960] raid6: sse2x2     417 MB/s
[  186.243957] raid6: sse2x4     277 MB/s
[  186.243957] raid6: using algorithm sse2x2 (417 MB/s)
[  186.245957] raid6: using intx1 recovery algorithm
[  186.610902] md: raid6 personality registered for level 6
[  186.611901] md: raid5 personality registered for level 5
[  186.612901] md: raid4 personality registered for level 4
[  186.670892] md/raid:md125: device sdc2 operational as raid disk 2
[  186.671892] md/raid:md125: device sdb2 operational as raid disk 1
[  186.673892] md/raid:md125: device sda2 operational as raid disk 0
[  186.791874] md/raid:md125: allocated 4356kB
[  186.797873] md/raid:md125: raid level 5 active with 3 out of 4 devices, algorithm 2
[  186.810871] md125: detected capacity change from 0 to 4718592
[  186.826869] md: recovery of RAID array md125
[  186.827869] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[  186.828868] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
[  186.829868] md: using 128k window, over a total of 1536k.
mdadm: array /dev/md/r5t1 started.
[  186.878861] ------------[ cut here ]------------
[  186.878861] kernel BUG at drivers/scsi/scsi_lib.c:1197!
[  186.878861] invalid opcode: 0000 [#1] SMP 
[  186.878861] Modules linked in: raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 kvm_amd kvm snd_pcsp snd_pcm snd_page_alloc snd_timer snd soundcore virtio_net virtio_scsi virtio_blk virtio_rng virtio_balloon virtio_mmio sparse_keymap rfkill sym53c8xx scsi_transport_spi crc8 crc_ccitt crc32 crc_itu_t libcrc32c
[  186.878861] CPU: 0 PID: 50 Comm: kworker/0:1H Tainted: G        W    3.10.0-0.rc0.git24.1.fc20.x86_64 #1
[  186.878861] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  186.878861] Workqueue: kblockd blk_delay_work
[  186.878861] task: ffff88001a10cdc0 ti: ffff88001a178000 task.ti: ffff88001a178000
[  186.878861] RIP: 0010:[<ffffffff8149d192>]  [<ffffffff8149d192>] scsi_setup_fs_cmnd+0x92/0xa0
[  186.878861] RSP: 0018:ffff88001a179c18  EFLAGS: 00000046
[  186.878861] RAX: 0000000000000000 RBX: ffff88001a020000 RCX: 000000000000002f
[  186.878861] RDX: 0000000000000000 RSI: ffff88001a040000 RDI: ffff88001a020000
[  186.878861] RBP: ffff88001a179c28 R08: 0000000000019000 R09: 0000000000000000
[  186.878861] R10: 0000000000002018 R11: 00000000000000c8 R12: ffff88001a040000
[  186.878861] R13: ffff88001a179fd8 R14: ffff88001a020000 R15: ffff88001a086f60
[  186.878861] FS:  0000000000000000(0000) GS:ffff88001f000000(0000) knlGS:0000000000000000
[  186.878861] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  186.878861] CR2: 00007fff2ebf6788 CR3: 0000000019daf000 CR4: 00000000000006f0
[  186.878861] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  186.878861] DR3: 0000000000000000 DR6: 0000000000000000 DR7: 0000000000000000
[  186.878861] Stack:
[  186.878861]  ffff88001a040000 ffff88001a028000 ffff88001a179c90 ffffffff814ab2b0
[  186.878861]  0000000000000001 ffffffff8258a970 0000000000000046 ffffffff000000c8
[  186.878861]  0000000000002018 0000000000000000 ffff88001a028000 ffff88001a179fd8
[  186.878861] Call Trace:
[  186.878861]  [<ffffffff814ab2b0>] sd_prep_fn+0x340/0xf90
[  186.878861]  [<ffffffff8132ee3d>] blk_peek_request+0x13d/0x300
[  186.878861]  [<ffffffff8149d492>] scsi_request_fn+0x52/0x540
[  186.878861]  [<ffffffff8132af33>] __blk_run_queue+0x33/0x40
[  186.878861]  [<ffffffff8132b125>] blk_delay_work+0x25/0x40
[  186.878861]  [<ffffffff8108d251>] process_one_work+0x211/0x6c0
[  186.878861]  [<ffffffff8108d1e5>] ? process_one_work+0x1a5/0x6c0
[  186.878861]  [<ffffffff8108d81d>] worker_thread+0x11d/0x3a0
[  186.878861]  [<ffffffff8108d700>] ? process_one_work+0x6c0/0x6c0
[  186.878861]  [<ffffffff81095cbd>] kthread+0xed/0x100
[  186.878861]  [<ffffffff81095bd0>] ? insert_kthread_work+0x80/0x80
[  186.878861]  [<ffffffff8172af2c>] ret_from_fork+0x7c/0xb0
[  186.878861]  [<ffffffff81095bd0>] ? insert_kthread_work+0x80/0x80
[  186.878861] Code: c3 48 8b 00 48 85 c0 74 b7 48 8b 40 48 48 85 c0 74 ae 4c 89 e6 48 89 df ff d0 85 c0 74 a2 eb dc b8 02 00 00 00 0f 1f 40 00 eb d1 <0f> 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 
[  186.878861] RIP  [<ffffffff8149d192>] scsi_setup_fs_cmnd+0x92/0xa0
[  186.878861]  RSP <ffff88001a179c18>
[  186.878861] ---[ end trace 6ed7a728fbe3e4df ]---
[  186.878861] BUG: sleeping function called from invalid context at kernel/rwsem.c:20
[  186.878861] in_atomic(): 1, irqs_disabled(): 1, pid: 50, name: kworker/0:1H
[  186.878861] INFO: lockdep is turned off.
[  186.878861] irq event stamp: 68212
[  186.878861] hardirqs last  enabled at (68211): [<ffffffff817213dc>] _raw_spin_unlock_irq+0x2c/0x50
[  186.878861] hardirqs last disabled at (68212): [<ffffffff8172123f>] _raw_spin_lock_irq+0x1f/0x90
[  186.878861] softirqs last  enabled at (68198): [<ffffffff81071ae3>] __do_softirq+0x193/0x400
[  186.878861] softirqs last disabled at (68155): [<ffffffff81071f25>] irq_exit+0xc5/0xd0
[  186.878861] CPU: 0 PID: 50 Comm: kworker/0:1H Tainted: G      D W    3.10.0-0.rc0.git24.1.fc20.x86_64 #1
[  186.878861] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  186.878861] Workqueue: kblockd blk_delay_work
[  186.878861]  ffffffff81a1c635 ffff88001a179918 ffffffff81719788 ffff88001a179940
[  186.878861]  ffffffff810a3309 ffff88001a06d268 ffff88001a06d2c8 ffff88001a10cdc0
[  186.878861]  ffff88001a179968 ffffffff8171e4ba ffff88001a179978 ffff88001a10cdc0
[  186.878861] Call Trace:
[  186.878861]  [<ffffffff81719788>] dump_stack+0x19/0x1b
[  186.878861]  [<ffffffff810a3309>] __might_sleep+0x179/0x230
[  186.878861]  [<ffffffff8171e4ba>] down_read+0x2a/0xa0
[  186.878861]  [<ffffffff81081694>] exit_signals+0x24/0x130
[  186.878861]  [<ffffffff8106ddbc>] do_exit+0xbc/0xcc0
[  186.878861]  [<ffffffff8106b0b8>] ? kmsg_dump+0x1b8/0x240
[  186.878861]  [<ffffffff8106af25>] ? kmsg_dump+0x25/0x240
[  186.878861]  [<ffffffff81722ff2>] oops_end+0xa2/0xe0
[  186.878861]  [<ffffffff8101bb4b>] die+0x4b/0x70
[  186.878861]  [<ffffffff81722880>] do_trap+0x60/0x170
[  186.878861]  [<ffffffff81018ff5>] do_invalid_op+0x95/0xb0
[  186.878861]  [<ffffffff8149d192>] ? scsi_setup_fs_cmnd+0x92/0xa0
[  186.878861]  [<ffffffff81326b2e>] ? elv_dispatch_sort+0xce/0xe0
[  186.878861]  [<ffffffff8134bf68>] ? cfq_dispatch_insert+0x158/0x280
[  186.878861]  [<ffffffff813615cd>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[  186.878861]  [<ffffffff817220e3>] ? restore_args+0x30/0x30
[  186.878861]  [<ffffffff8172c22e>] invalid_op+0x1e/0x30
[  186.878861]  [<ffffffff8149d192>] ? scsi_setup_fs_cmnd+0x92/0xa0
[  186.878861]  [<ffffffff8149d117>] ? scsi_setup_fs_cmnd+0x17/0xa0
[  186.878861]  [<ffffffff814ab2b0>] sd_prep_fn+0x340/0xf90
[  186.878861]  [<ffffffff8132ee3d>] blk_peek_request+0x13d/0x300
[  186.878861]  [<ffffffff8149d492>] scsi_request_fn+0x52/0x540
[  186.878861]  [<ffffffff8132af33>] __blk_run_queue+0x33/0x40
[  186.878861]  [<ffffffff8132b125>] blk_delay_work+0x25/0x40
[  186.878861]  [<ffffffff8108d251>] process_one_work+0x211/0x6c0
[  186.878861]  [<ffffffff8108d1e5>] ? process_one_work+0x1a5/0x6c0
[  186.878861]  [<ffffffff8108d81d>] worker_thread+0x11d/0x3a0
[  186.878861]  [<ffffffff8108d700>] ? process_one_work+0x6c0/0x6c0
[  186.878861]  [<ffffffff81095cbd>] kthread+0xed/0x100
[  186.878861]  [<ffffffff81095bd0>] ? insert_kthread_work+0x80/0x80
[  186.878861]  [<ffffffff8172af2c>] ret_from_fork+0x7c/0xb0
[  186.878861]  [<ffffffff81095bd0>] ? insert_kthread_work+0x80/0x80
[  186.878861] note: kworker/0:1H[50] exited with preempt_count 1
[  186.879861] BUG: unable to handle kernel paging request at ffffffffffffff98
[  186.879861] IP: [<ffffffff81096630>] kthread_data+0x10/0x20
[  186.879861] PGD 1c0f067 PUD 1c11067 PMD 0 
[  186.879861] Oops: 0000 [#2] SMP 
[  186.879861] Modules linked in: raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 kvm_amd kvm snd_pcsp snd_pcm snd_page_alloc snd_timer snd soundcore virtio_net virtio_scsi virtio_blk virtio_rng virtio_balloon virtio_mmio sparse_keymap rfkill sym53c8xx scsi_transport_spi crc8 crc_ccitt crc32 crc_itu_t libcrc32c
[  186.879861] CPU: 0 PID: 50 Comm: kworker/0:1H Tainted: G      D W    3.10.0-0.rc0.git24.1.fc20.x86_64 #1
[  186.879861] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  186.879861] task: ffff88001a10cdc0 ti: ffff88001a178000 task.ti: ffff88001a178000
[  186.879861] RIP: 0010:[<ffffffff81096630>]  [<ffffffff81096630>] kthread_data+0x10/0x20
[  186.879861] RSP: 0018:ffff88001a179900  EFLAGS: 00000002
[  186.879861] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000000f
[  186.879861] RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff88001a10cdc0
[  186.879861] RBP: ffff88001a179900 R08: ffff88001a10ce30 R09: 0000002b82e86108
[  186.879861] R10: 00000000ffffffff R11: 0000000000000000 R12: ffff88001f1d50c0
[  186.879861] R13: 0000000000000000 R14: ffff88001a1794f8 R15: 0000000000000000
[  186.879861] FS:  0000000000000000(0000) GS:ffff88001f000000(0000) knlGS:0000000000000000
[  186.879861] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  186.879861] CR2: 0000000000000028 CR3: 0000000019daf000 CR4: 00000000000006f0
[  186.879861] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  186.879861] DR3: 0000000000000000 DR6: 0000000000000000 DR7: 0000000000000000
[  186.879861] Stack:
[  186.879861]  ffff88001a179918 ffffffff8108e751 ffff88001a10cdc0 ffff88001a179978
[  186.879861]  ffffffff8171f0ba 00000000001d50c0 ffff88001a179fd8 ffff88001a179fd8
[  186.879861]  00000000001d50c0 ffff88001a10cdc0 ffff88001a10d5a8 ffff88001a1794f8
[  186.879861] Call Trace:
[  186.879861]  [<ffffffff8108e751>] wq_worker_sleeping+0x11/0x90
[  186.879861]  [<ffffffff8171f0ba>] __schedule+0x76a/0xa20
[  186.879861]  [<ffffffff8171f399>] schedule+0x29/0x70
[  186.879861]  [<ffffffff8106e498>] do_exit+0x798/0xcc0
[  186.879861]  [<ffffffff81722ff2>] oops_end+0xa2/0xe0
[  186.879861]  [<ffffffff8101bb4b>] die+0x4b/0x70
[  186.879861]  [<ffffffff81722880>] do_trap+0x60/0x170
[  186.879861]  [<ffffffff81018ff5>] do_invalid_op+0x95/0xb0
[  186.879861]  [<ffffffff8149d192>] ? scsi_setup_fs_cmnd+0x92/0xa0
[  186.879861]  [<ffffffff81326b2e>] ? elv_dispatch_sort+0xce/0xe0
[  186.879861]  [<ffffffff8134bf68>] ? cfq_dispatch_insert+0x158/0x280
[  186.879861]  [<ffffffff813615cd>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[  186.879861]  [<ffffffff817220e3>] ? restore_args+0x30/0x30
[  186.879861]  [<ffffffff8172c22e>] invalid_op+0x1e/0x30
[  186.879861]  [<ffffffff8149d192>] ? scsi_setup_fs_cmnd+0x92/0xa0
[  186.879861]  [<ffffffff8149d117>] ? scsi_setup_fs_cmnd+0x17/0xa0
[  186.879861]  [<ffffffff814ab2b0>] sd_prep_fn+0x340/0xf90
[  186.879861]  [<ffffffff8132ee3d>] blk_peek_request+0x13d/0x300
[  186.879861]  [<ffffffff8149d492>] scsi_request_fn+0x52/0x540
[  186.879861]  [<ffffffff8132af33>] __blk_run_queue+0x33/0x40
[  186.879861]  [<ffffffff8132b125>] blk_delay_work+0x25/0x40
[  186.879861]  [<ffffffff8108d251>] process_one_work+0x211/0x6c0
[  186.879861]  [<ffffffff8108d1e5>] ? process_one_work+0x1a5/0x6c0
[  186.879861]  [<ffffffff8108d81d>] worker_thread+0x11d/0x3a0
[  186.879861]  [<ffffffff8108d700>] ? process_one_work+0x6c0/0x6c0
[  186.879861]  [<ffffffff81095cbd>] kthread+0xed/0x100
[  186.879861]  [<ffffffff81095bd0>] ? insert_kthread_work+0x80/0x80
[  186.879861]  [<ffffffff8172af2c>] ret_from_fork+0x7c/0xb0
[  186.879861]  [<ffffffff81095bd0>] ? insert_kthread_work+0x80/0x80
[  186.879861] Code: 00 48 89 e5 5d 48 8b 40 88 48 c1 e8 02 83 e0 01 c3 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 48 8b 87 70 05 00 00 55 48 89 e5 <48> 8b 40 98 5d c3 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 
[  186.879861] RIP  [<ffffffff81096630>] kthread_data+0x10/0x20
[  186.879861]  RSP <ffff88001a179900>
[  186.879861] CR2: ffffffffffffff98
[  186.879861] ---[ end trace 6ed7a728fbe3e4e0 ]---
[  186.879861] Fixing recursive fault but reboot is needed!

Comment 1 Richard W.M. Jones 2013-05-11 14:39:10 UTC

CC-ing Paolo since the stack trace indicates that
virtio-scsi might be involved in this.

Comment 2 Jes Sorensen 2013-05-13 14:45:43 UTC

Could you please try this against real physical disks and see if it does the
same thing?

Comment 3 Richard W.M. Jones 2013-05-13 14:56:34 UTC

I don't want to disappoint, so basically no.  I don't have
a physical machine that runs Rawhide.  Any test by necessity
would be running in a VM, hence using virtio at some point.

Comment 4 Paolo Bonzini 2013-05-13 21:23:24 UTC

No, this is not virtio-scsi (doesn't seem like, at least), and yes, you should test using another driver.  megasas can work or, with QEMU 1.5, pvscsi too.  megasas is supported by libvirt (it calls it lsisas1078).

Comment 5 Richard W.M. Jones 2013-05-13 21:37:23 UTC

(In reply to comment #4)
> No, this is not virtio-scsi (doesn't seem like, at least), and yes, you
> should test using another driver.  megasas can work or, with QEMU 1.5,
> pvscsi too.  megasas is supported by libvirt (it calls it lsisas1078).

I'll see if I can work out the libvirt voodoo for this tomorrow.

Comment 6 Richard W.M. Jones 2013-05-14 18:26:39 UTC

Created attachment 747775 [details]
Complete messages (using lsisas1078 driver)

I tried lsisas1078 as suggested by Paolo Bonzini, and to the
untrained eye it appears to be failing in exactly the same way.
I attached the complete libguestfs debug + kernel messages in case
they are helpful.

Comment 7 Richard W.M. Jones 2013-05-14 18:40:00 UTC

By the way the actual error is:

        /*
         * Filesystem requests must transfer data.
         */
        BUG_ON(!req->nr_phys_segments);

Comment 8 Paolo Bonzini 2013-05-16 14:44:19 UTC

This is on the way out to QEMU, so it seems likely to be a kernel problem rather than a virt problem.  Can you try with an external USB disk and USB passthrough?  That would be the smoking gun that it is a kernel problem. :)

Comment 9 Richard W.M. Jones 2013-05-22 12:20:06 UTC

This seems to have "fixed itself" with the new kernel
(3.10.0-0.rc2.git0.3.fc20).  I'm just running a build to
confirm this.

Comment 10 Richard W.M. Jones 2013-05-22 13:05:43 UTC

Dammit unfortunately it's not fixed.  It has, however, changed
so that instead of just hanging the kernel, it now panics allowing
us to catch the error and report it.

The stack trace is the same as before
(http://kojipkgs.fedoraproject.org//work/tasks/8778/5408778/build.log)

Comment 11 Tom Kuther 2013-05-25 08:29:02 UTC

Created attachment 752953 [details]
BUG of existing RAID5

Attached is a very similar BUG from an existing level 5 array with physical disks. The bug happened when trying to read a file from it.

This is from vanilla 3.10-rc2 on an ArchLinux host, for whatever it's worth.. could underline the "it's a kernel bug" assumption.

The box also has a level 1 array - no issues with that one.

Comment 12 Jes Sorensen 2013-05-27 13:52:56 UTC

Tom,

Your bug report does not seem to resemble the previous bug report at all,
there are traces of plugging in your output and also drive errors coming
from the ATA driver:

2013-05-24T17:59:42+02:00 sirius kernel: ata7: lost interrupt (Status 0x58)
2013-05-24T17:59:42+02:00 sirius kernel: ata7: drained 65536 bytes to clear DRQ
2013-05-24T17:59:42+02:00 sirius kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
2013-05-24T17:59:42+02:00 sirius kernel: sr 6:0:0:0: CDB: 
2013-05-24T17:59:42+02:00 sirius kernel: cdb[0]=0x0: 00 00 00 00 00 00
2013-05-24T17:59:42+02:00 sirius kernel: ata7.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0
2013-05-24T17:59:42+02:00 sirius kernel: res 40/00:02:00:08:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
2013-05-24T17:59:42+02:00 sirius kernel: ata7.00: status: { DRDY }
2013-05-24T17:59:42+02:00 sirius kernel: ata7: soft resetting link
2013-05-24T17:59:42+02:00 sirius kernel: ata7.00: configured for UDMA/33
2013-05-24T17:59:42+02:00 sirius kernel: ata7: EH complete

What driver are you running?

Jes

Comment 13 Tom Kuther 2013-05-27 22:13:37 UTC

Jes, the "very similar" part is just the first trace (cut here.. end trace), indeed. 
The ATA errors could be caused by me unplugging a USB connected mobile while the box was slowly locking up, so please ignore those. IIRC they didn't happen when I tried vanilla -rc1. Anyways, the driver is PATA_ATIIXP.

The mdraid error was almost the same on -rc1, beside the fact that -rc1 went to death immediately, while with -rc2 it took 2-3 minutes until the system got completely unresponsive. The disks are good, and back on 2.9.1 mdraid is working fine.

If this is fully unrelated, I'm sorry. This bug is what I found crawling google and lkml about the bug headline "drivers/scsi/scsi_lib.c:1196"

Comment 14 Jes Sorensen 2013-05-28 06:51:42 UTC

Tom,

The ATA errors almost certainly came from the harddrives, not the USB mobile.
A USB disk would not normally show up with the name 'ata<X>'. A USB disk will
normally show up as 'sd<Y>'.

It is certainly possible that the problem is due to interrupts not getting
through to the ATA driver, or that the newer kernel introduced a bug in said
driver.

Jes

Comment 15 Richard W.M. Jones 2013-05-28 07:30:56 UTC

This thread appeared a couple of weeks ago on LKML.  There is
a suggested patch in the third comment.

http://comments.gmane.org/gmane.linux.kernel/1492771

Comment 16 Richard W.M. Jones 2013-05-28 07:33:20 UTC

Here is a similar-but-a-bit-different patch claiming to fix
the same bug:

http://permalink.gmane.org/gmane.linux.raid/42949

Comment 17 Richard W.M. Jones 2013-05-28 07:49:33 UTC

Neither of those is upstream in Linus's tree.

Comment 18 Richard W.M. Jones 2013-05-29 20:28:07 UTC

I tested http://thread.gmane.org/gmane.linux.kernel/1492771/focus=42953
on the basis that it was the simpler of the two patches.
Here is my scratch-build:
http://koji.fedoraproject.org/koji/taskinfo?taskID=5437990

This fixes the problem.  My test which previously failed
now works.

I asked on LKML for this patch to be included.

Comment 19 Jes Sorensen 2013-06-18 14:21:11 UTC

Fix is now in Linus' tree:

commit 4997b72ee62930cb841d185398ea547d979789f4
Author: Kent Overstreet <koverstreet>
Date:   Thu May 30 08:44:39 2013 +0200

    raid5: Initialize bi_vcnt
    
    The patch that converted raid5 to use bio_reset() forgot to initialize
    bi_vcnt.
    
    Signed-off-by: Kent Overstreet <koverstreet>
    Cc: NeilBrown <neilb>
    Cc: linux-raid.org
    Tested-by: Ilia Mirkin <imirkin.edu>
    Signed-off-by: Jens Axboe <axboe>

Should show up in the next resync I presume.

Comment 20 Dave Jones 2013-06-18 14:33:03 UTC

fixed in rc6 builds and newer.