Bug 741560

Summary:	Heavy IO to Hitachi XL2000 USB hard disc attached to Asus P7H55 (Intel H55 chipset) causes repeatable kernel panic
Product:	[Fedora] Fedora	Reporter:	Alex Butcher <bugzilla>
Component:	kernel	Assignee:	Kernel Maintainer List <kernel-maint>
Status:	CLOSED WONTFIX	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	14	CC:	gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-08-16 13:52:24 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Alex Butcher 2011-09-27 09:43:39 UTC

Description of problem:

When performing heavy IO (e.g. cp -r of a 50G directory, fsck) to a Hitachi XL2000 external USB hard disc connected to an Asus P7H55/iH55 motherboard fitted with an i5-760, kernel panics and is completely unresponsive. Happens every time I've tried it.

The crash log is 'Corrupted or bad crash', according to abrtd. I don't have a serial console, or even a monitor that can display the console attached.

Fitting the exact same disc to another machine (built around an older P5Q/iP45 chipset motherboard fitted with a Q6600) running the same kernel works fine (well, I had one hang whilst running badblocks -w over two USB discs simultaneously for ~60+ hours, whilst using it for normal desktop tasks, so I don't think that's very meaningful).

The disc has a single ext3 filesystem in a partition taking up the entire usable space (created using Linux fdisk and mke2fs).

Version-Release number of selected component (if applicable):

kernel-2.6.35.14-96.fc14.x86_64

How reproducible:

Steps to Reproduce:
1. Attach Hitachi XL2000 disc to (seemingly any) USB port of Asus P7H55 motherboard
2. Allow to auto-mount
3. Run cp -r, or unmount and fsck
4. Wait for kernel panic

Actual results:

Kernel panic, no usable crash log in /var/spool/abrt/.

Expected results:

IO operation completes successfully without crashing the machine.

Additional info:

Machine is a MythTV frontend/backend, so has Hauppauge Nova-T PCI DVB-T tuner and a Nova-T-500 dual USB-over-PCI DVB-T tuner cards attached. Running nVidia drivers, but is otherwise rock-solid between mains power failures.

Comment 1 Josh Boyer 2011-09-27 12:45:44 UTC

There really isn't anything we can go on without a backtrace or a vmcore or something to debug from.  If you can capture something like that without the nvidia drivers loaded, please let us know.

Comment 2 Alex Butcher 2011-09-27 12:58:37 UTC

(In reply to comment #1)
> There really isn't anything we can go on without a backtrace or a vmcore or
> something to debug from.  If you can capture something like that without the
> nvidia drivers loaded, please let us know.

Would using the procedure documented at http://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes be adequately useful?

Any other alternatives to a serial console if not?

Comment 3 Alex Butcher 2011-09-27 19:18:40 UTC

(In reply to comment #1)
> There really isn't anything we can go on without a backtrace or a vmcore or
> something to debug from.  If you can capture something like that without the
> nvidia drivers loaded, please let us know.

OK, this is the first time I've used kdump etc, so please be gentle. :-)

I disabled the nvidia kernel module, booted into run level 3, and kicked off an fsck of the ext3 partition on the XL2000. It panic'ed pretty quickly and this was the result:

# crash /var/crash/2011-09-27-20\:04/vmcore /usr/lib/debug/lib/modules/`uname -r`/vmlinux

crash 5.0.6-2.fc14
Copyright (C) 2002-2010  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
 
GNU gdb (GDB) 7.0
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

      KERNEL: /usr/lib/debug/lib/modules/2.6.35.14-96.fc14.x86_64/vmlinux
    DUMPFILE: /var/crash/2011-09-27-20:04/vmcore
        CPUS: 4
        DATE: Tue Sep 27 20:02:02 2011
      UPTIME: 00:04:22
LOAD AVERAGE: 1.80, 0.87, 0.35
       TASKS: 312
    NODENAME: mythtv.xxx.xxx.xxx
     RELEASE: 2.6.35.14-96.fc14.x86_64
     VERSION: #1 SMP Thu Sep 1 11:59:56 UTC 2011
     MACHINE: x86_64  (2809 Mhz)
      MEMORY: 4 GB
       PANIC: "[  262.575493] Oops: 0000 [#1] SMP " (check log for details)
         PID: 0
     COMMAND: "swapper"
        TASK: ffffffff81a4a020  (1 of 4)  [THREAD_INFO: ffffffff81a00000]
         CPU: 0
       STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 0      TASK: ffffffff81a4a020  CPU: 0   COMMAND: "swapper"
 #0 [ffff88000a203ba8] __pskb_pull_tail at ffffffff813b8e02
 #1 [ffff88000a203bf8] dev_queue_xmit at ffffffff813c2e46
 #2 [ffff88000a203c38] ip_finish_output2 at ffffffff813f557c
 #3 [ffff88000a203c68] ip_finish_output at ffffffff813f5621
 #4 [ffff88000a203c88] ip_output at ffffffff813f5e48
 #5 [ffff88000a203ca8] ip_forward_finish at ffffffff813f35dd
 #6 [ffff88000a203cc8] ip_forward at ffffffff813f38ba
 #7 [ffff88000a203d08] ip_rcv_finish at ffffffff813f2171
 #8 [ffff88000a203d48] NF_HOOK.clone.8 at ffffffff813f2412
 #9 [ffff88000a203d78] ip_rcv at ffffffff813f27a1
#10 [ffff88000a203da8] __netif_receive_skb at ffffffff813bf812
#11 [ffff88000a203e08] process_backlog at ffffffff813c1064
#12 [ffff88000a203e68] net_rx_action at ffffffff813c11e6
#13 [ffff88000a203ec8] __do_softirq at ffffffff81053db9
#14 [ffff88000a203f38] call_softirq at ffffffff8100ab9c
#15 [ffff88000a203f50] do_softirq at ffffffff8100c2f8
#16 [ffff88000a203f70] irq_exit at ffffffff81053f45
#17 [ffff88000a203f80] do_IRQ at ffffffff814715c5
--- <IRQ stack> ---
#18 [ffffffff81a01db8] ret_from_intr at ffffffff8146bad3
    [exception RIP: intel_idle+273]
    RIP: ffffffff81265bfc  RSP: ffffffff81a01e68  RFLAGS: 00000206
    RAX: 0000000000000000  RBX: ffffffff81a01ec8  RCX: 00000000000000bb
    RDX: 00000000000000bb  RSI: 0000000000000000  RDI: 00000000000003e8
    RBP: ffffffff8146bace   R8: 0000000000000000   R9: 00000000000002b3
    R10: 0000003d2d072cee  R11: 0000000000000000  R12: 0000000000000000
    R13: ffffffff81a01df8  R14: ffffffff8146ea81  R15: ffffffff81a01df8
    ORIG_RAX: ffffffffffffff86  CS: 0010  SS: 0018
#19 [ffffffff81a01ed0] cpuidle_idle_call at ffffffff813955b5
#20 [ffffffff81a01ef0] cpu_idle at ffffffff8100830b

# tail -64 /var/crash/2011-09-27-20\:04/dmesg 
<1>[  262.574738] BUG: unable to handle kernel NULL pointer dereference at (null)
<1>[  262.574991] IP: [<ffffffff810dca57>] put_page+0x10/0x7c
<4>[  262.575213] PGD 10fd81067 PUD 10fe18067 PMD 0 
<0>[  262.575493] Oops: 0000 [#1] SMP 
<0>[  262.575736] last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
<4>[  262.576067] CPU 0 
<4>[  262.576106] Modules linked in: nfsd lockd nfs_acl auth_rpcgss exportfs coretemp sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf nf_nat_irc nf_conntrack_irc nf_nat_ftp nf_conntrack_ftp xt_limit ipt_LOG iptable_mangle ipt_MASQUERADE iptable_nat nf_nat ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 jfs uinput dvb_pll cx22702 cx88_dvb cx88_vp3054_i2c videobuf_dvb rc_hauppauge_new mt2060 snd_hda_codec_via ir_lirc_codec cx8800 dvb_usb_dib0700 cx8802 lirc_dev cx88xx snd_hda_intel dib7000p dib0090 dib7000m dib0070 ir_sony_decoder snd_hda_codec dvb_usb ir_jvc_decoder dib8000 ir_rc6_decoder dib9000 ir_rc5_decoder dvb_core ir_nec_decoder dib3000mc rc_core snd_hwdep dibx000_common snd_seq snd_seq_device i2c_algo_bit tveeprom v4l2_common videodev microcode snd_pcm v4l2_compat_ioctl32 snd_timer sundance videobuf_dma_sg snd shpchp btcx_risc videobuf_core soundcore snd_page_alloc iTCO_wdt iTCO_vendor_support i2c_i801 i2c_core r8169 mii asus_atk0110 joydev raid1 usb_storage [last unloaded: scsi_wait_scan]
<4>[  262.581273] 
<4>[  262.581450] Pid: 0, comm: swapper Tainted: G          I 2.6.35.14-96.fc14.x86_64 #1 P7H55/System Product Name
<4>[  262.581785] RIP: 0010:[<ffffffff810dca57>]  [<ffffffff810dca57>] put_page+0x10/0x7c
<4>[  262.582147] RSP: 0018:ffff88000a203b80  EFLAGS: 00010246
<4>[  262.582332] RAX: 0000000000000030 RBX: ffff88012115fd00 RCX: ffff880120859670
<4>[  262.582519] RDX: ffff880120859640 RSI: 1506b29c96c716b9 RDI: 0000000000000000
<4>[  262.582707] RBP: ffff88000a203ba0 R08: ffff880127f2da58 R09: ffff880120859042
<4>[  262.582895] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
<4>[  262.583083] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
<4>[  262.583272] FS:  0000000000000000(0000) GS:ffff88000a200000(0000) knlGS:0000000000000000
<4>[  262.583601] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
<4>[  262.583786] CR2: 0000000000000000 CR3: 000000010fd65000 CR4: 00000000000006f0
<4>[  262.583974] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[  262.584162] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[  262.584351] Process swapper (pid: 0, threadinfo ffffffff81a00000, task ffffffff81a4a020)
<0>[  262.584679] Stack:
<4>[  262.584855]  ffff88012115fd00 0000000000000000 0000000000000000 0000000000000000
<4>[  262.585140] <0> ffff88000a203bf0 ffffffff813b8e02 0000001400000000 0000000000000030
<4>[  262.585631] <0> 9b07280a0002bb01 ffff88012115fd00 ffff88012a538000 ffff8800b7851800
<0>[  262.586289] Call Trace:
<0>[  262.586466]  <IRQ> 
<4>[  262.586679]  [<ffffffff813b8e02>] __pskb_pull_tail+0x1e1/0x293
<4>[  262.586867]  [<ffffffff813c2e46>] dev_queue_xmit+0x70/0x3ce
<4>[  262.587057]  [<ffffffff813f55bc>] ? ip_finish_output+0x0/0x6a
<4>[  262.587245]  [<ffffffff813f557c>] ip_finish_output2+0x1d6/0x216
<4>[  262.587468]  [<ffffffff813f5621>] ip_finish_output+0x65/0x6a
<4>[  262.587654]  [<ffffffff813f5e48>] ip_output+0x91/0x96
<4>[  262.587841]  [<ffffffff813f35dd>] ip_forward_finish+0x49/0x4d
<4>[  262.588028]  [<ffffffff813f38ba>] ip_forward+0x2d9/0x347
<4>[  262.588215]  [<ffffffff813f2171>] ip_rcv_finish+0x324/0x34a
<4>[  262.588402]  [<ffffffff813f1e4d>] ? ip_rcv_finish+0x0/0x34a
<4>[  262.588588]  [<ffffffff813f2412>] NF_HOOK.clone.8+0x51/0x58
<4>[  262.588775]  [<ffffffff813f27a1>] ip_rcv+0x21e/0x24d
<4>[  262.588962]  [<ffffffff813bf812>] __netif_receive_skb+0x3ed/0x412
<4>[  262.589151]  [<ffffffff813c1064>] process_backlog+0x87/0x15d
<4>[  262.589338]  [<ffffffff813c11e6>] net_rx_action+0xac/0x1bb
<4>[  262.589527]  [<ffffffff81053db9>] __do_softirq+0xf0/0x1bf
<4>[  262.589715]  [<ffffffff81023795>] ? apic_write+0x16/0x18
<4>[  262.589902]  [<ffffffff8101054b>] ? native_sched_clock+0x35/0x37
<4>[  262.590090]  [<ffffffff8100ab9c>] call_softirq+0x1c/0x30
<4>[  262.590276]  [<ffffffff8100c2f8>] do_softirq+0x46/0x82
<4>[  262.590462]  [<ffffffff81053f45>] irq_exit+0x49/0x8b
<4>[  262.590647]  [<ffffffff814715c5>] do_IRQ+0x9d/0xb4
<4>[  262.590835]  [<ffffffff8146bad3>] ret_from_intr+0x0/0x11
<0>[  262.591018]  <EOI> 
<4>[  262.591233]  [<ffffffff81265bfc>] ? intel_idle+0x111/0x139
<4>[  262.591419]  [<ffffffff81265bdb>] ? intel_idle+0xf0/0x139
<4>[  262.591607]  [<ffffffff813955b5>] cpuidle_idle_call+0x8b/0xe9
<4>[  262.591795]  [<ffffffff8100830b>] cpu_idle+0xaa/0xcc
<4>[  262.591982]  [<ffffffff81453186>] rest_init+0x8a/0x8c
<4>[  262.592169]  [<ffffffff81ba1c49>] start_kernel+0x40b/0x416
<4>[  262.592357]  [<ffffffff81ba12c6>] x86_64_start_reservations+0xb1/0xb5
<4>[  262.592546]  [<ffffffff81ba13c2>] x86_64_start_kernel+0xf8/0x107
<0>[  262.592731] Code: c1 e8 35 48 c1 ea 37 83 e0 03 48 69 c0 00 07 00 00 48 03 04 d5 70 0e b8 81 c9 c3 55 48 89 e5 41 56 41 55 41 54 53 0f 1f 44 00 00 <48> f7 07 00 c0 00 00 48 89 fb 74 07 e8 3f fe ff ff eb 50 e8 c5 
<1>[  262.595307] RIP  [<ffffffff810dca57>] put_page+0x10/0x7c
<4>[  262.595524]  RSP <ffff88000a203b80>
<0>[  262.595703] CR2: 0000000000000000

Comment 4 Dave Jones 2011-09-27 20:30:05 UTC

I posted this to the upstream networking developers, hoping that it rang a bell, and that they could point me a commit that might fix this.

The bad news is that this isn't an obvious bug. That trace indicates pretty serious corruption of the fragment list. This should obviously never happen.

At this stage in f14's lifecycle, we're not going to rebase to a newer kernel release, though it would be very useful to know if this is still a bug in a more modern kernel, so that it gets fixed for the newer releases. (For f15 onwards, we're going back to our old methodology of frequent rebases to prevent situations like this).

If you can't upgrade to f15 yet, perhaps try building your own local 3.1rc kernel with the same config options as the fedora kernel, to see if it still occurs there.

Comment 5 Dave Jones 2011-09-27 20:30:53 UTC

also, just for kicks, can you see if this still reproduces if you disable all the iptables rules ? That might narrow it down a little.

Comment 6 Alex Butcher 2011-10-01 15:01:54 UTC

(In reply to comment #5)
> also, just for kicks, can you see if this still reproduces if you disable all
> the iptables rules ? That might narrow it down a little.

FYI, Whilst running 2.6.35.14-96.fc14.x86_64, I took down the host's external interface using ifdown ethX, then ran service iptables stop to disable all the iptables rules. I then took the machine into runlevel 3 so X wasn't running (the nvidia kernel module was still loaded, though). I kicked off an fsck. It seemed to be doing OK, even with a vmstat 1 running over ssh and whilst serving NFS.

Eventually, it panic'ed, though. It may be coincidence, but the panic happened within 50s of starting a pinging a host connected to one interfaces of the host, from another host connected to a different interface (i.e. it was actively forwarding packets).

I'll have a try at getting a 3.x kernel built and installed.

Comment 7 Alex Butcher 2011-10-02 09:59:41 UTC

(In reply to comment #4)
> If you can't upgrade to f15 yet, perhaps try building your own local 3.1rc
> kernel with the same config options as the fedora kernel, to see if it still
> occurs there.

OK, I built the 3.1.0-0.rc7.git0.0.fc16.src.rpm with configs stolen from the 2.6.35.14-96.fc14.src.rpm. I booted into it with kdump enabled and the nvidia module unbuilt and therefore not loaded. I left my iptables rules in place, and set up a ping as before. The fsck completed normally.

I had to install the kernel with --nodeps as it also claimed to need updated versions of mdadm and module-init-tools, which I didn't really feel like updating. I'm not sure what side-effects that might have for production use.

I don't suppose it'd be possible for someone to use my backtrace to identify the necessary changes between 2.6.35.14-96 and 3.1.0-0.rc7.git0.0 to produce a backported fix for 2.6.35.14 or so?

Comment 8 Dave Jones 2011-10-03 16:20:33 UTC

thanks for testing. well it's good to know it got fixed in newer versions at least.
It's unlikely that we'll find the right fix to backport, though I'll bring this up again with the upstream maintainers, but given the high volume of changes in net/ between those two versions, I'm not optimistic.

This is probably going to end up being a CLOSED->NEXTRELEASE bug.

Comment 9 Fedora End Of Life 2012-08-16 13:52:27 UTC

This message is a notice that Fedora 14 is now at end of life. Fedora 
has stopped maintaining and issuing updates for Fedora 14. It is 
Fedora's policy to close all bug reports from releases that are no 
longer maintained.  At this time, all open bugs with a Fedora 'version'
of '14' have been closed as WONTFIX.

(Please note: Our normal process is to give advanced warning of this 
occurring, but we forgot to do that. A thousand apologies.)

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, feel free to reopen 
this bug and simply change the 'version' to a later Fedora version.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we were unable to fix it before Fedora 14 reached end of life. If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora, you are encouraged to click on 
"Clone This Bug" (top right of this page) and open it against that 
version of Fedora.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping