Bug 176208 - Oops on bootup, 2.6.14-1.1644_FC4smp ksoftirqd scsi_mod e1000 root over NFSv3
Oops on bootup, 2.6.14-1.1644_FC4smp ksoftirqd scsi_mod e1000 root over NFSv3
Status: CLOSED ERRATA
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
4
i386 Linux
medium Severity medium
: ---
: ---
Assigned To: Dave Jones
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-12-20 03:46 EST by Carl-Johan Kjellander
Modified: 2015-01-04 17:23 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-02-26 23:48:09 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Carl-Johan Kjellander 2005-12-20 03:46:02 EST
Description of problem:
We get a repeatable Oops when we boot several machines diskless.
They are all running on an ASUS P5ND2-SLI or P5ND2-SLI Deluxe motherboard.

Version-Release number of selected component (if applicable):
2.6.14-1.1644_FC4smp kernel-smp-2.6.14-1.1644_FC4

How reproducible:
It happens 1 in 10-20 times at bootup, doesn't even get to doing pivot_root.

Here are 2 oopses caught over serial console.

Here is Oops number 1:

*********************

Unable to handle kernel NULL pointer dereferenceACPI: PCI Interrupt Link [APSJ]
enabled at IRQ 22
  printing eip:
*pde = 00426001
Oops: 0000 [#1]
SMP
Modules linked in: sata_nv libata scsi_mod e1000 nfs lockd nfs_acl sunrpc
CPU:    0
EIP:    0060:[<e09528e6>]    Not tainted VLI
EFLAGS: 00010286   (2.6.14-1.1644_FC4smp)
EIP is at scsi_run_queue+0x10/0xaf [scsi_mod]
eax: 00000000   ebx: dd98007c   ecx: dffef880   edx: 00000001
esi: ddc99e00   edi: 00000246   ebp: dd9071fc   esp: c042af14
ds: 007b   es: 007b   ss: 0068
Process ksoftirqd/0 (pid: 3, threadinfo=c042a000 task=dfc41ab0)
Stack: dd9071fc dd98007c ddc99e00 00000246 dd9071fc e0952a7d ddc99e00 00000000
        00000000 dd98007c e0952e88 00000001 e088260b dcd3a570 c15e3380 dcd3a570
        00000000 00000000 00040000 00000024 dd9071fc 00000000 00000000 00000292
Call Trace:
  [<e0952a7d>] scsi_end_request+0x83/0xb0 [scsi_mod]
  [<e0952e88>] scsi_io_completion+0x29e/0x4d2 [scsi_mod]
  [<e088260b>] e1000_clean_rx_irq+0x95/0x4f1 [e1000]
  [<e094dcb2>] scsi_finish_command+0x82/0xb5 [scsi_mod]
  [<e094db97>] scsi_softirq+0xc0/0x133 [scsi_mod]
  [<c02bdf4e>] net_rx_action+0xb7/0x1bb
  [<c01258c2>] __do_softirq+0x72/0xdc
  [<c0105c43>] do_softirq+0x4b/0x4f
  =======================
  [<c0125ec2>] ksoftirqd+0x9c/0xe8
  [<c0125e26>] ksoftirqd+0x0/0xe8
  [<c0133d89>] kthread+0x93/0x97
  [<c0133cf6>] kthread+0x0/0x97
  [<c0101d5d>] kernel_thread_helper+0x5/0xb
Code: c5 8f df 8b 14 24 8b 42 44 e8 37 b6 9c df 89 44 24 04 89 d8 e8 2e b6 ff ff
eb b1 55 57 56 53 83 ec \
04 89 04 24 8b 80 10 01 00 00 <8b> 38 80 b8 85 01 00 00 00 0f 88 86 00 00 00 8b
47 44 e8 03 b6  <0>Kernel \
                panic - not syncing: Fatal exception in interrupt
ata3: no device found (phy stat 00000000)
scsi2 : sata_nv
  [<c0120358>] panic+0x45/0x1c4
  [<c0104caf>] die+0x17b/0x185
  [<c031ec40>] do_page_fault+0x0/0x700
  [<c031ee49>] do_page_fault+0x209/0x700
  [<c031ec40>] do_page_fault+0x0/0x700
  [<c010457f>] error_code+0x4f/0x54
  [<e09528e6>] scsi_run_queue+0x10/0xaf [scsi_mod]
  [<e0952a7d>] scsi_end_request+0x83/0xb0 [scsi_mod]
  [<e0952e88>] scsi_io_completion+0x29e/0x4d2 [scsi_mod]
  [<e088260b>] e1000_clean_rx_irq+0x95/0x4f1 [e1000]
  [<e094dcb2>] scsi_finish_command+0x82/0xb5 [scsi_mod]
  [<e094db97>] scsi_softirq+0xc0/0x133 [scsi_mod]
  [<c02bdf4e>] net_rx_action+0xb7/0x1bb
  [<c01258c2>] __do_softirq+0x72/0xdc
  [<c0105c43>] do_softirq+0x4b/0x4f
  =======================
  [<c0125ec2>] ksoftirqd+0x9c/0xe8
  [<c0125e26>] ksoftirqd+0x0/0xe8
  [<c0133d89>] kthread+0x93/0x97
  [<c0133cf6>] kthread+0x0/0x97
  [<c0101d5d>] kernel_thread_helper+0x5/0xb

And here is Oops number 2:

*********************

Unable to handle kernel NULL pointer dereference at virtual address 00000000
  printing eip:
*pde = 00426001
Oops: 0000 [#1]
SMP
Modules linked in: sata_nv libata scsi_mod e1000 nfs lockd nfs_acl sunrpc
CPU:    0
EIP:    0060:[<e09528e6>]    Not tainted VLI
EFLAGS: 00010286   (2.6.14-1.1644_FC4smp)
EIP is at scsi_run_queue+0x10/0xaf [scsi_mod]
eax: 00000000   ebx: dd9acb1c   ecx: dffef880   edx: 00000001
esi: dd50fc80   edi: 00000246   ebp: dd49a3b8   esp: c042af14
ds: 007b   es: 007b   ss: 0068
Process ksoftirqd/0 (pid: 3, threadinfo=c042a000 task=dfc41ab0)
Stack: dd49a3b8 dd9acb1c dd50fc80 00000246 dd49a3b8 e0952a7d dd50fc80 00000000
        00000000 dd9acb1c e0952e88 00000001 e088260b c04ac408 ded6b380 c1407fe0
        00000000 00000000 00040000 00000024 dd49a3b8 00000000 00000000 00000292
Call Trace:
  [<e0952a7d>] scsi_end_request+0x83/0xb0 [scsi_mod]
  [<e0952e88>] scsi_io_completion+0x29e/0x4d2 [scsi_mod]
  [<e088260b>] e1000_clean_rx_irq+0x95/0x4f1 [e1000]
  [<e094dcb2>] scsi_finish_command+0x82/0xb5 [scsi_mod]
  [<e094db97>] scsi_softirq+0xc0/0x133 [scsi_mod]
  [<c02bdf4e>] net_rx_action+0xb7/0x1bb
  [<c01258c2>] __do_softirq+0x72/0xdc
  [<c0105c43>] do_softirq+0x4b/0x4f
  =======================
  [<c0125ec2>] ksoftirqd+0x9c/0xe8
  [<c0125e26>] ksoftirqd+0x0/0xe8
  [<c0133d89>] kthread+0x93/0x97
  [<c0133cf6>] kthread+0x0/0x97
  [<c0101d5d>] kernel_thread_helper+0x5/0xb
Code: c5 8f df 8b 14 24 8b 42 44 e8 37 b6 9c df 89 44 24 04 89 d8 e8 2e b6 ff ff
eb b1 55 57 56 53 83 ec \
04 89 04 24 8b 80 10 01 00 00 <8b> 38 80 b8 85 01 00 00 00 0f 88 86 00 00 00 8b
47 44 e8 03 b6  <0>Kernel \
panic - not syncing: Fatal exception in interrupt  [<c0120358>] panic+0x45/0x1c4
  [<c0104caf>] die+0x17b/0x185
  [<c031ec40>] do_page_fault+0x0/0x700
  [<c031ee49>] do_page_fault+0x209/0x700
  [<c031ec40>] do_page_fault+0x0/0x700
  [<c010457f>] error_code+0x4f/0x54
  [<e09528e6>] scsi_run_queue+0x10/0xaf [scsi_mod]
  [<e0952a7d>] scsi_end_request+0x83/0xb0 [scsi_mod]
  [<e0952e88>] scsi_io_completion+0x29e/0x4d2 [scsi_mod]
  [<e088260b>] e1000_clean_rx_irq+0x95/0x4f1 [e1000]
  [<e094dcb2>] scsi_finish_command+0x82/0xb5 [scsi_mod]
  [<e094db97>] scsi_softirq+0xc0/0x133 [scsi_mod]
  [<c02bdf4e>] net_rx_action+0xb7/0x1bb
  [<c01258c2>] __do_softirq+0x72/0xdc
  [<c0105c43>] do_softirq+0x4b/0x4f
  =======================
  [<c0125ec2>] ksoftirqd+0x9c/0xe8
  [<c0125e26>] ksoftirqd+0x0/0xe8
  [<c0133d89>] kthread+0x93/0x97
  [<c0133cf6>] kthread+0x0/0x97
  [<c0101d5d>] kernel_thread_helper+0x5/0xb


Is there any kernel parameter we can try to work around this? Cause
it only happens sometimes and it would be nice to just work around it
for now, before the NULL pointer dereference is fixed.
Comment 1 Dave Jones 2005-12-23 14:42:15 EST
I'm very interested to hear whether the test kernels at
http://people.redhat.com/davej/kernels/Fedora/FC4 exhibit the same behaviour.
Comment 2 Dave Jones 2005-12-23 15:08:26 EST
You mention this is diskless, so I'm puzzled why we're doing scsi IO at all in
this backtrace.

is there any device at all connected to the sata_nv controller ?
(cd drive perhaps ?)
Comment 3 Carl-Johan Kjellander 2005-12-24 20:29:00 EST
We'll try those kernels on the 28th and we have a setup restarting
the computers hard every 20 minutes so we should have results in
less than a day if they work or not.

The reason we are doing SCSI IO is that we actually have a disk on
the sata_nv for logging while we are developing. When we develop
we might actually get >10 GB of logs per machine and we don't wanna
do that over NFS. And also the logging isn't multi threaded so any
delays would show on the graphics.

So logging to disk is the key here.
Comment 4 Dave Jones 2005-12-29 01:57:26 EST
There's a 2.6.14-1.1654_FC4 built which should appear in updates-testing in a
few days (As soon as the release team get back from vacation :-) )   That has a
patch which should fix this.
Comment 5 Carl-Johan Kjellander 2006-01-03 08:20:07 EST
That patch has fixed the crashes completely. The machine has had 50
reboots without any oopses.

It's linux-2.6-scsi-runqueue-oops.patch that fixes it right? Is it in
the mainline or something you did over Xmas? Anyway, thanks for fixing
it so quickly.
Comment 6 Dave Jones 2006-01-04 00:30:13 EST
yep, thats teh patch. it came from 2.6.15
The 2.6.14 update kernel-in-progress is going live in a few days, then a 2.6.15
based update will go into testing.
Comment 7 Dave Jones 2006-02-03 02:02:13 EST
This is a mass-update to all currently open kernel bugs.

A new kernel update has been released (Version: 2.6.15-1.1830_FC4)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO_REPORTER state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

Thank you.
Comment 8 Carl-Johan Kjellander 2006-02-08 07:38:32 EST
Tested with kernel-smp-2.6.15-1.1831_FC4 and no problems.

Note You need to log in before you can comment on or make changes to this bug.