471098 – consistent null pointer dereference on Atom/D945GCLF (not realtek related)

Bug 471098 - consistent null pointer dereference on Atom/D945GCLF (not realtek related)

Summary: consistent null pointer dereference on Atom/D945GCLF (not realtek related)

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-11-11 19:18 UTC by Kasper Pedersen
Modified:	2008-11-22 06:48 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2008-11-22 06:48:41 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
2.6.27.4-68.fc10.x86_64 kernel panic (18.69 KB, text/plain) 2008-11-11 19:18 UTC, Kasper Pedersen	no flags	Details
-68 variation: repeated null deref (17.79 KB, text/plain) 2008-11-11 19:33 UTC, Kasper Pedersen	no flags	Details
-79 using serial console (75.48 KB, text/plain) 2008-11-11 19:36 UTC, Kasper Pedersen	no flags	Details
-68 with larger ramdisk (29.32 KB, text/plain) 2008-11-11 20:03 UTC, Kasper Pedersen	no flags	Details
-79 panic output (45.75 KB, text/plain) 2008-11-11 20:26 UTC, Kasper Pedersen	no flags	Details
2.6.27.5-94.fc10.i686 D945GCLF dmesg log (28.19 KB, text/plain) 2008-11-12 16:45 UTC, Yanko Kaneti	no flags	Details
2.6.27.5-94.fc10.x86_64 panic (25.45 KB, text/plain) 2008-11-12 18:20 UTC, Kasper Pedersen	no flags	Details
2.6.27.5-94.fc10.x86_64 D945GCLF dmesg (28.88 KB, text/plain) 2008-11-12 19:39 UTC, Yanko Kaneti	no flags	Details
2.6.27.5-101.fc10.x86_64 (two attempts) (47.08 KB, text/plain) 2008-11-13 19:31 UTC, Kasper Pedersen	no flags	Details
2.6.27.5-113.fc10.x86_64 panic (21.40 KB, text/plain) 2008-11-18 17:41 UTC, Kasper Pedersen	no flags	Details
2.6.27.5-120.fc10.x86_64-D945GCLF crash (26.05 KB, text/plain) 2008-11-20 02:13 UTC, Yanko Kaneti	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Launchpad	279186	0	None	None	None	Never

Description Kasper Pedersen 2008-11-11 19:18:08 UTC

Created attachment 323217 [details]
2.6.27.4-68.fc10.x86_64 kernel panic

Description of problem:

  Kernel consistently fails with 'unable to handle kernel paging request' or 'unable to handle kernel NULL pointer dereference' on Atom230/D945GCLF hardware.

  This is not realtek related (tested with realtek disabled)

Version-Release number of selected component (if applicable):

2.6.27.4-68.fc10.x86_64 (FC10-x86_64-Preview)
2.6.27.4-79.fc10.x86_64 (today)

How reproducible:

always

Steps to Reproduce:
1. Attempt to boot FC10-x86_64-Preview using PXE
 or
1. Install FC10-x86_64-Preview on a Core2 machine, update to -79.
2. move HDD to Atom230/D945GCLF and attempt to boot
3.
  
Actual results:

hang, BUG output and then hang, or kernel panic

Expected results:

Normal startup

Additional info:

This is tested on two different D945GCLF boards, different memory modules, different power supplies, different bios versions (103 is latest).
Disabling hyperthreading tried.
Disabling the onboard realtek nic and PXE booting from PRO/1000GT tried.
my Core2Quad machine installs successfully from the same PXE server.

-79 hangs at 'Write protecting the kernel read-only data: 4952k' when using serial console. Using normal console it sometimes partially outputs a panic after printing the message

the -68 log attached complains about "BIOS bug, no explicit IRQ entries, using default mptable"; -79 does not.

Comment 1 Kasper Pedersen 2008-11-11 19:33:28 UTC

Created attachment 323221 [details]
-68 variation: repeated null deref

Comment 2 Kasper Pedersen 2008-11-11 19:36:46 UTC

Created attachment 323222 [details]
-79 using serial console

When using serial console, -79 hangs and does not generate BUG lines or panic

Comment 3 Dave Jones 2008-11-11 19:40:52 UTC

why are you booting with 'noapic acpi=off' ?

on modern machines that pretty much is a fast path to failure.

Comment 4 Dave Jones 2008-11-11 19:42:22 UTC

ok, disregard that last question, as you stopped doing it for the -79 kernel it seems.

No clue why it hangs yet though.

Comment 5 Yanko Kaneti 2008-11-11 20:03:32 UTC

I won't be able to test for a few days, so sharing perhaps a datapoint.

I've had D945GCLF behave quite erratically with recent f9 kernels, instantly dying or oopsing with something that mostly looked like acpi errors.
Then I did something in the bios. It was either disabling hyperthreading or disabling HPET, or both. Since then no problems with that board and the latest f9 kernel. Perhaps experimenting with those two could give some results here.

Comment 6 Kasper Pedersen 2008-11-11 20:03:36 UTC

Created attachment 323227 [details]
-68 with larger ramdisk

-68 with larger ramdisk, and no noapic and no noacpi. Lots of panic output (the kernel starts out untainted, faults again then becomes tainted, faults again..)

Comment 7 Kasper Pedersen 2008-11-11 20:08:40 UTC

I've tried HT on/off, and HPET on/off. It seems to make no difference.

I've also tried switching video from the default of DVMT/128/256 to FIXED/32/128; On earlier F10 alphas, the earliest ones liked the FIXED setting, the later ones liked the DVMT setting. It however only made a difference when starting X, and I'm keeping to runlevel 3.

Comment 8 Kasper Pedersen 2008-11-11 20:26:40 UTC

Created attachment 323239 [details]
-79 panic output

contains two startups of -79: My only ever working startup, and one where it panics.

Comment 9 Yanko Kaneti 2008-11-12 16:45:55 UTC

Created attachment 323345 [details]
2.6.27.5-94.fc10.i686  D945GCLF dmesg log

I've upgraded the F9 selectively to kernel 2.6.27.5-94.fc10.i686 from rawhide and so far its working without problems. Attached dmesg. I've rechecked the BIOS options that I think I played with the last time and its HT-on HPET-off.

Comment 10 Kasper Pedersen 2008-11-12 18:20:26 UTC

Created attachment 323366 [details]
2.6.27.5-94.fc10.x86_64 panic

i686 does indeed boot. 
x86_64 (which this is against) still panics.

Comment 11 Yanko Kaneti 2008-11-12 19:39:30 UTC

Created attachment 323380 [details]
2.6.27.5-94.fc10.x86_64 D945GCLF dmesg

Now this is weird. I installed the x86_64 kernel along the 32 bit userspace.
Booted normally it does crash consistently with a trace that I cannot capture currently due to lack of equipment (not even a camera...:(). the RIP is on on_each_cpu ....But... it consistently works if I add a vga=0x317 parameter. Attached a dmesg of that. What gives?

Comment 12 Kasper Pedersen 2008-11-13 19:31:43 UTC

Created attachment 323493 [details]
2.6.27.5-101.fc10.x86_64 (two attempts)

-101 still dies.
vga=0x317 gives me a pair of penguins but dies just the same. 
Tried nopat just in case but it makes no difference.

Comment 13 Kasper Pedersen 2008-11-18 17:41:16 UTC

Created attachment 323935 [details]
2.6.27.5-113.fc10.x86_64 panic

Still dies.

Comment 14 Yanko Kaneti 2008-11-19 14:54:29 UTC

I think I got something, could you  try with  retain_initrd on the command line.

Comment 15 Kasper Pedersen 2008-11-19 19:12:50 UTC

retain_initrd: -113: panics, and the call trace looks the same with and without.

Comment 16 Chuck Ebbert 2008-11-19 21:54:57 UTC

(In reply to comment #13)
> Created an attachment (id=323935) [details]
> 2.6.27.5-113.fc10.x86_64 panic
> 
> Still dies.

   0:   55                      push   %ebp  <=====
   1:   48                      dec    %eax

This oops makes no sense whatsoever. The fault aaddress is ffffffff81009000, which is in RAX, but the faulting instruction would write data to *RSP, which is ffff88003f1ffe08

Does that board run any other OS successfully?

Comment 17 Kasper Pedersen 2008-11-19 23:33:57 UTC

The two boards in my possession, with different brands and speed ratings of memory modules, different power supplies, and HDD's, successfully run F8-i386/x86_64, F10-preview-i386, and Windows2003-64. One has booted Windows2008-64 at least once (out of one try) when I got my by now rather large pile of HDDs mixed up. I have attempted to prove hardware failure, without success.

C3 is a ret, so 55 48 is the first two instructions of the function. ffffffff81009000 is also found in R12. If Atom wasn't an in-order machine I'd guess that this was a late effect of an earlier access in the caller, only that's not the way x86_64 is _supposed_ to work.

This is Atom, new and shiny, so don't rule out fresh CPU bugs.

Comment 18 Chuck Ebbert 2008-11-20 01:17:27 UTC

Can someone try booting with the kernel option 'noreplace-paravirt'?

Comment 19 Yanko Kaneti 2008-11-20 02:13:20 UTC

Created attachment 324132 [details]
2.6.27.5-120.fc10.x86_64-D945GCLF crash

Here is mine for completeness. The fact that it did boot sometimes led me to silly kernel parameter goose chase, sorry about that.

Comment 20 Yanko Kaneti 2008-11-21 12:31:35 UTC

Well, look at that. I just updated to the very recently released new bios for this board ver.0122 and its all good. Survived a number of reboots into x86_64 without a hitch.

Comment 21 Chuck Ebbert 2008-11-22 06:48:41 UTC

Looks like the latest BIOS update fixes this one.

Note You need to log in before you can comment on or make changes to this bug.