Red Hat Bugzilla – Bug 471098
consistent null pointer dereference on Atom/D945GCLF (not realtek related)
Last modified: 2008-11-22 01:48:41 EST
Created attachment 323217 [details]
126.96.36.199-68.fc10.x86_64 kernel panic
Description of problem:
Kernel consistently fails with 'unable to handle kernel paging request' or 'unable to handle kernel NULL pointer dereference' on Atom230/D945GCLF hardware.
This is not realtek related (tested with realtek disabled)
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Attempt to boot FC10-x86_64-Preview using PXE
1. Install FC10-x86_64-Preview on a Core2 machine, update to -79.
2. move HDD to Atom230/D945GCLF and attempt to boot
hang, BUG output and then hang, or kernel panic
This is tested on two different D945GCLF boards, different memory modules, different power supplies, different bios versions (103 is latest).
Disabling hyperthreading tried.
Disabling the onboard realtek nic and PXE booting from PRO/1000GT tried.
my Core2Quad machine installs successfully from the same PXE server.
-79 hangs at 'Write protecting the kernel read-only data: 4952k' when using serial console. Using normal console it sometimes partially outputs a panic after printing the message
the -68 log attached complains about "BIOS bug, no explicit IRQ entries, using default mptable"; -79 does not.
Created attachment 323221 [details]
-68 variation: repeated null deref
Created attachment 323222 [details]
-79 using serial console
When using serial console, -79 hangs and does not generate BUG lines or panic
why are you booting with 'noapic acpi=off' ?
on modern machines that pretty much is a fast path to failure.
ok, disregard that last question, as you stopped doing it for the -79 kernel it seems.
No clue why it hangs yet though.
I won't be able to test for a few days, so sharing perhaps a datapoint.
I've had D945GCLF behave quite erratically with recent f9 kernels, instantly dying or oopsing with something that mostly looked like acpi errors.
Then I did something in the bios. It was either disabling hyperthreading or disabling HPET, or both. Since then no problems with that board and the latest f9 kernel. Perhaps experimenting with those two could give some results here.
Created attachment 323227 [details]
-68 with larger ramdisk
-68 with larger ramdisk, and no noapic and no noacpi. Lots of panic output (the kernel starts out untainted, faults again then becomes tainted, faults again..)
I've tried HT on/off, and HPET on/off. It seems to make no difference.
I've also tried switching video from the default of DVMT/128/256 to FIXED/32/128; On earlier F10 alphas, the earliest ones liked the FIXED setting, the later ones liked the DVMT setting. It however only made a difference when starting X, and I'm keeping to runlevel 3.
Created attachment 323239 [details]
-79 panic output
contains two startups of -79: My only ever working startup, and one where it panics.
Created attachment 323345 [details]
188.8.131.52-94.fc10.i686 D945GCLF dmesg log
I've upgraded the F9 selectively to kernel 184.108.40.206-94.fc10.i686 from rawhide and so far its working without problems. Attached dmesg. I've rechecked the BIOS options that I think I played with the last time and its HT-on HPET-off.
Created attachment 323366 [details]
i686 does indeed boot.
x86_64 (which this is against) still panics.
Created attachment 323380 [details]
220.127.116.11-94.fc10.x86_64 D945GCLF dmesg
Now this is weird. I installed the x86_64 kernel along the 32 bit userspace.
Booted normally it does crash consistently with a trace that I cannot capture currently due to lack of equipment (not even a camera...:(). the RIP is on on_each_cpu ....But... it consistently works if I add a vga=0x317 parameter. Attached a dmesg of that. What gives?
Created attachment 323493 [details]
18.104.22.168-101.fc10.x86_64 (two attempts)
-101 still dies.
vga=0x317 gives me a pair of penguins but dies just the same.
Tried nopat just in case but it makes no difference.
Created attachment 323935 [details]
I think I got something, could you try with retain_initrd on the command line.
retain_initrd: -113: panics, and the call trace looks the same with and without.
(In reply to comment #13)
> Created an attachment (id=323935) [details]
> 22.214.171.124-113.fc10.x86_64 panic
> Still dies.
0: 55 push %ebp <=====
1: 48 dec %eax
This oops makes no sense whatsoever. The fault aaddress is ffffffff81009000, which is in RAX, but the faulting instruction would write data to *RSP, which is ffff88003f1ffe08
Does that board run any other OS successfully?
The two boards in my possession, with different brands and speed ratings of memory modules, different power supplies, and HDD's, successfully run F8-i386/x86_64, F10-preview-i386, and Windows2003-64. One has booted Windows2008-64 at least once (out of one try) when I got my by now rather large pile of HDDs mixed up. I have attempted to prove hardware failure, without success.
C3 is a ret, so 55 48 is the first two instructions of the function. ffffffff81009000 is also found in R12. If Atom wasn't an in-order machine I'd guess that this was a late effect of an earlier access in the caller, only that's not the way x86_64 is _supposed_ to work.
This is Atom, new and shiny, so don't rule out fresh CPU bugs.
Can someone try booting with the kernel option 'noreplace-paravirt'?
Created attachment 324132 [details]
Here is mine for completeness. The fact that it did boot sometimes led me to silly kernel parameter goose chase, sorry about that.
Well, look at that. I just updated to the very recently released new bios for this board ver.0122 and its all good. Survived a number of reboots into x86_64 without a hitch.
Looks like the latest BIOS update fixes this one.