Bug 249174

Summary: [agp] 2.6.22.1-27.fc7.x86_64 fails to boot on some x86_64 machines
Product: [Fedora] Fedora Reporter: Callum Lerwick <seg>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: low Docs Contact:
Priority: low    
Version: rawhideCC: airlied, fedora, jbacik, jmorris, michal, mlichvar, permonik, sundaram, twanno
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.23.1-41.fc8 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-10-30 19:14:40 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 235703    
Attachments:
Description Flags
dmesg from working kernel on m6805
none
a part of differences from dmesg for different kernels on two different boards
none
first screen from oops with 2.6.22.1-32.fc6
none
second screen from oops with 2.6.22.1-32.fc6
none
dmesg from 'mem=510M' boot with 2.6.22.4-45.fc6
none
Screenshot of kernel-2.6.22.9-91.fc7.x86_64 backtrace
none
Patch to revert broken commit none

Description Callum Lerwick 2007-07-22 03:30:35 UTC
Description of problem:
Kernel 2.6.22.1-27.fc7.x86_64 fails to boot on my eMachines m6805, aka Arima
W720-K8. Shortly after beginning to boot the kernel, it reboots, and falls into
an endless reboot loop.

Booting with "quiet" removed, it seems to get down to "agpgart: Detected AGP
bridge 0" before it reboots. Its only displayed for a fraction of a second so
its hard to be sure.

Kernel 2.6.21-1.3228.fc7.x86_64 and earlier works fine.

Version-Release number of selected component (if applicable):
kernel-2.6.22.1-27.fc7.x86_64

How reproducible:
Always

Comment 1 Josef Bacik 2007-07-22 13:12:04 UTC
Well since it reboots so quickly lets try and disable things and see if you can
get it to come up.  Add to your kernel line in your grub.conf

acpi=off noagp

and try and boot into that kernel.  If that works, remove one of those options
and see if it still boots.  Whichever option makes it play nicely post to this
ticket, it will help narrow down the problem.

Comment 2 Callum Lerwick 2007-07-23 01:59:49 UTC
Neither helps.

And it seems sometimes it will just lock up rather than rebooting, it is in fact
hanging/rebooting at "agpgart: Detected AGP bridge 0".

Comment 3 Chuck Ebbert 2007-07-23 17:29:14 UTC
(In reply to comment #1)
> Well since it reboots so quickly lets try and disable things and see if you can
> get it to come up.  Add to your kernel line in your grub.conf
> 
> acpi=off noagp
> 

That's     ^^^^^

   agp=off



Comment 4 Callum Lerwick 2007-07-24 05:02:26 UTC
Oooh, agp=off worked. I need my DRI though. :)

Comment 5 Michal Jaegermann 2007-07-24 19:01:36 UTC
I got hit by the same on SK8V from ASUSTeK Computer Inc.  I have no
idea how far this gets as a screen blinks and a machine reboots
before I have a chance to read anything at all.

It boots with agp=off but, of course, DRI is killed. No problems
with earlier F7 kernels.

A long succession of rawhide kernels, including various "2.6.22"
kernels, usually was booting on the same hardware.  The current
rawhide 2.6.23-0.43.rc0.git16.fc8 is fine.

Comment 6 Chuck Ebbert 2007-07-24 20:03:30 UTC
Can someone post the output of 'lspci' and also 'lspci -n' from the failing
machines?


Comment 7 Callum Lerwick 2007-07-24 20:22:16 UTC
00:00.0 Host bridge: VIA Technologies, Inc. VT8385 [K8T800 AGP] Host Bridge (rev 01)
00:01.0 PCI bridge: VIA Technologies, Inc. VT8237 PCI bridge [K8T800/K8T890 South]
00:0a.0 CardBus bridge: ENE Technology Inc CB1410 Cardbus Controller
00:0c.0 Network controller: Broadcom Corporation BCM4306 802.11b/g Wireless LAN
Controller (rev 03)
00:10.0 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 80)
00:10.1 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 80)
00:10.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 80)
00:10.3 USB Controller: VIA Technologies, Inc. USB 2.0 (rev 82)
00:11.0 ISA bridge: VIA Technologies, Inc. VT8235 ISA Bridge
00:11.1 IDE interface: VIA Technologies, Inc.
VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06)
00:11.5 Multimedia audio controller: VIA Technologies, Inc. VT8233/A/8235/8237
AC97 Audio Controller (rev 50)
00:11.6 Communication controller: VIA Technologies, Inc. AC'97 Modem Controller
(rev 80)
00:12.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II] (rev 74)
00:13.0 FireWire (IEEE 1394): VIA Technologies, Inc. IEEE 1394 Host Controller
(rev 80)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM
Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Miscellaneous Control
01:00.0 VGA compatible controller: ATI Technologies Inc RV350 [Mobility Radeon
9600 M10]


00:00.0 0600: 1106:3188 (rev 01)
00:01.0 0604: 1106:b188
00:0a.0 0607: 1524:1410
00:0c.0 0280: 14e4:4320 (rev 03)
00:10.0 0c03: 1106:3038 (rev 80)
00:10.1 0c03: 1106:3038 (rev 80)
00:10.2 0c03: 1106:3038 (rev 80)
00:10.3 0c03: 1106:3104 (rev 82)
00:11.0 0601: 1106:3177
00:11.1 0101: 1106:0571 (rev 06)
00:11.5 0401: 1106:3059 (rev 50)
00:11.6 0780: 1106:3068 (rev 80)
00:12.0 0200: 1106:3065 (rev 74)
00:13.0 0c00: 1106:3044 (rev 80)
00:18.0 0600: 1022:1100
00:18.1 0600: 1022:1101
00:18.2 0600: 1022:1102
00:18.3 0600: 1022:1103
01:00.0 0300: 1002:4e50


Comment 8 Michal Jaegermann 2007-07-24 22:27:26 UTC
This is for MSI board MS-6741 (x86_64). It also fails to boot starting
with 2.6.22.1-27.fc7.x86_64, with multiple kernel exceptions, but I
cannot get those details.  It may be a different problem. 'lspci' shows
there:

00:00.0 Host bridge: VIA Technologies, Inc. K8M800 Host Bridge
00:00.1 Host bridge: VIA Technologies, Inc. K8M800 Host Bridge
00:00.2 Host bridge: VIA Technologies, Inc. K8M800 Host Bridge
00:00.3 Host bridge: VIA Technologies, Inc. K8M800 Host Bridge
00:00.4 Host bridge: VIA Technologies, Inc. K8M800 Host Bridge
00:00.7 Host bridge: VIA Technologies, Inc. K8M800 Host Bridge
00:01.0 PCI bridge: VIA Technologies, Inc. VT8237 PCI bridge [K8T800/K8T890 South]
00:06.0 Network controller: Broadcom Corporation BCM4306 802.11b/g Wireless LAN
Controller (rev 03)
00:0e.0 FireWire (IEEE 1394): VIA Technologies, Inc. IEEE 1394 Host Controller
(rev 80)
00:0f.0 RAID bus controller: VIA Technologies, Inc. VIA VT6420 SATA RAID
Controller (rev 80)
00:0f.1 IDE interface: VIA Technologies, Inc.
VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06)
00:10.0 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 81)
00:10.1 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 81)
00:10.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 81)
00:10.3 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 81)
00:10.4 USB Controller: VIA Technologies, Inc. USB 2.0 (rev 86)
00:11.0 ISA bridge: VIA Technologies, Inc. VT8237 ISA bridge
[KT600/K8T800/K8T890 South]
00:11.5 Multimedia audio controller: VIA Technologies, Inc. VT8233/A/8235/8237
AC97 Audio Controller (rev 60)
00:12.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II] (rev 78)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM
Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Miscellaneous Control
01:00.0 VGA compatible controller: VIA Technologies, Inc. S3 Unichrome Pro VGA
Adapter (rev 01)

00:00.0 0600: 1106:0204
00:00.1 0600: 1106:1204
00:00.2 0600: 1106:2204
00:00.3 0600: 1106:3204
00:00.4 0600: 1106:4204
00:00.7 0600: 1106:7204
00:01.0 0604: 1106:b188
00:06.0 0280: 14e4:4320 (rev 03)
00:0e.0 0c00: 1106:3044 (rev 80)
00:0f.0 0104: 1106:3149 (rev 80)
00:0f.1 0101: 1106:0571 (rev 06)
00:10.0 0c03: 1106:3038 (rev 81)
00:10.1 0c03: 1106:3038 (rev 81)
00:10.2 0c03: 1106:3038 (rev 81)
00:10.3 0c03: 1106:3038 (rev 81)
00:10.4 0c03: 1106:3104 (rev 86)
00:11.0 0601: 1106:3227
00:11.5 0401: 1106:3059 (rev 60)
00:12.0 0200: 1106:3065 (rev 78)
00:18.0 0600: 1022:1100
00:18.1 0600: 1022:1101
00:18.2 0600: 1022:1102
00:18.3 0600: 1022:1103
01:00.0 0300: 1106:3108 (rev 01)

This is the same for SK8V from comment #5:

00:00.0 Host bridge: VIA Technologies, Inc. VT8385 [K8T800 AGP] Host Bridge (rev 01)
00:01.0 PCI bridge: VIA Technologies, Inc. VT8237 PCI bridge [K8T800/K8T890 South]
00:07.0 FireWire (IEEE 1394): VIA Technologies, Inc. IEEE 1394 Host Controller
(rev 80)
00:08.0 RAID bus controller: Promise Technology, Inc. PDC20378 (FastTrak
378/SATA 378) (rev 02)
00:0a.0 Ethernet controller: 3Com Corporation 3c940 10/100/1000Base-T [Marvell]
(rev 12)
00:0e.0 Ethernet controller: Intel Corporation 82557/8/9 Ethernet Pro 100 (rev 0c)
00:0f.0 RAID bus controller: VIA Technologies, Inc. VIA VT6420 SATA RAID
Controller (rev 80)
00:0f.1 IDE interface: VIA Technologies, Inc.
VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06)
00:10.0 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 81)
00:10.1 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 81)
00:10.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 81)
00:10.3 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 81)
00:10.4 USB Controller: VIA Technologies, Inc. USB 2.0 (rev 86)
00:11.0 ISA bridge: VIA Technologies, Inc. VT8237 ISA bridge
[KT600/K8T800/K8T890 South]
00:11.5 Multimedia audio controller: VIA Technologies, Inc. VT8233/A/8235/8237
AC97 Audio Controller (rev 60)
00:11.6 Communication controller: VIA Technologies, Inc. AC'97 Modem Controller
(rev 80)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM
Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Miscellaneous Control
01:00.0 VGA compatible controller: ATI Technologies Inc RV280 [Radeon 9200 PRO]
(rev 01)
01:00.1 Display controller: ATI Technologies Inc RV280 [Radeon 9200 PRO]
(Secondary) (rev 01)

00:00.0 0600: 1106:3188 (rev 01)
00:01.0 0604: 1106:b188
00:07.0 0c00: 1106:3044 (rev 80)
00:08.0 0104: 105a:3373 (rev 02)
00:0a.0 0200: 10b7:1700 (rev 12)
00:0e.0 0200: 8086:1229 (rev 0c)
00:0f.0 0104: 1106:3149 (rev 80)
00:0f.1 0101: 1106:0571 (rev 06)
00:10.0 0c03: 1106:3038 (rev 81)
00:10.1 0c03: 1106:3038 (rev 81)
00:10.2 0c03: 1106:3038 (rev 81)
00:10.3 0c03: 1106:3038 (rev 81)
00:10.4 0c03: 1106:3104 (rev 86)
00:11.0 0601: 1106:3227
00:11.5 0401: 1106:3059 (rev 60)
00:11.6 0780: 1106:3068 (rev 80)
00:18.0 0600: 1022:1100
00:18.1 0600: 1022:1101
00:18.2 0600: 1022:1102
00:18.3 0600: 1022:1103
01:00.0 0300: 1002:5960 (rev 01)
01:00.1 0380: 1002:5940 (rev 01)

-[0000:00]-+-00.0  VIA Technologies, Inc. VT8385 [K8T800 AGP] Host Bridge
           +-01.0-[0000:01]--+-00.0  ATI Technologies Inc RV280 [Radeon 9200 PRO]
           |                 \-00.1  ATI Technologies Inc RV280 [Radeon 9200
PRO] (Secondary)
           +-07.0  VIA Technologies, Inc. IEEE 1394 Host Controller
           +-08.0  Promise Technology, Inc. PDC20378 (FastTrak 378/SATA 378)
           +-0a.0  3Com Corporation 3c940 10/100/1000Base-T [Marvell]
           +-0e.0  Intel Corporation 82557/8/9 Ethernet Pro 100
           +-0f.0  VIA Technologies, Inc. VIA VT6420 SATA RAID Controller
           +-0f.1  VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C
PIPC Bus Master IDE
           +-10.0  VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
           +-10.1  VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
           +-10.2  VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
           +-10.3  VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
           +-10.4  VIA Technologies, Inc. USB 2.0
           +-11.0  VIA Technologies, Inc. VT8237 ISA bridge [KT600/K8T800/K8T890
South]
           +-11.5  VIA Technologies, Inc. VT8233/A/8235/8237 AC97 Audio Controller
           +-11.6  VIA Technologies, Inc. AC'97 Modem Controller
           +-18.0  Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
HyperTransport Technology Configuration
           +-18.1  Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
           +-18.2  Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM
Controller
           \-18.3  Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Miscellaneous Control



Comment 9 Chuck Ebbert 2007-07-24 22:57:07 UTC
It looks like we can slow down the boot messages by adding

   boot_delay=N

to the kernel command line. Try values of N starting with 200,
then try 400 etc. until it prints slowly enough to see the messages.
Try taking a picture of the screen if there's anything relevant,
then attach that to the bugzilla.


Comment 10 Michal Jaegermann 2007-07-25 00:46:22 UTC
It was just immediately rebooting when I was trying that before.
At this moment 2.6.22.1-27.fc7 on SK8V prints assorted things
until it will get down to these lines:

.....
NetLabel: unlabeled traffic allowed by default
agpgart: Detected AGP bridge 0

and after that it just sits there with a power switch the only
remaining option to affect something here.

No idea why that chaneg after powering down in the meantime.  I did
try to boot another working kernel, like 2.6.23-0.45.rc0.git16.fc8,
and restart with 2.6.22.1-27.fc7.  No visible influence.

Like I wrote - with 'agp=off' I can boot.


Comment 11 Michal Jaegermann 2007-07-26 22:13:23 UTC
Kernel 2.6.22.1-33.fc7 from updates-testing is stuck on SK8V
precisely in the same way like described in comment #10.

Comment 12 Chuck Ebbert 2007-07-26 22:46:49 UTC
There really doesn't seem to be much change in AGP to cause this.


Comment 13 Dave Airlie 2007-07-26 23:16:40 UTC
can you get me a dmesg from a previous kernel and attach it?

My guess is its a via chipset quirk missing or doing something different
somewhere else...



Comment 14 Callum Lerwick 2007-07-27 02:50:27 UTC
Created attachment 160089 [details]
dmesg from working kernel on m6805

Comment 15 Dave Airlie 2007-07-27 03:03:44 UTC
I'm guessing it might be the e820 stuff in amd64-agp.c

b92e9fac400d4ae5bc7a75c568e9844ec53ea329

is the lk commit I'm guessing, Chuck I'm unsure of procedure so can you followup?

Comment 16 Michal Jaegermann 2007-07-27 20:09:48 UTC
Created attachment 160133 [details]
a part of differences from dmesg for different kernels on two different boards

> I'm guessing it might be the e820 stuff in amd64-agp.c

Maybe.	I am not sure if this will help but attached is a diff
from dmesg for a kernel which still boots on SK8V,
i.e. 2.6.21-1.3228.fc7, and dmesg for 2.6.22.1-33.fc7 on another
x86_64 board (ASUS A8V Deluxe) where the later happens to work.

That diff continues only to a section with "agpgart: Detected AGP bridge 0".
Later stuff does a disk detection and other things which are
quite radically different in both cases.

Once again, with 2.6.22.1-33.fc7 on SK8V a boot just sits there
after printing a line "agpgart: Detected AGP bridge 0".
Rawhide kernels are fine.

Comment 17 Martin Ebourne 2007-07-29 14:07:19 UTC
I'm getting the same hang after detecting the AGP bridge. Motherboard is a
Gigabyte GA-K8VT800M, VIA chipset and a Matrox G550 video card. Same kernel
versions apply. agp=off and it boots ok.

Comment 18 Michal Jaegermann 2007-08-01 02:35:49 UTC
kernel-2.6.22.1-41.fc7 fails to boot the same way as 2.6.22.1-27.fc7
and 2.6.22.1-33.fc7.

Comment 19 Chuck Ebbert 2007-08-09 18:54:02 UTC
*** Bug 251555 has been marked as a duplicate of this bug. ***

Comment 20 Michal Jaegermann 2007-08-11 21:34:05 UTC
Kernel kernel-2.6.22.1-32.fc6 is affected by the same disease.
Screenshot attached to bug 251555 shows a tail of an oops but
the whole thing stretches for something between two and three
screenfulls (if it does not lock up silently).

Comment 21 Michal Jaegermann 2007-08-11 22:51:29 UTC
Created attachment 161126 [details]
first screen from oops with 2.6.22.1-32.fc6

With a help of 'boot_delay=200' I took pictures of both exception
screens for 2.6.22.1-32.fc6.  It says "invalid opcode" at the beginning.

Comment 22 Michal Jaegermann 2007-08-11 22:52:17 UTC
Created attachment 161127 [details]
second screen from oops with 2.6.22.1-32.fc6

Comment 23 Tomas Kopecek 2007-08-13 16:16:22 UTC
Same situation:
Last working kernel: 2.6.21-1.3228.fc7
Actual kernel: 2.6.22.1-41.fc7
Kernel hangs after printing line (withou quiet parameter): agpgart: Detected AGP
bridge 0
What is interesting is that debug version of the same kernel (2.6.22.1-41.fc7)
is working without problems.

Comment 24 Michal Jaegermann 2007-08-13 18:51:03 UTC
I tried rebooting with kernel-2.6.22.2-52.fc7 (at koji at this moment).
SK8V used in testing is a single core and is recognized by this kernel
as such.

With 'boot_delay=200' I can see "agpgart: Detected AGP bridge 0"
line followed by an immediate reboot.  Without 'boot_delay' parameter
I stare at BIOS boot screens in an instant.  After such boot attempt
on some unpredictable occasions a machine may hang in a reboot and
requires a powerdown.

kernel-2.6.23-0.101.rc2.git5.fc8 does boot on a test machine.

Comment 25 Chuck Ebbert 2007-08-13 19:12:56 UTC
Something changed in the e820 map. Does kernel option "numa=off" make a difference?

Comment 26 Tomas Kopecek 2007-08-13 19:35:35 UTC
(In reply to comment #25)
> Something changed in the e820 map. Does kernel option "numa=off" make a
difference?

No difference for me, but I have single-processor machine, so I think numa=off
should not have any impact at all.

Comment 27 Michal Jaegermann 2007-08-13 19:46:11 UTC
> Does kernel option "numa=off" make a difference?

Not the slightest one for me.  Not that surprising.  I already
mentioned that SK8V is a single core.

Does anybody sees that on multiple core x86_64's?

Comment 28 Chuck Ebbert 2007-08-13 20:20:51 UTC
(In reply to comment #22)
> Created an attachment (id=161127) [edit]
> second screen from oops with 2.6.22.1-32.fc6

The running kernel's code is corrupted. Somehow it has been overwritten during
the AGP init phase, apparently.

From the dump:
Code: ff ff ff ff 00 00 00 00 00 00 00 00 90 e2 47 0f 00 81 ff

But the kernel should have, at that address:
pci_read():
/usr/src/debug/kernel-2.6.22/linux-2.6.22.x86_64/arch/i386/pci/common.c:32
ffffffff811f6388:       48 8b 05 19 7f 35 00    mov    3505945(%rip),%rax      
 # ffffffff8154e2a8 <raw_pci_ops>
/usr/src/debug/kernel-2.6.22/linux-2.6.22.x86_64/arch/i386/pci/common.c:31
ffffffff811f638f:       41 89 f2                mov    %esi,%r10d
/usr/src/debug/kernel-2.6.22/linux-2.6.22.x86_64/arch/i386/pci/common.c:32
ffffffff811f6392:       0f b6 b7 98 00 00 00    movzbl 0x98(%rdi),%esi
ffffffff811f6399:       4d 89 c1                mov    %r8,%r9
ffffffff811f639c:       31 ff                   xor    %edi,%edi
ffffffff811f639e:       41 89 c8                mov    %ecx,%r8d
ffffffff811f63a1:       89 d1                   mov    %edx,%ecx
ffffffff811f63a3:       44 89 d2                mov    %r10d,%edx
ffffffff811f63a6:       4c 8b 18                mov    (%rax),%r11
ffffffff811f63a9:       41 ff e3                jmpq   *%r11


Comment 29 Chuck Ebbert 2007-08-13 20:23:30 UTC
(In reply to comment #27)
> > Does kernel option "numa=off" make a difference?
> 
> Not the slightest one for me.  Not that surprising.  I already
> mentioned that SK8V is a single core.

It is doing "fake" numa for single-CPU machines. There are some e820 patches in
2.6.23 for bugs in the e820 code that make it use invalid addresses for the fake
numa tables.

Comment 30 Michal Jaegermann 2007-08-13 23:27:03 UTC
Comparing sources for 2.6.22.1-32.fc6, which fails to boot,
and booting 2.6.23-0.101.rc2.git5.fc8 the only difference in
drivers/char/agp/amd64-agp.c is that in the first case there
is a call 'pci_read_config_byte(pdev, PCI_REVISION_ID, &rev_id);'
to get rev_id of u8 type and in the second one pdev->revision
is used instead of rev_id (in two places).  That difference is
due to patch-2.6.23-rc2.bz2.

How significant is that I do not know; possibly not very as
earlier rawhide kernels based on 2.6.22 were usually booting
just fine.

Comment 31 Michal Jaegermann 2007-08-13 23:45:19 UTC
Re comment 28: a picture posted by John Morris as
https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=161005
from a panic on F7 (bug 251555) shows a saner looking code line.
Unfortunately a preceeding screen is not there.

Comment 32 Callum Lerwick 2007-08-14 05:41:33 UTC
kernel-2.6.22.1-41.fc7.x86_64 fails in the same way on my m6805.

So it appears to be a problem with VIA chipsets. One wonders, do ANY x86_64
machines with VIA chipsets work?

My desktop machine with a Gigabyte GA-K8U motherboard, ULi M1689 chipset, boots
these kernels just fine. These two are the only x86_64 machines I have though.

I'll try numa=off.

Comment 33 Callum Lerwick 2007-08-14 06:42:42 UTC
Nope, numa=off does not appear to help at all on my m6805.

Comment 34 Michal Jaegermann 2007-08-15 22:19:27 UTC
I tried kernel-2.6.22.2-57.fc7.x86_64 from testing.  Just booting,
or booting with 'initcall_debug' and a machine is back to BIOS rebooting.
With 'initcall_debug boot_delay=150' the last two lines on screen are

Calling initcall 0xffffffff814291d6: pci_iommu_init+0x0/0x17()
agpgart: Detected AGP bridge 0

and after that it sits there completely frozen.

'agp=off' allows to boot that, as expected.  The next line after
that missing fragment about AGP is in this case:

ACPI: RTC can wake from S4

Comment 35 Chuck Ebbert 2007-08-15 22:26:15 UTC
(In reply to comment #31)
> Re comment 28: a picture posted by John Morris as
> https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=161005
> from a panic on F7 (bug 251555) shows a saner looking code line.

That's not code, it's ASCII text!



Comment 36 Michal Jaegermann 2007-08-15 23:13:47 UTC
> That's not code, it's ASCII text!

Hm, indeed.  You are right!

LOC: ERR: %10u
ar

Not even reversed as it happens with little-endian.

Maybe not as much as "overwritten" but something reads from somewhere
else it expects to read?

Comment 37 Chuck Ebbert 2007-08-21 16:23:12 UTC
Kernel with a possible fix (e820 hole mapping) is in Koji:

http://koji.fedoraproject.org/koji/buildinfo?buildID=13938


Comment 38 Tomas Kopecek 2007-08-21 18:39:20 UTC
(In reply to comment #37)
> Kernel with a possible fix (e820 hole mapping) is in Koji:
> 
> http://koji.fedoraproject.org/koji/buildinfo?buildID=13938
> 

This kernel works for me. If you have any questions about configuration or
something else, feel free to ask.

Comment 39 Michal Jaegermann 2007-08-21 19:21:24 UTC
> Kernel with a possible fix (e820 hole mapping) is in Koji

Sorry!  It dies for me the same way as before; i.e. it prints
"agpgart: Detected AGP bridge 0" and nothing happens after that.

Here is the top of dmesg output after booting this kernel with
agp=off:

Linux version 2.6.22.3-61.fc7 (kojibuilder.redhat.com) (gcc
version 4.1.2 20070502 (Red Hat 4.1.2-12)) #1 SMP Thu Aug 16 13:23:49 EDT 2007
Command line: ro root=/dev/Vols/Vol04 agp=off 3
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
 BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000e4000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000001ff30000 (usable)
 BIOS-e820: 000000001ff30000 - 000000001ff40000 (ACPI data)
 BIOS-e820: 000000001ff40000 - 000000001fff0000 (ACPI NVS)
 BIOS-e820: 000000001fff0000 - 0000000020000000 (reserved)
 BIOS-e820: 00000000fff80000 - 0000000100000000 (reserved)
Entering add_active_range(0, 0, 159) 0 entries of 3200 used
Entering add_active_range(0, 256, 130864) 1 entries of 3200 used
end_pfn_map = 1048576
DMI 2.3 present.
ACPI: RSDP 000FA870, 0021 (r2 ACPIAM)
ACPI: XSDT 1FF30100, 003C (r1 A M I  OEMXSDT   9000323 MSFT       97)
ACPI: FACP 1FF30290, 00F4 (r3 A M I  OEMFACP   9000323 MSFT       97)
ACPI: DSDT 1FF303E0, 35D8 (r1  SK8V_ SK8V_013       13 MSFT  100000D)
ACPI: FACS 1FF40000, 0040
ACPI: APIC 1FF30390, 004A (r1 A M I  OEMAPIC   9000323 MSFT       97)
ACPI: OEMB 1FF40040, 003F (r1 A M I  OEMBIOS   9000323 MSFT       97)
Scanning NUMA topology in Northbridge 24
No NUMA configuration found

Any other information which could be help?

Comment 40 Michal Jaegermann 2007-08-21 19:32:58 UTC
I tried, just in case, 2.6.22.3-61.fc7 with 'numa=off'.  This touches
AGP and I am immediately back to BIOS rebooting; even with
'boot_delay=200'.  Without 'boot_delay=...' and with 'numa=off'
I cannot even see what really happened.

Comment 41 Callum Lerwick 2007-08-23 12:26:30 UTC
Nope, no luck with 2.6.22.3-61.fc7 on the m6805 either.

Comment 42 Michal Jaegermann 2007-08-23 21:28:59 UTC
Checked kernel-2.6.22.4-66.fc7 which I found on koji.  That one
immediately reboots back to BIOS without or with various "regular" option
combinations I tried.  That with one exception; after 'boot_delay=200'
it got to "agpgart: Detected AGP bridge 0" and it sits there.

OTOH if I will drop an amount memory available (like "mem=64M") then
pretty consistently I am getting panics which look like what is
in an attachment (id=161127) (comment #22) picture.  AFAICS results
are the same with various values for 'mem=...' (all below 512M which
happens to be an amount of memory in my test box).  Only this time
RIP is pci_read+0x3/0x24 and a "Code" line comes always as

37 71 3d 33 27 57 f3 75 73 35 3d 27 0b 35 f3 57 77 35 33 3f

Not like that code from comment #28 but not an ASCII text either.

As an extra attraction boot parameters say 'mem=64M boot_delay=150'
mean an instant reboot the moment AGP is touched.

Comment 43 Callum Lerwick 2007-08-27 19:28:42 UTC
Sigh, still no luck with 2.6.22.4-65.fc7

Comment 44 Callum Lerwick 2007-08-27 19:56:35 UTC
And no luck with 2.6.22.5-71.fc7 either.

Comment 45 Michal Jaegermann 2007-08-27 21:41:37 UTC
Created attachment 174641 [details]
dmesg from 'mem=510M' boot with 2.6.22.4-45.fc6

I get "Kernel panic - not syncing ..." right away with
kernel-2.6.22.4-45.fc6.  A "Code" line gets even more curious:

00 01 10 00 00 00 00 00 00 02 20 00 00 00 00 00 00 00 00 00

OTOH I can boot that kernel if I will specify 'mem=510M' or some
smaller amount ('mem=64M' still boots fine).  With 'mem=511M' I get
panic, although a "Code" line does not resemble anything seen
earlier.  Higher than 511M is not accepted (I have 512M on board)
and for results with no 'mem=xxx' - see above.

dmesg from 'mem=510M' boot is attached.

Keep in mind comment #42 where I tried similar trickery with
kernel-2.6.22.4-66.fc7 and still was unable to boot.

Comment 46 Michal Jaegermann 2007-08-27 22:07:34 UTC
I tried the same 'mem=510M' with kernel-2.6.22.5-71.fc7 grabbed
from koji and that boots for me as well. 'mem=511M', or no such
parameter,  and with 2.6.22.5-71.fc7 I have an instant reboot.

Does that mean something scribbles over a memory it should not
touch?



Comment 47 Chuck Ebbert 2007-08-28 19:57:24 UTC
(In reply to comment #46)
> I tried the same 'mem=510M' with kernel-2.6.22.5-71.fc7 grabbed
> from koji and that boots for me as well. 'mem=511M', or no such
> parameter,  and with 2.6.22.5-71.fc7 I have an instant reboot.
> 

Using the boot_delay option, or serial console, can you get the first 30 lines
of output when booting without a "mem=" parameter (from the e820 info to the
"bootmem setup node 0..."?)

Comment 48 Michal Jaegermann 2007-08-28 22:19:18 UTC
> Using the boot_delay option, or serial console, can you get the
> first 30 lines of output ....

I am afraid that this turns out to be impossible.  That SK8V board
does not have a serial connector at all (or I would not bother
with pictures), BIOS does not have an option to change a number
of text lines on a screen and boot_delay is ineffective.  By the
latest I mean that the moment the first screen output shows up
on a monitor we are way past that fragment you would want to see
so I cannot even make pictures.  Adding options like
'earlyprintk=vga' and/or 'initcall_debug' does not help.

All I can tell that with 2.6.22.5-71.fc7 and with 'boot_delay=...'
the moment the code reaches AGP there is an immediate reboot.

Maybe somebody else has a hardware which would allow to catch
more and can repeat something similar to my results?

BTW - dmesg produced by 2.6.22.4-45.fc6, as attached to comment
#45, and by 2.6.22.5-71.fc7, both with 'mem=510M', do not differ
that much where it counts.  In particular e820 map is the same.

That map does not differ also from the one for 2.6.21-1.3228.fc7
which happens to be the last F7 kernel which boots on that board
without any "heroic efforts".  There are some differences in an
initial setup though.  Here:

--- dmesg.2.6.21-1.3228.fc7     2007-08-28 15:53:01.000000000 -0600
+++ dmesg.2.6.22.5-71.fc7       2007-08-27 16:08:42.000000000 -0600
@@ -21,26 +21,24 @@
 ACPI: APIC 1FF30390, 004A (r1 A M I  OEMAPIC   9000323 MSFT       97)
 ACPI: OEMB 1FF40040, 003F (r1 A M I  OEMBIOS   9000323 MSFT       97)
 Scanning NUMA topology in Northbridge 24
-Number of nodes 1
-Node 0 MemBase 0000000000000000 Limit 000000001ff30000
+No NUMA configuration found
+Faking a node at 0000000000000000-000000001fe00000
 Entering add_active_range(0, 0, 159) 0 entries of 3200 used
-Entering add_active_range(0, 256, 130864) 1 entries of 3200 used
-NUMA: Using 63 for the hash shift.
-Using node hash shift of 63
-Bootmem setup node 0 0000000000000000-000000001ff30000
+Entering add_active_range(0, 256, 130560) 1 entries of 3200 used
+Bootmem setup node 0 0000000000000000-000000001fe00000
.....

Although current rawhide kernels, booting, also have
"No NUMA configuration found" but:

Faking a node at 0000000000000000-000000001ff30000
....
Bootmem setup node 0 0000000000000000-000000001ff30000

and these happen to be the same addresses as in 2.6.21-1.3228.fc7
and not in 2.6.22.5-71.fc7.  Still looking at results of
booting 2.6.23-0.142.rc3.git10.fc8 with and without 'mem=510'
that difference is really a result of this option.

dmesg for 2.6.21-1.3228.fc7 was already attached to comment #14,
Different board but it looks very similar to what I see.


Comment 49 Michal Jaegermann 2007-09-24 16:54:56 UTC
No changes with 2.6.22.5-49.fc6 and 2.6.22.5-76.fc7. That means
that I can boot if I will use 'mem=510M' in boot parameters;
otherwise a bomb if 'agp=off' is not there.

Comment 50 Callum Lerwick 2007-10-01 01:26:26 UTC
Okay, with recent kernels I seem to get a backtrace before it reboots. By
setting mem=256M it will lock up instead, so I was able to get a picture of it.
vga=6 got the entire backtrace on screen.

mem=510M seems to work however! (The machine has 512mb in it) Hurray I can once
again get stable wireless and DRI at the same time...

Comment 51 Callum Lerwick 2007-10-01 01:28:33 UTC
Created attachment 211951 [details]
Screenshot of kernel-2.6.22.9-91.fc7.x86_64 backtrace

Comment 52 Michal Jaegermann 2007-10-01 02:48:53 UTC
In comment #50 by Callum Lerwick:
> By setting mem=256M it will lock up instead
Did you try mem=254M or somewhat less?

Comment 53 Michal Jaegermann 2007-10-22 19:56:50 UTC
See https://bugzilla.redhat.com/show_bug.cgi?id=336281#c6 for a possible
workaround, which may give DRI, if you have to boot with 'agp=off'.

Comment 54 Chuck Ebbert 2007-10-24 22:43:43 UTC
Does kernel option "numa=fake=1" make any difference?


Comment 55 Michal Jaegermann 2007-10-24 23:19:00 UTC
> Does kernel option "numa=fake=1" make any difference?
When trying with kernel-2.6.23.1-31.fc8, which for me consistently
gets stuck after showing "agpgart: Detected AGP bridge 0", the
first test with "numa=fake=1" caused an instant reboot.  But attempts
to repeat that later with extra options like "boot_delay=150"
were just getting stuck again in the same place.

Comment 56 Martin Ebourne 2007-10-28 18:45:08 UTC
This is the broken commit that causes this bug:

commit 2e1c49db4c640b35df13889b86b9d62215ade4b6
Author: Zou Nan hai <nanhai.zou>
Date:   Fri Jun 1 00:46:28 2007 -0700

    x86_64: allocate sparsemem memmap above 4G
    
    On systems with huge amount of physical memory, VFS cache and memory memmap
    may eat all available system memory under 4G, then the system may fail to
    allocate swiotlb bounce buffer.
    
    There was a fix for this issue in arch/x86_64/mm/numa.c, but that fix dose
    not cover sparsemem model.
    
    This patch add fix to sparsemem model by first try to allocate memmap above
    4G.
    
    Signed-off-by: Zou Nan hai <nanhai.zou>
    Acked-by: Suresh Siddha <suresh.b.siddha>
    Cc: Andi Kleen <ak>
    Cc: <stable>
    Signed-off-by: Andrew Morton <akpm>
    Signed-off-by: Linus Torvalds <torvalds>


Comment 57 Callum Lerwick 2007-10-28 20:26:40 UTC
I got a chance to experiment:

512M = "Error 28: Selected item cannot fit into memory"
511M = no boot
510M = boot
509M = no boot
508M = no boot
507M = no boot
506M = boot

256M = no boot
255M = boot

128M = no boot
127M = boot

96M = no boot
95M = boot

64M = no boot
63M = boot

32M = no boot
31M = boots, but the OOM killer kills the initrd. :)

Comment 58 Martin Ebourne 2007-10-28 20:54:13 UTC
Created attachment 240991 [details]
Patch to revert broken commit

This is tested against 2.6.23.1.

Comment 59 Martin Ebourne 2007-10-28 21:02:42 UTC
I've rebuilt the latest Fedora 8 kernel with the above patch included. Works
just great here.

http://mebourne.fedorapeople.org/kernel-2.6.23.1-37.bz249174.src.rpm
http://mebourne.fedorapeople.org/kernel-2.6.23.1-37.bz249174.x86_64.rpm

Of course, none of these machines have huge amounts of RAM, mine only has 512MB.
I suspect that the original 'fix' breaks Fedora for more people than it fixes
anything for. (I for one have been unable to upgrade two of my machines from FC6
to Fedora 7, and certainly don't want to miss out on Fedora 8 as well.)

Comment 60 Michal Jaegermann 2007-10-28 21:52:47 UTC
I can confirm that my test machine boots fine with
http://mebourne.fedorapeople.org/kernel-2.6.23.1-37.bz249174.x86_64.rpm
kernel and it will get stuck, unless agp=off is used, with
2.6.23.1-37.fc8.  kernel-2.6.23.1-37.bz249174 was configured with
debugging off, right?

Looking at the patch in question it seems to me that this is a pure
dumb luck that various x86_64 boxes can boot with this patch when some
workarounds are used (manipulating memory amounts, agp=off).  OTOH
just reverting the patch will break what this was supposed to fix
in the first place ("... VFS cache and memory memmap may eat all
available system memory under 4G, then the system may fail to
allocate swiotlb bounce buffer").



Comment 61 Rahul Sundaram 2007-10-28 22:17:26 UTC
We have a working confirmed patch. Adding it as a Fedora 8 blocker to review. 

Comment 62 Rahul Sundaram 2007-10-29 01:22:34 UTC
*** Bug 338551 has been marked as a duplicate of this bug. ***

Comment 63 Callum Lerwick 2007-10-29 04:02:23 UTC
*** Bug 336281 has been marked as a duplicate of this bug. ***

Comment 64 Callum Lerwick 2007-10-29 04:09:03 UTC
So, is there an F7 update I can test? :)

336281 is an AMD chipset. So it seems this isn't VIA only. I wonder if the
reason my ULi M1689 based desktop works is because it has 2.25gb RAM.

Comment 65 Martin Ebourne 2007-10-29 09:11:52 UTC
Try the F8 kernel above, should work on F7.

Comment 66 Mike A. Harris 2007-10-29 18:23:23 UTC
My problem (bug #336281 / Fedora 7) was on an AMD Solo motherboard with AMD chipset.

Is there an official Fedora project built test kernel with the fixes mentioned
in comment #59 available?  If so, I'd be happy to test it with Fedora 7 if there
are no F8 userland deps.

TIA

Comment 67 Jeremy Katz 2007-10-29 19:43:04 UTC
Patch added and will be building shortly...

Comment 68 Michal Jaegermann 2007-10-30 02:47:01 UTC
kernel-2.6.23.1-41.fc8 from koji boots rawhide on my test machine without any
extra options.  Also Xorg works, and it using DRI, without a need to force
bus to PCI.

I should note that the same kernel works also as above for F7 installation
too.  Not that surprising, as this is the same hardware only different
disk partitions, but I checked that just to be sure.

Comment 69 Michal Jaegermann 2007-11-05 20:49:29 UTC
kernel-2.6.22.11-68.fc6 ('updates-testing' at this moment) boots
for me as expected.