Bug 578108

Summary: kernel 2.6.32 cannot switch to graphic mode (nVidia with nouveau)
Product: [Fedora] Fedora Reporter: Éric Brunet <eric.brunet>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: low    
Version: 12CC: anton, dougsland, ermanaricus, francis.x.dolan.iv, gansalmon, guillaume.lelaurain, itamar, jonathan, kernel-maint
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-06-17 07:43:39 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Bug Depends On:    
Bug Blocks: 507684    
Attachments:
Description Flags
dmesg of faulty kernel none

Description Éric Brunet 2010-03-30 05:05:12 EDT
Starting with kernel 2.6.32 (up to 2.6.32.10-90), the system can no longer start X.
After the normal boot sequence, when X should start, the computer restarts (as if the reset button had been pressed) and the bios displays its splash screen. The BIOS is then so confused that it cannot initiate POST and boot: while(1) {
after a few seconds of splash screen, the screen blinks and the splash screen reappears ;} It looks like the GPU is in such a poor state that even the BIOS can no longer reset it. I have to completely shut down the computer to be able to boot again.

In /var/log, the files Xorg.0.log and dmesg are created but are empty. The file 
messages contain the lines

Mar 30 10:33:52 localhost kernel: imklog 4.4.2, log source = /proc/kmsg started.
Mar 30 10:33:52 localhost rsyslogd: [origin software="rsyslogd" swVersion="4.4.2" x-pid="1218" x-info="http://www.rsyslog.com"] (re)start
Mar 30 10:33:52 localhost kernel: Initializing cgroup subsys cpuset
Mar 30 10:33:52 localhost kernel: Initializing cgroup subsys cpu
Mar 30 10:33:52 localhost kernel: Linux version 2.6.32.10-90.fc12.x86_64 (mockbuild@x86-04.phx2.fedoraproject.org) (gcc version 4.4.3

(notice how the last line is truncated.)

Everything works well up to kernel-2.6.31.12-174.2.22

System is a HP Z400 in x86_64 mode, with an uptodate F12.

xorg-x11-drv-nouveau-0.0.15-21.20091105gite1c2efd.fc12.x86_64
xorg-x11-server-Xorg-1.7.6-1.fc12.x86_64

0f:00.0 VGA compatible controller: nVidia Corporation Device 0659 (rev a1) (prog-if 00 [VGA controller])
        Subsystem: nVidia Corporation Device 063a
        Physical Slot: 2
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 24
        Region 0: Memory at e2000000 (32-bit, non-prefetchable) [size=16M]
        Region 1: Memory at d0000000 (64-bit, prefetchable) [size=256M]
        Region 3: Memory at e0000000 (64-bit, non-prefetchable) [size=32M]
        Region 5: I/O ports at e000 [size=128]
        Expansion ROM at <unassigned> [disabled]
        Capabilities: [60] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [78] Express (v1) Endpoint, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 <4us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal+ Unsupported-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x16, ASPM L0s L1, Latency L0 <512ns, L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot-
                LnkCtl: ASPM Disabled; RCB 128 bytes Disabled- Retrain- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        Capabilities: [b4] Vendor Specific Information: Len=14 <?>
        Capabilities: [100 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntrySize=0
                Arb:    Fixed- WRR32- WRR64- WRR128- 100ns- - - onfig- TableOffset=0
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Fixed- RR32-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=01
                        Status: NegoPending- InProgress-
        Capabilities: [128 v1] Power Budgeting <?>
        Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Kernel driver in use: nouveau
        Kernel modules: nouveau, nvidiafb


The report might be related to bugs 571741, 571058, 577431 but I am not sure as the symptoms are not the same (I never reach a login screen. By the way, I use kdm.)
Comment 1 Éric Brunet 2010-04-12 12:44:37 EDT
Created attachment 406011 [details]
dmesg of faulty kernel

I just want to confirm that the bug is still present in kernel-2.6.32.11-99.fc12.x86_64, with exactly the same symptoms: it works correctly in runlevel 3 (see attached dmesg) but reboots violently when X is started. When it happens, the bios is unable to restart unless there is a cold boot.

Is there anything I can do to help corner that bug ?

Thanks,
Comment 2 Éric Brunet 2010-05-06 08:26:51 EDT
Today I tried to upgrade my f12 with the following packages from f13

kernel-2.6.33.1-24.fc13.x86_64.rpm
libdrm-2.4.19-1.fc13.x86_64.rpm
udev-151-7.fc13.x86_64.rpm
xorg-x11-drv-nouveau-0.0.16-2.20100218git2964702.fc13.x86_64.rpm
xorg-x11-server-common-1.7.99.902-2.20100319.fc13.x86_64.rpm
xorg-x11-server-Xorg-1.7.99.902-2.20100319.fc13.x86_64.rpm
xorg-x11-drv-fbdev-0.4.1-3.fc13.x86_64.rpm
linux-firmware-20100106-4.fc14.noarch
grubby-7.0.15-1.fc14.x86_64

(The last is fc14 because I also tried rawhide)

The result are the same as above: the computer reboots violently into a sore state when X is being run.

I also tried a vanilla 2.6.33.3 kernel (with f12 packages: same problem.)

Does that mean I am stuck forever with f12 and kernel 2.6.31 ?

What should I do ?
Comment 3 Éric Brunet 2010-05-17 09:23:48 EDT
I have just tried to boot the LiveCD x86_64 (KDE) for Fedora 13 beta.

The same problem occurs. When starting X, the computer reboots into a state where the BIOS itself is helpless.

I'd like to stress that HP is for four years the official provider of computers for France's most important research organization (CNRS), and it would really be a pity if Fedora would not install on their computer. What happened between kernels 2.6.31 and 2.6.32 ?

As the problem is present with fedora 13 beta, I'll try to add it to the fedora 13 target.
Comment 4 Éric Brunet 2010-05-21 07:44:15 EDT
I upgraded my f12 today to kernel-2.6.32.12-115.fc12.x86_64, and the problem is still here. Back to 2.6.31.

So to sum up; the bug appeared in a F12 update, and is still present in F13beta.

I have no idea how to debug this.
Comment 5 Guillaume Lelaurain 2010-05-21 10:43:38 EDT
(In reply to comment #4)
> I upgraded my f12 today to kernel-2.6.32.12-115.fc12.x86_64, and the problem is
> still here. Back to 2.6.31.
> 
> So to sum up; the bug appeared in a F12 update, and is still present in
> F13beta.
> 
> I have no idea how to debug this.    

I have the same problem (and the same market to buy my Z400: french-epst)

When I force the kernel option intel_iommu=off the workstations works on graphic mode.

With drivers nouveau or kmod-nvidia on kernel-2.6.32.12-115.fc12.x86_64.

Regards,
Guillaume
Comment 6 Éric Brunet 2010-05-25 09:10:52 EDT
I confirm that with the switch I can boot and enjoy kernel 2.6.32.12-115.

For info, looking in the dmesg I have with both kernels (the 2.6.31 without the switch and the 2.6.32 with the switch)

DMAR: Host address width 39
DMAR: DRHD base: 0x000000fed90000 flags: 0x1
IOMMU fed90000: ver 1:0 cap c90780106f0462 ecap f02076
DMAR: RMRR base: 0x000000cefd0000 end: 0x000000cefd0fff
DMAR: RMRR base: 0x000000cefd1000 end: 0x000000cefd1fff
DMAR: RMRR base: 0x000000cefd2000 end: 0x000000cefd2fff
DMAR: RMRR base: 0x000000cefd3000 end: 0x000000cefd3fff
DMAR: RMRR base: 0x000000cefd4000 end: 0x000000cefd4fff
DMAR: RMRR base: 0x000000cefd5000 end: 0x000000cefd5fff
DMAR: RMRR base: 0x000000cefd6000 end: 0x000000cefd6fff
DMAR: RMRR base: 0x000000cefd7000 end: 0x000000cefd7fff
DMAR: ATSR flags: 0x0

With the old kernel (2.6.31) without the switch I also have

IOMMU 0xfed90000: using Queued invalidation
IOMMU: hardware identity mapping for device 0000:0f:00.0
IOMMU: Setting RMRR:
IOMMU: Setting identity map for device 0000:00:1a.2 [0xcefd7000 -0xcefd8000]
IOMMU: Setting identity map for device 0000:00:1a.1 [0xcefd6000 - 0xcefd7000]
IOMMU: Setting identity map for device 0000:00:1a.0 [0xcefd5000 - 0xcefd6000]
IOMMU: Setting identity map for device 0000:00:1d.2 [0xcefd4000 - 0xcefd5000]
IOMMU: Setting identity map for device 0000:00:1d.1 [0xcefd3000 - 0xcefd4000]
IOMMU: Setting identity map for device 0000:00:1d.0 [0xcefd2000 - 0xcefd3000]
IOMMU: Setting identity map for device 0000:00:1a.7 [0xcefd1000 - 0xcefd2000]
IOMMU: Setting identity map for device 0000:00:1d.7 [0xcefd0000 - 0xcefd1000]
IOMMU: Prepare 0-16MiB unity mapping for LPC
IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 - 0x1000000]
PCI-DMA: Intel(R) Virtualization Technology for Directed I/O

I am not sure what all of these mean. I remark that the end address is different by one unit between the two sections. The troublesome video card is 0f:00.0
Comment 7 Francis Dolan 2010-06-14 15:00:58 EDT
I'm seeing it on HP Z600 though I'm not using nuveau driver.
I had been updating, but had not rebooted in a while.
first reboot I hit this problem. tried booting without
GUI and I have no problems. manualy start X and segfault reboot 


version 2.6.32.12-115.fc12.x86_64
in dmesg I see

nvidia: module license 'NVIDIA' taints kernel.
Disabling lock debugging due to kernel taint

a few lines down

nvidia 0000:0f:00.0: PCI INT A -> GSI 24 (level, low) -> IRQ 24
nvidia 0000:0f:0.0: setting latency timer to 64
vgaarb: device changed decodes: PCI:0000:0f:0.0,olddecodes=io+mem,decodes=none: owns=io+mem

a few more lines down

NVRM: loading NVIDIA UNIX x86_64 Kernel Module 195.36.24 Thu Apr 22 19:10:14 PDT 2010
nvidia-config-d[1429]: segfault at 7f8f2c000000 ip 00000039a9a79d3c sp 00007fff72a0b3c8 error 4 in libc-2.11.2.so[39a9a00000+170000]


I tried the intel_iommu=off and am still getting the nvidia segfault but the systems comes up in init 5 


tried installing Fedora-13-x86_64-DVD off of dvd and would reboot when it tried to bring up the GUI for install(dvd  was downloaded 6-9-2010) didn't try 
with =off (didn't want to upgrade unless I had to)
Comment 8 Éric Brunet 2010-06-17 07:43:39 EDT
The bug has disappeared with 2.6.33.5-124.fc13.x86_64;
I can now boot without intel_iommu=off.

I suspect that this bug is another manifestation of bug 561267 for which a fix went into the aforementionned kernel.