Bug 572517

Summary:	kernel 2.6.32: X does not start on GeForce 9200M GS with Nvidia propriate driver
Product:	[Fedora] Fedora	Reporter:	Serguei Miridonov <mirsev>
Component:	kernel	Assignee:	Kernel Maintainer List <kernel-maint>
Status:	CLOSED WONTFIX	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	high	Docs Contact:
Priority:	low
Version:	12	CC:	anton, bohacpetr, dougsland, fkoliver2, gansalmon, gwu, itamar, jesse.brandeburg, jonathan, kernel-maint, kmcmartin, neumann, poncho, rick.hendricksen, robin, rwahl, smithj4
Target Milestone:	---	Keywords:	Reopened
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2010-12-03 17:33:25 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Serguei Miridonov 2010-03-11 12:43:03 UTC

Description of problem:

After upgrading to kernel-2.6.32.9-67 or kernel-2.6.32.9-70 X can not start.

Version-Release number of selected component (if applicable):

Hardware: HP Pavilion dv5. More info is available at Smolts web page:

http://www.smolts.org/client/show/pub_185f33b5-e602-4fae-8714-fbc22a26e63f

Kernels:

kernel-2.6.32.9-67.fc12.i686 or kernel-2.6.32.9-70.fc12.i686
akmod-nvidia-195.36.08-1.fc12.i686
xorg-x11-drv-nvidia-libs-195.36.08-1.fc12.i686
xorg-x11-drv-nvidia-195.36.08-1.fc12.i686
xorg-x11-server-Xorg-1.7.5.901-1.fc12.i686

How reproducible: Always

Steps to Reproduce:

1. Install 2.6.32 kernel from Fedora update
2. Reboot
3. Wait when X will try to start
  
Actual results: Black screen, non-responsive keyboard.

Expected results: kdm login screen

Additional info:

/var/log/messages is flooded with

DRHD: handling fault status reg 2
DMAR:[DMA Read] Request device [01:00.0] fault addr 337b4000 
DMAR:[fault reason 01] Present bit in root entry is clear
NVRM: Xid (0001:00): 54, CMDre 00000000 00000000 00000000 00000001 00000001

Also a strange string appears right at the beginning of kernel boot:

ehci_hcd 0000:00:1d.7: dma_pool_free ehci_qh, c112c060/fffff060 (bad dma)

That was not happen before with 2.6.31 kernels.

Kernel command line:

rhgb quiet SYSFONT=latarcyrheb-sun16 LANG=en_US.UTF-8 KEYTABLE=us nouveau.modeset=0 rdblacklist=nouveau

I tried to add iommu=soft to this line for kernel 2.6.32 but that does not help.

Comment 1 Ronald Wahl 2010-03-11 18:36:48 UTC

Probably only large displays (e.g. larger than 1280x1024) are affected.

Comment 2 Serguei Miridonov 2010-03-11 18:58:59 UTC

Well, mine is 1680x1050...

Comment 3 Kyle McMartin 2010-03-12 02:18:08 UTC

Try intel_iommu=off.

Comment 4 Walter Neumann 2010-03-13 08:04:17 UTC

I have identical problem with kernels 2.6.32.9-67.fc12.x86_64 and 2.6.32.9-70.fc12.x86_64. "intel_iommu=off" does not help.

As soon as X starts screen goes black and wireless stops working and /ver/log/messages fills with 

Mar  7 20:58:22 lcn kernel: DMAR:[DMA Read] Request device [01:00.0] fault addr 119db6000 
Mar  7 20:58:22 lcn kernel: DMAR:[fault reason 01] Present bit in root entry is clear
Mar  7 20:58:22 lcn kernel: DRHD: handling fault status reg 2
Mar  7 20:58:22 lcn kernel: DMAR:[DMA Read] Request device [01:00.0] fault addr 119db6000 
Mar  7 20:58:22 lcn kernel: DMAR:[fault reason 01] Present bit in root entry is clear
Mar  7 20:58:22 lcn kernel: DRHD: handling fault status reg 2

Continues for about 6 minutes before total crash.

Here is the start of the same crash when I was next in the same location as my computer:

Mar 12 23:02:24 lcn abrtd: Non-processed crash in /var/cache/abrt/kerneloops-1268013723-3, saving into database
Mar 12 23:02:24 lcn abrtd: RunApp('/var/cache/abrt/kerneloops-1268013723-3','test x"`cat component`" = x"xorg-x11-server-Xorg" && cp /var/lo
g/Xorg.0.log .')
Mar 12 23:02:24 lcn abrtd: Getting local universal unique identification
Mar 12 23:02:25 lcn smbd[2240]: [2010/03/12 23:02:25,  0] smbd/server.c:457(smbd_open_one_socket)
Mar 12 23:02:25 lcn smbd[2240]:   smbd_open_once_socket: open_socket_in: Address already in use
Mar 12 23:02:25 lcn smbd[2240]: [2010/03/12 23:02:25,  0] smbd/server.c:457(smbd_open_one_socket)
Mar 12 23:02:25 lcn smbd[2240]:   smbd_open_once_socket: open_socket_in: Address already in use
Mar 12 23:02:25 lcn abrtd: Crash is in database already (dup of /var/cache/abrt/kerneloops-1268013723-4)
Mar 12 23:02:25 lcn abrtd: Done checking for unsaved crashes
Mar 12 23:02:25 lcn abrtd: Init complete, entering main loop
Mar 12 23:02:29 lcn kernel: CE: hpet increasing min_delta_ns to 15000 nsec
Mar 12 23:02:29 lcn kernel: CE: hpet increasing min_delta_ns to 22500 nsec
Mar 12 23:02:31 lcn kernel: DRHD: handling fault status reg 3
Mar 12 23:02:31 lcn kernel: DMAR:[DMA Read] Request device [01:00.0] fault addr 110199000 
Mar 12 23:02:31 lcn kernel: DMAR:[fault reason 01] Present bit in root entry is clear
Mar 12 23:02:32 lcn kernel: CE: hpet increasing min_delta_ns to 33750 nsec
Mar 12 23:02:36 lcn kernel: DRHD: handling fault status reg 2

Comment 5 Serguei Miridonov 2010-03-14 18:26:00 UTC

(In reply to comment #3)
> Try intel_iommu=off.    

OK, in my notebook this options works. Thank you. Could somebody comment anything regarding kernel change from 2.6.31 to 2.6.32? Is there anything to change in BIOS or somewhere else to fix it completely? Is it problem with CPU or other hardware issue?

Comment 6 Ronald Wahl 2010-03-15 19:23:31 UTC

The intel_iommu=off works here too. My observation that only displays larger than  1280x1024 are affected proved wrong. I did some tests on some more machines today. So it looks like the support for io virtualization is the difference why it does not work for some people. So the interesting question is - is this a bug in the kernel or X11 or even both?

Comment 7 Kyle McMartin 2010-03-16 16:27:09 UTC

It's likely a bug in the nvidia kernel driver, if you're using it... It's responsible for setting up the DMA mappings, and it appears to be using them incorrectly resulting in the IOMMU catching an illegal access. (It's like your hardware looking up a null pointer.)

I'm going to mark this NOTABUG, and we can release note turning off the IOMMU if you want to install the nvidia driver.

Comment 8 Serguei Miridonov 2010-03-16 17:04:30 UTC

(In reply to comment #7)
> It's likely a bug in the nvidia kernel driver, if you're using it... It's
> responsible for setting up the DMA mappings, and it appears to be using them
> incorrectly resulting in the IOMMU catching an illegal access. (It's like your
> hardware looking up a null pointer.)
> 
> I'm going to mark this NOTABUG, and we can release note turning off the IOMMU
> if you want to install the nvidia driver.  

Yes, I use nvidia driver because I need VDPAU and 3D.  

But, please, wait. There is also a strange string that appears it the moment when the kernel starts to boot:

ehci_hcd 0000:00:1d.7: dma_pool_free ehci_qh, c112c060/fffff060 (bad dma)

It is not from nvidia driver.

This string appears neither with 2.6.31 kernels nor with 2.6.32 with option intel_iommu=off.

Also, if it is a nvidia driver bug, then why doesn't it show up with earlier kernels?

Comment 9 Kyle McMartin 2010-03-16 17:12:45 UTC

It's a new feature in recent kernels. Nvidia needs to update their drivers to correctly do DMA.

Comment 10 Garrison Wu 2010-04-12 16:12:37 UTC

NVIDIA believes it unlikely that the problem is the result
of a bug in the NVIDIA kernel module. Both GT200GL and
G71GL, the GPUs on which the Quadro FX4800 and Quadro FX3500
are based on, and which we understand the problem has
been reproduced with, are capable of addressing any page
allocated on their behalf on PC hardware.

However, even if neither GPU was capable of addressing a
given page, the NVIDIA driver would only attempt to
remap it if the kernel it was built against did not define
the GFP_DMA32 zone. Else the IOMMU support code is not
built into the NVIDIA kernel module.  Note, also, that this
only applies to Linux/x86 kernels: the NVIDIA driver
only allocates from low memory, i.e. never specifies the
__GFP_HIGHMEM flag, and is therefore never built with
IOMMU support on x86.

Comment 11 Serguei Miridonov 2010-04-12 16:39:31 UTC

Dear Garrison Wu, thank you for your comment. I think that the bug must be reopened. Just a note for Kyle McMartin: you have not answered to my question about this suspicious line which appears right after grub:

ehci_hcd 0000:00:1d.7: dma_pool_free ehci_qh, c112c060/fffff060 (bad dma)

What does it mean?

Comment 12 Garrison Wu 2010-04-12 17:02:49 UTC

Serguei, no problem

and just to avoid confusion, my comment above should have read Linux/x86-64

Comment 13 Jesse Brandeburg 2010-04-12 17:42:05 UTC

@garrison: Thank you for your comments Garrison, but you really do need to include iommu support in the nvidia kernel module in x86_64, even for addresses < 4GB.  For instance I'm running vmware vmplayer on my laptop to run my windows xp image, and the nvidia driver to run my graphics.  vmware (and xen, and kvm) all use the Intel VT-d that is in my laptop (ICH9 based) which is an IOMMU.  The IOMMU means my VM can't mess up my host, and that the host can't mess up the VM.  I assume by iommu support you mean the pci_*map* calls or dma_*map* calls right?

@serguei: the ehci_hcd message is from the usb driver, and means there is probably a bug in that driver.

Right now with 2.6.32+ the only way I can get the nvidia driver loaded is with kernel boot option iommu=soft, or intel_iommu=off  Otherwise I just get a black screen as my log fills with errors from the nvidia driver.

Comment 14 Robin Rainton 2010-04-16 22:20:33 UTC

Same here with Kernel 2.6.32.11-99.fc12.i686.PAE Nvidia module kmod-nvidia-173xx-2.6.32.11-99.fc12.i686.PAE-173.14.25-1.fc12.2.i686

Adding intel_iommu=off as argument in /etc/grub.conf appears to fix this - thankyou for the suggestion.

Before this was seeing a log (/var/log/messages) full of (repeating constantly):

Apr 17 08:07:13 hsem kernel: DRHD: handling fault status reg 2
Apr 17 08:07:13 hsem kernel: DMAR:[DMA Read] Request device [02:00.0] fault addr 32006000
Apr 17 08:07:13 hsem kernel: DMAR:[fault reason 01] Present bit in root entry is clear
Apr 17 08:07:13 hsem kernel: DRHD: handling fault status reg 102
Apr 17 08:07:13 hsem kernel: DMAR:[DMA Read] Request device [02:00.0] fault addr 32006000
Apr 17 08:07:13 hsem kernel: DMAR:[fault reason 01] Present bit in root entry is clear
Apr 17 08:07:13 hsem kernel: DRHD: handling fault status reg 202
Apr 17 08:07:13 hsem kernel: DMAR:[DMA Read] Request device [02:00.0] fault addr 32006000
Apr 17 08:07:13 hsem kernel: DMAR:[fault reason 01] Present bit in root entry is clear
Apr 17 08:07:13 hsem kernel: DRHD: handling fault status reg 302
Apr 17 08:07:13 hsem kernel: DMAR:[DMA Read] Request device [02:00.0] fault addr 32006000
Apr 17 08:07:13 hsem kernel: DMAR:[fault reason 01] Present bit in root entry is clear
Apr 17 08:07:13 hsem kernel: DRHD: handling fault status reg 402
Apr 17 08:07:13 hsem kernel: DMAR:[DMA Read] Request device [02:00.0] fault addr 32006000
Apr 17 08:07:13 hsem kernel: DMAR:[fault reason 01] Present bit in root entry is clear
Apr 17 08:07:13 hsem kernel: DRHD: handling fault status reg 502
Apr 17 08:07:13 hsem kernel: DMAR:[DMA Read] Request device [02:00.0] fault addr 32006000
Apr 17 08:07:13 hsem kernel: DMAR:[fault reason 01] Present bit in root entry is clear
Apr 17 08:07:13 hsem kernel: DRHD: handling fault status reg 602
Apr 17 08:07:13 hsem kernel: DMAR:[DMA Read] Request device [02:00.0] fault addr 32006000
Apr 17 08:07:13 hsem kernel: DMAR:[fault reason 01] Present bit in root entry is clear
Apr 17 08:07:13 hsem kernel: DRHD: handling fault status reg 702
Apr 17 08:07:13 hsem kernel: DMAR:[DMA Read] Request device [02:00.0] fault addr 32006000
Apr 17 08:07:13 hsem kernel: DMAR:[fault reason 01] Present bit in root entry is clear

Comment 15 rick 2010-04-21 08:32:30 UTC

I get these same messages using the nouveau driver, using
01:00.0 VGA compatible controller: nVidia Corporation Quadro FX 770M (rev a1)
these messages are constantly repeating (very similar to what you all have): 

Apr 21 10:31:06 localhost kernel: DRHD: handling fault status reg 3
Apr 21 10:31:06 localhost kernel: DMAR:[DMA Read] Request device [01:00.0] fault addr 0 
Apr 21 10:31:06 localhost kernel: DMAR:[fault reason 06] PTE Read access is not set

Comment 16 Jason Smith 2010-06-09 14:22:58 UTC

I just upgraded to Fedora 13 and am having the exact same problem.  I have to boot with the "intel_iommu=off" kernel option so X does not segfault.

Comment 17 Poncho 2010-06-09 17:50:45 UTC

There is a patch for the nvidia drivers here: http://www.nvnews.net/vbulletin/showthread.php?s=5508b00020d562e14c1c1f33787f815d&t=151791

With the patched drivers I don't need the "intel_iommu=off" kernel option any more.

kernel: 2.6.33.5-112.fc13.x86_64

Comment 18 Bug Zapper 2010-11-03 20:07:32 UTC

This message is a reminder that Fedora 12 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 12.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '12'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 12's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 12 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 19 Bug Zapper 2010-12-03 17:33:25 UTC

Fedora 12 changed to end-of-life (EOL) status on 2010-12-02. Fedora 12 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.