Bug 561267

Summary:

Nouveau does DMA from invalid addresses

Product:

[Fedora] Fedora

Reporter:

Michal Hlavinka <mhlavink>

Component:

xorg-x11-drv-nouveau

Assignee:

Ben Skeggs <bskeggs>

Status:

CLOSED ERRATA

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

high

Docs Contact:

Priority:

low

Version:

CC:

airlied, ajax, amluto, anton, awilliam, Bert.Deknuydt, bskeggs, chemobejk, corsac, cra, dougsland, dwmw2, eric.brunet, fkoliver2, gansalmon, itamar, jmoskovc, jonathan, kernel-maint, llg, manisandro, maristgeek, markjx, martin, maurizio.antillon, mbreuer, mcepl, mhlavink, mishu, mpope, murraysj, p.a.crook, selinux, stefanrin, tbzatek, tomek, xgl-maint, zeekec, zing

Target Milestone:

---

Keywords:

Reopened, Triaged

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

538163

Environment:

Last Closed:

2010-09-20 15:22:10 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
/var/log/dmesg	none
All nouveau traces from /var/log/messages+/var/log/Xorg.0.log	none

Description Michal Hlavinka 2010-02-03 08:33:00 UTC

Created attachment 388475 [details]
/var/log/dmesg

+++ This bug was initially created as a clone of Bug #538163 +++

Description of problem:
I installed and booted 2.6.32.* kernels on my system:

# lspci -nn
00:00.0 Host bridge [0600]: Intel Corporation 5400 Chipset Memory Controller Hub [8086:4001] (rev 20)
00:01.0 PCI bridge [0604]: Intel Corporation 5400 Chipset PCI Express Port 1 [8086:4021] (rev 20)
00:05.0 PCI bridge [0604]: Intel Corporation 5400 Chipset PCI Express Port 5 [8086:4025] (rev 20)
00:09.0 PCI bridge [0604]: Intel Corporation 5400 Chipset PCI Express Port 9 [8086:4029] (rev 20)
00:10.0 Host bridge [0600]: Intel Corporation 5400 Chipset FSB Registers [8086:4030] (rev 20)
00:10.1 Host bridge [0600]: Intel Corporation 5400 Chipset FSB Registers [8086:4030] (rev 20)
00:10.2 Host bridge [0600]: Intel Corporation 5400 Chipset FSB Registers [8086:4030] (rev 20)
00:10.3 Host bridge [0600]: Intel Corporation 5400 Chipset FSB Registers [8086:4030] (rev 20)
00:10.4 Host bridge [0600]: Intel Corporation 5400 Chipset FSB Registers [8086:4030] (rev 20)
00:11.0 Host bridge [0600]: Intel Corporation 5400 Chipset CE/SF Registers [8086:4031] (rev 20)
00:15.0 Host bridge [0600]: Intel Corporation 5400 Chipset FBD Registers [8086:4035] (rev 20)
00:15.1 Host bridge [0600]: Intel Corporation 5400 Chipset FBD Registers [8086:4035] (rev 20)
00:16.0 Host bridge [0600]: Intel Corporation 5400 Chipset FBD Registers [8086:4036] (rev 20)
00:16.1 Host bridge [0600]: Intel Corporation 5400 Chipset FBD Registers [8086:4036] (rev 20)
00:1b.0 Audio device [0403]: Intel Corporation 631xESB/632xESB High Definition Audio Controller [8086:269a] (rev 09)
00:1c.0 PCI bridge [0604]: Intel Corporation 631xESB/632xESB/3100 Chipset PCI Express Root Port 1 [8086:2690] (rev 09)
00:1d.0 USB Controller [0c03]: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #1 [8086:2688] (rev 09)
00:1d.1 USB Controller [0c03]: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #2 [8086:2689] (rev 09)
00:1d.2 USB Controller [0c03]: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #3 [8086:268a] (rev 09)
00:1d.3 USB Controller [0c03]: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #4 [8086:268b] (rev 09)
00:1d.7 USB Controller [0c03]: Intel Corporation 631xESB/632xESB/3100 Chipset EHCI USB2 Controller [8086:268c] (rev 09)
00:1e.0 PCI bridge [0604]: Intel Corporation 82801 PCI Bridge [8086:244e] (rev d9)
00:1f.0 ISA bridge [0601]: Intel Corporation 631xESB/632xESB/3100 Chipset LPC Interface Controller [8086:2670] (rev 09)
00:1f.1 IDE interface [0101]: Intel Corporation 631xESB/632xESB IDE Controller [8086:269e] (rev 09)
00:1f.2 SATA controller [0106]: Intel Corporation 631xESB/632xESB SATA AHCI Controller [8086:2681] (rev 09)
00:1f.3 SMBus [0c05]: Intel Corporation 631xESB/632xESB/3100 Chipset SMBus Controller [8086:269b] (rev 09)
02:00.0 VGA compatible controller [0300]: nVidia Corporation Quadro NVS 290 [10de:042f] (rev a1)
03:00.0 PCI bridge [0604]: Intel Corporation 6311ESB/6321ESB PCI Express Upstream Port [8086:3500] (rev 01)
03:00.3 PCI bridge [0604]: Intel Corporation 6311ESB/6321ESB PCI Express to PCI-X Bridge [8086:350c] (rev 01)
04:00.0 PCI bridge [0604]: Intel Corporation 6311ESB/6321ESB PCI Express Downstream Port E1 [8086:3510] (rev 01)
04:01.0 PCI bridge [0604]: Intel Corporation 6311ESB/6321ESB PCI Express Downstream Port E2 [8086:3514] (rev 01)
08:00.0 Ethernet controller [0200]: Broadcom Corporation NetXtreme BCM5754 Gigabit Ethernet PCI Express [14e4:167a] (rev 02)


System boots up just fine, and X/kdm comes up as expected.

However, this kernel produces tons of DMAR error messageses (about 2000 lines per minute):

Feb  3 09:27:28 krles kernel: DRHD: handling fault status reg 2
Feb  3 09:27:28 krles kernel: DMAR:[DMA Read] Request device [02:00.0] fault addr 0 
Feb  3 09:27:28 krles kernel: DMAR:[fault reason 06] PTE Read access is not set

Except these error messages, system seems working fine


Version-Release number of selected component (if applicable):
kernel-2.6.32.7-40.fc12.x86_64

How reproducible:
Every boot

Steps to Reproduce:
1.boot
2.look at logs
3.
  
Actual results:
DMAR error messages

Expected results:
no DMAR error messages, at least not 2000 lines per minute

Comment 1 David Woodhouse 2010-02-03 08:48:33 UTC

Looks like the graphics device is attempting to do DMA from address zero.
Is the address _always_ zero?

Perhaps the driver needs to allocate (and dma-map) a scratch page and point 'unused' pointers to that page instead of assuming that address zero will be valid?

Comment 2 Michal Hlavinka 2010-02-03 09:03:40 UTC

(In reply to comment #1)
> Looks like the graphics device is attempting to do DMA from address zero.
> Is the address _always_ zero?

yes

Comment 3 David Woodhouse 2010-02-03 10:07:19 UTC

Hm, it _does_ use a scratch page. To start with, can you try something like this?
Although I don't see how it could ever trigger; we do seem to be initialising the whole page table with the scratch page at startup... unless we calculate the size of the table incorrectly in nouveau_sgdma_init()? 

diff --git a/drivers/gpu/drm/nouveau/nouveau_sgdma.c b/drivers/gpu/drm/nouveau/nouveau_sgdma.c
index 4c7f1e4..74ab2ce 100644
--- a/drivers/gpu/drm/nouveau/nouveau_sgdma.c
+++ b/drivers/gpu/drm/nouveau/nouveau_sgdma.c
@@ -103,6 +103,11 @@ nouveau_sgdma_bind(struct ttm_backend *be, struct ttm_mem_reg *mem)
 		uint32_t offset_l = lower_32_bits(dma_offset);
 		uint32_t offset_h = upper_32_bits(dma_offset);
 
+		if (WARN_ON_ONCE(!offset_l && !offset_h)) {
+			dma_offset = dev_priv->gart_info.sg_dummy_bus;
+			offset_l = lower_32_bits(dma_offset);
+			offset_h = upper_32_bits(dma_offset);
+		}
 		for (j = 0; j < PAGE_SIZE / NV_CTXDMA_PAGE_SIZE; j++) {
 			if (dev_priv->card_type < NV_50)
 				nv_wo32(dev, gpuobj, pte++, offset_l | 3);

Comment 4 Michal Hlavinka 2010-02-03 12:49:09 UTC

I've built new kernel with this patch and rebooted. What should I look for? Usual kernel oops trace or just some one-line message in log?

Comment 5 David Woodhouse 2010-02-03 14:04:02 UTC

You'd get a warning, which looks very much like an oops.

Comment 6 Adam Williamson 2010-02-03 22:59:16 UTC

It's not a good idea to create a bug as a clone of another bug, generally; it adds a lot of stuff you don't necessarily want (like, everyone who is CCed on the other bug was CCed on this one, and this one depends on that one, which it shouldn't). I've fixed it up now. In future just file a new bug, not a clone :) thanks!

Comment 7 Stefan Becker 2010-03-11 19:13:40 UTC

*** Bug 570142 has been marked as a duplicate of this bug. ***

Comment 8 Paul Crook 2010-03-27 00:14:51 UTC

I've got what I think is exactly same problem on a DELL Quad Core Xeon box, see snip it from log/messages below (of Gigs of messages).

The problem started when I turned on virtualisation in the BIOS, specifically the "Intel VT I/O option" (approximating Dell's phrasing).  I turned in on to run virtualised guests so I don't want to turn it off again or really set intel_iommu=off.

The problem is generated by both the proprietary nvidia and open source nouveau drivers.  It has a nVidia quadro 295.  As my best work around at this moment is to use a low resolution vesa driver I'm keen to help find a solution.

Let me know what I can do to help.

---

Box has Fedora 12 install with kernel (if needed I'm willing to upgrade this)

2.6.32.9-70.fc12.x86_64 #1 SMP Wed Mar 3 04:40:41 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux

Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz

---

Mar 27 00:11:00 localhost kernel: DRHD: handling fault status reg 302
Mar 27 00:11:00 localhost kernel: DMAR:[DMA Read] Request device [03:00.0] fault addr 0 
Mar 27 00:11:00 localhost kernel: DMAR:[fault reason 06] PTE Read access is not set
Mar 27 00:11:00 localhost kernel: DRHD: handling fault status reg 402
Mar 27 00:11:00 localhost kernel: DMAR:[DMA Read] Request device [03:00.0] fault addr 0 
Mar 27 00:11:00 localhost kernel: DMAR:[fault reason 06] PTE Read access is not set
Mar 27 00:11:00 localhost kernel: DRHD: handling fault status reg 502
Mar 27 00:11:00 localhost kernel: DMAR:[DMA Read] Request device [03:00.0] fault addr 0 
Mar 27 00:11:00 localhost kernel: DMAR:[fault reason 06] PTE Read access is not set
Mar 27 00:11:00 localhost kernel: DRHD: handling fault status reg 602
Mar 27 00:11:00 localhost kernel: DMAR:[DMA Read] Request device [03:00.0] fault addr 0 
Mar 27 00:11:00 localhost kernel: DMAR:[fault reason 06] PTE Read access is not set
Mar 27 00:11:00 localhost kernel: DRHD: handling fault status reg 702
Mar 27 00:11:00 localhost kernel: DMAR:[DMA Read] Request device [03:00.0] fault addr 0 
Mar 27 00:11:00 localhost kernel: DMAR:[fault reason 06] PTE Read access is not set
Mar 27 00:11:00 localhost kernel: DRHD: handling fault status reg 2
Mar 27 00:11:00 localhost kernel: DMAR:[DMA Read] Request device [03:00.0] fault addr 0 
Mar 27 00:11:00 localhost kernel: DMAR:[fault reason 06] PTE Read access is not set
Mar 27 00:11:00 localhost kernel: DRHD: handling fault status reg 102
Mar 27 00:11:00 localhost kernel: DMAR:[DMA Read] Request device [03:00.0] fault addr 0 
Mar 27 00:11:00 localhost kernel: DMAR:[fault reason 06] PTE Read access is not set
Mar 27 00:11:00 localhost kernel: DRHD: handling fault status reg 202
Mar 27 00:11:00 localhost kernel: DMAR:[DMA Read] Request device [03:00.0] fault addr 0 
Mar 27 00:11:00 localhost kernel: DMAR:[fault reason 06] PTE Read access is not set
Mar 27 00:11:00 localhost kernel: DRHD: handling fault status reg 302
Mar 27 00:11:00 localhost kernel: DMAR:[DMA Read] Request device [03:00.0] fault addr 0 
Mar 27 00:11:00 localhost kernel: DMAR:[fault reason 06] PTE Read access is not set

Comment 9 Paul Crook 2010-04-06 14:55:40 UTC

FYI problem still exists with the latest "git" snapshot from nouveau, i.e. Linus's kernel 2.6.34-rc2 git and nouveau git.

Comment 10 Michal Hlavinka 2010-04-15 13:30:55 UTC

this affects also F13-Beta

Comment 11 Andy Lutomirski 2010-04-30 22:09:42 UTC

The IOMMU fault messages should probably be ratelimited as well -- bogus hardware shouldn't be able to flood the logs that easily.

Comment 12 David Woodhouse 2010-05-01 10:58:14 UTC

On modern systems we should be able to completely disable broken hardware which is causing such faults. Better to fix the driver though, I suspect.

Comment 13 Michal Hlavinka 2010-05-24 08:06:52 UTC

jftr: I've tried this on F-13 machine with xorg-x11-drv-nouveau repository snapshot (20100519) and rebuilt rawhide kernel (2.6.34-3) and bug is still there

Comment 14 David Woodhouse 2010-05-24 09:49:39 UTC

That doesn't surprise me; there's been precisely zero activity on the upstream fd.o bug and our own nouveau developers don't seem to have looked at this bug either.

Ben?

Comment 15 Ben Skeggs 2010-05-24 22:50:11 UTC

I've looked at it a bit, and completely confused as to where these are coming from actually.  On the machine I had access too we seemed to get one every second or so.

All our page tables are cleared with scratch pages etc, so it's some part of the GPU we don't know anything about that's doing the DMA in all likelihood.  But, not real ideas from here.

Comment 16 Michal Hlavinka 2010-05-25 05:35:08 UTC

(In reply to comment #15)
> I've looked at it a bit, and completely confused as to where these are coming
> from actually.  On the machine I had access too we seemed to get one every
> second or so.

I don't know if it does matter, but I have 2 lcds and I'm getting more than one message every second - hundreds. After two weeks system was unusable - I found out /var/log/messages was 5.6 GB big which exhausted all free space on root partition
 
> All our page tables are cleared with scratch pages etc, so it's some part of
> the GPU we don't know anything about that's doing the DMA in all likelihood. 
> But, not real ideas from here.    

well, there were probably a lot of changes, but maybe looking into changes between 2.6.31,* (which was working fine) and 2.6.32 (first broken) could help

Comment 17 Ben Skeggs 2010-05-25 05:46:31 UTC

(In reply to comment #16)
> (In reply to comment #15)
> > I've looked at it a bit, and completely confused as to where these are coming
> > from actually.  On the machine I had access too we seemed to get one every
> > second or so.
> 
> I don't know if it does matter, but I have 2 lcds and I'm getting more than one
> message every second - hundreds. After two weeks system was unusable - I found
> out /var/log/messages was 5.6 GB big which exhausted all free space on root
> partition
> 
> > All our page tables are cleared with scratch pages etc, so it's some part of
> > the GPU we don't know anything about that's doing the DMA in all likelihood. 
> > But, not real ideas from here.    
> 
> well, there were probably a lot of changes, but maybe looking into changes
> between 2.6.31,* (which was working fine) and 2.6.32 (first broken) could help    
My guess is that 2.6.32 is where VT-d either got added, or turned on, and that it's always been broken.  But regardless, I've double-checked every place we know of that can reference system memory on the GPU, and we're all fine there.  It's still a mystery.

To "hide" the issue for now, you can disable VT-d in your BIOS setup.

Does the binary nvidia driver work for you by the way?  On the machine I was using, it caused a massive VT-d flood which essentially made the machine appear to be hung.

Comment 18 Michal Hlavinka 2010-05-25 08:56:16 UTC

(In reply to comment #17)
> My guess is that 2.6.32 is where VT-d either got added, or turned on, and that
> it's always been broken.  But regardless, I've double-checked every place we
> know of that can reference system memory on the GPU, and we're all fine there. 
> It's still a mystery.
> 
> To "hide" the issue for now, you can disable VT-d in your BIOS setup.

is it possible to disable VT-d just for nouveau?

> Does the binary nvidia driver work for you by the way?  On the machine I was
> using, it caused a massive VT-d flood which essentially made the machine appear
> to be hung.    

I've tried nvidia, nouveau both with and without VT-d enabled

nvidia without VT-d works fine
nvidia with VT-d produces similar DMAR error messages:

DRHD: handling fault status reg 2
DMAR:[DMA Read] Request device [02:00.0] fault addr 128785000
DMAR:[fault reason 01] Present bit in root entry is clear

addr is different every time I restart X server

also server won't start, in xorg.log everything seems ok but last line:
(EE) NVIDIA(0): WAIT: (E, 0, 0x827d, 0)

nouveau with VT-d = this bug
nouveau without VT-d seems working fine so far (I'm using it right now)

Comment 19 Michael Breuer 2010-05-25 19:43:19 UTC

These disappeared for me somewhere around kernel.org 2.6.33 rc5. They're still gone for me in 2.6.34.

Unfortunately, with 2.6.34 & rawhide nouveau drm updates I can't log in using Gnome or KDE (dead keyboard after telinit 5... but OK after chvt away from X. Probably something I did, so no bug report for that yet.

-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 20 Paul Crook 2010-05-26 11:23:05 UTC

No such luck here.  I've been trying rawhide for a while now and the problem has never gone away.  Can see it now even with 2.6.34-2.

Typical 3-4 DMA errors per second though it possible depends on how much on screen activity there is.  Had to delete some 9G of message logs the other day when hard drive space dried up.

When I first came across this problem I was using the nvidia driver which produced the same DMA errors (and X failed to start) so it looks like the problem is common to both nvidia and nouvea.

Comment 21 Erik Zeek 2010-05-26 15:22:01 UTC

Not really a comment on the bug, but to help with the log file size.

I've added the following lines to my rsyslog.conf (before the other rules) to stop the log spam.  Dmesg is still useless, but my log files don't grow out of controll.

# Coverup nouveau log spam.                    
:msg, contains, "DMAR" ~
:msg, contains, "DRHD" ~

Comment 22 Mark Jx 2010-06-05 14:29:45 UTC

Just wanted to add that disabling VT-d worked for me.

System configuration:
* Asus P6T Deluxe Motherboard
* Intel i7 920 CPU
* nVidia Corporation GT216 [GeForce GT 220]
* Fedora 13
* Kernel 2.6.33.5-112.fc13.x86_64
* xorg-x11-drv-nouveau-0.0.16-6.20100423git13c1043.fc13.x86_64

Oh, and I have other machines with i7 Extreme Editions on P6T (Supercomputer and P6TD Deluxe) motherboards.  The Intel dual port 10 Gig Ethernet cards didn't work until I disabled VT-d.

Comment 23 Charles R. Anderson 2010-06-05 14:50:03 UTC

I did iommu=soft which also works.

Comment 24 Ben Skeggs 2010-06-10 00:01:17 UTC

It appears reporter and others using at least F13 now, so re-targeting.

This should be fixed in kernel-2.6.33.5-120.fc13 (http://koji.fedoraproject.org/koji/buildinfo?buildID=177442).

Comment 25 Michal Hlavinka 2010-06-10 08:47:59 UTC

I've installed latest kernel from koji: kernel-2.6.33.5-122.fc13 and even after re-enabling VT-d in the bios I no longer get those DMAR/DRHD error messages in the log. So it's fixed, at least for me.

Comment 26 Mike Pope 2010-06-11 00:08:25 UTC

Yet another `me too'.  I have kernel-2.6.33.5-112.fc13.x86_64, xorg-x11-drv-nouveau-0.0.16-6.20100423git13c1043.fc13.x86_64, on Intel Core 2 Duo / NV50.
If I boot with Vt-d enabled, I get the DMAR-spew in messages and a hard lock after about 0-5 minutes.  If Vt-d is disabled... no sign yet of either.

Comment 27 Ben Skeggs 2010-06-11 00:21:35 UTC

Mike, if you read the last few comments here, the issue is fixed in 2.6.33.5-120.fc13

Comment 28 Stefan Becker 2010-06-11 10:18:33 UTC

Installed kernel-2.6.33.5-120.fc13.x86_64 and removed the kernel option "intel_iommu=off". No error messages in the log yet...

Comment 29 Mike Pope 2010-06-15 04:43:32 UTC

Re #27, apologies Ben, I misread the kernel number.  -120 does work for me, but alas only keeps the machine up long enough for #566987 to bite.

Comment 30 Mike Pope 2010-06-16 03:52:34 UTC

Actually now I am not so sure this is fully fixed.  Since re-enabling Vt-d yesterday I have had two hard locks with -120 (rebooted afterward as single user and checked Xorg.0.log but no sign of the #566987 signature), although there were no more spewed DMAR messages.  -124 is downloading now.

Comment 31 Stefan Becker 2010-06-16 04:10:24 UTC

(In reply to comment #30)
> Actually now I am not so sure this is fully fixed.  Since re-enabling Vt-d
> yesterday I have had two hard locks with -120 (rebooted afterward as single
> user and checked Xorg.0.log but no sign of the #566987 signature), although
> there were no more spewed DMAR messages.  -124 is downloading now.    

I had the same experience on my work desktop yesterday. Can you please check /var/log/messages? Do you see there:

kernel: [drm] nouveau 0000:02:00.0: PGRAPH_TRAP - Ch 2/5 Class 0x8297 Mthd 0x0f04 Data 0x00000000:0x00000000
kernel: [drm] nouveau 0000:02:00.0: PGRAPH_TRAP_CCACHE_FAULT - VM: Trapped read at 00412a2000 status 00000560 00000000 channel 2
kernel: [drm] nouveau 0000:02:00.0: PGRAPH_TRAP_CCACHE_FAULT - 00000000 00000000 00000000 00000000 00000000 00000000 00000000

On my system X was frozen, with the X server eating 100% CPU on one core. Other functionality was still OK, i.e. I was able to login remotely and initiate a shutdown command. Unfortunately the shutdown got stuck, probably because it couldn't kill the X server.

You might also want to check Bug #566987 if your system supports PCI-E ASPM. My work desktop doesn't have this feature, so pcie_aspm=off won't help

Comment 32 Mike Pope 2010-06-16 05:08:13 UTC

The hard locks I am seeing leave nothing in /var/log/messages.  AFAICT nouveau initializes happily, and runs well, emitting no messages after post-boot settle down.

I am pretty confident this is not #566987, which this machine also does suffer from.  Here that one is always associated with a PFIFO message, leaves the mouse alive, and the machine still accepts ssh, none of which is the case for the lockups.  The box does have PCI-E, and I set pcie_aspm=off this morning, following the first hard lockup but before the second.

The only thing I changed before the lockups started was re-enabling Vt-d.
Its off again now while I try to get some work done, but I will turn it on again later.

Comment 33 Éric Brunet 2010-06-16 12:11:18 UTC

Could this be the same bug as 578108, about a problem with nvidia hardware, related to iommu, which appeared with kernel 2.6.32 and which is fixed with intel_iommu=off ?

Comment 34 Stefan Becker 2010-06-22 12:30:26 UTC

(In reply to comment #31)
> kernel: [drm] nouveau 0000:02:00.0: PGRAPH_TRAP - Ch 2/5 Class 0x8297 Mthd
> 0x0f04 Data 0x00000000:0x00000000
> kernel: [drm] nouveau 0000:02:00.0: PGRAPH_TRAP_CCACHE_FAULT - VM: Trapped read
> at 00412a2000 status 00000560 00000000 channel 2
> kernel: [drm] nouveau 0000:02:00.0: PGRAPH_TRAP_CCACHE_FAULT - 00000000
> 00000000 00000000 00000000 00000000 00000000 00000000

The same X freeze happened with -124 just 10 minutes ago. I was able "telinit 3", kill X and shutdown the machine cleanly, so only X is affected by this. I guess if this happens again I'll go back to intel_iommu=off.

Comment 35 Éric Brunet 2010-06-22 12:47:43 UTC

Coïncidence: I've also just had a freeze of my computer (no answer from mouse or keyboard, not even the CapsLock led working) with 2.6.33.5-124.fc13.x86_64. I didn't try to log in remotely and just pressed the reset button.

The last three lines of /var/log/messages are
kernel: [drm] nouveau 0000:0f:00.0: PFIFO_DMA_PUSHER - Ch 2
kernel: [drm] nouveau 0000:0f:00.0: PFIFO_CACHE_ERROR - Ch 2/0 Mthd 0x0000 Data 0x03000000
kernel: [drm] nouveau 0000:0f:00.0: PFIFO_DMA_PUSHER - Ch 2

So, what do we loose with going back to intel_iommu=off ? Is there a reason why it shouldn't be the default ? What does this stuff do ?

(Remember, I am coming from bug 578108 which I suspect has the same origin as this bug, but I had different symptoms in previous kernels as everybody here.)

Comment 36 Stefan Becker 2010-06-22 12:56:23 UTC

(In reply to comment #35)
> So, what do we loose with going back to intel_iommu=off ? Is there a reason why
> it shouldn't be the default ? What does this stuff do ?

It disables an Intel virtualization feature (same as disabling VT-d in the BIOS). If you don't use virtualization you shouldn't loose anything.

Comment 37 Mike Pope 2010-06-23 00:10:05 UTC

The PFIFO_DMA_PUSHER one sounds like #566987.  I see this bug (#561267) quite quickly if I enable Vt-d in the BIOS, but if I disable it I can run for some time before eventually seeing #566987.  If I also add intel_iommu=off, I have not seen either and now have an uptime of 5 days.  (All with 2.6.33.5-124.fc13.x86_64)

Comment 38 Paul Crook 2010-06-23 13:44:23 UTC

Just updated to the 2.6.34-45.fc14.x86_64 kernel on a machine that using the rawhide repositories (actually fedora 14 but I don't think there's a different as yet).

Looks like the problem might of gone away.  No sure about stability thought, so far it's been up for a couple of hours and nothing gone wrong.

Comment 39 Paul Crook 2010-06-25 10:03:14 UTC

Up 1 day, 18 hours and counting.  No more DMAR log messages.  VT-d enabled and running virtualised machines.  This is the same machine as I mentioned in comment#8.  Looks like this is fixed in kernel 2.6.34-45.

Comment 40 Paul Crook 2010-06-25 16:34:18 UTC

Spoke too soon.

Machine just locked up after 2 days running.  X locked up, ssh not working pings not being answered.  Nothing in /var/log/messages.  I noticed a similar problem with the previous version 2.6.34 (not sure but possible 2.6.34-40).  Obviously this is different to the original problem, however a completely different machine without an nvidia card but using the same kernel and with VT-d turned on hasn't locked up suggesting there's possibly still some linkage to nvidia.

Comment 41 Paul Crook 2010-06-25 17:02:09 UTC

Now trying kernel 2.6.35-0.2.rc3.git0.fc14.x86_64. No DMAR messages. Okay lets see if this last more than 2 days before freezing.

Comment 42 Matěj Cepl 2010-06-25 21:22:50 UTC

Michal, do you agree with the comment 41? Is it fixed in 2.6.34-45?

Thank you

Comment 43 Ben Skeggs 2010-06-26 00:49:30 UTC

Just to clarify this bug, it's referring to nouveau triggering IOMMU faults.  This bug *is* fixed.  Any lockups (and there's known issues on NV86 and NVA3/NVA5/NVA8) are different bugs.

Comment 44 Mike Pope 2010-06-28 00:31:45 UTC

Fair enough Ben, but where then do you want such reports to go?  I have now seen another lockup with 2.6.33.5-124.fc13.x86_64 + intel_iommu=off + NV92.  Is there a bug open for mysterious-NV-lockups-which-used-to-DMAR-spew?

Comment 45 Ben Skeggs 2010-06-28 00:47:46 UTC

You seen the bug *with* intel_iommu=off or without?  I don't think there's any link between the two bugs, the DMAR spew would happen on *any* NVIDIA chipset that's plugged into a board supporting VT-d.

Comment 46 Mike Pope 2010-06-28 01:18:57 UTC

Yes, with intel_iommu=off.  When I first saw this it was a DMAR-spew followed quickly by lockup.  Now its just a rare lockup.  It certainly could be different problems, but there is nothing in /var/log/messages or Xorg.0.log to characterize it better.

Comment 47 Ben Skeggs 2010-06-28 01:23:27 UTC

Okay, then I'm even more convinced now they're completely separate issues :)  A new bug report for the lockup would be great, though as with all these random lockups, they're quite hard to track down.

Comment 48 Michal Hlavinka 2010-06-28 07:22:48 UTC

(In reply to comment #42)
> Michal, do you agree with the comment 41? Is it fixed in 2.6.34-45?

I didn't try this kernel, its not available for F-13, but Ben fixed it in another build (see comment #24) and it's working for me (comment #25). So I agree that this bug is fixed

Comment 49 Stefan Ring 2010-06-29 14:09:25 UTC

Just had a hard lockup + reset as well, with kernel 2.6.33.5-124.fc13.x86_64.

Hardware: http://www.smolts.org/client/show/pub_dbe294f3-62e0-40f9-b141-547eeb979466

Until a few weeks ago, before the DMA spewing was fixed, the X server would just hang, and I would ssh in and reboot the box. This time, however, it crashed by itself, too quickly for me to bring up the other machine and log in.

There is nothing in /var/log/messages. I'm using pcie_aspm=off intel_iommu=off.

Comment 50 Mike Pope 2010-06-30 00:04:49 UTC

Ben, if you would like a distinct bug report for just the hard lockup I am happy to open one.  I just wish there was more useful detail to put in it.  All I have right now is `machine <description> hard locks occasionally without warning, did not happen in F11, nothing in logs, pcie_aspm setting does not matter, intel_iommu setting does not matter, Vt-d off in BIOS' (actually I need to confirm the latter, its not explicit in my notes).

That is not giving you much signal.  Is there some extra logging you can recommend I turn on to increase the log verbosity?

Comment 51 Paul Crook 2010-06-30 09:21:27 UTC

I was getting similar hard locks with a couple of 2.6.34 kernels that I tried (from rawhide/F14), see comment #38 - comment #41.  Now running 2.6.35-0.2.rc3.git0.fc14.x86_64 and so far so good.  Uptime so far 4 days 16 hours.  I've got Vt-d enabled and I'm *not* using pcie_aspm=off intel_iommu=off.

Comment 52 Ben Skeggs 2010-07-01 00:57:16 UTC

(In reply to comment #50)
> Ben, if you would like a distinct bug report for just the hard lockup I am
> happy to open one.  I just wish there was more useful detail to put in it.  All
> I have right now is `machine <description> hard locks occasionally without
> warning, did not happen in F11, nothing in logs, pcie_aspm setting does not
> matter, intel_iommu setting does not matter, Vt-d off in BIOS' (actually I need
> to confirm the latter, its not explicit in my notes).
> 
> That is not giving you much signal.  Is there some extra logging you can
> recommend I turn on to increase the log verbosity?   
What *exact* chipset is your card?  dmesg or X log will be useful to know this.

Comment 53 Mike Pope 2010-07-01 01:35:25 UTC

Created attachment 428144 [details]
All nouveau traces from /var/log/messages+/var/log/Xorg.0.log

The kernel says: Detected an NV50 generation card (0x092a00a2)
The X server says: NOUVEAU(0): Chipset: "NVIDIA NV92"
See attached.

Comment 54 Ben Skeggs 2010-07-01 01:55:32 UTC

Okay, thanks.  Then yeah, definitely file a new bug, that hang isn't a known one.

Comment 55 Mike Pope 2010-07-01 03:14:47 UTC

Done. #609764.

Comment 56 Stefan Becker 2010-08-26 14:28:43 UTC

IMHO this should be closed. The (hard)lock issues are reported in several other bug reports already?

Does the original reporter concur?

Comment 57 Michal Hlavinka 2010-08-26 17:04:22 UTC

yes, I agree see comment #48

Comment 58 Stefan Becker 2010-08-27 14:30:22 UTC

Then let's close this one.(In reply to comment #57)
> yes, I agree see comment #48

Then please you or bskeggs close it, because I don't have the rights to do it.

Comment 59 Jiri Moskovcak 2010-09-15 21:46:54 UTC

I'm sorry to say that, but the bug doesn't seem to be fixed at least on my HW, still get those DMAR messages and /var/log/messages is filling my harddrive. Disabling VT'd in bios or using intel_iommu=off helps, so I guess it's the same bug. I'm running kernel-2.6.34.6-54.fc13.x86_64.

Comment 60 Chuck Ebbert 2010-09-20 04:01:34 UTC

(In reply to comment #59)
> I'm sorry to say that, but the bug doesn't seem to be fixed at least on my HW,
> still get those DMAR messages and /var/log/messages is filling my harddrive.
> Disabling VT'd in bios or using intel_iommu=off helps, so I guess it's the same
> bug. I'm running kernel-2.6.34.6-54.fc13.x86_64.

When you look at the error message similar to this:

  kernel: DMAR:[DMA Read] Request device [02:00.0] fault addr 0 

does the device number (in this case 02:00.0 but yours may be different) match up to an nvidia video adapter in the output of lspci, or is it some other device? In the original reporter's machine, device 02:00.0 is this:

  02:00.0 VGA compatible controller [0300]: nVidia Corporation Quadro NVS 290 [10de:042f] (rev a1)

Comment 61 Jiri Moskovcak 2010-09-20 13:38:39 UTC

in my case it's:
Sep 20 15:09:13 dhcp-25-200 kernel: DRHD: handling fault status reg 2
Sep 20 15:09:13 dhcp-25-200 kernel: DMAR:[DMA Read] Request device reg 2
Sep 20 15:09:13 dhcp-25-200 kernel: DMAR:[DMA Read] Request device [0d:00.0] fault addr fffff000 reg 2
Sep 20 15:09:13 dhcp-25-200 kernel: DMAR:[DMA Read] Request device [0d:00.0] fault addr fffffreg 2
Sep 20 15:09:13 dhcp-25-200 kernel: DMAR:[DMA Read] Request device [0d:00.0] fault addr fffff000 
Sep 20 15:09:13 dhcp-25-200 kernel: DMreg 2
Sep 20 15:09:13 dhcp-25-200 kernel: DMAR:[DMA Read] Requece [0d:00.0] fault addr ffreg 2
Sep 20 15:09:13 dhcp-25-200 kernel: DMAR:[DMA Read] Request device [0d:00.0] fault addr fffff000 
Sep 20 15:09:13 dhcp-25-200 kernel: <reg 2
Sep 20 15:09:13 dhcp-25-200 kernel: DMAR:[DMA Read] Request device [0d:00.0] fault addr fffff0reg 2
Sep 20 15:09:13 dhcp-25-200 kernel: DMAR:[DMA Rereg 2
Sep 20 15:09:13 dhcp-25-200 kernel: DMAR:[DMA Reace [0d:00.0] fault addr ffreg 2
Sep 20 15:09:13 dhcp-25-200 kernel: DMAR:[DMA Read] Request device [0d:00.0] fault addr fffff000 

and the device seems to be:
0d:00.0 SD Host controller: Ricoh Co Ltd Device e822 (rev 01)

So it's probably a different component, sorry for the noise then..

Comment 62 Chuck Ebbert 2010-09-20 15:22:10 UTC

(In reply to comment #61)
> 
> and the device seems to be:
> 0d:00.0 SD Host controller: Ricoh Co Ltd Device e822 (rev 01)
> 
> So it's probably a different component, sorry for the noise then..

That's bug 605888.

Comment 63 Yves-Alexis Perez 2010-12-16 15:10:46 UTC

(In reply to comment #24)
> It appears reporter and others using at least F13 now, so re-targeting.
> 
> This should be fixed in kernel-2.6.33.5-120.fc13
> (http://koji.fedoraproject.org/koji/buildinfo?buildID=177442).

Out of curiosity, which upstream commit fixes this bug?

Comment 64 Yves-Alexis Perez 2010-12-16 15:21:30 UTC

Fwiw, it's 4eb3033c72099fab3536ed8ac54a5dc99f0832d7

Comment 65 Stephen Murray 2011-06-02 19:28:26 UTC

The problem has resurfaced in Fedora 15. My log is full of these messages:

Jun  1 11:36:03 murraysj kernel: [    3.084032] DRHD: handling fault status reg 2
Jun  1 11:36:03 murraysj kernel: [    3.084037] DMAR:[DMA Read] Request device [02:00.0] fault addr 0
Jun  1 11:36:03 murraysj kernel: [    3.084037] DMAR:[fault reason 06] PTE Read access is not set

The X session is sluggish and sometimes fails to respond completely.

I run VT-d with Windows guests.

Machine is a Dell Precision T3500 QuadCore.

[root@murraysj ~]# lspci -v | grep VGA
02:00.0 VGA compatible controller: nVidia Corporation NV43GL [Quadro FX 550] (rev a2) (prog-if 00 [VGA controller])
[root@murraysj ~]# lspci -vn | grep VGA
02:00.0 0300: 10de:014d (rev a2) (prog-if 00 [VGA controller])

Problem did not occur on this machine with Fedora 13 or Fedora 14.

When running with nouveau the gnome3 graphics worked correctly (when they worked).

I am now running the nVidia driver from rpmfusion, the error has vanished but the 3D graphics don't work, I'm stuck in gnome3 fallback mode.

FWIW, I am running Fedora 15 on a Dell Precision 390 with similar graphics hardware, it does not have the problem.

Is it possible to reopen this bugzilla or should a new one be started ?

Comment 66 Stephen Murray 2011-06-05 22:58:25 UTC

I disabled VT-d in the BIOS as suggested by an earlier post, now the nouveau error has gone away. The Windows guest is also performing correctly. 

As I stated, the error in nouveau did not occur under Fedora 13 or 14, or even 12 I seem to recall. The computer hardware has not changed, just the level of Fedora. The nouveau driver in Fedora 15 appears to have interaction problems with VT-d that the earlier versions did not.