569074 – nouveau NULL pointer after resume

Bug 569074 - nouveau NULL pointer after resume

Summary: nouveau NULL pointer after resume

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	13
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Ben Skeggs
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-02-27 22:35 UTC by Mike Snitzer
Modified:	2010-08-30 18:23 UTC (History)
CC List:	8 users (show)
Fixed In Version:	kernel-2.6.34.6-47.fc13
Clone Of:
Clones:	601002 (view as bug list)
Environment:
Last Closed:	2010-08-30 18:23:42 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
dmesg from y550 that shows NULL pointer after resume (72.90 KB, text/plain) 2010-02-27 22:42 UTC, Mike Snitzer	no flags	Details
kernel log showing same NULL pointer with 2.6.33.4-95.fc13.x86_64 (8.83 KB, text/plain) 2010-05-26 02:07 UTC, Mike Snitzer	no flags	Details
View All

Description Mike Snitzer 2010-02-27 22:35:34 UTC

Description of problem:
Resuming after suspend fails on this Ideapad Y550 which has an NV50 (NVIDIA GeForce GT 130M).  I haven't seen resume work on this system yet (this includes F12).  But I decided to try Fedora 13 (even in its bleeding form) because nouveau has seen updates upstream.

Version-Release number of selected component (if applicable):
2.6.33-1.fc13.x86_64

How reproducible:
always

Steps to Reproduce:
1. suspend
2. resume
3. screen stays black (NULL pointer occurs in nouveau driver)
  
Actual results:
[drm] nouveau 0000:01:00.0: GPU lockup - switching to software fbcon
usb 2-3: reset high speed USB device using ehci_hcd and address 2
BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
IP: [<ffffffffa005f835>] ttm_bo_pci_offset+0x5b/0x7b [ttm]
PGD 137471067 PUD 136d9e067 PMD 0 
Oops: 0000 [#1] SMP 
last sysfs file: /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/boot_vga
CPU 1 
Pid: 1365, comm: Xorg Not tainted 2.6.33-1.fc13.x86_64 #1 KIWB1/4186                
RIP: 0010:[<ffffffffa005f835>]  [<ffffffffa005f835>] ttm_bo_pci_offset+0x5b/0x7b [ttm]
RSP: 0018:ffff8801274e18e8  EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff8801274e1b78 RCX: ffff8801274e1910
RDX: ffff8801274e1900 RSI: ffff8801274e1b78 RDI: ffff8801366800f0
RBP: ffff8801274e18e8 R08: ffff8801274e1908 R09: 0000000000000180
R10: 0000000000000002 R11: 0000000000000000 R12: ffff8801274e19f8
R13: ffff8801366800f0 R14: 0000000000000002 R15: 0000000000000000
FS:  00007f543a55d7a0(0000) GS:ffff880006600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000028 CR3: 0000000136d7b000 CR4: 00000000000406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process Xorg (pid: 1365, threadinfo ffff8801274e0000, task ffff880137f3c920)
Stack:
 ffff8801274e1948 ffffffffa006261d ffffffff81478dbe 00000000d0000000
<0> 0000000000000000 0000000000680000 0000000000000018 ffff8801274e1b78
<0> ffff880115281120 0000000000000000 0000000000000000 0000000000000000
Call Trace:
 [<ffffffffa006261d>] ttm_mem_reg_ioremap+0x3b/0xaf [ttm]
 [<ffffffff81478dbe>] ? _raw_spin_unlock_irqrestore+0x4c/0x56
 [<ffffffffa0062ab1>] ttm_bo_move_memcpy+0x79/0x3d7 [ttm]
 [<ffffffffa007e05e>] ? nouveau_fence_wait+0x54/0xb4 [nouveau]
 [<ffffffffa007de62>] ? nouveau_fence_unref+0x29/0x34 [nouveau]
 [<ffffffffa007d2cc>] ? nouveau_bo_move_m2mf+0x354/0x366 [nouveau]
 [<ffffffffa007d66e>] nouveau_bo_move+0x390/0x41d [nouveau]
 [<ffffffff810f34db>] ? unmap_mapping_range+0x10e/0x11d
 [<ffffffffa006030b>] ttm_bo_handle_move_mem+0x1ad/0x2b5 [ttm]
 [<ffffffffa0061ec7>] ttm_bo_move_buffer+0xbc/0x10c [ttm]
 [<ffffffffa006006a>] ? ttm_bo_reserve+0x38/0xf8 [ttm]
 [<ffffffffa0061fc5>] ttm_bo_validate+0xae/0xf7 [ttm]
 [<ffffffffa007e775>] validate_list+0x157/0x288 [nouveau]
 [<ffffffffa007fa7a>] nouveau_gem_ioctl_pushbuf+0xca9/0xcdf [nouveau]
 [<ffffffffa0019418>] drm_ioctl+0x28f/0x373 [drm]
 [<ffffffffa007edd1>] ? nouveau_gem_ioctl_pushbuf+0x0/0xcdf [nouveau]
 [<ffffffff8111e922>] ? do_sync_read+0xc4/0x101
 [<ffffffff8112bcfc>] vfs_ioctl+0x32/0xa6
 [<ffffffff8112c27c>] do_vfs_ioctl+0x490/0x4d6
 [<ffffffff8112c318>] sys_ioctl+0x56/0x79
 [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b
Code: 69 c0 c0 00 00 00 8b 44 07 64 a8 01 75 13 45 85 d2 74 0a a8 08 75 06 f6 46 22 01 74 04 31 c0 c9 c3 48 8b 06 4d 69 c9 c0 00 00 00 <48> 8b 40 28 48 c1 e0 0c 48 89 01 48 8b 46 10 48 c1 e0 0c 49 89 
RIP  [<ffffffffa005f835>] ttm_bo_pci_offset+0x5b/0x7b [ttm]
 RSP <ffff8801274e18e8>
CR2: 0000000000000028
---[ end trace 7f2055f83ed9bd03 ]---

Expected results:
Avoid NULL pointer and successfully resume after suspend.

Additional Info:
I'll be attaching the full dmesg.

Comment 1 Mike Snitzer 2010-02-27 22:42:45 UTC

Created attachment 396803 [details]
dmesg from y550 that shows NULL pointer after resume

Comment 2 Mike Snitzer 2010-05-26 02:07:12 UTC

Created attachment 416601 [details]
kernel log showing same NULL pointer with 2.6.33.4-95.fc13.x86_64

Still happens with latest F13 kernel: 2.6.33.4-95.fc13.x86_64

Comment 4 Mike Snitzer 2010-06-05 02:38:53 UTC

Happens with 2.6.33.5-112.fc13.x86_64 too:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
IP: [<ffffffffa006f71e>] ttm_bo_pci_offset+0x5d/0x7d [ttm]
...

Other bug reports for this same issue:
https://bugs.freedesktop.org/show_bug.cgi?id=26521
https://bugs.freedesktop.org/show_bug.cgi?id=27574
https://bugzilla.kernel.org/show_bug.cgi?id=15120

Comment 5 Mike Snitzer 2010-06-05 12:48:27 UTC

Cc'ing Jerome given his recent ttm rework that is now upstream in 2.6.35.  The following commits remove ttm_bo_pci_offset entirely:
http://git.kernel.org/linus/0c321c7962718
http://git.kernel.org/linus/82c5da6bf8b55

I'll see if I can test these bits (via rawhide or applying by hand) relative to this bug.

Comment 6 Mike Snitzer 2010-06-05 14:21:37 UTC

(In reply to comment #5)
> Cc'ing Jerome given his recent ttm rework that is now upstream in 2.6.35.  The
> following commits remove ttm_bo_pci_offset entirely:
> http://git.kernel.org/linus/0c321c7962718
> http://git.kernel.org/linus/82c5da6bf8b55
> 
> I'll see if I can test these bits (via rawhide or applying by hand) relative to
> this bug.    

Good news is rawhide's 2.6.34-20.fc14.x86_64 (which I understand has 2.6.35-rc1's drm) no longer hits a NULL pointer in TTM.

Bad news is the display still stays blank after resume.  Here are the details from the log:

 kernel: PM: Syncing filesystems ... done.
 kernel: Freezing user space processes ... (elapsed 0.01 seconds) done.
 kernel: Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done.
 kernel: Suspending console(s) (use no_console_suspend to debug)
 kernel: sd 0:0:0:0: [sda] Synchronizing SCSI cache
 kernel: sd 0:0:0:0: [sda] Stopping disk
 kernel: [drm] nouveau 0000:01:00.0: Disabling fbcon acceleration...
 kernel: [drm] nouveau 0000:01:00.0: Unpinning framebuffer(s)...
 kernel: [drm] nouveau 0000:01:00.0: Evicting buffers...
 kernel: pci 0000:00:1f.3: PCI INT C disabled
 kernel: ehci_hcd 0000:00:1d.7: PCI INT A disabled
 kernel: uhci_hcd 0000:00:1d.2: PCI INT C disabled
 kernel: uhci_hcd 0000:00:1d.1: PCI INT B disabled
 kernel: uhci_hcd 0000:00:1d.0: PCI INT A disabled
 kernel: ehci_hcd 0000:00:1a.7: PCI INT C disabled
 kernel: uhci_hcd 0000:00:1a.2: PCI INT C disabled
 kernel: uhci_hcd 0000:00:1a.1: PCI INT B disabled
 kernel: uhci_hcd 0000:00:1a.0: PCI INT A disabled
 kernel: [drm] nouveau 0000:01:00.0: Idling channels...
 kernel: [drm] nouveau 0000:01:00.0: Suspending GPU objects...
 kernel: HDA Intel 0000:00:1b.0: PCI INT A disabled
 kernel: HDA Intel 0000:00:1b.0: power state changed by ACPI to D3
 kernel: HDA Intel 0000:01:00.1: PCI INT A disabled
 kernel: [drm] nouveau 0000:01:00.0: And we're gone!
 kernel: nouveau 0000:01:00.0: PCI INT A disabled
 kernel: nouveau 0000:01:00.0: power state changed by ACPI to D3
 kernel: PM: suspend of devices complete after 854.583 msecs
 kernel: PM: late suspend of devices complete after 22.012 msecs
 ...
 kernel: nouveau 0000:01:00.0: power state changed by ACPI to D0
 kernel: nouveau 0000:01:00.0: power state changed by ACPI to D0
 kernel: nouveau 0000:01:00.0: power state changed by ACPI to D0
 kernel: nouveau 0000:01:00.0: power state changed by ACPI to D0
 kernel: nouveau 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
 kernel: [drm] nouveau 0000:01:00.0: POSTing device...
 kernel: [drm] nouveau 0000:01:00.0: Parsing VBIOS init table 0 at offset 0xD359
 kernel: sd 0:0:0:0: [sda] Starting disk
 kernel: [drm] nouveau 0000:01:00.0: 0xD62F: Failed parsing init table opcode: INIT_ZM_I2C_BYTE -6
 kernel: [drm] nouveau 0000:01:00.0: Parsing VBIOS init table 1 at offset 0xD8A4
 kernel: [drm] nouveau 0000:01:00.0: Parsing VBIOS init table 2 at offset 0xE3D3
 kernel: [drm] nouveau 0000:01:00.0: Parsing VBIOS init table 3 at offset 0xE408
 kernel: [drm] nouveau 0000:01:00.0: Parsing VBIOS init table 4 at offset 0xE59C
 kernel: [drm] nouveau 0000:01:00.0: Parsing VBIOS init table at offset 0xE601
 kernel: [drm] nouveau 0000:01:00.0: 0xBC89: parsing output script 0
 kernel: [drm] nouveau 0000:01:00.0: 0xC078: parsing output script 0
 kernel: [drm] nouveau 0000:01:00.0: Reinitialising engines...
 kernel: [drm] nouveau 0000:01:00.0: Restoring GPU objects...
 kernel: usb 2-3: reset high speed USB device using ehci_hcd and address 2
 kernel: ata5: SATA link down (SStatus 0 SControl 300)
 kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
 kernel: ata2.00: configured for UDMA/133
 kernel: [drm] nouveau 0000:01:00.0: Restoring mode...
 kernel: [drm] nouveau 0000:01:00.0: 0xBC8D: parsing output script 1
 kernel: [drm] nouveau 0000:01:00.0: 0xBB00: parsing clock script 0
 kernel: [drm] nouveau 0000:01:00.0: 0xBC84: parsing clock script 1
 kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
 kernel: ata1.00: configured for UDMA/133
 kernel: PM: resume of devices complete after 3247.071 msecs
 kernel: Restarting tasks ... done.
 kernel: video LNXVIDEO:00: Restoring backlight state

Comment 7 Ben Skeggs 2010-06-06 23:03:02 UTC

(In reply to comment #6)
> (In reply to comment #5)
> 
> Bad news is the display still stays blank after resume.  Here are the details
> from the log:
> 
>  kernel: [drm] nouveau 0000:01:00.0: 0xD62F: Failed parsing init table opcode:
> INIT_ZM_I2C_BYTE -6
This is likely the reason why.  I'd say there's more setup for your card in that init table that we're skipping because INIT_ZM_I2C_BYTE fails.  Can you file a new bug report against rawhide to track this issue please.

It'd be great if you could include your dmesg output after a suspend/resume with "drm.debug=14 log_buf_len=1M" appended to your boot options, as well as a vbios image (I may want vbtracetool traces later too, but we'll discuss that when you open a new bug).

Thanks!

Comment 8 Mike Snitzer 2010-06-07 02:03:43 UTC

(In reply to comment #6)
> (In reply to comment #5)
> > Cc'ing Jerome given his recent ttm rework that is now upstream in 2.6.35.  The
> > following commits remove ttm_bo_pci_offset entirely:
> > http://git.kernel.org/linus/0c321c7962718
> > http://git.kernel.org/linus/82c5da6bf8b55
> > 
> > I'll see if I can test these bits (via rawhide or applying by hand) relative to
> > this bug.    
> 
> Good news is rawhide's 2.6.34-20.fc14.x86_64 (which I understand has
> 2.6.35-rc1's drm) no longer hits a NULL pointer in TTM.

Unfortunately, it seems a comparable NULL pointer still exists w/ 2.6.34-20.fc14.x86_64, see:
https://bugzilla.redhat.com/show_bug.cgi?id=601002#c4

But it appears harder to hit (rather than always hitting it on resume with f13's kernel(s): 1 out of 3 resumes hit it with rawhide's 2.6.34-20.fc14.x86_64).

Comment 9 Ben Skeggs 2010-07-14 04:55:04 UTC

Mike, can you give the kernel at http://koji.fedoraproject.org/koji/buildinfo?buildID=183346 a try and see if there's any change?

Comment 10 Fedora Update System 2010-08-07 05:00:55 UTC

kernel-2.6.34.2-34.fc13 has been submitted as an update for Fedora 13.
http://admin.fedoraproject.org/updates/kernel-2.6.34.2-34.fc13

Comment 11 Fedora Update System 2010-08-07 23:28:53 UTC

kernel-2.6.34.2-34.fc13 has been pushed to the Fedora 13 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update kernel'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/kernel-2.6.34.2-34.fc13

Comment 12 Fedora Update System 2010-08-10 23:53:55 UTC

kernel-2.6.34.3-37.fc13 has been submitted as an update for Fedora 13.
http://admin.fedoraproject.org/updates/kernel-2.6.34.3-37.fc13

Comment 13 Fedora Update System 2010-08-11 07:26:15 UTC

kernel-2.6.34.3-37.fc13 has been pushed to the Fedora 13 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update kernel'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/kernel-2.6.34.3-37.fc13

Comment 14 Chuck Ebbert 2010-08-18 09:48:51 UTC

2.6.34 kernel has been withdrawn.

Comment 15 Fedora Update System 2010-08-27 11:23:35 UTC

kernel-2.6.34.6-47.fc13 has been submitted as an update for Fedora 13.
https://admin.fedoraproject.org/updates/kernel-2.6.34.6-47.fc13

Comment 16 Fedora Update System 2010-08-30 18:22:21 UTC

kernel-2.6.34.6-47.fc13 has been pushed to the Fedora 13 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.