Bug 609764
Summary: | Unpredictable hang | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Mike Pope <mpope> | ||||
Component: | xorg-x11-drv-nouveau | Assignee: | Ben Skeggs <bskeggs> | ||||
Status: | CLOSED WONTFIX | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 13 | CC: | airlied, bloch, bskeggs, bulk, chemobejk, elad, gokcen.eraslan, jim, jkrupka, ossman, rnovacek, tbzatek | ||||
Target Milestone: | --- | Keywords: | Triaged | ||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | [cat:lockup] | ||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2011-06-27 18:02:49 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
I've just updated to kernel 2.6.33.6-147.fc13.x86_64 and the problem got *MUCH* worse. Basically the machine has become unusable, i.e. within seconds or minutes from login into my desktop session the X server freezes up. /var/log/messages entries from the last 3 freezes: Jul 9 10:17:26 salit23 kernel: [drm] nouveau 0000:02:00.0: PFIFO_DMA_PUSHER - Ch 2 Jul 9 10:17:26 salit23 kernel: [drm] nouveau 0000:02:00.0: PGRAPH_DATA_ERROR - Ch 2/5 Class 0x8297 Mthd 0x1288 Data 0x00000000:0x00042050 Jul 9 10:17:26 salit23 kernel: [drm] nouveau 0000:02:00.0: PGRAPH_DATA_ERROR - INVALID_VALUE Jul 9 10:39:51 salit23 kernel: [drm] nouveau 0000:02:00.0: PFIFO_DMA_PUSHER - Ch 2 Jul 9 10:39:51 salit23 kernel: [drm] nouveau 0000:02:00.0: PGRAPH_DATA_ERROR - Ch 2/5 Class 0x8297 Mthd 0x1458 Data 0x00000000:0x00104280 Jul 9 10:39:51 salit23 kernel: [drm] nouveau 0000:02:00.0: PGRAPH_DATA_ERROR - unknown value 0x00000003 Jul 9 10:47:47 salit23 kernel: [drm] nouveau 0000:02:00.0: PFIFO_DMA_PUSHER - Ch 2 lspci: 02:00.0 VGA compatible controller [0300]: nVidia Corporation G96 [Quadro FX 580] [10de:0659] (rev a1) Kernel command line: ro root=/dev/mapper/VolGroup00-LogVol00 rhgb quiet SYSFONT=latarcyrheb-sun16 LANG=en_US.UTF-8 KEYTABLE=fi intel_iommu=off i.e. I've not disabled VT-d in the BIOS, just in the kernel. This HW doesn't support PCIe ASPM, so I didn't disable it: # dmesg | fgrep -i aspm ACPI FADT declares the system doesn't support PCIe ASPM, so disable it Package versions: xorg-x11-server-Xorg-1.8.2-1.fc13.x86_64 xorg-x11-drv-nouveau-0.0.16-7.20100423git13c1043.fc13.x86_64 mesa-libGL-7.8.1-6.fc13.x86_64 libdrm-2.4.21-2.fc13.x86_64 kernel-2.6.33.6-147.fc13.x86_64 Not sure if any of this is relevant: - Display is a 30" Dell 2560x1600 LCD, connected via Dual-Link DVI - I'm using KDE4 desktop with XRender desktop effects enabled in kwin4 (OpenGL doesn't work yet). The last 3 freezes happened when some windowing operation was ongoing, i.e. menu appearing. I'm now going back to kernel-2.6.33.5-124.fc13.x86_64, as this was at least stable for more than 2 weeks.... Got at least once the same hang on my other nVidia system (CLEVO laptop): Jul 10 15:57:12 localhost kernel: [drm] nouveau 0000:01:00.0: PFIFO_DMA_PUSHER - Ch 2 lscpi: 01:00.0 VGA compatible controller [0300]: nVidia Corporation G94 [GeForce 9800M GTS] [10de:062c] (rev a1) Kernel command line: ro root=/dev/mapper/VolGroup00-root rhgb quiet fastboot SYSFONT=latarcyrheb-sun16 LANG=en_US.UTF-8 KEYTABLE=de-latin1 pcie_aspm=off This system supports PCIe ASPM, which I disabled due to bug #566987. It doesn't support VT-d, so intel_iommu=off won't help. The system was up about 20 hours, so the problem is not as bad as on the other system. If it happens again, I'll drop back to -124 on this system too. Just got the second crash on the CLEVO laptop after about 2 hours of usage. Back to -124 on that system too... Bahh, now I got the same crash with 2.6.33.5-124.fc13.x86_64 too on the CLEVO laptop :-( Bug reporters: Please don't vote down kernel updates that fix other people's bugs just because they don't fix yours. 2.6.33.6-147.2.4.fc13 is not expected to fix this bug, but that's no reason to block that minor update from being pushed to stable. Also, has anyone tried the 2.6.34.1-29.fc13 kernel from koji? Hi Chuck, I tried the first of the 2.6.34.1 and I had no issues. Mixed results with kernel-2.6.34.1-29.fc13.x86_64: - laptop: seems to work, but to be sure I have to wait 2 weeks to see if the bug hits or not. - desktop: kernel boot fails. With "rhgb quiet" removed from the kernel boot command line I see that the kernel startup stops after KMS, USB devices and the USB card reader devices have been initialized. After a few seconds the message "Boot has failed, sleeping forever." appears on the console (reported as bug #620313). OK, after finding a workaround for bug #620313 I'm able to run kernel-2.6.34.1-29.fc13.x86_64 on the desktop machine. It froze less then 10 minutes into my X session with the usual errors: Aug 3 10:31:18 salit23 kernel: [drm] nouveau 0000:02:00.0: PGRAPH_TRAP - Ch 2/5 Class 0x8297 Mthd 0x15e0 Data 0x00000000:0x00000000 Aug 3 10:31:18 salit23 kernel: [drm] nouveau 0000:02:00.0: PGRAPH_TRAP_CCACHE_FAULT - VM: Trapped read at 0000000100 status 00000500 00000000 channel 2 Aug 3 10:31:18 salit23 kernel: [drm] nouveau 0000:02:00.0: PGRAPH_TRAP_CCACHE_FAULT - 00000000 00000000 00000000 00000000 00000000 00000000 00000000 Aug 3 10:31:18 salit23 kernel: [drm] nouveau 0000:02:00.0: PGRAPH_TRAP_MP_EXEC - TP 0 MP 0: INVALID_OPCODE at 000100 warp 8, opcode 00000000 00000000 Aug 3 10:31:18 salit23 kernel: [drm] nouveau 0000:02:00.0: PGRAPH_TRAP_MP_EXEC - TP 0 MP 1: INVALID_OPCODE at 000100 warp 1, opcode 00000000 00000000 Aug 3 10:31:18 salit23 kernel: [drm] nouveau 0000:02:00.0: PGRAPH_TRAP_MP - TP1: Unhandled ustatus 0x00020000 Aug 3 10:31:18 salit23 kernel: [drm] nouveau 0000:02:00.0: PFIFO_DMA_PUSHER - Ch 2 Please try the new kernel generated for bug #602956. This seems to work fine for me on my Dell T3500. .... and froze up 1 minute later :-( But with the new kernel there is now a new message filling up the log after the freeze: Aug 6 14:37:21 salit23 kernel: Linux version 2.6.34.2-34.fc13.x86_64 (mockbuild @x86-13.phx2.fedoraproject.org) (gcc version 4.4.4 20100630 (Red Hat 4.4.4-10) ( GCC) ) #1 SMP Thu Aug 5 22:43:35 UTC 2010 .... Aug 6 14:53:21 salit23 kernel: [drm] nouveau 0000:02:00.0: PFIFO_DMA_PUSHER - Ch 2 Aug 6 14:53:21 salit23 kernel: [drm] nouveau 0000:02:00.0: nv50cal_space: -16 Aug 6 14:53:21 salit23 kernel: [drm] nouveau 0000:02:00.0: nv50cal_space: -16 Aug 6 14:53:21 salit23 kernel: [drm] nouveau 0000:02:00.0: nv50cal_space: -16 .... My Xorg.0.log doesn't show a crash like bug #602956 does. Just got a hang with kernel-2.6.34.1-29.fc13.x86_64 on the laptop after roughly a week of usage. Going to switch to the new F13 .34 kernel that has been pushed to testing. As pointed out in this bug report: <https://bugzilla.redhat.com/show_bug.cgi?id=602956#c14> GNOME and KDE will exercise the driver code in different ways. Maybe this is relevant for this bug report too. I'm running on both systems KDE4 with kwin4, using XRender composite. I have mesa-dri-drivers-experimental installed, but the OpenGL offered by that driver isn't good enough to be accepted by kwin4 yet. As far as I can tell all my lockups happened with popup windows (window menus, taskbar menus). Just another "me too" comment, running gnome and compiz with packages: xorg-x11-server-Xorg-1.8.2-3.fc13.i686 xorg-x11-drv-nouveau-0.0.16-7.20100423git13c1043.fc13.i686 mesa-libGL-7.8.1-8.fc13.i686 libdrm-2.4.21-2.fc13.i686 kernel-PAE-2.6.33.6-147.fc13.i686 mesa-dri-drivers-experimental-7.8.1-8.fc13.i686 and the error message that showed up was: kernel: [drm] nouveau 0000:01:00.0: PFIFO_DMA_PUSHER - Ch 2 Any news on this bug? Are the 2.6.34 kernels working any better? Just had another lock-up with the same error message in /var/log/messages: kernel: [drm] nouveau 0000:01:00.0: PFIFO_DMA_PUSHER - Ch 2 packages: kernel: 2.6.34.6-54.fc13.i686.PAE xorg-x11-drv-nouveau-0.0.16-8.20100423git13c1043.fc13.i686 xorg-x11-server-Xorg-1.8.2-3.fc13.i686 mesa-dri-drivers-experimental-7.8.1-8.fc13.i686 Just tried kernel-2.6.34.7-59.fc13.x86_64.rpm from Koji as it promises a fix for a race condition and better error handling. No luck, got stuck within a minute after logging into my KDE session: Ch 2/1 Mthd 0x0000 Data 0x017a0a62 Oct 1 14:05:07 salit23 kernel: [drm] nouveau 0000:02:00.0: PFIFO_CACHE_ERROR - Ch 2/1 Mthd 0x0000 Data 0x0004b5e0 Oct 1 14:05:07 salit23 kernel: [drm] nouveau 0000:02:00.0: PFIFO_CACHE_ERROR - Ch 2/1 Mthd 0x0000 Data 0x00044110 Oct 1 14:05:07 salit23 kernel: [drm] nouveau 0000:02:00.0: PFIFO_CACHE_ERROR - Ch 2/1 Mthd 0x0000 Data 0x00042050 Oct 1 14:05:07 salit23 kernel: [drm] nouveau 0000:02:00.0: PFIFO_DMA_PUSHER - Ch 2 Get 0x0020023cb4 Put 0x0020023cb8 IbGet 0x00000f05 IbPut 0x0000022f State 0x8000af04 Push 0x00406040 Oct 1 14:05:07 salit23 kernel: [drm] nouveau 0000:02:00.0: PFIFO_DMA_PUSHER - Ch 2 Get 0x0020023cbc Put 0x0020023cc0 IbGet 0x00000f07 IbPut 0x00000231 State 0x8000b35c Push 0x00406040 Oct 1 14:05:07 salit23 kernel: [drm] nouveau 0000:02:00.0: PFIFO_DMA_PUSHER - Ch 2 Get 0x0020051778 Put 0x0020053f34 IbGet 0x00000f08 IbPut 0x00000233 State 0x80000000 Push 0x00406040 Oct 1 14:05:07 salit23 kernel: [drm] nouveau 0000:02:00.0: PFIFO_DMA_PUSHER - Ch 2 Get 0x0020023ccc Put 0x0020023cd0 IbGet 0x00000f0b IbPut 0x00000233 State 0x80004244 Push 0x00406040 Oct 1 14:05:07 salit23 kernel: [drm] nouveau 0000:02:00.0: PGRAPH_TRAP_CCACHE_FAULT - 00000000 00000000 00000000 00000000 00000000 00000000 00000000 Back to the old .33 kernel. That means that I'm stuck on F13 for this machine :-( (In reply to comment #15) > Back to the old .33 kernel. That means that I'm stuck on F13 for this machine > :-( Which is the last known kernel to work for you, and the first known to fail? It started with F13 and the DMAR problem fix (bug #561267). I don't remember any hangs before that. kernel-2.6.33.5-124.fc13.x86_64 is fairly stable on my desktop machine (Dell T3500, G96? PCI 10de:0659), i.e. I can use it for two weeks without running into the problem. As I usually drag in security updates during this time and reboot the machine this is acceptable. FYI: on that machine I use intel_iommu=off and it doesn't have PCIe ASPM. Every newer kernel than the above freezes up reliably within 30 minutes after login into my KDE desktop. Usually it happens on window or menu open/close operations. I have seen the same problem on my laptop (CLEVO M860TU, G94 [GeForce 9800M GTS] [10de:062c]). But a lockup is very rare, so I don't really care. This machine I have already upgraded to F14Beta with kernel 2.6.35. Stefan, Can you "mount -t debugfs debugfs /sys/kernel/debug" and then "cat /sys/kernel/debug/dri/0/channel/1" a few times in a row. Do you see cur/put (and friends) changing regularly? When you do that, make sure you're sitting in X (and not at a vt) and your desktop is otherwise idle. If you see those values increasing, can you try booting with "nouveau.nofbaccel=1", and see if your random X hangs disappear? The above option will disable fbcon acceleration, but leave X accelerated. I checked /sys/kernel/debug/dri/0/channel/ 1 & 2 & 3 while inside my idle X session a few times while keeping the mouse cursor inside the same window. Only channel 2 shows any changes at all. Channel 1 & 3 don't change at all (as long as I don't use any other window). I'll try nouveau.nofbaccel=1 anyway, just to be sure. kernel-2.6.34.7-59.fc13.x86_64 still freezes up within 10-20 seconds after the KDE splash screen with nouveau.nofbaccel=1. So it doesn't help. Just out of curiosity: FIFO channel 3 is for OpenGL/Mesa? If yes, would maybe removing mesa-dri-drivers-experimental from the system, i.e. nouveau_dri.so, make sure that there is no traffic on channel 3? *THAT* is one of the differences between F12 -> F13. To summarize my tests I did with kernel-2.6.34.7-59.fc13.x86_64 today: a) nouveau.nofbaccel =1 Didn't help, froze up after 10-20 seconds into the KDE session b) (a) + removal of mesa-dri-drivers-experimental Didn't help, froze up after about 30 minutes (probably just luck that it didn't freeze up earlier) c) (b) + disabling KDE desktop effects completely, i.e. disabling composite using XRender No hangs for several hours. Looks like it is rock-solid Only problem is that without composite X gets real slow on a 2560x1600 screen and long window/CPU operations can block the UI. I have also seen some minor drawing artifacts. I'll stay on the new kernel for now. On the next reboot I'll remove nofbaccel and reinstall the nouveau mesa drivers again. The one thing I wanted to mention too was that at least in the last lockups there was always an active video stream (mplayer with XV or OpenGL as renderer). Maybe that interferes with XRender composite? (In reply to comment #21) > c) (b) + disabling KDE desktop effects completely, i.e. disabling composite > using XRender > > No hangs for several hours. Looks like it is rock-solid In this configuration nouveau doesn't seem to hang, but it now can crash :-( [ 99098.456] (II) NOUVEAU(0): EDID for output DP-2 [255347.318] Backtrace: [255347.444] 0: /usr/bin/Xorg (xorg_backtrace+0x28) [0x460d18] [255347.455] 1: /usr/bin/Xorg (0x400000+0x63509) [0x463509] [255347.455] 2: /lib64/libc.so.6 (0x7fb5c9c3c000+0x32a20) [0x7fb5c9c6ea20] [255347.455] 3: /usr/lib64/xorg/modules/drivers/nouveau_drv.so (0x7fb5c7e55000+0x9cdc) [0x7fb5c7e5ecdc] [255347.455] 4: /usr/bin/Xorg (0x400000+0x117210) [0x517210] [255347.455] 5: /usr/lib64/xorg/modules/extensions/libextmod.so (0x7fb5c8f2f000+0x12d4d) [0x7fb5c8f41d4d] [255347.455] 6: /usr/bin/Xorg (0x400000+0x2dbdc) [0x42dbdc] [255347.455] 7: /usr/bin/Xorg (0x400000+0x2189a) [0x42189a] [255347.455] 8: /lib64/libc.so.6 (__libc_start_main+0xfd) [0x7fb5c9c5ac5d] [255347.455] 9: /usr/bin/Xorg (0x400000+0x21449) [0x421449] [255347.465] Segmentation fault at address 0x7fb5c1fec000 [255347.465] Fatal server error: [255347.465] Caught signal 11 (Segmentation fault). Server aborting [255347.465] [255347.466] Please consult the Fedora Project support at http://bodhi.fedoraproject.org/ for help. [255347.466] Please also check the log file at "/var/log/Xorg.0.log" for additional information. [255347.466] [255347.514] (II) Power Button: Close Once again, during this crash mplayer was showing a video using XV. Upgraded the Dell T3500 to F14 today, enabled KDE Desktop Effects using XRender and it froze up after a short while :-( So this bug should be moved from F13 to F14. On the other hand: the new mesa 7.9 seems to have enough improvements on the OpenGL side that KDEs kwin now accepts it for its Desktop Effects. I have seen some strange drawing errors (once again probably related to XV or Video-via-OpenGL) but no lockups yet. Let's see if it is usable enough in the long term. (In reply to comment #23) > Upgraded the Dell T3500 to F14 today, enabled KDE Desktop Effects using XRender > and it froze up after a short while :-( So this bug should be moved from F13 to > F14. > > On the other hand: the new mesa 7.9 seems to have enough improvements on the > OpenGL side that KDEs kwin now accepts it for its Desktop Effects. I have seen > some strange drawing errors (once again probably related to XV or > Video-via-OpenGL) but no lockups yet. Let's see if it is usable enough in the > long term. I suspect the drawing errors very probably occur where you used to see hangs, you should see DMA_PUSHER error messages in dmesg after these corruptions occur if this is the case. A little summary: kernel-2.6.35.6-48.fc14.x86_64 mesa-dri-drivers-experimental-7.9-1.fc14.x86_64 xorg-x11-drv-nouveau-0.0.16-11.20100826git065576d.fc14.x86_64 After continuously running F14 mesa + KDE desktop effects on OpenGL (the whole shebang: desktop cube, wobbly windows, transparency, animations, etc.) for about 10 days: - ZERO(!) hangs - ZERO crashes - zero PFIFO_DMA_PUSHER messages in /var/log/messages - smaller problems with stuff disappearing on redraw. Nothing major, usually appears again when clicking on window (it is after all called mesa-dri-drivers-experimental :-) - No problems when using "mplayer -vo xv" - App with the most annoying redraw artefacts: XEmacs editor (but maybe that's just because I use it extensively on this machine. On the other hand it might be a good test case, maybe it does something nasty with X11) I have not seen the drawing errors from my initial attempt, which were probably caused by using "mplayer -vo gl2". Maybe nouveau OpenGL support isn't good enough yet to handle video surface drawing. As both non-Composite and OpenGL-Composite work without problems on this box I would conclude that the hangs are related to XRender acceleration/composite. This message is a reminder that Fedora 13 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 13. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '13'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 13's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 13 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping I can no longer reproduce this bug due to the hardware required dying a few months ago. My general impression though is that the bug was no longer happening for me as of late 2010. Therefore I have no objection to this report being closed. I still have the hardware, but I haven't seen any of these lockups in some time. I've been on F14 and not F13 for a while, but with the last couple of F14 kernel releases the problem seems to have gone away. Closing then. -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers Fedora 13 changed to end-of-life (EOL) status on 2011-06-25. Fedora 13 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed. |
Created attachment 428154 [details] All nouveau messages since boot + Xorg.0.log Description of problem: Machine hangs (hard locked, power cycle required) after unpredictable amounts of time--- from about an hour to 5 days. Version-Release number of selected component (if applicable): xorg-x11-drv-nouveau-0.0.16-6.20100423git13c1043.fc13.x86_64 + kernel-2.6.33.5-124.fc13.x86_64 and all predecessors in F13. Now testing with xorg-x11-drv-nouveau.0.0.16-7.20100423git13c1043. How reproducible: Always, unless another bug (the machine also suffers from #566987) requires a reboot first. Steps to Reproduce: 1. Boot machine 2. Wait Actual results: Machine hangs. Expected results: Machine does not hang. Additional info: Kernel options pcie_aspm=off and/or intel_iommu=off have no effect. Vt-d may be implicated, currently testing with it enabled in BIOS.