Bug 1096989
Description
Morgan Leijström
2014-05-12 22:03:02 UTC
Created attachment 894896 [details]
lspci -v from R61
We can't do anything about issues using the nvidia driver. Perhaps the nouveau maintainers will have some insight into the issues with that one. As the behaviour is so similar on both noveau and nvidia, i guess it is the same cause. If the problem in both casas could be fixed by some change in the kernel or some setting anywhere, that would be great. If we find problem in nouveau - also great, and hope nvidia can fix the same issue in theirs. I will swith thinkpad R61 back to nouveau. Tell me if you want me to test something or grab a log, whatever. Created attachment 895052 [details]
lsmod from R61 running nvidia
Any ide ahow to track this, how to get any useable log from something? kdump do not work when it actually happens, just when i test manually. I get some from /var/log/messages, but do not know if it is related, also it is strange two computers have "hardware error", so i presume it is rather the kernel or driver that do not suit this hardware... or many computers have a design error. 2014-05-20 10.09.45 bamse mcelog Error overflow 2014-05-20 10.09.45 bamse mcelog Uncorrected error 2014-05-20 10.09.45 bamse mcelog Error enabled 2014-05-20 10.09.45 bamse mcelog MCA: BUS Level-0 Local-CPU-originated-request Generic Memory-access Request-did-not-timeout Error 2014-05-20 10.09.45 bamse mcelog BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE 2014-05-20 10.09.45 bamse mcelog failure that caused IERR 2014-05-20 10.09.45 bamse mcelog Hardware event. This is not a software error. 2014-05-20 10.09.45 bamse mcelog Error overflow 2014-05-20 10.09.45 bamse mcelog Uncorrected error 2014-05-20 10.09.45 bamse mcelog Error enabled 2014-05-20 10.09.45 bamse mcelog MCA: BUS Level-3 Generic Generic Other-transaction Request-did-not-timeout Error 2014-05-20 10.09.45 bamse mcelog BQ_DCU_READ_TYPE BQ_ERR_AERR2_TYPE BQ_ERR_AERR2_TYPE 2014-05-20 10.09.45 bamse mcelog received parity error on response transaction 2014-05-20 10.09.45 bamse mcelog Hardware event. This is not a software error. 2014-05-20 10.09.45 bamse rpcbind Cannot open '/var/lib/rpcbind/rpcbind.xdr' file for reading, errno 2 (No such file or directory) 2014-05-20 10.09.45 bamse rpcbind Cannot open '/var/lib/rpcbind/portmap.xdr' file for reading, errno 2 (No such file or directory) 2014-05-20 10.09.45 bamse mcelog Error overflow 2014-05-20 10.09.45 bamse mcelog Uncorrected error 2014-05-20 10.09.45 bamse mcelog Error enabled 2014-05-20 10.09.45 bamse mcelog MCA: BUS Level-3 Generic Generic Other-transaction Request-did-not-timeout Error 2014-05-20 10.09.45 bamse mcelog BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE 2014-05-20 10.09.45 bamse mcelog received parity error on response transaction Created attachment 898239 [details]
photo of crash screen saying: noveau PBUS MMIO write fault
Before comment #5 i did a full system reinstall (keeping home) so there is no traces of nvidia. Only nouveau. As before kdump fail to work when this bug happens (which deserves an own bug) However now i sometimes get a text screen after reboot and it stays up a minute or two before it automatically reboots. I attach a photo - slightly below middle you see two lines like nvidia E[ PBUS][0000:01:00.0] MMIO write of 0x00000000 FAULT at 0x00fd94 Strangely, while i am writing this i see noveau cach error popped up in /var log messages, i do not know if it is related. But I percieve no problem.: May 22 08:58:52 bamse kernel: [ 2103.113955] nouveau E[ PFIFO][0000:01:00.0] CACHE_ERROR - ch 4 [kwin[1589]] subc 0 mthd 0x0060 data 0xbeef0201 May 22 08:58:52 bamse kernel: nouveau E[ PFIFO][0000:01:00.0] CACHE_ERROR - ch 4 [kwin[1589]] subc 0 mthd 0x0060 data 0xbeef0201 I also have Bug 1093959 that is possibly related: abrt report kernel crashed. I finally got kdump to grab a vmcore on a resume crash :) Because of the size it is on dropbox: https://dl.dropboxusercontent.com/u/35922960/Morgan/fc20/kdump/127.0.0.1-2014.05.22-12%3A32%3A01/vmcore Note there is also a zero length file "vmcore-dmesg-incomplete.txt" uname -a Linux bamse 3.14.4-200.fc20.x86_64 #1 SMP Tue May 13 13:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Tell me if there is anything more i can log or check to help. I found a tip for making suspend work for fc16 on a sister machine: http://www.thinkwiki.org/wiki/Installing_Fedora_8_on_a_T61#Suspend_to_RAM : Create /etc/pm/config.d/unload_modules , containing: SUSPEND_MODULES="ehci_hcd ohci_hdc" But now i had a crash while using the computer, and kdump failed to get a vmcore :( It may be unrelated as we had one crash before during use (not resuming). I will switch from KDE to cinnamon for a while just to try... Tried other DEs without noticeable change: cinnamon cinnamon-software rendering mate For some reason kdump do not catch anything, despite i see it is active before system crash: $ systemctl status kdump kdump.service - Crash recovery kernel arming Loaded: loaded (/usr/lib/systemd/system/kdump.service; disabled) Active: active (exited) since tor 2014-05-22 22:10:20 CEST; 8min ago Process: 914 ExecStart=/usr/bin/kdumpctl start (code=exited, status=0/SUCCESS) Main PID: 914 (code=exited, status=0/SUCCESS) CGroup: /system.slice/kdump.service I done many tests and it seem to crash also with not so many programs loaded, just possibly it crashes more easily when more programs are loaded before suspend-resume. Running firefox with facebook seem to make it crash more easilu but not always. Unfortunately we can not use hibernate as an alternative as that crash: Bug 1100978 It do not seem like related in other ways than it is same machine... Tank you whoever / what changed. We have not had a resume crash in days since some updates. No idea what update fixed it though, as it did not crash every resume. (so it may also be that we are just fortunate the last days...) Iĺl attach lists of installed packages and yum history if anyone want to dig. I have tested both before and now on both thinkpad T61 and R61 (by swapping disks) Another test i did was to install fedora 16 on another disk and there both suspend and hibernate and resuming from then works perfectly after default install without any setup, in both gnome and KDE. I used a separate ext4 /boot then a LVM with /, /home, swap. No encryption. So resuming from suspend was a regresion (hopefully fixed now). Hibernation is stil a regression, i update that bug 1100978. Created attachment 901429 [details]
installed packages; output of rpm -qa
Created attachment 901430 [details]
recent update history from yum
Unfortunately i celebrated too soon: of about a dozen suspends in two days casual use, I got two crashes on resume = still a real problem. I found messages on the net saying fedora 16 can suspend and hibernate, so i started installing and testing fc16, 17, 18, 19, 20. Here is the result: __________TEST RESULTS __fedora 16 suspend-resume: perfect hibernate: perfect __fedora 17, 18, 19, works identical in this regard: suspend-resume: perfect. at suspend and hibernate, i see a glimpse of text screen nouveau PBUS MMIO write fault lines, - similar to fc20, so it seems like that is not a problem. (On fedora 17 i saw something about interrupt not handled too) hibernate: half broken: saves, then the moon lamp start blinking, power remais on. << major problem but not critical Need manual power off. At power on it restores OK; work saved. HOWEVER, (only tested on 20) when /home and swap are in an encrypted LVM, it is Completely broken: At power on it reboots like normal, - not using saved state = Work lost! <<<<<<<<< Critical! __________TEST SETUP All are fresh install to SSD using the x86_64 DVD iso, (selecting KDE + office, tools, applications) partitioning: a separate /boot, then a LVM with /, /home, swap reboot, full update, reboot. No tinkering. Install gkrellm, firefox, flash. Launch gkrellm, gimp, scribus, dolphin, konqueror, firefox with some pages incl flash, all calligra{author,flow,sheets,stage,words} and libreoffice{calc,draw,impress,writer} ...Then try suspend and hibernation. 16 and 20 are more tested than the others: more suspend cycles, and also gnome and cinnamon. On 20 i also tested other partitioning, including plain partitions (no LVM nor LUKS), and /home and swap in encrypted LVM (this breaks restoring from hibernation) On 20 kdump is installed, and once grabbed a dump, see bug. Machines: Tested on laptop Lenovo thinkpad T61, for 16 & 20 also thinkpad R61. Both are dual core, and nvidia quadro graphis, running default nouveau driver. Tests were done on a SSD drive reserved for testing purposes during this period. Normally other drives for the normal use with real user production data are in T61 (mageia 4, same problem as fc20), and R61 use another SSD with fc20 (was mageia4). On those sytstem there is an encrypted LVM containing /home and swap, same problems. Created attachment 902334 [details]
xsession-errors_vesa-display-froze-after-resume_system-is-alive
I believe the graphics driver is not to blame, resume fail with nvidia, nouveau, and now also tested vesa - vesa fail *every* resume.
Using vesa, the system do not reboot, but the screen is frozen, with the image of the login dialog (that is prepared during suspend). However muse pointer moves, i guess it is "hardware accelerated"
I see by the disk LED that the session reacts to keys (I tried Alt-F4)
Also, I could switch to a text console (Ctrl-Alt-Fx) and grab .xsession-errors and Xorg.0.log. However i do not find anything interesting there nor in journalctl.
Created attachment 902336 [details]
Xorg.0.log_when-vesa-display-froze-after-resume_system-is-alive
So all three tried drivers fail on resume, differently.
Nvidia: out of memory, reboot
nouveau: write fault, reboot
vesa: screen freeze and system seem not aware anything is wrong
I believe it is not driver fault, I think the resume process is broken.
Who shall we cc on resume issues?
Created attachment 902346 [details]
journalctl -k after fail of resuming the session after rebooting after last attachements.
?1? nouveau seem to still be around - but, not like before neither in xorg log nor the result...
In order to remove nouveau i uninstalled "nouveau" package and rebooted, and observed in Xorg log it could not find nouveau. So i thought that was OK.
But now i observe journalctl -k tells hudreds of errors from nouveau?!
Can someone point me how to remove nouveau cleanly?
(I am a beginner on fedora - mainly using mageia, which have GIU tools for this, but about same problem resuming sometimes crash on same machines ince a year or so)
Finally i have a crash dump of this :) (I think. It is not entirely clear how it crashed, i sat down at my wifes computer and realise dit asked for encryption key, given that, it said kdump saved vmcore) Too large to attach, so it is here: https://dl.dropboxusercontent.com/u/35922960/Morgan/fc20/kdump/127.0.0.1-2014.06.20-08%3A37%3A51/vmcore kernel 3.14.8-200.fc20.x86_64
> Can someone point me how to remove nouveau cleanly?
Not sure the real problem, for removing nouveau you can add it to blacklist.
add a file /etc/modprobe.d/nouveau.conf with content "blacklist nouveau"
OK thanks. Using vesa instead of noveau makes difference: It do not crash, but often screen is completely off instead, ( and i have to reboot it doing commands in blind, ctrl alt del or power button.) (May be related: bug 1100978 : system hibernates to disk correctly when using vesa, but hangs instead of powering off when nouveau) This message is a reminder that Fedora 20 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 20. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '20'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 20 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. Fedora 20 changed to end-of-life (EOL) status on 2015-06-23. Fedora 20 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed. |