Bug 1096989

Summary:

X crash or system crash on resume - user loose work

Product:

[Fedora] Fedora

Reporter:

Morgan Leijström <fri>

Component:

xorg-x11-drv-nouveau

Assignee:

Ben Skeggs <bskeggs>

Status:

CLOSED EOL

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

CC:

airlied, ajax, bskeggs, gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda, mchehab, ruyang

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-06-29 20:35:31 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Xorg.0.log.old - nvidia out of memory	none
lspci -v from R61	none
lsmod from R61 running nvidia	none
photo of crash screen saying: noveau PBUS MMIO write fault	none
installed packages; output of rpm -qa	none
recent update history from yum	none
xsession-errors_vesa-display-froze-after-resume_system-is-alive	none
Xorg.0.log_when-vesa-display-froze-after-resume_system-is-alive	none
journalctl -k after fail of resuming the session after rebooting after last attachements.	none

Description Morgan Leijström 2014-05-12 22:03:02 UTC

Created attachment 894884 [details]
Xorg.0.log.old - nvidia out of memory

Description of problem:
User loose the session on resume from suspend, if more than a few programs were open

  How reproducible: Always;
§ on my thinkpad R61 with fc20 and noveau
§ on my thinkpad R61 with fc20 and akmod-nvidia
§ on same thinkpad R61 with mageia4 and nvidia
§ on same thinkpad R61 with mageia4 and noveau
§ on my thinkpad T61 with mageia4 and nvidia
§ on my thinkpad T61 with mageia4 and noveau


  Steps to Reproduce:

1.fresh install of fc20 (or mageia 4) 64 bit, KDE4
on thinkpad T61 or R61 (nvidia quadro NVS140 GPU)

2.load a lot of programs (with only a few loaded, resume works)

3.suspend

4.resume, and:

a) if it sucees, fire up more program and go to 3

b) if fc20 with nvidia, or mga4 with any driver, what happens most often is that "only" the X session crash, and you se the nvidia splash, then get to KDE login. found in xorg log:
(EE) NVIDIA(0): Failed to allocate primary buffer: out of memory.

c) fc20 with noveau: system usually reboots immediately
I have installed kdump and it works when manually triggering it, but when the resume problem happens system just reboots immediately anyway!, and i get no crash log.
i have spotted this line in the logs, maybe why kdump fail, but why that then?:
[kdumpctl] cat: write error: Broken pipe

  The following may or may not be related:
d) it have also happened (seldom) that at login the system just show background and need power off manually.
e) fc20 have a few times crashed while "alive", noveau driver.
f) brightness control is broken when using nvidia (both mga4 and fc20) 


There are also similar problems resuming from suspend, but i wait to test more until there is progress on resuming from suspend.


Here is the corresponding bug i put on mageia:
https://bugs.mageia.org/show_bug.cgi?id=12712

Currently we run fc20 on the R61, and mga4 on T61.

Comment 1 Morgan Leijström 2014-05-12 22:10:32 UTC

Created attachment 894896 [details]
lspci -v from R61

Comment 2 Josh Boyer 2014-05-12 23:36:09 UTC

We can't do anything about issues using the nvidia driver.  Perhaps the nouveau maintainers will have some insight into the issues with that one.

Comment 3 Morgan Leijström 2014-05-13 08:32:52 UTC

As the behaviour is so similar on both noveau and nvidia, i guess it is the same cause.  If the problem in both casas could be fixed by some change in the kernel or some setting anywhere, that would be great.  If we find problem in nouveau - also great, and hope nvidia can fix the same issue in theirs.

I will swith thinkpad R61 back to nouveau.  Tell me if you want me to test something or grab a log, whatever.

Comment 4 Morgan Leijström 2014-05-13 08:48:36 UTC

Created attachment 895052 [details]
lsmod from R61 running nvidia

Comment 5 Morgan Leijström 2014-05-20 09:21:14 UTC

Any ide ahow to track this, how to get any useable log from something?
kdump do not work when it actually happens, just when i test manually.

I get some from /var/log/messages, but do not know if it is related, also it is strange two computers have "hardware error", so i presume it is rather the kernel or driver that do not suit this hardware... or many computers have a design error.

2014-05-20 10.09.45	bamse	mcelog	Error overflow
2014-05-20 10.09.45	bamse	mcelog	Uncorrected error
2014-05-20 10.09.45	bamse	mcelog	Error enabled
2014-05-20 10.09.45	bamse	mcelog	MCA: BUS Level-0 Local-CPU-originated-request Generic Memory-access Request-did-not-timeout Error
2014-05-20 10.09.45	bamse	mcelog	BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
2014-05-20 10.09.45	bamse	mcelog	failure that caused IERR
2014-05-20 10.09.45	bamse	mcelog	Hardware event. This is not a software error.
2014-05-20 10.09.45	bamse	mcelog	Error overflow
2014-05-20 10.09.45	bamse	mcelog	Uncorrected error
2014-05-20 10.09.45	bamse	mcelog	Error enabled
2014-05-20 10.09.45	bamse	mcelog	MCA: BUS Level-3 Generic Generic Other-transaction Request-did-not-timeout Error
2014-05-20 10.09.45	bamse	mcelog	BQ_DCU_READ_TYPE BQ_ERR_AERR2_TYPE BQ_ERR_AERR2_TYPE
2014-05-20 10.09.45	bamse	mcelog	received parity error on response transaction
2014-05-20 10.09.45	bamse	mcelog	Hardware event. This is not a software error.
2014-05-20 10.09.45	bamse	rpcbind	Cannot open '/var/lib/rpcbind/rpcbind.xdr' file for reading, errno 2 (No such file or directory)
2014-05-20 10.09.45	bamse	rpcbind	Cannot open '/var/lib/rpcbind/portmap.xdr' file for reading, errno 2 (No such file or directory)
2014-05-20 10.09.45	bamse	mcelog	Error overflow
2014-05-20 10.09.45	bamse	mcelog	Uncorrected error
2014-05-20 10.09.45	bamse	mcelog	Error enabled
2014-05-20 10.09.45	bamse	mcelog	MCA: BUS Level-3 Generic Generic Other-transaction Request-did-not-timeout Error
2014-05-20 10.09.45	bamse	mcelog	BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
2014-05-20 10.09.45	bamse	mcelog	received parity error on response transaction

Comment 6 Morgan Leijström 2014-05-22 07:04:22 UTC

Created attachment 898239 [details]
photo of crash screen saying: noveau PBUS MMIO write fault

Comment 7 Morgan Leijström 2014-05-22 07:07:07 UTC

Before comment #5 i did a full system reinstall (keeping home) so there is no traces of nvidia.  Only nouveau.

As before kdump fail to work when this bug happens (which deserves an own bug)

However now i sometimes get a text screen after reboot and it stays up a minute or two before it automatically reboots.
I attach a photo - slightly below middle you see two lines like
nvidia  E[   PBUS][0000:01:00.0]  MMIO write of 0x00000000 FAULT at 0x00fd94


Strangely, while i am writing this i see noveau cach error popped up in /var log messages, i do not know if it is related.  But I percieve no problem.:
May 22 08:58:52 bamse kernel: [ 2103.113955] nouveau E[   PFIFO][0000:01:00.0] CACHE_ERROR - ch 4 [kwin[1589]] subc 0 mthd 0x0060 data 0xbeef0201
May 22 08:58:52 bamse kernel: nouveau E[   PFIFO][0000:01:00.0] CACHE_ERROR - ch 4 [kwin[1589]] subc 0 mthd 0x0060 data 0xbeef0201

Comment 8 Morgan Leijström 2014-05-22 10:47:58 UTC

I also have Bug 1093959 that is possibly related: abrt report kernel crashed.

Comment 9 Morgan Leijström 2014-05-22 11:20:05 UTC

I finally got kdump to grab a vmcore on a resume crash :)
Because of the size it is on dropbox:
https://dl.dropboxusercontent.com/u/35922960/Morgan/fc20/kdump/127.0.0.1-2014.05.22-12%3A32%3A01/vmcore
Note there is also a zero length file "vmcore-dmesg-incomplete.txt"

uname -a
Linux bamse 3.14.4-200.fc20.x86_64 #1 SMP Tue May 13 13:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Tell me if there is anything more i can log or check to help.

Comment 10 Morgan Leijström 2014-05-22 12:20:10 UTC

I found a tip for making suspend work for fc16 on a sister machine: http://www.thinkwiki.org/wiki/Installing_Fedora_8_on_a_T61#Suspend_to_RAM :
Create /etc/pm/config.d/unload_modules , containing:

SUSPEND_MODULES="ehci_hcd ohci_hdc" 

But now i had a crash while using the computer, and kdump failed to get a vmcore :(  It may be unrelated as we had one crash before during use (not resuming).

I will switch from KDE to cinnamon for a while just to try...

Comment 11 Morgan Leijström 2014-05-22 20:22:49 UTC

Tried other DEs without noticeable change:
 cinnamon
 cinnamon-software rendering
 mate

For some reason kdump do not catch anything, despite i see it is active before system crash:

$ systemctl status kdump
kdump.service - Crash recovery kernel arming
   Loaded: loaded (/usr/lib/systemd/system/kdump.service; disabled)
   Active: active (exited) since tor 2014-05-22 22:10:20 CEST; 8min ago
  Process: 914 ExecStart=/usr/bin/kdumpctl start (code=exited, status=0/SUCCESS)
 Main PID: 914 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/kdump.service


I done many tests and it seem to crash also with not so many programs loaded, just possibly it crashes more easily when more programs are loaded before suspend-resume.  Running firefox with facebook seem to make it crash more easilu but not always.

Comment 12 Morgan Leijström 2014-05-24 22:36:06 UTC

Unfortunately we can not use hibernate as an alternative as that crash: Bug 1100978  It do not seem like related in other ways than it is same machine...

Comment 13 Morgan Leijström 2014-06-02 12:08:02 UTC

Tank you whoever / what changed.
We have not had a resume crash in days since some updates.
No idea what update fixed it though, as it did not crash every resume.
(so it may also be that we are just fortunate the last days...)

Iĺl attach lists of installed packages and yum history if anyone want to dig.

I have tested both before and now on both thinkpad T61 and R61 (by swapping disks)

Another test i did was to install fedora 16 on another disk and there both suspend and hibernate and resuming from then works perfectly after default install without any setup, in both gnome and KDE.  I used a separate ext4 /boot then a LVM with /, /home, swap.  No encryption.

So resuming from suspend was a regresion (hopefully fixed now).
Hibernation is stil a regression, i update that bug 1100978.

Comment 14 Morgan Leijström 2014-06-02 12:21:06 UTC

Created attachment 901429 [details]
installed packages; output of rpm -qa

Comment 15 Morgan Leijström 2014-06-02 12:22:19 UTC

Created attachment 901430 [details]
recent update history from yum

Comment 16 Morgan Leijström 2014-06-04 13:46:58 UTC

Unfortunately i celebrated too soon: of about a dozen suspends in two days casual use, I got two crashes on resume = still a real problem.

I found messages on the net saying fedora 16 can suspend and hibernate, so i started installing and testing fc16, 17, 18, 19, 20. Here is the result:

__________TEST RESULTS

__fedora 16
suspend-resume: perfect
hibernate: perfect

__fedora 17, 18, 19, works identical in this regard:
suspend-resume: perfect.
at suspend and hibernate, i see a glimpse of text screen nouveau PBUS MMIO write fault lines,
- similar to fc20, so it seems like that is not a problem.
(On fedora 17 i saw something about interrupt not handled too)
hibernate: half broken: saves, then the moon lamp start blinking,
power remais on. << major problem but not critical
Need manual power off. At power on it restores OK; work saved.

HOWEVER, (only tested on 20) when /home and swap are in an encrypted LVM,
it is Completely broken: At power on it reboots like normal,
- not using saved state = Work lost! <<<<<<<<< Critical!

__________TEST SETUP

All are fresh install to SSD using the x86_64 DVD iso,
(selecting KDE + office, tools, applications)
partitioning: a separate /boot, then a LVM with /, /home, swap
reboot, full update, reboot. No tinkering.
Install gkrellm, firefox, flash.
Launch gkrellm, gimp, scribus, dolphin, konqueror,
firefox with some pages incl flash,
all calligra{author,flow,sheets,stage,words}
and libreoffice{calc,draw,impress,writer}
...Then try suspend and hibernation.

16 and 20 are more tested than the others: more suspend cycles, and also gnome and cinnamon.
On 20 i also tested other partitioning, including plain partitions (no LVM nor LUKS),
and /home and swap in encrypted LVM (this breaks restoring from hibernation)
On 20 kdump is installed, and once grabbed a dump, see bug.

Machines: Tested on laptop Lenovo thinkpad T61, for 16 & 20 also thinkpad R61.
Both are dual core, and nvidia quadro graphis, running default nouveau driver.

Tests were done on a SSD drive reserved for testing purposes during this period.

Normally other drives for the normal use with real user production data are in
T61 (mageia 4, same problem as fc20), and R61 use another SSD with fc20 (was mageia4).
On those sytstem there is an encrypted LVM containing /home and swap, same problems.

Comment 17 Morgan Leijström 2014-06-04 20:46:50 UTC

Created attachment 902334 [details]
xsession-errors_vesa-display-froze-after-resume_system-is-alive

I believe the graphics driver is not to blame, resume fail with nvidia, nouveau, and now also tested vesa - vesa fail *every* resume.

Using vesa, the system do not reboot, but the screen is frozen, with the image of the login dialog (that is prepared during suspend). However muse pointer moves, i guess it is "hardware accelerated"

I see by the disk LED that the session reacts to keys (I tried Alt-F4)

Also, I could switch to a text console (Ctrl-Alt-Fx) and grab .xsession-errors and Xorg.0.log.  However i do not find anything interesting there nor in journalctl.

Comment 18 Morgan Leijström 2014-06-04 20:53:10 UTC

Created attachment 902336 [details]
Xorg.0.log_when-vesa-display-froze-after-resume_system-is-alive

So all three tried drivers fail on resume, differently.
Nvidia: out of memory, reboot
nouveau: write fault, reboot
vesa: screen freeze and system seem not aware anything is wrong

I believe it is not driver fault, I think the resume process is broken.

Who shall we cc on resume issues?

Comment 19 Morgan Leijström 2014-06-04 21:15:51 UTC

Created attachment 902346 [details]
journalctl -k after fail of resuming the session after rebooting after last attachements.

?1? nouveau seem to still be around - but, not like before neither in xorg log nor the result...

In order to remove nouveau i uninstalled "nouveau" package and rebooted, and observed in Xorg log it could not find nouveau.  So i thought that was OK.

But now i observe journalctl -k tells hudreds of errors from nouveau?!

Can someone point me how to remove nouveau cleanly?

(I am a beginner on fedora - mainly using mageia, which have GIU tools for this, but about same problem resuming sometimes crash on same machines ince a year or so)

Comment 20 Morgan Leijström 2014-06-20 07:44:02 UTC

Finally i have a crash dump of this :)

(I think. It is not entirely clear how it crashed, i sat down at my wifes computer and realise dit asked for encryption key, given that, it said kdump saved vmcore)

Too large to attach, so it is here:
https://dl.dropboxusercontent.com/u/35922960/Morgan/fc20/kdump/127.0.0.1-2014.06.20-08%3A37%3A51/vmcore

kernel 3.14.8-200.fc20.x86_64

Comment 21 Dave Young 2014-06-20 08:49:21 UTC

> Can someone point me how to remove nouveau cleanly?

Not sure the real problem, for removing nouveau you can add it to blacklist.
add a file /etc/modprobe.d/nouveau.conf with content "blacklist nouveau"

Comment 22 Morgan Leijström 2014-07-01 18:56:50 UTC

OK thanks.
Using vesa instead of noveau makes difference:  It do not crash, but often screen is completely off instead, ( and i have to reboot it doing commands in blind, ctrl alt del or power button.)

(May be related: bug 1100978 : system hibernates to disk correctly when using vesa, but hangs instead of powering off when nouveau)

Comment 23 Fedora End Of Life 2015-05-29 11:49:45 UTC

This message is a reminder that Fedora 20 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 20. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '20'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 20 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 24 Fedora End Of Life 2015-06-29 20:35:31 UTC

Fedora 20 changed to end-of-life (EOL) status on 2015-06-23. Fedora 20 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.