Bug 607077

Summary: [mi] EQ overflowing. The server is probably stuck in an infinite loop
Product: Red Hat Enterprise Linux 6 Reporter: Aleksandar Mihajlov <aleksandar.mihajlov>
Component: xorg-x11-drv-nouveauAssignee: Ben Skeggs <bskeggs>
Status: CLOSED WONTFIX QA Contact: Desktop QE <desktop-qa-list>
Severity: high Docs Contact:
Priority: low    
Version: 6.0CC: ajschult784, matti.aarnio, rockowitz, vengmd
Target Milestone: rcKeywords: Triaged
Target Release: ---Flags: bskeggs: needinfo? (aleksandar.mihajlov)
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-06 11:29:48 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Attachments:
Description Flags
X log
none
/var/log/messages
none
output of dmesg
none
output of dmesg 29.06.
none
/var/log/messages 29.06.
none
xorg.conf
none
Xorg log 29.06.
none
output of strace
none
output of top command
none
file descriptors of X none

Description Aleksandar Mihajlov 2010-06-23 07:11:55 UTC
Created attachment 426178 [details]
X log

Description of problem:

After several days of normal work, X stops working. It doesn't  respond to Ctrl+Alt+Backspace. The only way to recover machine is to reboot.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 2 RHEL Product and Program Management 2010-06-23 07:23:04 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

Comment 3 Matěj Cepl 2010-06-23 08:53:18 UTC
Backtrace:
0: /usr/bin/Xorg (xorg_backtrace+0x28) [0x46e898]
1: /usr/bin/Xorg (mieqEnqueue+0x1f4) [0x45ee24]
2: /usr/bin/Xorg (xf86PostMotionEventP+0xce) [0x4840be]
3: /usr/lib64/xorg/modules/input/evdev_drv.so (0x7f78e9b11000+0x516f) [0x7f78e9b1616f]
4: /usr/bin/Xorg (0x400000+0x7b227) [0x47b227]
5: /usr/bin/Xorg (0x400000+0x10d163) [0x50d163]
6: /lib64/libpthread.so.0 (0x34f0000000+0xf0f0) [0x34f000f0f0]
7: /lib64/libc.so.6 (ioctl+0x7) [0x34ef4d69d7]
8: /usr/lib64/libdrm.so.2 (drmIoctl+0x23) [0x3500c03383]
9: /usr/lib64/libdrm.so.2 (drmCommandWrite+0x1b) [0x3500c0360b]
10: /usr/lib64/libdrm_nouveau.so.1 (0x7f78ed1e4000+0x2f1d) [0x7f78ed1e6f1d]
11: /usr/lib64/libdrm_nouveau.so.1 (nouveau_bo_map_range+0xfc) [0x7f78ed1e711c]
12: /usr/lib64/libdrm_nouveau.so.1 (0x7f78ed1e4000+0x2106) [0x7f78ed1e6106]
13: /usr/lib64/libdrm_nouveau.so.1 (nouveau_pushbuf_flush+0x29c) [0x7f78ed1e649c]
14: /usr/lib64/xorg/modules/drivers/nouveau_drv.so (0x7f78ed408000+0x3abdd) [0x7f78ed442bdd]
15: /usr/lib64/xorg/modules/libexa.so (0x7f78eafa5000+0xd196) [0x7f78eafb2196]
16: /usr/lib64/xorg/modules/libexa.so (0x7f78eafa5000+0xe072) [0x7f78eafb3072]
17: /usr/bin/Xorg (0x400000+0xcadc0) [0x4cadc0]
18: /usr/bin/Xorg (0x400000+0xc12de) [0x4c12de]
19: /usr/bin/Xorg (0x400000+0x421fc) [0x4421fc]
20: /usr/bin/Xorg (0x400000+0x21d8a) [0x421d8a]
21: /lib64/libc.so.6 (__libc_start_main+0xfd) [0x34ef41eb1d]
22: /usr/bin/Xorg (0x400000+0x21949) [0x421949]

Comment 4 Matěj Cepl 2010-06-23 12:33:12 UTC
Thanks for the bug report.  We have reviewed the information you have provided above, and there is some additional information we require that will be helpful in our diagnosis of this issue.

Please add drm.debug=0x04 to the kernel command line, restart computer, wait until Xorg crash, switch to console (Ctrl-Alt-F2), collect and attach

* your X server config file (/etc/X11/xorg.conf, if available),
* output of the dmesg command, and
* system log (/var/log/messages)

to the bug report as individual uncompressed file attachments using the bugzilla file attachment link above.

We will review this issue again once you've had a chance to attach this information.

Thanks in advance.

Comment 5 Aleksandar Mihajlov 2010-06-28 06:46:46 UTC
Created attachment 427316 [details]
/var/log/messages

Comment 6 Aleksandar Mihajlov 2010-06-28 06:49:29 UTC
Created attachment 427317 [details]
output of dmesg

Comment 7 Aleksandar Mihajlov 2010-06-28 06:53:34 UTC
It happened again. Buy I couldn't switch to console with Ctrl+Alt+F2. I had to reboot the machine. Maybe isn't just a X problem ?
I could ping the machine, But I couldn't access with ssh.

First messages after reboot are starting from Jun 28 10:23

I also attached output of dmesg, but it is output after reboot.

I don't know if this is useful, but it is all I have.

Comment 8 Aleksandar Mihajlov 2010-06-29 06:06:45 UTC
Created attachment 427576 [details]
output of dmesg 29.06.

Comment 9 Aleksandar Mihajlov 2010-06-29 06:09:11 UTC
Created attachment 427577 [details]
/var/log/messages 29.06.

Comment 10 Aleksandar Mihajlov 2010-06-29 06:09:45 UTC
Created attachment 427578 [details]
xorg.conf

Comment 11 Aleksandar Mihajlov 2010-06-29 06:10:27 UTC
Created attachment 427579 [details]
Xorg log 29.06.

Comment 12 Aleksandar Mihajlov 2010-06-29 06:10:51 UTC
Created attachment 427580 [details]
output of strace

Comment 13 Aleksandar Mihajlov 2010-06-29 06:11:40 UTC
Created attachment 427581 [details]
output  of top command

Comment 14 Aleksandar Mihajlov 2010-06-29 06:12:37 UTC
Created attachment 427582 [details]
file descriptors of X

Comment 15 Aleksandar Mihajlov 2010-06-29 06:20:21 UTC
Ok, this time I have more useful data.

I could access to machine even X was frozen, so I collect more data.

You can find:

output of dmesg
/var/log/messages
xorg.conf
Xorg.0.log
output of strace (strace -p <PID of Xorg>)
output of top command (X is taking from 95% to 100% of CPU)

list of Xorg file descriptors (ls -l /proc/<PID>/fd)

As i can see from stracer, Xorg is stuck in:
.....
ioctl(11, 0x40086485, 0x7fff94dcac70)   = ? ERESTARTSYS (To be restarted)
--- SIGALRM (Alarm clock) @ 0 (0) ---
rt_sigreturn(0xe)                       = -1 EINTR (Interrupted system call)
ioctl(11, 0x40086485, 0x7fff94dcac70)   = ? ERESTARTSYS (To be restarted)
--- SIGALRM (Alarm clock) @ 0 (0) ---
rt_sigreturn(0xe)                       = -1 EINTR (Interrupted system call)
ioctl(11, 0x40086485, 0x7fff94dcac70)   = ? ERESTARTSYS (To be restarted)
......

where file descriptor 11 is:

/dev/dri/card0


I hope this is more useful then previous logs.

Comment 16 RHEL Product and Program Management 2010-07-15 14:56:01 UTC
This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release. It has
been denied for the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **

Comment 18 RHEL Product and Program Management 2011-01-07 04:33:35 UTC
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.

Comment 19 Suzanne Yeghiayan 2011-01-07 16:04:26 UTC
This request was erroneously denied for the current release of Red Hat
Enterprise Linux.  The error has been fixed and this request has been
re-proposed for the current release.

Comment 20 RHEL Product and Program Management 2011-02-01 06:04:42 UTC
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.

Comment 21 RHEL Product and Program Management 2011-02-01 18:25:48 UTC
This request was erroneously denied for the current release of
Red Hat Enterprise Linux.  The error has been fixed and this
request has been re-proposed for the current release.

Comment 22 RHEL Product and Program Management 2011-04-04 02:33:03 UTC
Since RHEL 6.1 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 23 Ben Skeggs 2011-08-24 04:54:04 UTC
Did you still see this issue in 6.1?

Comment 24 RHEL Product and Program Management 2011-10-07 16:15:48 UTC
Since RHEL 6.2 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 25 matti aarnio 2014-03-19 21:48:34 UTC
I found a way to easily reproduce this error at Fedora 20.
I am always starting to init state 3 ( = text console, ) logging in, and then explicitly starting the X with command:

   $ startx > x.log 2>&1 &

This enables me to show you the log extract below.

xorg-x11-drv-nouveau-1.0.9-2.fc20.x86_64
xorg-x11-server-Xorg-1.14.4-7.fc20.x86_64

The way how I trigger this is simple:

 0) Have a PC with "GeForce GTX 550 Ti" video card, 16 GB RAM, 4+ x86-64 cores.
 1) Have a text terminal open in X   (desktop suite does not matter)
 2) Go to a directory with around 30 .doc files
 3) Run command:  ooffice *.doc
 4) Wait about a minute (15-20 docs to open) and X server should crash.
    Screen goes black and non-standard mouse cursor appears.

The keyboard drops off the USB, mouse (USB) works.
I didn't test if un-pluggin and re-plugging of the keyboard recovers it.

Login on the machine from network (from other machine,) and run "init 6" to reboot it. Otherwise the x.log file may be incomplete, that is just pressing RESET button may not have the X server's alert data written all the way to disk..

Important thing in this is to have the ooffice launch quickly many documents, and it becomes able to provoke some sort of timing dependent deadlock.


A representative sample of 'file' output on these documents that have 3-5 pages of text, no pictures:

B-CR-TS102204-14-Errata v1.doc: Composite Document File V2 Document, Little Endian, Os: Windows, Version 5.0, Code page: 1252, Title: CR template v1.5.0, Author: xxxxx, MCC, Keywords: CR, template, Template: 3gpp_70.dot, Last Saved By: xxxxx, Revision Number: 7, Name of Creating Application: Microsoft Word 8.0, Total Editing Time: 09:00, Last Printed: Fri Feb 13 14:58:00 2004, Create Time/Date: Wed Nov 24 13:34:00 2004, Last Saved Time/Date: Fri Dec  3 00:09:00 2004, Number of Pages: 1, Number of Words: 665, Number of Characters: 3791, Security: 0

---------------------------------
(EE) [mi] EQ overflowing.  Additional events will be discarded until existing events are processed.
(EE) 
(EE) Backtrace:
(EE) 0: /usr/bin/X (?+0x33) [0x583373]
(EE) 1: /usr/bin/X (?+0x33) [0x451453]
(EE) 2: /usr/lib64/xorg/modules/input/evdev_drv.so (_init+0x2d1d) [0x7f2212d07c8d]
(EE) 3: /usr/bin/X (?+0x2d1d) [0x48dfed]
(EE) 4: /usr/bin/X (?+0x2d1d) [0x4b73bd]
(EE) 5: /lib64/libpthread.so.0 (__restore_rt+0x0) [0x381fa0f74f]
(EE) 6: /lib64/libc.so.6 (ioctl+0x7) [0x381eeec067]
(EE) 7: /lib64/libdrm.so.2 (drmIoctl+0x34) [0x3825a036e4]
(EE) 8: /lib64/libdrm.so.2 (drmCommandWrite+0x1e) [0x3825a05fce]
(EE) 9: /lib64/libdrm_nouveau.so.2 (nouveau_bo_wait+0x99) [0x7f2215b196e9]
(EE) 10: /lib64/libdrm_nouveau.so.2 (nouveau_pushbuf_space+0xd1) [0x7f2215b1a9e1]
(EE) 11: /usr/lib64/xorg/modules/drivers/nouveau_drv.so (_init+0x1edf9) [0x7f2215d60e29]
(EE) 12: /usr/lib64/xorg/modules/libexa.so (exaEnableDisableFBAccess+0x91e) [0x7f22156e84fe]
(EE) 13: /usr/lib64/xorg/modules/libexa.so (exaEnableDisableFBAccess+0x120c) [0x7f22156e9c3c]
(EE) 14: /usr/lib64/xorg/modules/libexa.so (exaMoveOutPixmap+0x7fa5) [0x7f22156ee2b5]
(EE) 15: /usr/bin/X (?+0x7fa5) [0x533875]
(EE) 16: /usr/bin/X (?+0x7fa5) [0x52ce35]
(EE) 17: /usr/bin/X (?+0x7fa5) [0x441f25]
(EE) 18: /usr/bin/X (?+0x7fa5) [0x4304a5]
(EE) 19: /lib64/libc.so.6 (__libc_start_main+0xf5) [0x381ee21d65]
(EE) 20: /usr/bin/X (?+0xf5) [0x428d01]
(EE) 21: ? (?+0xf5) [0xf5]
(EE) 
(EE) [mi] These backtraces from mieqEnqueue may point to a culprit higher up the stack.
(EE) [mi] mieq is *NOT* the cause.  It is a victim.
(EE) [mi] EQ overflow continuing.  100 events have been dropped.
(EE) 
(EE) Backtrace:
(EE) 0: /usr/bin/X (?+0x1) [0x451421]
(EE) 1: /usr/lib64/xorg/modules/input/evdev_drv.so (_init+0x2d1d) [0x7f2212d07c8d]
(EE) 2: /usr/bin/X (?+0x2d1d) [0x48dfed]
(EE) 3: /usr/bin/X (?+0x2d1d) [0x4b73bd]
(EE) 4: /lib64/libpthread.so.0 (__restore_rt+0x0) [0x381fa0f74f]
(EE) 5: /lib64/libc.so.6 (ioctl+0x7) [0x381eeec067]
(EE) 6: /lib64/libdrm.so.2 (drmIoctl+0x34) [0x3825a036e4]
(EE) 7: /lib64/libdrm.so.2 (drmCommandWrite+0x1e) [0x3825a05fce]
(EE) 8: /lib64/libdrm_nouveau.so.2 (nouveau_bo_wait+0x99) [0x7f2215b196e9]
(EE) 9: /lib64/libdrm_nouveau.so.2 (nouveau_pushbuf_space+0xd1) [0x7f2215b1a9e1]
(EE) 10: /usr/lib64/xorg/modules/drivers/nouveau_drv.so (_init+0x1edf9) [0x7f2215d60e29]
(EE) 11: /usr/lib64/xorg/modules/libexa.so (exaEnableDisableFBAccess+0x91e) [0x7f22156e84fe]
(EE) 12: /usr/lib64/xorg/modules/libexa.so (exaEnableDisableFBAccess+0x120c) [0x7f22156e9c3c]
(EE) 13: /usr/lib64/xorg/modules/libexa.so (exaMoveOutPixmap+0x7fa5) [0x7f22156ee2b5]
(EE) 14: /usr/bin/X (?+0x7fa5) [0x533875]
(EE) 15: /usr/bin/X (?+0x7fa5) [0x52ce35]
(EE) 16: /usr/bin/X (?+0x7fa5) [0x441f25]
(EE) 17: /usr/bin/X (?+0x7fa5) [0x4304a5]
(EE) 18: /lib64/libc.so.6 (__libc_start_main+0xf5) [0x381ee21d65]
(EE) 19: /usr/bin/X (?+0xf5) [0x428d01]
(EE) 20: ? (?+0xf5) [0xf5]
(EE) 
(EE) [mi] EQ overflow continuing.  200 events have been dropped.
---------------------------------

Comment 26 matti aarnio 2014-03-19 22:14:06 UTC
The  dmesg did show following (typical extracts, not thousands of repeats of same pairs)  Out of some 16 000 log lines, majority are repeats like these four:


kernel: [467685.150774] nouveau E[  PGRAPH][0000:02:00.0] TRAP ch 2 [0x023fc00000 X[2079]]

kernel: [467685.150782] nouveau E[  PGRAPH][0000:02:00.0] SHADER 0xa0040a0e

kernel: [467685.150799] nouveau E[  PGRAPH][0000:02:00.0] TRAP ch 2 [0x023fc00000 X[2079]]

kernel: [467685.150803] nouveau E[  PGRAPH][0000:02:00.0] SHADER 0xa0040a0e


Removing those, the remainder of kernel messages is:

kernel: [467685.155755] nouveau E[  PGRAPH][0000:02:00.0] GPC0/TPC0/MP trap: INVALID_OPCODE

kernel: [467685.155763] nouveau E[  PGRAPH][0000:02:00.0] GPC0/TPC2/MP trap: INVALID_OPCODE

kernel: [467685.155769] nouveau E[  PGRAPH][0000:02:00.0] GPC0/TPC3/MP trap: INVALID_OPCODE

kernel: [467685.155793] nouveau E[  PGRAPH][0000:02:00.0] GPC0/TPC1/MP trap: INVALID_OPCODE

kernel: nouveau E[  PGRAPH][0000:02:00.0] GPC0/TPC0/MP trap: INVALID_OPCODE

kernel: nouveau E[  PGRAPH][0000:02:00.0] GPC0/TPC2/MP trap: INVALID_OPCODE

kernel: nouveau E[  PGRAPH][0000:02:00.0] GPC0/TPC3/MP trap: INVALID_OPCODE

kernel: nouveau E[  PGRAPH][0000:02:00.0] GPC0/TPC1/MP trap: INVALID_OPCODE

kernel: [467685.938554] nouveau E[   PFIFO][0000:02:00.0] read fault at 0x0030b40000 [INVALID_STORAGE_TYPE] from PGRAPH/GPC0/(unknown enum 0x00000007) on channel 0x023fc00000 [X[2079]]

kernel: nouveau E[   PFIFO][0000:02:00.0] read fault at 0x0030b40000 [INVALID_STORAGE_TYPE] from PGRAPH/GPC0/(unknown enum 0x00000007) on channel 0x023fc00000 [X[2079]]

Comment 27 Jan Kurik 2017-12-06 11:29:48 UTC
Red Hat Enterprise Linux 6 is in the Production 3 Phase. During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available.

The official life cycle policy can be reviewed here:

http://redhat.com/rhel/lifecycle

This issue does not meet the inclusion criteria for the Production 3 Phase and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification. Note that a strong business justification will be required for re-evaluation. Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL:

https://access.redhat.com/