Bug 1471986 - Hard lock-up on Wayland GNOME login using nouveau (GTX 680MX)
Hard lock-up on Wayland GNOME login using nouveau (GTX 680MX)
Status: NEW
Product: Fedora
Classification: Fedora
Component: wayland (Show other bugs)
26
x86_64 Linux
unspecified Severity high
: ---
: ---
Assigned To: Adam Jackson
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-17 17:03 EDT by T.J. Rowe
Modified: 2017-08-10 14:04 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description T.J. Rowe 2017-07-17 17:03:57 EDT
Description of problem:

After an upgrading to Fedora 26 on an iMac 27" 2012 with the GTX 680MX GPU (default nouveau driver), logging-in using GNOME under Wayland causes the screen to fill with artifacts and become non-responsive.

Note that I can still SSH in and reboot, but I cannot recover the console.

Falling back to GNOME under Xorg works just fine.  Wayland was working okay under Fedora 25.  There are no special GNOME extensions enabled to my knowledge, and the desktop is pretty standard (this is just a file server where the desktop is seldom used).


Version-Release number of selected component (if applicable):

Fedora 26

How reproducible:

Completely reproducible as of the date of this filing.

Steps to Reproduce:
1.  Boot to GDM on an iMac 27" 2012 with a GTX 680MX
2.  Use the default GNOME (with Wayland) session
3.  Screen hard-locks with artifacts

Actual results:

Hard-lock on console with artifacts on the screen.

Expected results:

Normal GNOME session as with Fedora 25.

Additional info:

Here is a log extract (note that the log fills up extremely fast with these messages):

17:34:46 hostname kernel: nouveau 0000:01:00.0: Xwayland[1901]: channel 19 killed!0:01:00.0: fifo: PBDMA0: 80006000 [GPFIFO GPPTR SIGNATURE] ch 28 [007dbd3000 Xwayland[1901]] subc 0 mthd 0000 data 00000000
Jul 13 17:34:46 hostname kernel: nouveau 0000:01:00.0: fifo: read fault at 0000011000 engine 07 [HOST0] client 06 [HOST] reason 0c [UNSUPPORTED_KIND] on channel 25 [007e198000 Xwayland[1901]]
Jul 13 17:34:46 hostname kernel: nouveau 0000:01:00.0: fifo: channel 25: killedGPFIFO GPPTR SIGNATURE] ch 28 [007dbd3000 Xwayland[1901]] subc 0 mthd 0000 data 00000000
Jul 13 17:34:46 hostname kernel: nouveau 0000:01:00.0: fifo: runlist 0: scheduled for recoveryGNATURE] ch 28 [007dbd3000 Xwayland[1901]] subc 0 mthd 0000 data 00000000
Jul 13 17:34:46 hostname kernel: nouveau 0000:01:00.0: Xwayland[1901]: channel 25 killed!TR SIGNATURE] ch 28 [007dbd3000 Xwayland[1901]] subc 0 mthd 0000 data 00000000
Jul 13 17:34:46 hostname kernel: nouveau 0000:01:00.0: fifo: read fault at 0000011000 engine 07 [HOST0] client 06 [HOST] reason 0c [UNSUPPORTED_KIND] on channel 27 [007df9a000 Xwayland[1901]]
Jul 13 17:34:46 hostname kernel: nouveau 0000:01:00.0: fifo: channel 27: killedGPFIFO GPPTR SIGNATURE] ch 28 [007dbd3000 Xwayland[1901]] subc 0 mthd 0000 data 00000000
Jul 13 17:34:46 hostname kernel: nouveau 0000:01:00.0: fifo: runlist 0: scheduled for recoveryGNATURE] ch 28 [007dbd3000 Xwayland[1901]] subc 0 mthd 0000 data 00000000
Jul 13 17:34:46 hostname kernel: nouveau 0000:01:00.0: Xwayland[1901]: channel 27 killed!TR SIGNATURE] ch 28 [007dbd3000 Xwayland[1901]] subc 0 mthd 0000 data 00000000
Jul 13 17:34:46 hostname kernel: nouveau 0000:01:00.0: fifo: PBDMA0: 80000000 [SIGNATURE] ch 28 [007dbd3000 Xwayland[1901]] subc 7 mthd 3ffc data ffffffffdata 00000000
Jul 13 17:34:46 hostname kernel: nouveau 0000:01:00.0: fifo: PBDMA0: 80000000 [SIGNATURE] ch 28 [007dbd3000 Xwayland[1901]] subc 7 mthd 3ffc data ffffffffdata 00000000
Jul 13 17:34:46 hostname kernel: nouveau 0000:01:00.0: fifo: PBDMA0: 80046000 [GPFIFO GPPTR PBENTRY SIGNATURE] ch 28 [007dbd3000 Xwayland[1901]] subc 7 mthd 3ffc data ffffffff
Jul 13 17:34:46 hostname kernel: nouveau 0000:01:00.0: fifo: PBDMA0: 80046000 [GPFIFO GPPTR PBENTRY SIGNATURE] ch 28 [007dbd3000 Xwayland[1901]] subc 7 mthd 3ffc data ffffffff
Jul 13 17:34:46 hostname kernel: nouveau 0000:01:00.0: fifo: PBDMA0: 80046000 [GPFIFO GPPTR PBENTRY SIGNATURE] ch 28 [007dbd3000 Xwayland[1901]] subc 0 mthd 0000 data 00000000
Jul 13 17:34:46 hostname kernel: nouveau 0000:01:00.0: fifo: PBDMA0: 80006000 [GPFIFO GPPTR SIGNATURE] ch 28 [007dbd3000 Xwayland[1901]] subc 0 mthd 0000 data 00000000
Jul 13 17:34:46 hostname kernel: nouveau 0000:01:00.0: fifo: PBDMA0: 80006000 [GPFIFO GPPTR SIGNATURE] ch 28 [007dbd3000 Xwayland[1901]] subc 0 mthd 0000 data 00000000
Jul 13 17:34:46 hostname kernel: nouveau 0000:01:00.0: fifo: PBDMA0: 80006000 [GPFIFO GPPTR SIGNATURE] ch 28 [007dbd3000 Xwayland[1901]] subc 0 mthd 0000 data 00000000
Jul 13 17:34:46 hostname kernel: nouveau 0000:01:00.0: fifo: PBDMA0: 80006000 [GPFIFO GPPTR SIGNATURE] ch 28 [007dbd3000 Xwayland[1901]] subc 0 mthd 0000 data 00000000
Jul 13 17:34:46 hostname kernel: nouveau 0000:01:00.0: fifo: PBDMA0: 80006000 [GPFIFO GPPTR SIGNATURE] ch 28 [007dbd3000 Xwayland[1901]] subc 0 mthd 0000 data 00000000
Jul 13 17:34:46 hostname kernel: nouveau 0000:01:00.0: fifo: PBDMA0: 80006000 [GPFIFO GPPTR SIGNATURE] ch 28 [007dbd3000 Xwayland[1901]] subc 0 mthd 0000 data 00000000
Jul 13 17:34:46 hostname kernel: nouveau 0000:01:00.0: fifo: PBDMA0: 80006000 [GPFIFO GPPTR SIGNATURE] ch 28 [007dbd3000 Xwayland[1901]] subc 0 mthd 0000 data 00000000
Comment 1 Charles Haithcock 2017-07-22 00:55:47 EDT
Hey all, 

Apparently, I too am hitting this. Falling back to GNOME on X works fine, but the system seems to lock up when on Wayland. I did some vmcore analysis, but am extremely unfamiliar with nvidia territory in the code. As such, the following may be useful or useless. In any case, the best I can ascertain from my little knowledge of how we interact with nvidia is Xwayland is attempting a method on an object bound to a specific subchannel as part of an interrupt handling sequence. My kernel ring buffer is inundated with these logs without the actual values changing after the first few implying spamming the same method. However, I also see, when switching from X to Wayland via logging out and in, a couple errors in regards to read faults, channels being killed, and something with disp having an unknown error. This leads me to believe a vmcore will not suffice as it looks like I am looking into the result of something gone awry and I haven't caught it in the act. 

I can confirm the card worked with multiple monitors before the update from fedora 25 to 26 and occurred only afterwards. Unfortunately, no other kernels work now with Wayland as they all exhibit this issue and nvidia modules are and haven't been installed on any kernel.

I am up for troubleshooting if someone can give guidance. I've seen 'nomodeset' be thrown around on the internet as well as a few other options, but if something specific is useful here, I can certainly try. I've also thought about throwing F27 beta on to see if the issue is reproduced. I am not super keen on this method however, but am open to installing a few beta packages that may be useful to try. I can provide my vmcore as well if needed. I have no info that needs scrubbing from it. 




Below is my analysis. 



          KERNEL: /usr/lib/debug/lib/modules/4.11.10-300.fc26.x86_64/vmlinux
        DUMPFILE: /var/crash/127.0.0.1-2017-07-21-20:15:51/vmcore  [PARTIAL DUMP]
            CPUS: 4
            DATE: Fri Jul 21 20:15:15 2017
          UPTIME: 00:41:05
    LOAD AVERAGE: 0.63, 0.27, 0.21
           TASKS: 726
        NODENAME: eden
         RELEASE: 4.11.10-300.fc26.x86_64
         VERSION: #1 SMP Wed Jul 12 17:05:39 UTC 2017
         MACHINE: x86_64  (3297 Mhz)
          MEMORY: 16 GB
           PANIC: "sysrq: SysRq : Trigger a crash"
             PID: 1105
         COMMAND: "gnome-shell"
            TASK: ffff9e9e5d6c0000  [THREAD_INFO: ffff9e9e5d6c0000]
             CPU: 0
           STATE: TASK_RUNNING (SYSRQ)

    crash> sys -i | grep -i -e bios -e board
            DMI_BIOS_VENDOR: American Megatrends Inc.
           DMI_BIOS_VERSION: 2205
              DMI_BIOS_DATE: 02/12/2015
           DMI_BOARD_VENDOR: ASUSTeK COMPUTER INC.
             DMI_BOARD_NAME: Z97-A
          DMI_BOARD_VERSION: Rev 1.xx

    crash> mod -t
    no tainted modules

    > lspci | grep VGA
    01:00.0 VGA compatible controller: NVIDIA Corporation GM206 [GeForce GTX 960] (rev a1)


[I.0] Hardware overview: 
 - Physical with Z97-A motherboard
 - NVIDIA GeForce GTX 960
 - System was up for a minute or so before the crash. I was jumping around to ensure the system would crash
   before initiating the crash


    crash> log 
    - - - - - - - - - - - [SNIP] - - - - - - - - - - - 
    [    0.258760] pci 0000:01:00.0: vgaarb: setting as boot VGA device
    [    0.258912] pci 0000:01:00.0: vgaarb: VGA device added: decodes=io+mem,owns=io+mem,locks=none
    [    0.259171] pci 0000:01:00.0: vgaarb: bridge control possible
    - - - - - - - - - - - [SNIP] - - - - - - - - - - - 
    [    0.276778] system 00:01: [io  0x0800-0x087f] has been reserved
    [    0.276931] system 00:01: Plug and Play ACPI device, IDs PNP0c02 (active)
    - - - - - - - - - - - [SNIP] - - - - - - - - - - - 
    [    0.287193] pci_bus 0000:01: resource 0 [io  0xe000-0xefff]
    [    0.287194] pci_bus 0000:01: resource 1 [mem 0xde000000-0xdf0fffff]
    [    0.287195] pci_bus 0000:01: resource 2 [mem 0xc0000000-0xd1ffffff 64bit pref]
    - - - - - - - - - - - [SNIP] - - - - - - - - - - - 
    [    1.324484] nouveau 0000:01:00.0: bios: version 84.06.32.00.27
    [    1.324803] nouveau 0000:01:00.0: disp: dcb 15 type 8 unknown
    [    1.325491] nouveau 0000:01:00.0: fb: 2048 MiB GDDR5
    [    1.325662] nouveau 0000:01:00.0: bus: MMIO write of 800000f0 FAULT at 10eb14 [ IBUS ]
    - - - - - - - - - - - [SNIP] - - - - - - - - - - - 
    [    1.333635] nouveau 0000:01:00.0: DRM: VRAM: 2048 MiB
    [    1.333838] nouveau 0000:01:00.0: DRM: GART: 1048576 MiB
    [    1.333997] nouveau 0000:01:00.0: DRM: TMDS table version 2.0
    [    1.334148] nouveau 0000:01:00.0: DRM: DCB version 4.1
    [    1.334308] nouveau 0000:01:00.0: DRM: DCB outp 00: 01000f02 00020030
    [    1.334460] nouveau 0000:01:00.0: DRM: DCB outp 01: 02000f00 00000000
    [    1.334611] nouveau 0000:01:00.0: DRM: DCB outp 02: 02811f76 04400020
    [    1.334767] nouveau 0000:01:00.0: DRM: DCB outp 03: 02011f72 00020020
    [    1.334926] nouveau 0000:01:00.0: DRM: DCB outp 04: 04822f86 04400010
    [    1.335084] nouveau 0000:01:00.0: DRM: DCB outp 05: 04022f82 00020010
    [    1.335242] nouveau 0000:01:00.0: DRM: DCB outp 06: 04833f96 04400020
    [    1.335401] nouveau 0000:01:00.0: DRM: DCB outp 07: 04033f92 00020020
    [    1.338050] nouveau 0000:01:00.0: DRM: DCB outp 08: 02044f62 00020010
    [    1.338208] nouveau 0000:01:00.0: DRM: DCB outp 15: 01df5ff8 00000000
    [    1.338360] nouveau 0000:01:00.0: DRM: DCB conn 00: 00001030
    [    1.338511] nouveau 0000:01:00.0: DRM: DCB conn 01: 00020146
    [    1.338667] nouveau 0000:01:00.0: DRM: DCB conn 02: 01000246
    [    1.338824] nouveau 0000:01:00.0: DRM: DCB conn 03: 02000346
    [    1.338982] nouveau 0000:01:00.0: DRM: DCB conn 04: 00010461
    [    1.339140] nouveau 0000:01:00.0: DRM: DCB conn 05: 00000570
    [    1.339298] nouveau 0000:01:00.0: DRM: Pointer to flat panel table invalid
    [    1.407607] nouveau 0000:01:00.0: DRM: unknown connector type 70
    [    1.407795] nouveau 0000:01:00.0: DRM: failed to create encoder 1/8/0: -19
    [    1.407958] nouveau 0000:01:00.0: DRM: Unknown-1 has no encoders, removing
    - - - - - - - - - - - [SNIP] - - - - - - - - - - - 
    [    1.578921] nouveau 0000:01:00.0: DRM: MM: using COPY for buffer copies
    [    1.648874] nouveau 0000:01:00.0: priv: GPC0: 419df4 00000000 (1840820e)
    [    1.649041] nouveau 0000:01:00.0: priv: GPC1: 419df4 00000000 (1840820e)
    - - - - - - - - - - - [SNIP] - - - - - - - - - - - 
    [    1.757805] nouveau 0000:01:00.0: DRM: allocated 2560x1440 fb: 0x60000, bo ffff9e9e66d70000
    [    1.858899] nouveau 0000:01:00.0: disp: 0x5c73[0]: INIT_GENERIC_CONDITON: unknown 0x07
    [    2.028270] nouveau 0000:01:00.0: fb0: nouveaufb frame buffer device
    - - - - - - - - - - - [SNIP] - - - - - - - - - - - 
    [    3.159573] ACPI Warning: SystemIO range 0x0000000000001828-0x000000000000182F conflicts with OpRegion 0x0000000000001800-0x000000000000187F (\PMIO) (20170119/utaddress-247)
    [    3.159587] ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
    [    3.159593] ACPI Warning: SystemIO range 0x0000000000001C40-0x0000000000001C4F conflicts with OpRegion 0x0000000000001C00-0x0000000000001FFF (\GPR) (20170119/utaddress-247)
    [    3.159599] ACPI Warning: SystemIO range 0x0000000000001C40-0x0000000000001C4F conflicts with OpRegion 0x0000000000001C00-0x0000000000001C7F (\_GPE.GPBX) (20170119/utaddress-247)
    [    3.159608] ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
    [    3.159613] ACPI Warning: SystemIO range 0x0000000000001C30-0x0000000000001C3F conflicts with OpRegion 0x0000000000001C00-0x0000000000001C3F (\GPRL) (20170119/utaddress-247)
    [    3.159619] ACPI Warning: SystemIO range 0x0000000000001C30-0x0000000000001C3F conflicts with OpRegion 0x0000000000001C00-0x0000000000001FFF (\GPR) (20170119/utaddress-247)
    [    3.159624] ACPI Warning: SystemIO range 0x0000000000001C30-0x0000000000001C3F conflicts with OpRegion 0x0000000000001C00-0x0000000000001C7F (\_GPE.GPBX) (20170119/utaddress-247)
    [    3.159630] ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
    [    3.159633] ACPI Warning: SystemIO range 0x0000000000001C00-0x0000000000001C2F conflicts with OpRegion 0x0000000000001C00-0x0000000000001C3F (\GPRL) (20170119/utaddress-247)
    [    3.159639] ACPI Warning: SystemIO range 0x0000000000001C00-0x0000000000001C2F conflicts with OpRegion 0x0000000000001C00-0x0000000000001FFF (\GPR) (20170119/utaddress-247)
    [    3.159644] ACPI Warning: SystemIO range 0x0000000000001C00-0x0000000000001C2F conflicts with OpRegion 0x0000000000001C00-0x0000000000001C7F (\_GPE.GPBX) (20170119/utaddress-247)
    - - - - - - - - - - - [SNIP] - - - - - - - - - - - 
    [    8.662027] e1000e: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
    [    8.662067] IPv6: ADDRCONF(NETDEV_CHANGE): eno1: link becomes ready
    [   21.024303] systemd-journald[572]: File /var/log/journal/fe0cee7971a44dbb824922b171e63813/user-1000.journal corrupted or uncleanly shut down, renaming and replacing.
    [   25.373328] rfkill: input handler disabled
    [   26.635583] ISO 9660 Extensions: Microsoft Joliet Level 3
    [   26.641173] ISOFS: changing to secondary root
    [ 2449.110087] rfkill: input handler enabled
    [ 2449.398957] nouveau 0000:01:00.0: disp: 0x5c73[0]: INIT_GENERIC_CONDITON: unknown 0x07
    [ 2449.448983] nouveau 0000:01:00.0: disp: 0x5c73[0]: INIT_GENERIC_CONDITON: unknown 0x07
    [ 2464.344588] rfkill: input handler disabled
    [ 2464.655497] nouveau 0000:01:00.0: fifo: read fault at 0000011000 engine 07 [HOST0] client 06 [HOST] reason 0c [UNSUPPORTED_KIND] on channel 24 [007c994000 Xwayland[15818]]
    [ 2464.655507] nouveau 0000:01:00.0: fifo: channel 24: killed
    [ 2464.655509] nouveau 0000:01:00.0: fifo: runlist 0: scheduled for recovery
    [ 2464.655515] nouveau 0000:01:00.0: Xwayland[15818]: channel 24 killed!
    [ 2464.688477] nouveau 0000:01:00.0: fifo: read fault at 0000011000 engine 07 [HOST0] client 06 [HOST] reason 02 [PTE] on channel 25 [007c2ef000 Xwayland[15818]]
    [ 2464.688487] nouveau 0000:01:00.0: fifo: channel 25: killed
    [ 2464.688490] nouveau 0000:01:00.0: fifo: runlist 0: scheduled for recovery
    [ 2464.725441] nouveau 0000:01:00.0: Xwayland[15818]: channel 25 killed!
    [ 2464.725476] nouveau 0000:01:00.0: fifo: read fault at 5555555000 engine 1f [] client 07 [HOST_CPU] reason 0d [REGION_VIOLATION] on channel -1 [0000000000 unknown]
    [ 2464.725486] nouveau 0000:01:00.0: fifo: PBDMA0: 80044000 [GPPTR PBENTRY SIGNATURE] ch 26 [007c2de000 Xwayland[15818]] subc 5 mthd 1554 data 55555555
    [ 2464.725506] nouveau 0000:01:00.0: fifo: PBDMA0: 00044000 [GPPTR PBENTRY] ch 26 [007c2de000 Xwayland[15818]] subc 0 mthd 0000 data 00000000
    [ 2464.725524] nouveau 0000:01:00.0: fifo: PBDMA0: 00044000 [GPPTR PBENTRY] ch 26 [007c2de000 Xwayland[15818]] subc 0 mthd 0000 data 00000000
    [ 2464.725543] nouveau 0000:01:00.0: fifo: PBDMA0: 00044000 [GPPTR PBENTRY] ch 26 [007c2de000 Xwayland[15818]] subc 0 mthd 0000 data 00000000
    [ 2464.725562] nouveau 0000:01:00.0: fifo: PBDMA0: 00004000 [GPPTR] ch 26 [007c2de000 Xwayland[15818]] subc 0 mthd 0000 data 00000000
    [ 2464.725582] nouveau 0000:01:00.0: fifo: PBDMA0: 00004000 [GPPTR] ch 26 [007c2de000 Xwayland[15818]] subc 0 mthd 0000 data 00000000
    [ 2464.725601] nouveau 0000:01:00.0: fifo: PBDMA0: 00004000 [GPPTR] ch 26 [007c2de000 Xwayland[15818]] subc 0 mthd 0000 data 00000000
    [ 2464.725620] nouveau 0000:01:00.0: fifo: PBDMA0: 00004000 [GPPTR] ch 26 [007c2de000 Xwayland[15818]] subc 0 mthd 0000 data 00000000
    - - - - - - - - - - - [SNIP] - - - - - - - - - - - 
    [ 2465.317314] nouveau 0000:01:00.0: fifo: PBDMA0: 00004000 [GPPTR] ch 26 [007c2de000 Xwayland[15818]] subc 0 mthd 0000 data 00000000
    [ 2465.317326] nouveau 0000:01:00.0: fifo: PBDMA0: 00004000 [GPPTR] ch 26 [007c2de000 Xwayland[15818]] subc 0 mthd 0000 data 00000000
    [ 2465.317339] nouveau 0000:01:00.0: fifo: PBDMA0: 00004000 [GPPTR] ch 26 [007c2de000 Xwayland[15818]] subc 0 mthd 0000 data 00000000
    [ 2465.317354] sysrq: SysRq : Trigger a crash


[I.1] See a fair bit of logging from nouveau and not sure what it is. Also some ACPI memory reservation conflicts


    PID: 1105   TASK: ffff9e9e5d6c0000  CPU: 0   COMMAND: "gnome-shell"         __handle_sysrq -> machine_kexec
    bt: cannot transition from IRQ stack to current process stack:
            IRQ stack pointer: ffff9e9e7dc036f8
        process stack pointer: ffffffff9c878afe
           current stack base: ffffc31c82c84000
    PID: 16280  TASK: ffff9e9e24d44b00  CPU: 1   COMMAND: "tracker-extract"     <userspace> -> do_nmi -> crash_nmi_callback
    PID: 16336  TASK: ffff9e9e68ed2580  CPU: 2   COMMAND: "pkla-check-auth"     <userspace> -> do_nmi -> crash_nmi_callback
    PID: 572    TASK: ffff9e9e66e24b00  CPU: 3   COMMAND: "systemd-journal"     do_filp_open -> do_dentry_open -> ext4_xattr_security_get -> __bpf_prog_run+2070 -> do_nmi -> crash_nmi_callback

    crash> runq
    CPU 0 RUNQUEUE: ffff9e9e7dc195c0
      CURRENT: PID: 1105   TASK: ffff9e9e5d6c0000  COMMAND: "gnome-shell"
      RT PRIO_ARRAY: ffff9e9e7dc19770
         [no tasks queued]
      CFS RB_ROOT: ffff9e9e7dc19658
         [120] PID: 16334  TASK: ffff9e9cb4afcb00  COMMAND: "pool"
         [120] PID: 1      TASK: ffff9e9e6b64a580  COMMAND: "systemd"
         [120] PID: 2545   TASK: ffff9e9e5d724b00  COMMAND: "gdbus"
         [120] PID: 16268  TASK: ffff9e9cb3004b00  COMMAND: "systemd-localed"
         [120] PID: 16275  TASK: ffff9e9d9f2da580  COMMAND: "tracker-miner-a"

    CPU 1 RUNQUEUE: ffff9e9e7dc995c0
      CURRENT: PID: 16280  TASK: ffff9e9e24d44b00  COMMAND: "tracker-extract"
      RT PRIO_ARRAY: ffff9e9e7dc99770
         [no tasks queued]
      CFS RB_ROOT: ffff9e9e7dc99658
         [no tasks queued]

    CPU 2 RUNQUEUE: ffff9e9e7dd195c0
      CURRENT: PID: 16336  TASK: ffff9e9e68ed2580  COMMAND: "pkla-check-auth"
      RT PRIO_ARRAY: ffff9e9e7dd19770
         [no tasks queued]
      CFS RB_ROOT: ffff9e9e7dd19658
         [120] PID: 15715  TASK: ffff9e9e46a0a580  COMMAND: "gnome-session-b"
         [120] PID: 16274  TASK: ffff9e9d9f2d8000  COMMAND: "gdbus"
         [120] PID: 16308  TASK: ffff9e9e25fa4b00  COMMAND: "gdbus"
         [120] PID: 16333  TASK: ffff9e9e68ed4b00  COMMAND: "xdg-user-dirs-g"

    CPU 3 RUNQUEUE: ffff9e9e7dd995c0
      CURRENT: PID: 572    TASK: ffff9e9e66e24b00  COMMAND: "systemd-journal"
      RT PRIO_ARRAY: ffff9e9e7dd99770
         [no tasks queued]
      CFS RB_ROOT: ffff9e9e7dd99658
         [no tasks queued]


[I.2] CPU 0 handled the sysrq interrupt, CPUs 1/2 are in userspace and received the nmi to crash, and
      CPU 3 was attempting to open a directory on an ext4 fs and entered an eBPF program

      None of the runqueues are saturated with processes and it looks like Xwayland is not running 
      or queued to run. 


    crash> ps -S
      RU: 17
      IN: 708
      ZO: 1

    crash> ps -m | grep RU | grep -v swapper
    [0 00:00:00.000] [RU]  PID: 16336  TASK: ffff9e9e68ed2580  CPU: 2   COMMAND: "pkla-check-auth"
    [0 00:00:00.000] [RU]  PID: 16280  TASK: ffff9e9e24d44b00  CPU: 1   COMMAND: "tracker-extract"
    [0 00:00:00.001] [RU]  PID: 16333  TASK: ffff9e9e68ed4b00  CPU: 2   COMMAND: "xdg-user-dirs-g"
    [0 00:00:00.001] [RU]  PID: 16274  TASK: ffff9e9d9f2d8000  CPU: 2   COMMAND: "gdbus"
    [0 00:00:00.001] [RU]  PID: 16308  TASK: ffff9e9e25fa4b00  CPU: 2   COMMAND: "gdbus"
    [0 00:00:00.004] [RU]  PID: 572    TASK: ffff9e9e66e24b00  CPU: 3   COMMAND: "systemd-journal"
    [0 00:00:00.003] [RU]  PID: 1105   TASK: ffff9e9e5d6c0000  CPU: 0   COMMAND: "gnome-shell"
    [0 00:00:00.003] [RU]  PID: 16334  TASK: ffff9e9cb4afcb00  CPU: 0   COMMAND: "pool"
    [0 00:00:00.003] [RU]  PID: 16275  TASK: ffff9e9d9f2da580  CPU: 0   COMMAND: "tracker-miner-a"
    [0 00:00:00.006] [RU]  PID: 15715  TASK: ffff9e9e46a0a580  CPU: 2   COMMAND: "gnome-session-b"
    [0 00:00:00.005] [RU]  PID: 16268  TASK: ffff9e9cb3004b00  CPU: 0   COMMAND: "systemd-localed"
    [0 00:00:00.009] [RU]  PID: 2545   TASK: ffff9e9e5d724b00  CPU: 0   COMMAND: "gdbus"
    [0 00:00:00.012] [RU]  PID: 1      TASK: ffff9e9e6b64a580  CPU: 0   COMMAND: "systemd"


    crash> ps -m | grep -i wayland
    [0 00:00:00.024] [IN]  PID: 15818  TASK: ffff9e9e258aa580  CPU: 2   COMMAND: "Xwayland"
    [0 00:00:04.455] [IN]  PID: 1180   TASK: ffff9e9e5e1da580  CPU: 1   COMMAND: "Xwayland"
    [0 00:00:04.915] [IN]  PID: 15709  TASK: ffff9e9e66a58000  CPU: 0   COMMAND: "gdm-wayland-ses"
    [0 00:40:59.654] [IN]  PID: 1076   TASK: ffff9e9e5ce82580  CPU: 0   COMMAND: "gdm-wayland-ses"


    crash> bt 15818
    PID: 15818  TASK: ffff9e9e258aa580  CPU: 2   COMMAND: "Xwayland"
     #0 [ffffc31c8250fd08] __schedule at ffffffff9c870064
     #1 [ffffc31c8250fda0] schedule at ffffffff9c870736
     #2 [ffffc31c8250fdb8] schedule_hrtimeout_range_clock at ffffffff9c874b49
     #3 [ffffc31c8250fe48] schedule_hrtimeout_range at ffffffff9c874c33
     #4 [ffffc31c8250fe58] ep_poll at ffffffff9c2b825a
     #5 [ffffc31c8250ff10] sys_epoll_wait at ffffffff9c2b9dfe
     #6 [ffffc31c8250ff50] entry_SYSCALL_64_fastpath at ffffffff9c875ff7
        RIP: 00007f2d302960f3  RSP: 00007ffd967500b0  RFLAGS: 00000293
        RAX: ffffffffffffffda  RBX: 0000000002a2fde0  RCX: 00007f2d302960f3
        RDX: 0000000000000100  RSI: 00007ffd967500c0  RDI: 0000000000000007
        RBP: 0000000002688d80   R8: 0000000000000002   R9: 0000000000000000
        R10: 0000000000091649  R11: 0000000000000293  R12: 00000000025f9eb0
        R13: 0000000000000000  R14: 000000000082b060  R15: 0000000002a2fde0
        ORIG_RAX: 00000000000000e8  CS: 0033  SS: 002b


[I.3] Ah it is asleep waiting on events. 

      Let's check where the error message is printed and what conditions it could be printed in


    > grep -ir 'subc.*mthd'
    (1)
    ./drivers/gpu/drm/nouveau/nvkm/engine/fifo/gf100.c:			if (nvkm_sw_mthd(device->sw, chid, subc, mthd, data))
    ./drivers/gpu/drm/nouveau/nvkm/engine/fifo/gf100.c:				   "subc %d mthd %04x data %08x\n",
    ./drivers/gpu/drm/nouveau/nvkm/engine/fifo/gf100.c:			   subc, mthd, data);
    (2)
    ./drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.c:			if (nvkm_sw_mthd(device->sw, chid, subc, mthd, data))
    ./drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.c:				   "subc %d mthd %04x data %08x\n",
    ./drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.c:			   subc, mthd, data);
    (3)
    ./drivers/gpu/drm/nouveau/nvkm/engine/fifo/nv04.c:			handled = nvkm_sw_mthd(sw, chid, subc, mthd, data);
    ./drivers/gpu/drm/nouveau/nvkm/engine/fifo/nv04.c:			   "ch %d [%s] subc %d mthd %04x data %08x\n",

    > grep -ir -e GPPTR -e PBENTRY .
    ./drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.c:	{ 0x00004000, "GPPTR" },
    ./drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.c:	{ 0x00040000, "PBENTRY" },


[I.4] We are almost certainly looking at gk104.c since the "msg" includes GPPTR and PBENTRY

      Now walk the chain up to see what calls us


    ./drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.c:
    static void
    gk104_fifo_intr_pbdma_0(struct gk104_fifo *fifo, int unit)
    {
            struct nvkm_subdev *subdev = &fifo->base.engine.subdev;
            struct nvkm_device *device = subdev->device;
            u32 mask = nvkm_rd32(device, 0x04010c + (unit * 0x2000));
            u32 stat = nvkm_rd32(device, 0x040108 + (unit * 0x2000)) & mask;
            u32 addr = nvkm_rd32(device, 0x0400c0 + (unit * 0x2000));
            u32 data = nvkm_rd32(device, 0x0400c4 + (unit * 0x2000));
            u32 chid = nvkm_rd32(device, 0x040120 + (unit * 0x2000)) & 0xfff;
            u32 subc = (addr & 0x00070000) >> 16;
            u32 mthd = (addr & 0x00003ffc);
            u32 show = stat;
            struct nvkm_fifo_chan *chan;
            unsigned long flags;
            char msg[128];

            if (stat & 0x00800000) {
                    if (device->sw) {
                            if (nvkm_sw_mthd(device->sw, chid, subc, mthd, data))
                                    show &= ~0x00800000;
                    }
            }

            nvkm_wr32(device, 0x0400c0 + (unit * 0x2000), 0x80600008);

            if (show) {
                    nvkm_snprintbf(msg, sizeof(msg), gk104_fifo_pbdma_intr_0, show);
                    chan = nvkm_fifo_chan_chid(&fifo->base, chid, &flags);
                    nvkm_error(subdev, "PBDMA%d: %08x [%s] ch %d [%010llx %s] "     | Error 
                                       "subc %d mthd %04x data %08x\n",             | is 
                               unit, show, msg, chid, chan ? chan->inst->addr : 0,  | printed
                               chan ? chan->object.client->name : "unknown",        | here
                               subc, mthd, data);                                   |
                    nvkm_fifo_chan_put(&fifo->base, flags, &chan);
            }
            
            nvkm_wr32(device, 0x040108 + (unit * 0x2000), stat);
    }


    ./drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.c:
    static void
    gk104_fifo_intr(struct nvkm_fifo *base)
    {
            struct gk104_fifo *fifo = gk104_fifo(base);
            struct nvkm_subdev *subdev = &fifo->base.engine.subdev;
            struct nvkm_device *device = subdev->device;
            u32 mask = nvkm_rd32(device, 0x002140);
            u32 stat = nvkm_rd32(device, 0x002100) & mask;
    - - - - - - - - - - - [SNIP] - - - - - - - - - - - 
            if (stat & 0x20000000) {
                    u32 mask = nvkm_rd32(device, 0x0025a0);
                    while (mask) {
                            u32 unit = __ffs(mask);
                            gk104_fifo_intr_pbdma_0(fifo, unit);            <--- this is the only place 
                            gk104_fifo_intr_pbdma_1(fifo, unit);                 gk104_fifo_intr_pbdma_0 is called
                            nvkm_wr32(device, 0x0025a0, (1 << unit));
                            mask &= ~(1 << unit);
                    }
                    stat &= ~0x20000000;
            }
    - - - - - - - - - - - [SNIP] - - - - - - - - - - - 


    static const struct nvkm_fifo_func
    gk104_fifo_ = {
            .dtor = gk104_fifo_dtor,
            .oneinit = gk104_fifo_oneinit,
            .init = gk104_fifo_init,
            .fini = gk104_fifo_fini,
            .intr = gk104_fifo_intr,                <--- passed in as a control block. Interrupt handler? Maybe this is the command pushed to the fifo?
            .uevent_init = gk104_fifo_uevent_init,
            .uevent_fini = gk104_fifo_uevent_fini,
            .recover_chan = gk104_fifo_recover_chan,
            .class_get = gk104_fifo_class_get,


    And the only location it is worked with is the following: 


    int
    gk104_fifo_new_(const struct gk104_fifo_func *func, struct nvkm_device *device,
                    int index, int nr, struct nvkm_fifo **pfifo)
    {
            struct gk104_fifo *fifo;

            if (!(fifo = kzalloc(sizeof(*fifo), GFP_KERNEL)))
                    return -ENOMEM;
            fifo->func = func;
            INIT_WORK(&fifo->recover.work, gk104_fifo_recover_work);
            *pfifo = &fifo->base;

            return nvkm_fifo_ctor(&gk104_fifo_, device, index, nr, &fifo->base);
    }


    Not sure where to go from here.
Comment 2 Charles Haithcock 2017-07-22 00:58:52 EDT
I forgot to add; 

It appears the upgrade downgraded nouveau: 


> grep nouv /var/log/dnf.log-20170721 
Jul 21 17:38:03 DEBUG ---> Package xorg-x11-drv-nouveau.x86_64 1:1.0.15-2.fc25 will be downgraded
Jul 21 17:38:03 DEBUG ---> Package xorg-x11-drv-nouveau.x86_64 1:1.0.15-1.fc26 will be a downgrade
 xorg-x11-drv-nouveau                x86_64 1:1.0.15-1.fc26                    fedora         100 k
Comment 3 Charles Haithcock 2017-07-22 12:16:31 EDT
Interestingly, I logged into the Sway WM which uses wayland and, for some reason, it is operating fine.
Comment 5 Charles Haithcock 2017-08-07 09:23:31 EDT
(In reply to Johnny B. Goode from comment #4)
> This is known problem but without good answer


Ah, then should we close this as DUPLICATE?
Comment 6 Johnny B. Goode 2017-08-07 12:16:13 EDT
This is a question to Adam or Ben.

New patches are here:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0b5477d9dabd96ded4c5ef7a5f08b00188fc1dec
but some of these are in kernel-4.12.5 so we will see probably pretty quickly.
Comment 7 Charles Haithcock 2017-08-10 14:04:57 EDT
(In reply to Johnny B. Goode from comment #6)
> This is a question to Adam or Ben.
> 
> New patches are here:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=0b5477d9dabd96ded4c5ef7a5f08b00188fc1dec
> but some of these are in kernel-4.12.5 so we will see probably pretty
> quickly.

Awesome, I can attempt and update and make a switch over to wayland when it is released and report if the patch helped. If you would like, I don't mind a test kernel to try out if you can build one with the patches in place.

Note You need to log in before you can comment on or make changes to this bug.