Bug 1463157

Summary: [GK106] GTX 660 freeze computer shortly after login
Product: Red Hat Enterprise Linux 7 Reporter: Tomas Pelka <tpelka>
Component: xorg-x11-drv-nouveauAssignee: Ben Skeggs <bskeggs>
Status: CLOSED WONTFIX QA Contact: Desktop QE <desktop-qa-list>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.4CC: bskeggs, jan.public, kherbst, tpelka
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-11 21:47:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1547138    

Description Tomas Pelka 2017-06-20 09:10:29 UTC
Description of problem:
I can see following in kernel log

Jun 20 11:03:05 localhost.localdomain kernel: [drm:drm_atomic_helper_swap_state [drm_kms_helper]] *ERROR* [CRTC:40:head-1] hw_done timed out
Jun 20 11:03:15 localhost.localdomain kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:40:head-1] hw_done timed out
Jun 20 11:03:25 localhost.localdomain kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:40:head-1] flip_done timed out
Jun 20 11:03:35 localhost.localdomain kernel: [drm:drm_atomic_helper_swap_state [drm_kms_helper]] *ERROR* [CRTC:40:head-1] hw_done timed out
Jun 20 11:03:45 localhost.localdomain kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:40:head-1] hw_done timed out
Jun 20 11:03:55 localhost.localdomain kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:40:head-1] flip_done timed out
Jun 20 11:04:05 localhost.localdomain kernel: [drm:drm_atomic_helper_swap_state [drm_kms_helper]] *ERROR* [CRTC:40:head-1] hw_done timed out
Jun 20 11:04:15 localhost.localdomain kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:40:head-1] hw_done timed out
Jun 20 11:04:25 localhost.localdomain kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:40:head-1] flip_done timed out
Jun 20 11:04:35 localhost.localdomain kernel: [drm:drm_atomic_helper_swap_state [drm_kms_helper]] *ERROR* [CRTC:40:head-1] hw_done timed out
Jun 20 11:04:45 localhost.localdomain kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:40:head-1] hw_done timed out
Jun 20 11:04:55 localhost.localdomain kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:40:head-1] flip_done timed out
Jun 20 11:05:05 localhost.localdomain kernel: [drm:drm_atomic_helper_swap_state [drm_kms_helper]] *ERROR* [CRTC:40:head-1] hw_done timed out
Jun 20 11:05:15 localhost.localdomain kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:40:head-1] hw_done timed out
Jun 20 11:05:25 localhost.localdomain kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:40:head-1] flip_done timed out
Jun 20 11:05:31 localhost.localdomain kernel: INFO: task kworker/u16:3:339 blocked for more than 120 seconds.
Jun 20 11:05:31 localhost.localdomain kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Version-Release number of selected component (if applicable):
kernel-3.10.0-680.el7.x86_64
xorg-x11-server-Xorg-1.19.3-7.el7.x86_64


How reproducible:
60%

Steps to Reproduce:
1. boot computer
2.
3.

Actual results:
see above

Expected results:
no freeze

Additional info:
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GK106 [GeForce GTX 660] [10de:11c0] (rev a1)

Comment 1 Tomas Pelka 2017-06-20 09:11:26 UTC
This freeze is actually also followed by crash:

Jun 20 11:05:31 localhost.localdomain kernel: kworker/u16:3   D 0000000000000246     0   339      2 0x00000000
Jun 20 11:05:31 localhost.localdomain kernel: Workqueue: events_unbound nv50_disp_atomic_commit_work [nouveau]
Jun 20 11:05:31 localhost.localdomain kernel:  ffff880506acfc00 0000000000000046 ffff880506ad0000 ffff880506acffd8
Jun 20 11:05:31 localhost.localdomain kernel:  ffff880506acffd8 ffff880506acffd8 ffff880506ad0000 0000000000000000
Jun 20 11:05:31 localhost.localdomain kernel:  ffff880506ad0000 7fffffffffffffff ffff8804eeafe540 0000000000000246
Jun 20 11:05:31 localhost.localdomain kernel: Call Trace:
Jun 20 11:05:31 localhost.localdomain kernel:  [<ffffffff816a6f09>] schedule+0x29/0x70
Jun 20 11:05:31 localhost.localdomain kernel:  [<ffffffff816a4a19>] schedule_timeout+0x239/0x2c0
Jun 20 11:05:31 localhost.localdomain kernel:  [<ffffffff811de381>] ? __slab_free+0x81/0x2f0
Jun 20 11:05:31 localhost.localdomain kernel:  [<ffffffff8145ec9f>] dma_fence_default_wait+0x1cf/0x230
Jun 20 11:05:31 localhost.localdomain kernel:  [<ffffffff8145e9a0>] ? dma_fence_free+0x20/0x20
Jun 20 11:05:31 localhost.localdomain kernel:  [<ffffffff8145e889>] dma_fence_wait_timeout+0x39/0xd0
Jun 20 11:05:31 localhost.localdomain kernel:  [<ffffffffc018cc0d>] drm_atomic_helper_wait_for_fences+0x7d/0x100 [drm_kms_helper]
Jun 20 11:05:31 localhost.localdomain kernel:  [<ffffffffc028e095>] nv50_disp_atomic_commit_tail+0x55/0x1180 [nouveau]
Jun 20 11:05:31 localhost.localdomain kernel:  [<ffffffffc028f1d2>] nv50_disp_atomic_commit_work+0x12/0x20 [nouveau]
Jun 20 11:05:31 localhost.localdomain kernel:  [<ffffffff810a87fa>] process_one_work+0x17a/0x440
Jun 20 11:05:31 localhost.localdomain kernel:  [<ffffffff810a94c6>] worker_thread+0x126/0x3c0
Jun 20 11:05:31 localhost.localdomain kernel:  [<ffffffff810a93a0>] ? manage_workers.isra.24+0x2a0/0x2a0
Jun 20 11:05:31 localhost.localdomain kernel:  [<ffffffff810b096f>] kthread+0xcf/0xe0
Jun 20 11:05:31 localhost.localdomain kernel:  [<ffffffff810b08a0>] ? insert_kthread_work+0x40/0x40
Jun 20 11:05:31 localhost.localdomain kernel:  [<ffffffff816b2958>] ret_from_fork+0x58/0x90
Jun 20 11:05:31 localhost.localdomain kernel:  [<ffffffff810b08a0>] ? insert_kthread_work+0x40/0x40
Jun 2

Comment 2 Tomas Pelka 2017-06-20 09:15:42 UTC
One more thing, seem I can 100% reproduce by logging in gnome-session and playing video (big buck cunny trailer, ogv) in totem.

Kernel shows: 
nouveau 0000:01:00.0: gr: TRAP ch 2 [023fad6000 X[1330]]
Jun 20 11:14:00 localhost.localdomain kernel: nouveau 0000:01:00.0: gr: GPC0/PROP trap: 00000080 [ZETA_STORAGE_TYPE_MISMATCH] x = 80, y = 96, format = 0, storage type = fe
Jun 20 11:14:00 localhost.localdomain kernel: nouveau 0000:01:00.0: gr: TRAP ch 2 [023fad6000 X[1330]]
Jun 20 11:14:00 localhost.localdomain kernel: nouveau 0000:01:00.0: gr: GPC0/PROP trap: 00000080 [ZETA_STORAGE_TYPE_MISMATCH] x = 160, y = 320, format = 0, storage type = fe
Jun 20 11:14:04 localhost.localdomain kernel: nouveau 0000:01:00.0: fifo: SCHED_ERROR 0a [CTXSW_TIMEOUT]
Jun 20 11:14:04 localhost.localdomain kernel: nouveau 0000:01:00.0: fifo: gr engine fault on channel 4, recovering...


and desktop freeze

Comment 3 Tomas Pelka 2017-06-20 09:23:12 UTC
I was able to trigger this issue also by libreoffice presentation mode.

Comment 4 Tomas Pelka 2017-06-20 14:52:52 UTC
I can reproduce on 

01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GK110 [GeForce GTX 780] [10de:1004] (rev a1)

too

Comment 7 Ben Skeggs 2018-04-13 10:55:53 UTC
Tomas,

Can you reproduce this on 7.5?

Thanks,
Ben.

Comment 8 Tomas Pelka 2018-04-13 11:12:09 UTC
(In reply to Ben Skeggs from comment #7)
> Tomas,
> 
> Can you reproduce this on 7.5?
> 
> Thanks,
> Ben.

Tomas please have a look.

Thanks
-Tom

Comment 9 Tomas Hudziec 2018-04-17 14:10:06 UTC
I can reproduce it on 7.5 with kernel-3.10.0-862.el7.x86_64. Desktop froze when playing video, installing libreoffice-impress and moving lo-impress window.

Kernel call trace from journalctl:
Apr 17 14:04:38 localhost.localdomain kernel: INFO: task kworker/u16:5:343 blocked for more than 120 seconds.
Apr 17 14:04:38 localhost.localdomain kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 17 14:04:38 localhost.localdomain kernel: kworker/u16:5   D ffff943407281fa0     0   343      2 0x00000000
Apr 17 14:04:38 localhost.localdomain kernel: Workqueue: events_unbound nv50_disp_atomic_commit_work [nouveau]
Apr 17 14:04:38 localhost.localdomain kernel: Call Trace:
Apr 17 14:04:38 localhost.localdomain kernel:  [<ffffffffc02fd307>] ? nvkm_client_notify_get+0x27/0x40 [nouveau]
Apr 17 14:04:38 localhost.localdomain kernel:  [<ffffffffc02feb5a>] ? nvkm_ioctl_ntfy_get+0x6a/0xc0 [nouveau]
Apr 17 14:04:38 localhost.localdomain kernel:  [<ffffffff86512f49>] schedule+0x29/0x70
Apr 17 14:04:38 localhost.localdomain kernel:  [<ffffffff865108b9>] schedule_timeout+0x239/0x2c0
Apr 17 14:04:38 localhost.localdomain kernel:  [<ffffffffc03af912>] ? nvkm_client_ioctl+0x12/0x20 [nouveau]
Apr 17 14:04:38 localhost.localdomain kernel:  [<ffffffffc02fc048>] ? nvif_object_ioctl+0x48/0x60 [nouveau]
Apr 17 14:04:38 localhost.localdomain kernel:  [<ffffffffc03b266c>] ? nouveau_bo_rd32+0x2c/0x30 [nouveau]
Apr 17 14:04:38 localhost.localdomain kernel:  [<ffffffffc03cea2e>] ? nv84_fence_read+0x2e/0x30 [nouveau]
Apr 17 14:04:38 localhost.localdomain kernel:  [<ffffffffc03ccbfc>] ? nouveau_fence_no_signaling+0x2c/0x90 [nouveau]
Apr 17 14:04:38 localhost.localdomain kernel:  [<ffffffff86295adc>] dma_fence_default_wait+0x1cc/0x220
Apr 17 14:04:38 localhost.localdomain kernel:  [<ffffffff862956a0>] ? dma_fence_release+0xa0/0xa0
Apr 17 14:04:38 localhost.localdomain kernel:  [<ffffffff862954df>] dma_fence_wait_timeout+0x3f/0xe0
Apr 17 14:04:38 localhost.localdomain kernel:  [<ffffffffc02dc869>] drm_atomic_helper_wait_for_fences+0x69/0xe0 [drm_kms_helper]
Apr 17 14:04:38 localhost.localdomain kernel:  [<ffffffffc03c27b5>] nv50_disp_atomic_commit_tail+0x55/0x1200 [nouveau]
Apr 17 14:04:38 localhost.localdomain kernel:  [<ffffffff8651291c>] ? __schedule+0x41c/0xa20
Apr 17 14:04:38 localhost.localdomain kernel:  [<ffffffffc03c3972>] nv50_disp_atomic_commit_work+0x12/0x20 [nouveau]
Apr 17 14:04:38 localhost.localdomain kernel:  [<ffffffff85eb2dff>] process_one_work+0x17f/0x440
Apr 17 14:04:38 localhost.localdomain kernel:  [<ffffffff85eb3ac6>] worker_thread+0x126/0x3c0
Apr 17 14:04:38 localhost.localdomain kernel:  [<ffffffff85eb39a0>] ? manage_workers.isra.24+0x2a0/0x2a0
Apr 17 14:04:38 localhost.localdomain kernel:  [<ffffffff85ebae31>] kthread+0xd1/0xe0
Apr 17 14:04:38 localhost.localdomain kernel:  [<ffffffff85ebad60>] ? insert_kthread_work+0x40/0x40
Apr 17 14:04:38 localhost.localdomain kernel:  [<ffffffff8651f637>] ret_from_fork_nospec_begin+0x21/0x21
Apr 17 14:04:38 localhost.localdomain kernel:  [<ffffffff85ebad60>] ? insert_kthread_work+0x40/0x40

Comment 16 Chris Williams 2020-11-11 21:47:24 UTC
Red Hat Enterprise Linux 7 shipped it's final minor release on September 29th, 2020. 7.9 was the last minor releases scheduled for RHEL 7.
From intial triage it does not appear the remaining Bugzillas meet the inclusion criteria for Maintenance Phase 2 and will now be closed. 

From the RHEL life cycle page:
https://access.redhat.com/support/policy/updates/errata#Maintenance_Support_2_Phase
"During Maintenance Support 2 Phase for Red Hat Enterprise Linux version 7,Red Hat defined Critical and Important impact Security Advisories (RHSAs) and selected (at Red Hat discretion) Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available."

If this BZ was closed in error and meets the above criteria please re-open it flag for 7.9.z, provide suitable business and technical justifications, and follow the process for Accelerated Fixes:
https://source.redhat.com/groups/public/pnt-cxno/pnt_customer_experience_and_operations_wiki/support_delivery_accelerated_fix_release_handbook  

Feature Requests can re-opened and moved to RHEL 8 if the desired functionality is not already present in the product. 

Please reach out to the applicable Product Experience Engineer[0] if you have any questions or concerns.  

[0] https://bugzilla.redhat.com/page.cgi?id=agile_component_mapping.html&product=Red+Hat+Enterprise+Linux+7