Bug 1251331
Summary: | Lenovo W541 Xorg freezes when mini display port cable is plugged in - 3.10.0-267.el7 WARNING: at drivers/gpu/drm/drm_dp_mst_topology.c:1272 process_single_tx_qlock+0x4b6/0x540 [drm_kms_helper]() | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Dave Wysochanski <dwysocha> | |
Component: | kernel | Assignee: | Dave Airlie <airlied> | |
kernel sub component: | Graphics | QA Contact: | Desktop QE <desktop-qa-list> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | unspecified | CC: | airlied, dwysocha, ian, lmiksik, rclark, tpelka, zlynx | |
Version: | 7.1 | Keywords: | OtherQA, Regression | |
Target Milestone: | rc | |||
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | kernel-3.10.0-323.el7 | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1274157 (view as bug list) | Environment: | ||
Last Closed: | 2015-11-19 23:04:55 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1274157 | |||
Attachments: |
Description
Dave Wysochanski
2015-08-07 02:07:05 UTC
Hey Dave A - can you take a look at this oops and let me know if you want me to try any further experiments on other kernels or enable debug, etc? I'm not familiar with this area of the kernel so need some pointers. I'd like to get this working since it's my home setup and the new laptop seems to lockup every time I use my home monitor. Thanks. Based on my tests, and that fact that the warning I'm getting is inside code which was checked in between the 123 and the 229 kernel, this is a regression from the RHEL7.0 GA kernel 3.10.0-123.el7 Test A Test B .123 PASS PASS .229 FAIL FAIL .267 FAIL FAIL Laptop used for all tests is the Lenovo W541 Test A: With laptop powered down, plug in mini display port cable, then power machine on Test B: With laptop powered on, log in to gnome desktop. Pull up system settings / displays. Then plug in mini display port cable. Haven't got a vmcore yet or further debug, but I'll work on that next. Wanted to confirm if a prior kernel worked since I can use that as a workaround in the interim. Ok got a vmcore via sysrq-c after plugging in the cable (Test B above). It looks to me like Xorg can deadlock on itself while trying to obtain a mutex it already has. Interestingly, the automated analyzer actually went into its own loop as a result of this! In any case, the Xorg pid 1774 deserves a deeper dive as the bug must be somewhere in here. PID: 1774 TASK: ffff8804674f38e0 CPU: 6 COMMAND: "Xorg" #0 [ffff8804516ff8c0] __schedule at ffffffff81608cfd #1 [ffff8804516ff928] schedule_preempt_disabled at ffffffff8160a449 #2 [ffff8804516ff938] __mutex_lock_slowpath at ffffffff81608195 #3 [ffff8804516ff998] mutex_lock at ffffffff816075ff #4 [ffff8804516ff9b0] intel_dp_destroy_mst_connector at ffffffffa0271da9 [i915] #5 [ffff8804516ff9e0] drm_dp_destroy_port at ffffffffa00e52bd [drm_kms_helper] #6 [ffff8804516ffa00] drm_dp_destroy_mst_branch_device at ffffffffa00e535e [drm_kms_helper] #7 [ffff8804516ffa30] drm_dp_mst_i2c_xfer at ffffffffa00e7e41 [drm_kms_helper] #8 [ffff8804516ffaf0] __i2c_transfer at ffffffffa00381fc [i2c_core] #9 [ffff8804516ffb28] i2c_transfer at ffffffffa0038c2c [i2c_core] #10 [ffff8804516ffb50] drm_do_probe_ddc_edid at ffffffffa007d295 [drm] #11 [ffff8804516ffbe8] drm_get_edid at ffffffffa007d3a4 [drm] #12 [ffff8804516ffc50] drm_dp_mst_get_edid at ffffffffa00e5682 [drm_kms_helper] #13 [ffff8804516ffc78] intel_dp_mst_get_modes at ffffffffa02717a9 [i915] #14 [ffff8804516ffc98] drm_helper_probe_single_connector_modes_merge_bits at ffffffffa00e2ad6 [drm_kms_helper] #15 [ffff8804516ffce0] drm_helper_probe_single_connector_modes at ffffffffa00e2be3 [drm_kms_helper] #16 [ffff8804516ffcf0] drm_mode_getconnector at ffffffffa00788ff [drm] #17 [ffff8804516ffdb8] drm_ioctl at ffffffffa0069b6c [drm] #18 [ffff8804516ffeb8] do_vfs_ioctl at ffffffff811d9a75 #19 [ffff8804516fff30] sys_ioctl at ffffffff811d9cf1 The kworker pid is the oldest blocked pid 406, but only slightly. This pid is waiting on a mutex held by Xorg pid 1774. But Xorg pid 1774 too is waiting on a mutex. However, the mutex it's waiting on is the one which it owns. KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/3.10.0-229.el7.x86_64/vmlinux DUMPFILE: /cores/retrace/tasks/843210390/crash/vmcore [PARTIAL DUMP] CPUS: 8 DATE: Sun Aug 9 00:37:30 2015 UPTIME: 00:03:54 LOAD AVERAGE: 1.86, 0.91, 0.36 TASKS: 428 ... RELEASE: 3.10.0-229.el7.x86_64 VERSION: #1 SMP Thu Jan 29 18:37:38 EST 2015 MACHINE: x86_64 (2793 Mhz) MEMORY: 15.6 GB PANIC: "SysRq : Trigger a crash" sys output <end> Checking log - Did not find DMI string Looking for tainted kernel with 'sys -t' crash> sys -t TAINTED_MASK: 200 W crash> Looking for tainted modules with: 'mod -t' Found 'no tainted modules' - NOTE: If this is a kernel crash it almost certain a kernel bug!!! Detecting panic type from 'sys' output - Found panic string: PANIC: "SysRq : Trigger a crash" - Found sysrq triggered crash dump Process triggering the sysrq crash dump was crash> files -R /proc/sysrq-trigger crash> Checking log output for RIP symbol rip_string = [ 234.525438] RIP: 0010:[<ffffffff81397486>] [<ffffffff81397486>] sysrq_handle_crash+0x16/0x20 rip_symbol = sysrq_handle_crash ... END: Basic vmcore analysis BEGIN: Basic hung vmcore analysis Analyzing LOAD AVERAGE from 'sys' - Found load avgs: 1 (1 min) 0 (5 min) 0 (15 min) - Found 8 tasks in 'RU' state - Found 2 tasks in 'UN' state Showing oldest blocked pid (NOTE: some processes such as gablogd filtered out) Oldest blocked pid: 406 Here are the oldest 2 blocked (D-state) processes and how long they've been blocked Blocked processes <begin> [0 00:02:19.801] [UN] PID: 1774 TASK: ffff8804674f38e0 CPU: 6 COMMAND: "Xorg" [0 00:02:19.614] [UN] PID: 406 TASK: ffff880463ebf1c0 CPU: 3 COMMAND: "kworker/u16:5" crash> Blocked processes <end> Backtrace of oldest blocked task or task triggering hung_task panic backtrace <begin> PID: 406 TASK: ffff880463ebf1c0 CPU: 3 COMMAND: "kworker/u16:5" #0 [ffff88046236bbe0] __schedule at ffffffff81608cfd #1 [ffff88046236bc48] schedule_preempt_disabled at ffffffff8160a449 #2 [ffff88046236bc58] __mutex_lock_slowpath at ffffffff81608195 #3 [ffff88046236bcb0] mutex_lock at ffffffff816075ff #4 [ffff88046236bcc8] drm_fb_helper_hotplug_event at ffffffffa00ea9fa [drm_kms_helper] #5 [ffff88046236bcf8] intel_fbdev_output_poll_changed at ffffffffa026199e [i915] #6 [ffff88046236bd08] drm_kms_helper_hotplug_event at ffffffffa00e25b7 [drm_kms_helper] #7 [ffff88046236bd20] intel_dp_check_mst_status at ffffffffa026f931 [i915] #8 [ffff88046236bd90] intel_dp_hpd_pulse at ffffffffa02703e0 [i915] #9 [ffff88046236bdd0] i915_digport_work_func at ffffffffa02292ed [i915] #10 [ffff88046236be20] process_one_work at ffffffff8108f0ab #11 [ffff88046236be68] worker_thread at ffffffff8108fe8b #12 [ffff88046236bec8] kthread at ffffffff8109726f #13 [ffff88046236bf50] ret_from_fork at ffffffff81613cfc crash> backtrace <end> Checking for specific hung task signatures - Found pid 406 waiting on mutex. Next steps: check holder of mutex Attempting to pull mutex off stack - get_value_from_stack: pid == 406, frame = mutex_lock, pushes = 6 - INFO: mutex == 0xffff880463cb8a08 - INFO: mutex_owner == 0xffff8804674f38e0 - INFO: mutex.wait_list == 0xffff880463cb8a10 The mutex wait_list has the following number of waters: crash> list -H 0xffff880463cb8a10 | wc -l 3 crash> You may want to list the backtraces of the waiters with: list -H 0xffff880463cb8a10 -o mutex_waiter.list -s mutex_waiter.task | awk '/task =/ { print "bt "$3 }' - BEGIN backtrace crash> bt 0xffff8804674f38e0 PID: 1774 TASK: ffff8804674f38e0 CPU: 6 COMMAND: "Xorg" #0 [ffff8804516ff8c0] __schedule at ffffffff81608cfd #1 [ffff8804516ff928] schedule_preempt_disabled at ffffffff8160a449 #2 [ffff8804516ff938] __mutex_lock_slowpath at ffffffff81608195 #3 [ffff8804516ff998] mutex_lock at ffffffff816075ff #4 [ffff8804516ff9b0] intel_dp_destroy_mst_connector at ffffffffa0271da9 [i915] #5 [ffff8804516ff9e0] drm_dp_destroy_port at ffffffffa00e52bd [drm_kms_helper] #6 [ffff8804516ffa00] drm_dp_destroy_mst_branch_device at ffffffffa00e535e [drm_kms_helper] #7 [ffff8804516ffa30] drm_dp_mst_i2c_xfer at ffffffffa00e7e41 [drm_kms_helper] #8 [ffff8804516ffaf0] __i2c_transfer at ffffffffa00381fc [i2c_core] #9 [ffff8804516ffb28] i2c_transfer at ffffffffa0038c2c [i2c_core] #10 [ffff8804516ffb50] drm_do_probe_ddc_edid at ffffffffa007d295 [drm] #11 [ffff8804516ffbe8] drm_get_edid at ffffffffa007d3a4 [drm] #12 [ffff8804516ffc50] drm_dp_mst_get_edid at ffffffffa00e5682 [drm_kms_helper] #13 [ffff8804516ffc78] intel_dp_mst_get_modes at ffffffffa02717a9 [i915] #14 [ffff8804516ffc98] drm_helper_probe_single_connector_modes_merge_bits at ffffffffa00e2ad6 [drm_kms_helper] #15 [ffff8804516ffce0] drm_helper_probe_single_connector_modes at ffffffffa00e2be3 [drm_kms_helper] #16 [ffff8804516ffcf0] drm_mode_getconnector at ffffffffa00788ff [drm] #17 [ffff8804516ffdb8] drm_ioctl at ffffffffa0069b6c [drm] #18 [ffff8804516ffeb8] do_vfs_ioctl at ffffffff811d9a75 #19 [ffff8804516fff30] sys_ioctl at ffffffff811d9cf1 #20 [ffff8804516fff80] system_call_fastpath at ffffffff81613da9 RIP: 00007f87b345f257 RSP: 00007fff5ca45ec0 RFLAGS: 00013206 RAX: 0000000000000010 RBX: ffffffff81613da9 RCX: 0000000000000001 RDX: 00007fff5ca45f70 RSI: 00000000c05064a7 RDI: 0000000000000009 RBP: 00007fff5ca45f70 R8: 000000000000ffff R9: 0000000000000003 R10: fffffffffffff5e3 R11: 0000000000003246 R12: 0000000000000000 R13: 00007f87b5de4040 R14: 0000000000000009 R15: 00000000c05064a7 ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b crash> - END backtrace Now analyzing mutex_owner backtrace - Found pid 0xffff8804674f38e0 waiting on mutex. Next steps: check holder of mutex Attempting to pull mutex off stack - get_value_from_stack: pid == 0xffff8804674f38e0, frame = mutex_lock, pushes = 6 - INFO: mutex == 0xffff880463cb8a08 - INFO: mutex_owner == 0xffff8804674f38e0 - INFO: mutex.wait_list == 0xffff880463cb8a10 The mutex wait_list has the following number of waters: crash> list -H 0xffff880463cb8a10 | wc -l 3 crash> You may want to list the backtraces of the waiters with: list -H 0xffff880463cb8a10 -o mutex_waiter.list -s mutex_waiter.task | awk '/task =/ { print "bt "$3 }' - BEGIN backtrace crash> bt 0xffff8804674f38e0 PID: 1774 TASK: ffff8804674f38e0 CPU: 6 COMMAND: "Xorg" #0 [ffff8804516ff8c0] __schedule at ffffffff81608cfd #1 [ffff8804516ff928] schedule_preempt_disabled at ffffffff8160a449 #2 [ffff8804516ff938] __mutex_lock_slowpath at ffffffff81608195 #3 [ffff8804516ff998] mutex_lock at ffffffff816075ff #4 [ffff8804516ff9b0] intel_dp_destroy_mst_connector at ffffffffa0271da9 [i915] #5 [ffff8804516ff9e0] drm_dp_destroy_port at ffffffffa00e52bd [drm_kms_helper] #6 [ffff8804516ffa00] drm_dp_destroy_mst_branch_device at ffffffffa00e535e [drm_kms_helper] #7 [ffff8804516ffa30] drm_dp_mst_i2c_xfer at ffffffffa00e7e41 [drm_kms_helper] #8 [ffff8804516ffaf0] __i2c_transfer at ffffffffa00381fc [i2c_core] #9 [ffff8804516ffb28] i2c_transfer at ffffffffa0038c2c [i2c_core] #10 [ffff8804516ffb50] drm_do_probe_ddc_edid at ffffffffa007d295 [drm] #11 [ffff8804516ffbe8] drm_get_edid at ffffffffa007d3a4 [drm] #12 [ffff8804516ffc50] drm_dp_mst_get_edid at ffffffffa00e5682 [drm_kms_helper] #13 [ffff8804516ffc78] intel_dp_mst_get_modes at ffffffffa02717a9 [i915] #14 [ffff8804516ffc98] drm_helper_probe_single_connector_modes_merge_bits at ffffffffa00e2ad6 [drm_kms_helper] #15 [ffff8804516ffce0] drm_helper_probe_single_connector_modes at ffffffffa00e2be3 [drm_kms_helper] #16 [ffff8804516ffcf0] drm_mode_getconnector at ffffffffa00788ff [drm] #17 [ffff8804516ffdb8] drm_ioctl at ffffffffa0069b6c [drm] #18 [ffff8804516ffeb8] do_vfs_ioctl at ffffffff811d9a75 #19 [ffff8804516fff30] sys_ioctl at ffffffff811d9cf1 #20 [ffff8804516fff80] system_call_fastpath at ffffffff81613da9 RIP: 00007f87b345f257 RSP: 00007fff5ca45ec0 RFLAGS: 00013206 RAX: 0000000000000010 RBX: ffffffff81613da9 RCX: 0000000000000001 RDX: 00007fff5ca45f70 RSI: 00000000c05064a7 RDI: 0000000000000009 RBP: 00007fff5ca45f70 R8: 000000000000ffff R9: 0000000000000003 R10: fffffffffffff5e3 R11: 0000000000003246 R12: 0000000000000000 R13: 00007f87b5de4040 R14: 0000000000000009 R15: 00000000c05064a7 ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b crash> - END backtrace Now analyzing mutex_owner backtrace - Found pid 0xffff8804674f38e0 waiting on mutex. Next steps: check holder of mutex Attempting to pull mutex off stack - get_value_from_stack: pid == 0xffff8804674f38e0, frame = mutex_lock, pushes = 6 - INFO: mutex == 0xffff880463cb8a08 - INFO: mutex_owner == 0xffff8804674f38e0 - INFO: mutex.wait_list == 0xffff880463cb8a10 The mutex wait_list has the following number of waters: crash> list -H 0xffff880463cb8a10 | wc -l 3 crash> You may want to list the backtraces of the waiters with: list -H 0xffff880463cb8a10 -o mutex_waiter.list -s mutex_waiter.task | awk '/task =/ { print "bt "$3 }' - BEGIN backtrace crash> bt 0xffff8804674f38e0 PID: 1774 TASK: ffff8804674f38e0 CPU: 6 COMMAND: "Xorg" #0 [ffff8804516ff8c0] __schedule at ffffffff81608cfd #1 [ffff8804516ff928] schedule_preempt_disabled at ffffffff8160a449 #2 [ffff8804516ff938] __mutex_lock_slowpath at ffffffff81608195 #3 [ffff8804516ff998] mutex_lock at ffffffff816075ff #4 [ffff8804516ff9b0] intel_dp_destroy_mst_connector at ffffffffa0271da9 [i915] #5 [ffff8804516ff9e0] drm_dp_destroy_port at ffffffffa00e52bd [drm_kms_helper] #6 [ffff8804516ffa00] drm_dp_destroy_mst_branch_device at ffffffffa00e535e [drm_kms_helper] #7 [ffff8804516ffa30] drm_dp_mst_i2c_xfer at ffffffffa00e7e41 [drm_kms_helper] #8 [ffff8804516ffaf0] __i2c_transfer at ffffffffa00381fc [i2c_core] #9 [ffff8804516ffb28] i2c_transfer at ffffffffa0038c2c [i2c_core] #10 [ffff8804516ffb50] drm_do_probe_ddc_edid at ffffffffa007d295 [drm] #11 [ffff8804516ffbe8] drm_get_edid at ffffffffa007d3a4 [drm] #12 [ffff8804516ffc50] drm_dp_mst_get_edid at ffffffffa00e5682 [drm_kms_helper] #13 [ffff8804516ffc78] intel_dp_mst_get_modes at ffffffffa02717a9 [i915] #14 [ffff8804516ffc98] drm_helper_probe_single_connector_modes_merge_bits at ffffffffa00e2ad6 [drm_kms_helper] #15 [ffff8804516ffce0] drm_helper_probe_single_connector_modes at ffffffffa00e2be3 [drm_kms_helper] #16 [ffff8804516ffcf0] drm_mode_getconnector at ffffffffa00788ff [drm] #17 [ffff8804516ffdb8] drm_ioctl at ffffffffa0069b6c [drm] #18 [ffff8804516ffeb8] do_vfs_ioctl at ffffffff811d9a75 #19 [ffff8804516fff30] sys_ioctl at ffffffff811d9cf1 #20 [ffff8804516fff80] system_call_fastpath at ffffffff81613da9 RIP: 00007f87b345f257 RSP: 00007fff5ca45ec0 RFLAGS: 00013206 RAX: 0000000000000010 RBX: ffffffff81613da9 RCX: 0000000000000001 RDX: 00007fff5ca45f70 RSI: 00000000c05064a7 RDI: 0000000000000009 RBP: 00007fff5ca45f70 R8: 000000000000ffff R9: 0000000000000003 R10: fffffffffffff5e3 R11: 0000000000003246 R12: 0000000000000000 .... Proof that the Xorg process owns the mutex it is trying to obtain. crash> set 1774 PID: 1774 COMMAND: "Xorg" TASK: ffff8804674f38e0 [THREAD_INFO: ffff8804516fc000] CPU: 6 STATE: TASK_UNINTERRUPTIBLE crash> bt -f | grep "mutex_lock at" #3 [ffff8804516ff998] mutex_lock at ffffffff816075ff crash> px (0xffff8804516ff998^C crash> bt PID: 1774 TASK: ffff8804674f38e0 CPU: 6 COMMAND: "Xorg" #0 [ffff8804516ff8c0] __schedule at ffffffff81608cfd #1 [ffff8804516ff928] schedule_preempt_disabled at ffffffff8160a449 #2 [ffff8804516ff938] __mutex_lock_slowpath at ffffffff81608195 #3 [ffff8804516ff998] mutex_lock at ffffffff816075ff #4 [ffff8804516ff9b0] intel_dp_destroy_mst_connector at ffffffffa0271da9 [i915] #5 [ffff8804516ff9e0] drm_dp_destroy_port at ffffffffa00e52bd [drm_kms_helper] #6 [ffff8804516ffa00] drm_dp_destroy_mst_branch_device at ffffffffa00e535e [drm_kms_helper] #7 [ffff8804516ffa30] drm_dp_mst_i2c_xfer at ffffffffa00e7e41 [drm_kms_helper] #8 [ffff8804516ffaf0] __i2c_transfer at ffffffffa00381fc [i2c_core] #9 [ffff8804516ffb28] i2c_transfer at ffffffffa0038c2c [i2c_core] #10 [ffff8804516ffb50] drm_do_probe_ddc_edid at ffffffffa007d295 [drm] #11 [ffff8804516ffbe8] drm_get_edid at ffffffffa007d3a4 [drm] #12 [ffff8804516ffc50] drm_dp_mst_get_edid at ffffffffa00e5682 [drm_kms_helper] #13 [ffff8804516ffc78] intel_dp_mst_get_modes at ffffffffa02717a9 [i915] #14 [ffff8804516ffc98] drm_helper_probe_single_connector_modes_merge_bits at ffffffffa00e2ad6 [drm_kms_helper] #15 [ffff8804516ffce0] drm_helper_probe_single_connector_modes at ffffffffa00e2be3 [drm_kms_helper] #16 [ffff8804516ffcf0] drm_mode_getconnector at ffffffffa00788ff [drm] #17 [ffff8804516ffdb8] drm_ioctl at ffffffffa0069b6c [drm] #18 [ffff8804516ffeb8] do_vfs_ioctl at ffffffff811d9a75 #19 [ffff8804516fff30] sys_ioctl at ffffffff811d9cf1 #20 [ffff8804516fff80] system_call_fastpath at ffffffff81613da9 RIP: 00007f87b345f257 RSP: 00007fff5ca45ec0 RFLAGS: 00013206 RAX: 0000000000000010 RBX: ffffffff81613da9 RCX: 0000000000000001 RDX: 00007fff5ca45f70 RSI: 00000000c05064a7 RDI: 0000000000000009 RBP: 00007fff5ca45f70 R8: 000000000000ffff R9: 0000000000000003 R10: fffffffffffff5e3 R11: 0000000000003246 R12: 0000000000000000 R13: 00007f87b5de4040 R14: 0000000000000009 R15: 00000000c05064a7 ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b crash> dis -r ffffffff816075ff | tail 0xffffffff816075e5 <mutex_lock+5>: push %rbp 0xffffffff816075e6 <mutex_lock+6>: mov %rsp,%rbp 0xffffffff816075e9 <mutex_lock+9>: push %rbx 0xffffffff816075ea <mutex_lock+10>: mov %rdi,%rbx 0xffffffff816075ed <mutex_lock+13>: callq 0xffffffff816095f0 <_cond_resched> 0xffffffff816075f2 <mutex_lock+18>: mov %rbx,%rdi <---- rdx == rdi == struct mutex * 0xffffffff816075f5 <mutex_lock+21>: lock decl (%rdi) 0xffffffff816075f8 <mutex_lock+24>: jns 0xffffffff816075ff <mutex_lock+31> 0xffffffff816075fa <mutex_lock+26>: callq 0xffffffff816080d0 <__mutex_lock_slowpath> 0xffffffff816075ff <mutex_lock+31>: mov %gs:0xb7c0,%rax crash> dis __mutex_lock_slowpath | head 0xffffffff816080d0 <__mutex_lock_slowpath>: nopl 0x0(%rax,%rax,1) [FTRACE NOP] 0xffffffff816080d5 <__mutex_lock_slowpath+5>: push %rbp 0xffffffff816080d6 <__mutex_lock_slowpath+6>: mov %rsp,%rbp 0xffffffff816080d9 <__mutex_lock_slowpath+9>: push %r15 0xffffffff816080db <__mutex_lock_slowpath+11>: push %r14 0xffffffff816080dd <__mutex_lock_slowpath+13>: push %r13 0xffffffff816080df <__mutex_lock_slowpath+15>: mov %gs:0xb7c0,%r13 0xffffffff816080e8 <__mutex_lock_slowpath+24>: push %r12 0xffffffff816080ea <__mutex_lock_slowpath+26>: push %rbx <---- 6th push 0xffffffff816080eb <__mutex_lock_slowpath+27>: mov %rdi,%rbx crash> bt -f | grep "mutex_lock at" #3 [ffff8804516ff998] mutex_lock at ffffffff816075ff crash> px (0xffff8804516ff998 - 6*8) $1 = 0xffff8804516ff968 crash> rd 0xffff8804516ff968 ffff8804516ff968: ffff880463cb8a08 ...c.... crash> mutex.owner ffff880463cb8a08 owner = 0xffff8804674f38e0 crash> bt 0xffff8804674f38e0 PID: 1774 TASK: ffff8804674f38e0 CPU: 6 COMMAND: "Xorg" #0 [ffff8804516ff8c0] __schedule at ffffffff81608cfd #1 [ffff8804516ff928] schedule_preempt_disabled at ffffffff8160a449 #2 [ffff8804516ff938] __mutex_lock_slowpath at ffffffff81608195 #3 [ffff8804516ff998] mutex_lock at ffffffff816075ff #4 [ffff8804516ff9b0] intel_dp_destroy_mst_connector at ffffffffa0271da9 [i915] #5 [ffff8804516ff9e0] drm_dp_destroy_port at ffffffffa00e52bd [drm_kms_helper] #6 [ffff8804516ffa00] drm_dp_destroy_mst_branch_device at ffffffffa00e535e [drm_kms_helper] #7 [ffff8804516ffa30] drm_dp_mst_i2c_xfer at ffffffffa00e7e41 [drm_kms_helper] #8 [ffff8804516ffaf0] __i2c_transfer at ffffffffa00381fc [i2c_core] #9 [ffff8804516ffb28] i2c_transfer at ffffffffa0038c2c [i2c_core] #10 [ffff8804516ffb50] drm_do_probe_ddc_edid at ffffffffa007d295 [drm] #11 [ffff8804516ffbe8] drm_get_edid at ffffffffa007d3a4 [drm] #12 [ffff8804516ffc50] drm_dp_mst_get_edid at ffffffffa00e5682 [drm_kms_helper] #13 [ffff8804516ffc78] intel_dp_mst_get_modes at ffffffffa02717a9 [i915] #14 [ffff8804516ffc98] drm_helper_probe_single_connector_modes_merge_bits at ffffffffa00e2ad6 [drm_kms_helper] #15 [ffff8804516ffce0] drm_helper_probe_single_connector_modes at ffffffffa00e2be3 [drm_kms_helper] #16 [ffff8804516ffcf0] drm_mode_getconnector at ffffffffa00788ff [drm] #17 [ffff8804516ffdb8] drm_ioctl at ffffffffa0069b6c [drm] #18 [ffff8804516ffeb8] do_vfs_ioctl at ffffffff811d9a75 #19 [ffff8804516fff30] sys_ioctl at ffffffff811d9cf1 #20 [ffff8804516fff80] system_call_fastpath at ffffffff81613da9 RIP: 00007f87b345f257 RSP: 00007fff5ca45ec0 RFLAGS: 00013206 RAX: 0000000000000010 RBX: ffffffff81613da9 RCX: 0000000000000001 RDX: 00007fff5ca45f70 RSI: 00000000c05064a7 RDI: 0000000000000009 RBP: 00007fff5ca45f70 R8: 000000000000ffff R9: 0000000000000003 R10: fffffffffffff5e3 R11: 0000000000003246 R12: 0000000000000000 R13: 00007f87b5de4040 R14: 0000000000000009 R15: 00000000c05064a7 ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b Ok so the bug appears to be this. The dev->mode_config.mutex is first obtained inside what appears to be generic code in drm_mode_getconnector, before calling fill_modes == drm_helper_probe_single_connector_modes. ~~~ drivers/gpu/drm/drm_crtc.c 1831 int drm_mode_getconnector(struct drm_device *dev, void *data, 1832 struct drm_file *file_priv) 1833 { ... 1853 1854 DRM_DEBUG_KMS("[CONNECTOR:%d:?]\n", out_resp->connector_id); 1855 1856 mutex_lock(&dev->mode_config.mutex); ... 1872 if (out_resp->count_modes == 0) { 1873--> connector->funcs->fill_modes(connector, 1874 dev->mode_config.max_width, 1875 dev->mode_config.max_height); ~~~ Ultimately later we call down into the i915 driver and get inside drivers/gpu/drm/i915/intel_dp_mst.c line 427 function intel_dp_destroy_mst_connector where the driver tries to obtain the same mutex a second time, and deadlocks on itself. ~~~ drivers/gpu/drm/i915/intel_dp_mst.c 420 static void intel_dp_destroy_mst_connector(struct drm_dp_mst_topology_mgr *mgr, 421 struct drm_connector *connector) 422 { 423 struct intel_connector *intel_connector = to_intel_connector(connector); 424 struct drm_device *dev = connector->dev; 425 struct drm_i915_private *dev_priv = dev->dev_private; 426 /* need to nuke the connector */ 427--> mutex_lock(&dev->mode_config.mutex); 428 intel_connector_dpms(connector, DRM_MODE_DPMS_OFF); ~~~ Next steps is to look for some obvious patch though it may be nearly impossible depending on how much the code has changed upstream vs RHEL7.1 Analysis details ================ Which mutex has Xorg pid 1774 tried to obtain twice? ~~~ #3 [ffff8804516ff998] mutex_lock at ffffffff816075ff #4 [ffff8804516ff9b0] intel_dp_destroy_mst_connector at ffffffffa0271da9 [i915] crash> dis -lr ffffffffa0271da9 | tail 0xffffffffa0271d93 <intel_dp_destroy_mst_connector+19>: mov %rsi,%rbx /usr/src/debug/kernel-3.10.0-229.el7/linux-3.10.0-229.el7.x86_64/drivers/gpu/drm/i915/intel_dp_mst.c: 427 0xffffffffa0271d96 <intel_dp_destroy_mst_connector+22>: lea 0x208(%r13),%r12 /usr/src/debug/kernel-3.10.0-229.el7/linux-3.10.0-229.el7.x86_64/drivers/gpu/drm/i915/intel_dp_mst.c: 425 0xffffffffa0271d9d <intel_dp_destroy_mst_connector+29>: mov 0x28(%r13),%r14 /usr/src/debug/kernel-3.10.0-229.el7/linux-3.10.0-229.el7.x86_64/drivers/gpu/drm/i915/intel_dp_mst.c: 427 0xffffffffa0271da1 <intel_dp_destroy_mst_connector+33>: mov %r12,%rdi 0xffffffffa0271da4 <intel_dp_destroy_mst_connector+36>: callq 0xffffffff816075e0 <mutex_lock> /usr/src/debug/kernel-3.10.0-229.el7/linux-3.10.0-229.el7.x86_64/drivers/gpu/drm/i915/intel_dp_mst.c: 428 0xffffffffa0271da9 <intel_dp_destroy_mst_connector+41>: mov $0x3,%esi drivers/gpu/drm/i915/intel_dp_mst.c 420 static void intel_dp_destroy_mst_connector(struct drm_dp_mst_topology_mgr *mgr, 421 struct drm_connector *connector) 422 { 423 struct intel_connector *intel_connector = to_intel_connector(connector); 424 struct drm_device *dev = connector->dev; 425 struct drm_i915_private *dev_priv = dev->dev_private; 426 /* need to nuke the connector */ 427--> mutex_lock(&dev->mode_config.mutex); 428 intel_connector_dpms(connector, DRM_MODE_DPMS_OFF); 429 mutex_unlock(&dev->mode_config.mutex); 430 431 intel_connector->unregister(intel_connector); 432 433 mutex_lock(&dev->mode_config.mutex); 434 drm_fb_helper_remove_one_connector(&dev_priv->fbdev->helper, connector); 435 drm_connector_cleanup(connector); 436 mutex_unlock(&dev->mode_config.mutex); ~~~ Looks like `drm_device.mode_config.mutex` That was simple, but where did we obtain it first? Where else is it locked? ~~~ $ grep -n mode_config.mutex drivers/gpu/drm/i915/*.[ch] | grep mutex_lock drivers/gpu/drm/i915/i915_debugfs.c:1680: int ret = mutex_lock_interruptible(&dev->mode_config.mutex); drivers/gpu/drm/i915/i915_debugfs.c:1726: ret = mutex_lock_interruptible(&dev->mode_config.mutex); drivers/gpu/drm/i915/intel_dp.c:4411: mutex_lock(&dev->mode_config.mutex); drivers/gpu/drm/i915/intel_dp_mst.c:413: mutex_lock(&dev->mode_config.mutex); drivers/gpu/drm/i915/intel_dp_mst.c:427: mutex_lock(&dev->mode_config.mutex); drivers/gpu/drm/i915/intel_dp_mst.c:433: mutex_lock(&dev->mode_config.mutex); drivers/gpu/drm/i915/intel_lvds.c:1003: mutex_lock(&dev->mode_config.mutex); ~~~ We can rule out /intel_dp_mst.c:427 and 433 since we see them above. Examine the following ones ~~~ drivers/gpu/drm/i915/i915_debugfs.c:1680: int ret = mutex_lock_interruptible(&dev->mode_config.mutex); drivers/gpu/drm/i915/i915_debugfs.c:1726: ret = mutex_lock_interruptible(&dev->mode_config.mutex); drivers/gpu/drm/i915/intel_dp.c:4411: mutex_lock(&dev->mode_config.mutex); drivers/gpu/drm/i915/intel_dp_mst.c:413: mutex_lock(&dev->mode_config.mutex); drivers/gpu/drm/i915/intel_lvds.c:1003: mutex_lock(&dev->mode_config.mutex); ~~~ There's a handful there so let's go back to the backtrace and see if we can spot something that fits ~~~ #16 [ffff8804516ffcf0] drm_mode_getconnector at ffffffffa00788ff [drm] crash> dis -lr ffffffffa00788ff | tail 0xffffffffa00788d5 <drm_mode_getconnector+981>: cmp $0xc,%rdx 0xffffffffa00788d9 <drm_mode_getconnector+985>: jne 0xffffffffa00788b2 <drm_mode_getconnector+946> 0xffffffffa00788db <drm_mode_getconnector+987>: jmpq 0xffffffffa007884e <drm_mode_getconnector+846> /usr/src/debug/kernel-3.10.0-229.el7/linux-3.10.0-229.el7.x86_64/drivers/gpu/drm/drm_crtc.c: 1873 0xffffffffa00788e0 <drm_mode_getconnector+992>: mov 0xa0(%r13),%rax 0xffffffffa00788e7 <drm_mode_getconnector+999>: mov 0x39c(%r15),%edx 0xffffffffa00788ee <drm_mode_getconnector+1006>: mov 0x398(%r15),%esi 0xffffffffa00788f5 <drm_mode_getconnector+1013>: mov -0xa8(%rbp),%rdi 0xffffffffa00788fc <drm_mode_getconnector+1020>: callq *0x28(%rax) 0xffffffffa00788ff <drm_mode_getconnector+1023>: jmpq 0xffffffffa00785fe <drm_mode_getconnector+254> drivers/gpu/drm/drm_crtc.c 1831 int drm_mode_getconnector(struct drm_device *dev, void *data, 1832 struct drm_file *file_priv) 1833 { ... 1853 1854 DRM_DEBUG_KMS("[CONNECTOR:%d:?]\n", out_resp->connector_id); 1855 1856 mutex_lock(&dev->mode_config.mutex); 1857 1858 connector = drm_connector_find(dev, out_resp->connector_id); 1859 if (!connector) { 1860 ret = -ENOENT; 1861 goto out; 1862 } 1863 ... 1871 1872 if (out_resp->count_modes == 0) { 1873--> connector->funcs->fill_modes(connector, 1874 dev->mode_config.max_width, 1875 dev->mode_config.max_height); 1876 } ~~~ Looks like we may have gotten lucky. Assuming 'dev->mode_config.mutex' is the same on line 1856 of drm_mode_getconnector, then I'd say this is where this Xorg process first obtained it. Note that the next function matches fill_modes even though it is i915 specific Now it does seem like i915 calls drm_helper_probe_single_connector_modes ~~~ #15 [ffff8804516ffce0] drm_helper_probe_single_connector_modes at ffffffffa00e2be3 [drm_kms_helper] #16 [ffff8804516ffcf0] drm_mode_getconnector at ffffffffa00788ff [drm] n gpu/drm/i915/intel_crt.c 768 .fill_modes = drm_helper_probe_single_connector_modes, o gpu/drm/i915/intel_dp.c 3898 .fill_modes = drm_helper_probe_single_connector_modes, p gpu/drm/i915/intel_dp_mst.c 321 .fill_modes = drm_helper_probe_single_connector_modes, q gpu/drm/i915/intel_dsi.c 658 .fill_modes = drm_helper_probe_single_connector_modes, r gpu/drm/i915/intel_dvo.c 387 .fill_modes = drm_helper_probe_single_connector_modes, s gpu/drm/i915/intel_hdmi.c 1403 .fill_modes = drm_helper_probe_single_connector_modes, t gpu/drm/i915/intel_lvds.c 526 .fill_modes = drm_helper_probe_single_connector_modes, u gpu/drm/i915/intel_sdvo.c 2235 .fill_modes = drm_helper_probe_single_connector_modes, v gpu/drm/i915/intel_tv.c 1516 .fill_modes = drm_helper_probe_single_connector_modes, ~~~ So let's check the mutex here. Let's get the first arg 'dev' from the stack here and see if the mutex address matches up. ~~~ crash> dis -r ffffffffa0069b6c | tail 0xffffffffa0069b42 <drm_ioctl+450>: js 0xffffffffa0069e10 <drm_ioctl+1168> 0xffffffffa0069b48 <drm_ioctl+456>: testb $0x10,0x4(%r10) 0xffffffffa0069b4d <drm_ioctl+461>: mov %r11,-0xe0(%rbp) 0xffffffffa0069b54 <drm_ioctl+468>: je 0xffffffffa0069c80 <drm_ioctl+768> 0xffffffffa0069b5a <drm_ioctl+474>: mov %r10,-0xd8(%rbp) 0xffffffffa0069b61 <drm_ioctl+481>: mov %r15,%rdx 0xffffffffa0069b64 <drm_ioctl+484>: mov %rbx,%rsi 0xffffffffa0069b67 <drm_ioctl+487>: mov %r14,%rdi 0xffffffffa0069b6a <drm_ioctl+490>: callq *%rcx 0xffffffffa0069b6c <drm_ioctl+492>: mov -0xd8(%rbp),%r10 crash> dis drm_mode_getconnector | head 0xffffffffa0078500 <drm_mode_getconnector>: nopl 0x0(%rax,%rax,1) [FTRACE NOP] 0xffffffffa0078505 <drm_mode_getconnector+5>: push %rbp 0xffffffffa0078506 <drm_mode_getconnector+6>: mov %rsp,%rbp 0xffffffffa0078509 <drm_mode_getconnector+9>: push %r15 0xffffffffa007850b <drm_mode_getconnector+11>: mov %rdi,%r15 0xffffffffa007850e <drm_mode_getconnector+14>: push %r14 <--- 3rd push crash> bt -f | grep drm_ioctl #17 [ffff8804516ffdb8] drm_ioctl at ffffffffa0069b6c [drm] crash> px (0xffff8804516ffdb8 - 3*8) $1 = 0xffff8804516ffda0 crash> rd 0xffff8804516ffda0 ffff8804516ffda0: ffff880463cb8800 ...c.... crash> p &((struct drm_device *)0xffff880463cb8800)->mode_config.mutex $2 = (struct mutex *) 0xffff880463cb8a08 ~~~ Recall our previous mutex address we obtained: ~~~ crash> mutex.owner ffff880463cb8a08 ~~~ Indeed this is the same mutex. So the bug is that inside drivers/gpu/drm/drm_crtc.c:1856 drm_mode_getconnector() we first obtain the mutex, then we call into the i915 driver intel_dp_destroy_mst_connector where we try to obtain it again and so we deadlock ourselves. ~~~ 1831 int drm_mode_getconnector(struct drm_device *dev, void *data, 1832 struct drm_file *file_priv) 1833 { ... 1853 1854 DRM_DEBUG_KMS("[CONNECTOR:%d:?]\n", out_resp->connector_id); 1855 1856 mutex_lock(&dev->mode_config.mutex); ... 1891 if (connector->encoder) 1892 out_resp->encoder_id = connector->encoder->base.id; 1893--> else 1894 out_resp->encoder_id = 0; ~~~ this should be fixed by the stable backport from upstream with the patches 6b8eeca65b18ae77e175cc2b6571731f0ee413bf drm/dp/mst: close deadlock in connector destruction. 4772ff03df8094fd99d28de5fcf5df3a3e9c68bb drm/dp/mst: Remove port after removing connector. (In reply to Dave Airlie from comment #9) > this should be fixed by the stable backport from upstream with the patches > > 6b8eeca65b18ae77e175cc2b6571731f0ee413bf drm/dp/mst: close deadlock in > connector destruction. > 4772ff03df8094fd99d28de5fcf5df3a3e9c68bb drm/dp/mst: Remove port after > removing connector. Thanks! When you backport it and/or get a build for test let me know. It reliably fails for me so should be an easy test. Actually those are clean cherry-picks so I can do a quick build / test on 3.10.0-267.el7 + those two patches. Tried the patches, and still doesn't work. But I've got another vmcore and it looks slightly different. Don't have much analysis yet but will try to work on it next week. so, from the backtrace splats in #c15, it still looks like there is quite a lot different in the -314 kernel vs what you are running? Would it be possible to try again w/ the -314 kernel (which has the two patches that David mentioned in #c9 plus a bunch of other stable backport fixes)? (In reply to Rob Clark from comment #16) > so, from the backtrace splats in #c15, it still looks like there is quite a > lot different in the -314 kernel vs what you are running? Would it be > possible to try again w/ the -314 kernel (which has the two patches that > David mentioned in #c9 plus a bunch of other stable backport fixes)? Can you point at any specific patches in -314 which you feel will fix the deadlock shown in my last vmcore? Testing -314 kernel is not a simple "install the -314 kernel" test as there is a dependency on dracut apparently. I'm not sure I want to risk updating that and bricking my main system unless there's some evidence which points at specific patches fixing this problem. ab236e1ef798e2d5105207134cc3e3263feabb04 [drm] dp-mst: Remove debug WARN_ON will at least clean up the warn's you see, can you point me where the backtrace for the deadlock is that isn't in the WARN_ON? the warn on is just noise from debugging, and can happen depending on the mst receiver. I've sent a few patches upstream to fix some stuff in this area, (In reply to Dave Airlie from comment #18) > ab236e1ef798e2d5105207134cc3e3263feabb04 [drm] dp-mst: Remove debug WARN_ON > > will at least clean up the warn's you see, > Ok, yeah I was not sure if the WARN was important to the deadlock or not. > can you point me where the backtrace for the deadlock is that isn't in the > WARN_ON? the warn on is just noise from debugging, and can happen depending > on the mst receiver. > I've not delved into the latest vmcore I took on 3.10.0-267.el7.bz1251331.1.x86_64 but will see what I can come up with. > I've sent a few patches upstream to fix some stuff in this area, Ok. Created attachment 1075607 [details] vmcore analysis of hang on 3.10.0-267.el7.bz1251331.1.x86_64 Comment on attachment 1075607 [details] vmcore analysis of hang on 3.10.0-267.el7.bz1251331.1.x86_64 In summary, there appears to be another deadlock between Xorg pid 2967, who is blocked the longest, and a kworker pid 290, which appears to be doing the work which Xorg is waiting to be cancelled. The problem is similar to the first in that the same ioctl is being done by Xorg, but he gets blocked a different way this time. Now Xorg is waiting here, but I'm not quite sure I fully understand where the corresponding queue_work was called from, or if indeed it should be waiting here. Assuming it is correct though, this process does appear to be tied to a kworker thread, which is the next oldest blocked task. ~~~ crash> bt 0xffff8804404b9700 PID: 2967 TASK: ffff8804404b9700 CPU: 2 COMMAND: "Xorg" #0 [ffff8800359f37c0] __schedule at ffffffff81624655 #1 [ffff8800359f3828] schedule at ffffffff81624cd9 #2 [ffff8800359f3838] schedule_timeout at ffffffff81622979 #3 [ffff8800359f38e0] wait_for_completion at ffffffff816250a6 #4 [ffff8800359f3940] flush_work at ffffffff810946dc #5 [ffff8800359f39b8] __cancel_work_timer at ffffffff81094828 #6 [ffff8800359f39f0] cancel_work_sync at ffffffff810948d0 #7 [ffff8800359f3a00] drm_dp_destroy_mst_branch_device at ffffffffa02d5b87 [drm_kms_helper] #8 [ffff8800359f3a30] drm_dp_mst_i2c_xfer at ffffffffa02d87c1 [drm_kms_helper] #9 [ffff8800359f3af0] __i2c_transfer at ffffffffa00551fc [i2c_core] #10 [ffff8800359f3b28] i2c_transfer at ffffffffa0055c2c [i2c_core] <----- obtains rt_mutex waited on by pid 290 #11 [ffff8800359f3b50] drm_do_probe_ddc_edid at ffffffffa00cec55 [drm] #12 [ffff8800359f3be8] drm_get_edid at ffffffffa00d03de [drm] #13 [ffff8800359f3c48] drm_dp_mst_detect_port at ffffffffa02d600a [drm_kms_helper] #14 [ffff8800359f3c78] intel_dp_mst_detect at ffffffffa0386b63 [i915] #15 [ffff8800359f3c88] drm_helper_probe_single_connector_modes_merge_bits at ffffffffa02d2d40 [drm_kms_helper] #16 [ffff8800359f3ce0] drm_helper_probe_single_connector_modes at ffffffffa02d2f13 [drm_kms_helper] #17 [ffff8800359f3cf0] drm_mode_getconnector at ffffffffa00ca8a4 [drm] #18 [ffff8800359f3db8] drm_ioctl at ffffffffa00bbcf9 [drm] #19 [ffff8800359f3eb8] do_vfs_ioctl at ffffffff811e4d95 #20 [ffff8800359f3f30] sys_ioctl at ffffffff811e5011 #21 [ffff8800359f3f80] system_call_fastpath at ffffffff8162fd49 ~~~ The kworker involved is blocked here, and it's waiting on an rt_mutex which is held by Xorg (obtained i2c_transfer) ~~~ PID: 290 TASK: ffff88046749f300 CPU: 4 COMMAND: "kworker/4:1" #0 [ffff880463c6b978] __schedule at ffffffff81624655 #1 [ffff880463c6b9e0] schedule at ffffffff81624cd9 #2 [ffff880463c6b9f0] __rt_mutex_slowlock at ffffffff8162607b #3 [ffff880463c6ba40] rt_mutex_slowlock at ffffffff8162621e #4 [ffff880463c6baf0] rt_mutex_lock at ffffffff8162639c #5 [ffff880463c6bb08] i2c_lock_adapter at ffffffffa00555a5 [i2c_core] #6 [ffff880463c6bb18] i2c_transfer at ffffffffa0055c4d [i2c_core] #7 [ffff880463c6bb40] drm_do_probe_ddc_edid at ffffffffa00cec55 [drm] #8 [ffff880463c6bc30] drm_dp_add_port at ffffffffa02d85c9 [drm_kms_helper] #9 [ffff880463c6bd80] drm_dp_send_link_address at ffffffffa02d80fc [drm_kms_helper] #10 [ffff880463c6bde0] drm_dp_check_and_send_link_address at ffffffffa02d86dd [drm_kms_helper] #11 [ffff880463c6be10] drm_dp_mst_link_probe_work at ffffffffa02d870c [drm_kms_helper] #12 [ffff880463c6be20] process_one_work at ffffffff8109352b #13 [ffff880463c6be68] worker_thread at ffffffff810942fb #14 [ffff880463c6bec8] kthread at ffffffff8109b7bf #15 [ffff880463c6bf50] ret_from_fork at ffffffff8162fc98 ~~~ So it seems like the Xorg is waiting for the completion of the work function, which is being executed by kworker. But kworker can't proceed because he is waiting for the rt_mutex held by Xorg. I can't say I fully understand this though. This is all I know at this point. This may be enough to identify the problem and/or patches which may fix this. Or it may require another pass at the vmcore. If I take another pass at the vmcore, I would probably study where the queue_work was called for the Xorg function which is blocked, and make sure that was fully understood / valid. Full gory details of the vmcore analysis is attached. What do you think of https://bugzilla.redhat.com/show_bug.cgi?id=1251331#c21 and the analysis so far? Does any of this sound familiar to you? Hi Dave, this looks good, https://patchwork.kernel.org/patch/7192371/ I think this patch will stop this from occuring, as it stops X.org from using the device until things are setup further. Dave. Created attachment 1076065 [details] backport of https://patchwork.kernel.org/patch/7192371/ to 3.10.0-317.el7 Comment on attachment 1076065 [details] backport of https://patchwork.kernel.org/patch/7192371/ to 3.10.0-317.el7 Simpleton backport of https://bugzilla.redhat.com/show_bug.cgi?id=1251331#c24 to 3.10.0-317.el7 though this did not backport cleanly. Do we need other patches? Not sure about radeon_dp_register_mst_connector and intel_dp_register_mst_connector. Original patch has only drm_modeset_lock_all @@ -462,11 +462,17 @@ static struct drm_connector *intel_dp_add_mst_connector(struct drm_dp_mst_topolo drm_object_attach_property(&connector->base, dev->mode_config.tile_property, 0); drm_mode_connector_set_path_property(connector, pathprop); + return connector; +} + +static void intel_dp_register_mst_connector(struct drm_connector *connector) +{ + struct intel_connector *intel_connector = to_intel_connector(connector); + struct drm_device *dev = connector->dev; drm_modeset_lock_all(dev); intel_connector_add_to_fbdev(intel_connector); drm_modeset_unlock_all(dev); drm_connector_register(&intel_connector->base); - return connector; } RHEL7 has drm_reinit_primary_mode_group and mutex_lock(&dev->mode_config.mutex) / mutex_unlock +static void intel_dp_register_mst_connector(struct drm_connector *connector) +{ + struct intel_connector *intel_connector = to_intel_connector(connector); + struct drm_device *dev = connector->dev; + drm_reinit_primary_mode_group(dev); mutex_lock(&dev->mode_config.mutex); intel_connector_add_to_fbdev(intel_connector); mutex_unlock(&dev->mode_config.mutex); drm_connector_register(&intel_connector->base); - return connector; } Hi, I believe I've hit this bug as well, but with a different setup. I have the same laptop (W541), however I am not using a Dell 2815Q, and I am on Debian testing instead. The display(s) I am trying to use is a EVGA DP MST hub with 2 Dell U2412Mb displays attached to it. I tried out the above patch, but I didn't use the back ported patch. Instead, I just went ahead and checked out drm-fixes-mst directly from git://people.freedesktop.org/~airlied/linux and compiled from there. This actually did stop the freezing for me, and nothing suspicious showed up in dmesg. However none of the displays showed up even after scanning or connecting them individually at the hub. Previously they would occasionally show up in xrandr if the machine hadn't frozen, but they weren't usable in any case. Sorry in advance if this is just noise due to the other variables (distribution, Xorg version, hardware) :) But if not, I'm happy to help in testing anything. Thanks Dave A - I will try to test all the patches you posted and verify. (In reply to Ian Maxon from comment #28) > Hi, > > I believe I've hit this bug as well, but with a different setup. I have the > same laptop (W541), however I am not using a Dell 2815Q, and I am on Debian > testing instead. The display(s) I am trying to use is a EVGA DP MST hub with > 2 Dell U2412Mb displays attached to it. > > I tried out the above patch, but I didn't use the back ported patch. > Instead, I just went ahead and checked out drm-fixes-mst directly from > git://people.freedesktop.org/~airlied/linux and compiled from there. This > actually did stop the freezing for me, and nothing suspicious showed up in > dmesg. However none of the displays showed up even after scanning or > connecting them individually at the hub. Previously they would occasionally > show up in xrandr if the machine hadn't frozen, but they weren't usable in > any case. > Thanks for the info. I just had the same high-level outcome as well with the latest patches. Basically, no more freezes, but the monitor is not recognized - it's like it's not connected. Here's what I did to build my kernel(s). 1. Start with 3.10.0-267.el7, add all drivers/drm patches up to 3.10.0-321.el7. There was 56 patches total. This kernel was built as 3.10.0-267.el7.bz1251331.2.x86_64 2. As a second test, add on the 6 patches just posted on Oct 1. This kernel was built as 3.10.0-267.el7.bz1251331.3.x86_64 Both kernels seemed to behave the same - no more freezes like on the earlier kernels, but no external display either (the laptop panel works fine). I did not update anything else such as systemd or other X packages. I'm not sure if this is required or not. > Sorry in advance if this is just noise due to the other variables > (distribution, Xorg version, hardware) :) But if not, I'm happy to help in > testing anything. Not at all. Your setup may be different but it looks like we may be getting the same result though perhaps until we go deeper into diagnostics / logs we won't know for sure. It could look the same at the high level but could be totally different. Did your setup ever work? I do know my setup still works on a 7.1 install plus 7.0 kernel 3.10.0-123.el7, which is why this is marked as a Regression. If no more patches come for this bug before 7.2, I may try a clean install of latest 7.2 nightly to some other drive or partition as a final definitive test as to whether this is fixed or not. Another approach could be to bisect based on the original PASS / FAIL: https://bugzilla.redhat.com/show_bug.cgi?id=1251331#c3 (In reply to Dave Wysochanski from comment #30) > (In reply to Ian Maxon from comment #28) > > Hi, > > > > I believe I've hit this bug as well, but with a different setup. I have the > > same laptop (W541), however I am not using a Dell 2815Q, and I am on Debian > > testing instead. The display(s) I am trying to use is a EVGA DP MST hub with > > 2 Dell U2412Mb displays attached to it. > > > > I tried out the above patch, but I didn't use the back ported patch. > > Instead, I just went ahead and checked out drm-fixes-mst directly from > > git://people.freedesktop.org/~airlied/linux and compiled from there. This > > actually did stop the freezing for me, and nothing suspicious showed up in > > dmesg. However none of the displays showed up even after scanning or > > connecting them individually at the hub. Previously they would occasionally > > show up in xrandr if the machine hadn't frozen, but they weren't usable in > > any case. > > > > Thanks for the info. I just had the same high-level outcome as well with > the latest patches. Basically, no more freezes, but the monitor is not > recognized - it's like it's not connected. > > Here's what I did to build my kernel(s). > > 1. Start with 3.10.0-267.el7, add all drivers/drm patches up to > 3.10.0-321.el7. There was 56 patches total. > This kernel was built as 3.10.0-267.el7.bz1251331.2.x86_64 > > 2. As a second test, add on the 6 patches just posted on Oct 1. > This kernel was built as 3.10.0-267.el7.bz1251331.3.x86_64 > > Both kernels seemed to behave the same - no more freezes like on the earlier > kernels, but no external display either (the laptop panel works fine). I > did not update anything else such as systemd or other X packages. I'm not > sure if this is required or not. > > > Sorry in advance if this is just noise due to the other variables > > (distribution, Xorg version, hardware) :) But if not, I'm happy to help in > > testing anything. > > Not at all. Your setup may be different but it looks like we may be getting > the same result though perhaps until we go deeper into diagnostics / logs we > won't know for sure. It could look the same at the high level but could be > totally different. > > Did your setup ever work? I do know my setup still works on a 7.1 install > plus 7.0 kernel 3.10.0-123.el7, which is why this is marked as a Regression. > > If no more patches come for this bug before 7.2, I may try a clean install > of latest 7.2 nightly to some other drive or partition as a final definitive > test as to whether this is fixed or not. > > Another approach could be to bisect based on the original PASS / FAIL: > https://bugzilla.redhat.com/show_bug.cgi?id=1251331#c3 So I have a Dell MST display here.. not sure the exact model # but it fails in the same way. airlied's patches do clean up the various problems that are triggered by this failure mode, but ofc not the root issue of not getting a picture on the display. The problem is that the display gets into a mode where it sends DP_AUX_NATIVE_REPLY_DEFER for every aux ch msg we send it, and eventually after 32 tries we give up. I've tried boosting that limit (which we've had to raise in the past), but it does not seem to do any good. I'm not entirely sure what windows does differently (if, presumably it works there). But this seems to be a bug in the monitor. The question, I think, is how to work around it in the kernel. I'm not sure bisecting from 7.0 kernel will do much good. Honestly I'm not even sure if we had MST support in the 7.0 kernel, so it may have simply been working via some fallback path. Maybe Dave A. remembers better? Any progress here? Still considering for 7.2? (In reply to Tomas Pelka from comment #33) > Any progress here? Still considering for 7.2? Well, the issue w/ the Dell monitor is unlikely to be fixed for 7.2.. but maybe a new bug report should be filed for that issue, since the original issue mentioned in this bz has been solved and patches POSTed. Patch(es) available on kernel-3.10.0-323.el7 (In reply to Rob Clark from comment #34) > (In reply to Tomas Pelka from comment #33) > > Any progress here? Still considering for 7.2? > > Well, the issue w/ the Dell monitor is unlikely to be fixed for 7.2.. but > maybe a new bug report should be filed for that issue, since the original > issue mentioned in this bz has been solved and patches POSTed. FWIW, I did build a DVD of the latest nightly and booted to it, but got the same result - monitor was as though it was not there. I did not install but my guess is it won't matter based on previous tests. For now I'll continue booting to the RHEL7.0 GA kernel if I need the external monitor via the display port cable. I guess I will file a new regression bug for the 7.2 GA kernel when it comes out and I've updated to that. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-2152.html This deadlock was not fixed in RHEL7.2. The patch which fixes the deadlock is Rob's patch "drm/dp: move hw_mutex up the call stack" which was posted on the new bug: https://bugzilla.redhat.com/show_bug.cgi?id=1274157#c23 FWIW, I'll attach my backport to RHEL7.1 kernel 3.10.0-229.22.1.el7 to this bug. Created attachment 1104378 [details]
Rob Clark's patch patch which fixes this bug - backported from upstream posting
*** Bug 1313120 has been marked as a duplicate of this bug. *** |