Bug 1251331

Summary:

Lenovo W541 Xorg freezes when mini display port cable is plugged in - 3.10.0-267.el7 WARNING: at drivers/gpu/drm/drm_dp_mst_topology.c:1272 process_single_tx_qlock+0x4b6/0x540 [drm_kms_helper]()

Product:

Red Hat Enterprise Linux 7

Reporter:

Dave Wysochanski <dwysocha>

Component:

kernel

Assignee:

Dave Airlie <airlied>

kernel sub component:

Graphics

QA Contact:

Desktop QE <desktop-qa-list>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

high

Priority:

unspecified

CC:

airlied, dwysocha, ian, lmiksik, rclark, tpelka, zlynx

Version:

7.1

Keywords:

OtherQA, Regression

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

kernel-3.10.0-323.el7

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

1274157 (view as bug list)

Environment:

Last Closed:

2015-11-19 23:04:55 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1274157

Attachments:

Description	Flags
vmcore analysis of hang on 3.10.0-267.el7.bz1251331.1.x86_64	none
backport of https://patchwork.kernel.org/patch/7192371/ to 3.10.0-317.el7	none
Rob Clark's patch patch which fixes this bug - backported from upstream posting	none

Description Dave Wysochanski 2015-08-07 02:07:05 UTC

Description of problem:
Lenovo W541 freezes when mini display port cable is plugged in, which is connected to Dell 2815Q.
I get the following abrt oops

WARNING: at drivers/gpu/drm/drm_dp_mst_topology.c:1272 process_single_tx_qlock+0x4b6/0x540 [drm_kms_helper]()
fail
Modules linked in: i915 crct10dif_pclmul crct10dif_common crc32_pclmul crc32c_intel i2c_algo_bit drm_kms_helper ghash_clmulni_intel ahci aesni_intel drm libahci glue_helper e1000e lrw libata gf128mul ablk_helper cryptd ptp pps_core i2c_core video dm_mirror dm_region_hash dm_log dm_mod
CPU: 0 PID: 188 Comm: kworker/0:3 Not tainted 3.10.0-267.el7.x86_64 #1
Hardware name: LENOVO 20EGS0R600/20EGS0R600, BIOS GNET71WW (2.19 ) 02/05/2015
Workqueue: events_long drm_dp_mst_link_probe_work [drm_kms_helper]
 ffff8804638bf8c8 000000002a9533a0 ffff8804638bf880 ffffffff8161eb7e
 ffff8804638bf8b8 ffffffff81071540 000000000000000c 0000000000001000
 000000000000000c 0000000000000000 ffff880463d35618 ffff8804638bf920
Call Trace:
 [<ffffffff8161eb7e>] dump_stack+0x19/0x1b
 [<ffffffff81071540>] warn_slowpath_common+0x70/0xb0
 [<ffffffff810715dc>] warn_slowpath_fmt+0x5c/0x80
 [<ffffffffa019f206>] process_single_tx_qlock+0x4b6/0x540 [drm_kms_helper]
 [<ffffffff810a4be8>] ? __wake_up_common+0x58/0x90
 [<ffffffffa019f2e0>] process_single_down_tx_qlock+0x50/0x110 [drm_kms_helper]
 [<ffffffffa019f3ee>] drm_dp_queue_down_tx+0x4e/0x60 [drm_kms_helper]
 [<ffffffffa01a2728>] drm_dp_mst_i2c_xfer+0x198/0x220 [drm_kms_helper]
 [<ffffffff81270150>] ? TSS_checkhmac2.constprop.6+0x2a0/0x320
 [<ffffffff8127d910>] ? inode_doinit_with_dentry+0x210/0x680
 [<ffffffff811b4655>] ? kmem_cache_alloc+0x35/0x1d0
 [<ffffffffa00561fc>] __i2c_transfer+0x5c/0x70 [i2c_core]
 [<ffffffffa0056c2c>] i2c_transfer+0x5c/0xc0 [i2c_core]
 [<ffffffffa0127c95>] drm_do_probe_ddc_edid+0xe5/0x160 [drm]
 [<ffffffffa012941e>] drm_get_edid+0x3e/0x410 [drm]
 [<ffffffffa01a2449>] drm_dp_add_port+0x399/0x420 [drm_kms_helper]
 [<ffffffff81080038>] ? ptrace_traceme+0x68/0x80
 [<ffffffff81621d3d>] ? schedule_timeout+0x17d/0x2d0
 [<ffffffff81082470>] ? internal_add_timer+0x70/0x70
 [<ffffffff8109c506>] ? finish_wait+0x56/0x70
 [<ffffffff81622312>] ? mutex_lock+0x12/0x2f
 [<ffffffffa019e4ce>] ? drm_dp_mst_wait_tx_reply+0xfe/0x1f0 [drm_kms_helper]
 [<ffffffff8109c680>] ? wake_up_atomic_t+0x30/0x30
 [<ffffffffa01a1f7c>] drm_dp_send_link_address+0xdc/0x210 [drm_kms_helper]
 [<ffffffff810135c6>] ? __switch_to+0x136/0x4a0
 [<ffffffffa01a255d>] drm_dp_check_and_send_link_address+0x8d/0xa0 [drm_kms_helper]
 [<ffffffffa01a258c>] drm_dp_mst_link_probe_work+0x1c/0x20 [drm_kms_helper]
 [<ffffffff810933fb>] process_one_work+0x17b/0x470
 [<ffffffff810941cb>] worker_thread+0x11b/0x400
 [<ffffffff810940b0>] ? rescuer_thread+0x400/0x400
 [<ffffffff8109b68f>] kthread+0xcf/0xe0
 [<ffffffff8109b5c0>] ? kthread_create_on_node+0x140/0x140
 [<ffffffff8162f018>] ret_from_fork+0x58/0x90
 [<ffffffff8109b5c0>] ? kthread_create_on_node+0x140/0x140                                                                                                                 


Version-Release number of selected component (if applicable):
3.10.0-267.el7

How reproducible:
Every time

Steps to Reproduce:
1. Boot RHEL7.1 laptop with Red Hat CSB (corporate standard build) installed (Red Hat Enterprise Linux Workstation release 7.1 +  3.10.0-267.el7.x86_64)
2. Once logged in and gnome desktop loaded, plug in mini display port cable to laptop - cable connected to Dell 28" Monitor | P2815Q

Actual results:
Laptop freezes and is unusable - have to power cycle via the power button

Expected results:
Laptop doesn't hang

Additional info:

This may be the same problem as this Fedora22 bug reported:
https://bugzilla.redhat.com/show_bug.cgi?id=1226345

This is easily reproducible for me on the W541

I have a T530 running RHEL7.0 and kernel 3.10.0-123.el7 and it works just fine with the identical monitor and cable and in the same method above (boot laptop, then after logged in, plug in cable).

This may be a regression in RHEL7.1 even though I'm seeing it on 3.10.0-267.el7.  I've not tested RHEL7.1 or RHEL7.0 kernels yet but I can probably tomorrow.
I notice the code which contains the WARN was added with a large display port patchset for this RHEL7.1 bz: https://bugzilla.redhat.com/show_bug.cgi?id=1140440
It was patch 7/21 which added the code which is being executed here when the WARN pops.

I don't see anything from 3.10.0-267.el7 to the latest RHEL7 build which touches the files in that area.

FWIW, here is the code the WARN actually hits.  We're actually inside drm_dp_send_sideband_msg and it looks like if we turned on debug statements we would see 'failed to dpcd write ..." whatever that means!

drivers/gpu/drm/drm_dp_mst_topology.c
1249 static int drm_dp_send_sideband_msg(struct drm_dp_mst_topology_mgr *mgr,
1250 				    bool up, u8 *msg, int len)
1251 {
1252 	int ret;
1253 	int regbase = up ? DP_SIDEBAND_MSG_UP_REP_BASE : DP_SIDEBAND_MSG_DOWN_REQ_BASE;
1254 	int tosend, total, offset;
1255 	int retries = 0;
1256 
1257 retry:
1258 	total = len;
1259 	offset = 0;
1260 	do {
1261 		tosend = min3(mgr->max_dpcd_transaction_bytes, 16, total);
1262 
1263 		ret = drm_dp_dpcd_write(mgr->aux, regbase + offset,
1264 					&msg[offset],
1265 					tosend);
1266 		if (ret != tosend) {
1267 			if (ret == -EIO && retries < 5) {
1268 				retries++;
1269 				goto retry;
1270 			}
1271 			DRM_DEBUG_KMS("failed to dpcd write %d %d\n", tosend, ret);
1272-->			WARN(1, "fail\n");
1273 
1274 			return -EIO;
1275 		}
1276 		offset += tosend;
1277 		total -= tosend;
1278 	} while (total > 0);
1279 	return 0;
1280 }

Comment 2 Dave Wysochanski 2015-08-07 13:51:51 UTC

Hey Dave A - can you take a look at this oops and let me know if you want me to try any further experiments on other kernels or enable debug, etc?  I'm not familiar with this area of the kernel so need some pointers.

I'd like to get this working since it's my home setup and the new laptop seems to lockup every time I use my home monitor.

Thanks.

Comment 3 Dave Wysochanski 2015-08-09 04:21:45 UTC

Based on my tests, and that fact that the warning I'm getting is inside code which was checked in between the 123 and the 229 kernel, this is a regression from the RHEL7.0 GA kernel 3.10.0-123.el7
      Test A    Test B      
.123  PASS      PASS
.229  FAIL      FAIL
.267  FAIL      FAIL

Laptop used for all tests is the Lenovo W541
Test A: With laptop powered down, plug in mini display port cable, then power machine on
Test B: With laptop powered on, log in to gnome desktop.  Pull up system settings / displays.  Then plug in mini display port cable.

Haven't got a vmcore yet or further debug, but I'll work on that next.
Wanted to confirm if a prior kernel worked since I can use that as a workaround in the interim.

Comment 5 Dave Wysochanski 2015-08-09 05:08:33 UTC

Ok got a vmcore via sysrq-c after plugging in the cable (Test B above).

It looks to me like Xorg can deadlock on itself while trying to obtain a mutex it already has.
Interestingly, the automated analyzer actually went into its own loop as a result of this!
In any case, the Xorg pid 1774 deserves a deeper dive as the bug must be somewhere in here.

PID: 1774   TASK: ffff8804674f38e0  CPU: 6   COMMAND: "Xorg"
 #0 [ffff8804516ff8c0] __schedule at ffffffff81608cfd
 #1 [ffff8804516ff928] schedule_preempt_disabled at ffffffff8160a449
 #2 [ffff8804516ff938] __mutex_lock_slowpath at ffffffff81608195
 #3 [ffff8804516ff998] mutex_lock at ffffffff816075ff
 #4 [ffff8804516ff9b0] intel_dp_destroy_mst_connector at ffffffffa0271da9 [i915]
 #5 [ffff8804516ff9e0] drm_dp_destroy_port at ffffffffa00e52bd [drm_kms_helper]
 #6 [ffff8804516ffa00] drm_dp_destroy_mst_branch_device at ffffffffa00e535e [drm_kms_helper]
 #7 [ffff8804516ffa30] drm_dp_mst_i2c_xfer at ffffffffa00e7e41 [drm_kms_helper]
 #8 [ffff8804516ffaf0] __i2c_transfer at ffffffffa00381fc [i2c_core]
 #9 [ffff8804516ffb28] i2c_transfer at ffffffffa0038c2c [i2c_core]
#10 [ffff8804516ffb50] drm_do_probe_ddc_edid at ffffffffa007d295 [drm]
#11 [ffff8804516ffbe8] drm_get_edid at ffffffffa007d3a4 [drm]
#12 [ffff8804516ffc50] drm_dp_mst_get_edid at ffffffffa00e5682 [drm_kms_helper]
#13 [ffff8804516ffc78] intel_dp_mst_get_modes at ffffffffa02717a9 [i915]
#14 [ffff8804516ffc98] drm_helper_probe_single_connector_modes_merge_bits at ffffffffa00e2ad6 [drm_kms_helper]
#15 [ffff8804516ffce0] drm_helper_probe_single_connector_modes at ffffffffa00e2be3 [drm_kms_helper]
#16 [ffff8804516ffcf0] drm_mode_getconnector at ffffffffa00788ff [drm]
#17 [ffff8804516ffdb8] drm_ioctl at ffffffffa0069b6c [drm]
#18 [ffff8804516ffeb8] do_vfs_ioctl at ffffffff811d9a75
#19 [ffff8804516fff30] sys_ioctl at ffffffff811d9cf1



The kworker pid is the oldest blocked pid 406, but only slightly.  This pid is waiting on a mutex held by Xorg pid 1774.
But Xorg pid 1774 too is waiting on a mutex.  However, the mutex it's waiting on is the one which it owns.



      KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/3.10.0-229.el7.x86_64/vmlinux
    DUMPFILE: /cores/retrace/tasks/843210390/crash/vmcore  [PARTIAL DUMP]
        CPUS: 8
        DATE: Sun Aug  9 00:37:30 2015
      UPTIME: 00:03:54
LOAD AVERAGE: 1.86, 0.91, 0.36
       TASKS: 428
...
     RELEASE: 3.10.0-229.el7.x86_64
     VERSION: #1 SMP Thu Jan 29 18:37:38 EST 2015
     MACHINE: x86_64  (2793 Mhz)
      MEMORY: 15.6 GB
       PANIC: "SysRq : Trigger a crash"
 sys output <end>
Checking log
- Did not find DMI string
Looking for tainted kernel with 'sys -t'
crash>  sys -t
 
TAINTED_MASK: 200  W
crash> 
Looking for tainted modules with: 'mod -t'
Found 'no tainted modules' - NOTE: If this is a kernel crash it almost certain a kernel bug!!!
Detecting panic type from 'sys' output
- Found panic string: PANIC: "SysRq : Trigger a crash"
- Found sysrq triggered crash dump
Process triggering the sysrq crash dump was
crash>  files -R /proc/sysrq-trigger
 
crash> 
Checking log output for RIP symbol
rip_string = [  234.525438] RIP: 0010:[<ffffffff81397486>]  [<ffffffff81397486>] sysrq_handle_crash+0x16/0x20
rip_symbol = sysrq_handle_crash
...
END: Basic vmcore analysis
BEGIN: Basic hung vmcore analysis
Analyzing LOAD AVERAGE from 'sys'
- Found load avgs: 1 (1 min)  0 (5 min)  0 (15 min)
- Found 8 tasks in 'RU' state
- Found 2 tasks in 'UN' state
Showing oldest blocked pid (NOTE: some processes such as gablogd filtered out)
Oldest blocked pid: 406
Here are the oldest 2 blocked (D-state) processes and how long they've been blocked
Blocked processes <begin> 
[0 00:02:19.801] [UN]  PID: 1774   TASK: ffff8804674f38e0  CPU: 6   COMMAND: "Xorg"
[0 00:02:19.614] [UN]  PID: 406    TASK: ffff880463ebf1c0  CPU: 3   COMMAND: "kworker/u16:5"
crash>  Blocked processes <end>
Backtrace of oldest blocked task or task triggering hung_task panic
backtrace <begin> 
PID: 406    TASK: ffff880463ebf1c0  CPU: 3   COMMAND: "kworker/u16:5"
 #0 [ffff88046236bbe0] __schedule at ffffffff81608cfd
 #1 [ffff88046236bc48] schedule_preempt_disabled at ffffffff8160a449
 #2 [ffff88046236bc58] __mutex_lock_slowpath at ffffffff81608195
 #3 [ffff88046236bcb0] mutex_lock at ffffffff816075ff
 #4 [ffff88046236bcc8] drm_fb_helper_hotplug_event at ffffffffa00ea9fa [drm_kms_helper]
 #5 [ffff88046236bcf8] intel_fbdev_output_poll_changed at ffffffffa026199e [i915]
 #6 [ffff88046236bd08] drm_kms_helper_hotplug_event at ffffffffa00e25b7 [drm_kms_helper]
 #7 [ffff88046236bd20] intel_dp_check_mst_status at ffffffffa026f931 [i915]
 #8 [ffff88046236bd90] intel_dp_hpd_pulse at ffffffffa02703e0 [i915]
 #9 [ffff88046236bdd0] i915_digport_work_func at ffffffffa02292ed [i915]
#10 [ffff88046236be20] process_one_work at ffffffff8108f0ab
#11 [ffff88046236be68] worker_thread at ffffffff8108fe8b
#12 [ffff88046236bec8] kthread at ffffffff8109726f
#13 [ffff88046236bf50] ret_from_fork at ffffffff81613cfc
crash>  backtrace <end>
Checking for specific hung task signatures
- Found pid 406 waiting on mutex.  Next steps: check holder of mutex
Attempting to pull mutex off stack
- get_value_from_stack: pid == 406, frame = mutex_lock, pushes = 6
- INFO: mutex == 0xffff880463cb8a08
- INFO: mutex_owner == 0xffff8804674f38e0
- INFO: mutex.wait_list == 0xffff880463cb8a10
The mutex wait_list has the following number of waters:
crash>  list -H 0xffff880463cb8a10 | wc -l
 
3
crash> 
You may want to list the backtraces of the waiters with: list -H 0xffff880463cb8a10 -o mutex_waiter.list -s mutex_waiter.task | awk '/task =/ { print "bt "$3 }' 
- BEGIN backtrace
crash>  bt 0xffff8804674f38e0
 
PID: 1774   TASK: ffff8804674f38e0  CPU: 6   COMMAND: "Xorg"
 #0 [ffff8804516ff8c0] __schedule at ffffffff81608cfd
 #1 [ffff8804516ff928] schedule_preempt_disabled at ffffffff8160a449
 #2 [ffff8804516ff938] __mutex_lock_slowpath at ffffffff81608195
 #3 [ffff8804516ff998] mutex_lock at ffffffff816075ff
 #4 [ffff8804516ff9b0] intel_dp_destroy_mst_connector at ffffffffa0271da9 [i915]
 #5 [ffff8804516ff9e0] drm_dp_destroy_port at ffffffffa00e52bd [drm_kms_helper]
 #6 [ffff8804516ffa00] drm_dp_destroy_mst_branch_device at ffffffffa00e535e [drm_kms_helper]
 #7 [ffff8804516ffa30] drm_dp_mst_i2c_xfer at ffffffffa00e7e41 [drm_kms_helper]
 #8 [ffff8804516ffaf0] __i2c_transfer at ffffffffa00381fc [i2c_core]
 #9 [ffff8804516ffb28] i2c_transfer at ffffffffa0038c2c [i2c_core]
#10 [ffff8804516ffb50] drm_do_probe_ddc_edid at ffffffffa007d295 [drm]
#11 [ffff8804516ffbe8] drm_get_edid at ffffffffa007d3a4 [drm]
#12 [ffff8804516ffc50] drm_dp_mst_get_edid at ffffffffa00e5682 [drm_kms_helper]
#13 [ffff8804516ffc78] intel_dp_mst_get_modes at ffffffffa02717a9 [i915]
#14 [ffff8804516ffc98] drm_helper_probe_single_connector_modes_merge_bits at ffffffffa00e2ad6 [drm_kms_helper]
#15 [ffff8804516ffce0] drm_helper_probe_single_connector_modes at ffffffffa00e2be3 [drm_kms_helper]
#16 [ffff8804516ffcf0] drm_mode_getconnector at ffffffffa00788ff [drm]
#17 [ffff8804516ffdb8] drm_ioctl at ffffffffa0069b6c [drm]
#18 [ffff8804516ffeb8] do_vfs_ioctl at ffffffff811d9a75
#19 [ffff8804516fff30] sys_ioctl at ffffffff811d9cf1
#20 [ffff8804516fff80] system_call_fastpath at ffffffff81613da9
    RIP: 00007f87b345f257  RSP: 00007fff5ca45ec0  RFLAGS: 00013206
    RAX: 0000000000000010  RBX: ffffffff81613da9  RCX: 0000000000000001
    RDX: 00007fff5ca45f70  RSI: 00000000c05064a7  RDI: 0000000000000009
    RBP: 00007fff5ca45f70   R8: 000000000000ffff   R9: 0000000000000003
    R10: fffffffffffff5e3  R11: 0000000000003246  R12: 0000000000000000
    R13: 00007f87b5de4040  R14: 0000000000000009  R15: 00000000c05064a7
    ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b
crash> 
- END backtrace
Now analyzing mutex_owner backtrace
- Found pid 0xffff8804674f38e0 waiting on mutex.  Next steps: check holder of mutex
Attempting to pull mutex off stack
- get_value_from_stack: pid == 0xffff8804674f38e0, frame = mutex_lock, pushes = 6
- INFO: mutex == 0xffff880463cb8a08
- INFO: mutex_owner == 0xffff8804674f38e0
- INFO: mutex.wait_list == 0xffff880463cb8a10
The mutex wait_list has the following number of waters:
crash>  list -H 0xffff880463cb8a10 | wc -l
 
3
crash> 
You may want to list the backtraces of the waiters with: list -H 0xffff880463cb8a10 -o mutex_waiter.list -s mutex_waiter.task | awk '/task =/ { print "bt "$3 }' 
- BEGIN backtrace
crash>  bt 0xffff8804674f38e0
 
PID: 1774   TASK: ffff8804674f38e0  CPU: 6   COMMAND: "Xorg"
 #0 [ffff8804516ff8c0] __schedule at ffffffff81608cfd
 #1 [ffff8804516ff928] schedule_preempt_disabled at ffffffff8160a449
 #2 [ffff8804516ff938] __mutex_lock_slowpath at ffffffff81608195
 #3 [ffff8804516ff998] mutex_lock at ffffffff816075ff
 #4 [ffff8804516ff9b0] intel_dp_destroy_mst_connector at ffffffffa0271da9 [i915]
 #5 [ffff8804516ff9e0] drm_dp_destroy_port at ffffffffa00e52bd [drm_kms_helper]
 #6 [ffff8804516ffa00] drm_dp_destroy_mst_branch_device at ffffffffa00e535e [drm_kms_helper]
 #7 [ffff8804516ffa30] drm_dp_mst_i2c_xfer at ffffffffa00e7e41 [drm_kms_helper]
 #8 [ffff8804516ffaf0] __i2c_transfer at ffffffffa00381fc [i2c_core]
 #9 [ffff8804516ffb28] i2c_transfer at ffffffffa0038c2c [i2c_core]
#10 [ffff8804516ffb50] drm_do_probe_ddc_edid at ffffffffa007d295 [drm]
#11 [ffff8804516ffbe8] drm_get_edid at ffffffffa007d3a4 [drm]
#12 [ffff8804516ffc50] drm_dp_mst_get_edid at ffffffffa00e5682 [drm_kms_helper]
#13 [ffff8804516ffc78] intel_dp_mst_get_modes at ffffffffa02717a9 [i915]
#14 [ffff8804516ffc98] drm_helper_probe_single_connector_modes_merge_bits at ffffffffa00e2ad6 [drm_kms_helper]
#15 [ffff8804516ffce0] drm_helper_probe_single_connector_modes at ffffffffa00e2be3 [drm_kms_helper]
#16 [ffff8804516ffcf0] drm_mode_getconnector at ffffffffa00788ff [drm]
#17 [ffff8804516ffdb8] drm_ioctl at ffffffffa0069b6c [drm]
#18 [ffff8804516ffeb8] do_vfs_ioctl at ffffffff811d9a75
#19 [ffff8804516fff30] sys_ioctl at ffffffff811d9cf1
#20 [ffff8804516fff80] system_call_fastpath at ffffffff81613da9
    RIP: 00007f87b345f257  RSP: 00007fff5ca45ec0  RFLAGS: 00013206
    RAX: 0000000000000010  RBX: ffffffff81613da9  RCX: 0000000000000001
    RDX: 00007fff5ca45f70  RSI: 00000000c05064a7  RDI: 0000000000000009
    RBP: 00007fff5ca45f70   R8: 000000000000ffff   R9: 0000000000000003
    R10: fffffffffffff5e3  R11: 0000000000003246  R12: 0000000000000000
    R13: 00007f87b5de4040  R14: 0000000000000009  R15: 00000000c05064a7
    ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b
crash> 
- END backtrace
Now analyzing mutex_owner backtrace
- Found pid 0xffff8804674f38e0 waiting on mutex.  Next steps: check holder of mutex
Attempting to pull mutex off stack
- get_value_from_stack: pid == 0xffff8804674f38e0, frame = mutex_lock, pushes = 6
- INFO: mutex == 0xffff880463cb8a08
- INFO: mutex_owner == 0xffff8804674f38e0
- INFO: mutex.wait_list == 0xffff880463cb8a10
The mutex wait_list has the following number of waters:
crash>  list -H 0xffff880463cb8a10 | wc -l
 
3
crash> 
You may want to list the backtraces of the waiters with: list -H 0xffff880463cb8a10 -o mutex_waiter.list -s mutex_waiter.task | awk '/task =/ { print "bt "$3 }' 
- BEGIN backtrace
crash>  bt 0xffff8804674f38e0
 
PID: 1774   TASK: ffff8804674f38e0  CPU: 6   COMMAND: "Xorg"
 #0 [ffff8804516ff8c0] __schedule at ffffffff81608cfd
 #1 [ffff8804516ff928] schedule_preempt_disabled at ffffffff8160a449
 #2 [ffff8804516ff938] __mutex_lock_slowpath at ffffffff81608195
 #3 [ffff8804516ff998] mutex_lock at ffffffff816075ff
 #4 [ffff8804516ff9b0] intel_dp_destroy_mst_connector at ffffffffa0271da9 [i915]
 #5 [ffff8804516ff9e0] drm_dp_destroy_port at ffffffffa00e52bd [drm_kms_helper]
 #6 [ffff8804516ffa00] drm_dp_destroy_mst_branch_device at ffffffffa00e535e [drm_kms_helper]
 #7 [ffff8804516ffa30] drm_dp_mst_i2c_xfer at ffffffffa00e7e41 [drm_kms_helper]
 #8 [ffff8804516ffaf0] __i2c_transfer at ffffffffa00381fc [i2c_core]
 #9 [ffff8804516ffb28] i2c_transfer at ffffffffa0038c2c [i2c_core]
#10 [ffff8804516ffb50] drm_do_probe_ddc_edid at ffffffffa007d295 [drm]
#11 [ffff8804516ffbe8] drm_get_edid at ffffffffa007d3a4 [drm]
#12 [ffff8804516ffc50] drm_dp_mst_get_edid at ffffffffa00e5682 [drm_kms_helper]
#13 [ffff8804516ffc78] intel_dp_mst_get_modes at ffffffffa02717a9 [i915]
#14 [ffff8804516ffc98] drm_helper_probe_single_connector_modes_merge_bits at ffffffffa00e2ad6 [drm_kms_helper]
#15 [ffff8804516ffce0] drm_helper_probe_single_connector_modes at ffffffffa00e2be3 [drm_kms_helper]
#16 [ffff8804516ffcf0] drm_mode_getconnector at ffffffffa00788ff [drm]
#17 [ffff8804516ffdb8] drm_ioctl at ffffffffa0069b6c [drm]
#18 [ffff8804516ffeb8] do_vfs_ioctl at ffffffff811d9a75
#19 [ffff8804516fff30] sys_ioctl at ffffffff811d9cf1
#20 [ffff8804516fff80] system_call_fastpath at ffffffff81613da9
    RIP: 00007f87b345f257  RSP: 00007fff5ca45ec0  RFLAGS: 00013206
    RAX: 0000000000000010  RBX: ffffffff81613da9  RCX: 0000000000000001
    RDX: 00007fff5ca45f70  RSI: 00000000c05064a7  RDI: 0000000000000009
    RBP: 00007fff5ca45f70   R8: 000000000000ffff   R9: 0000000000000003
    R10: fffffffffffff5e3  R11: 0000000000003246  R12: 0000000000000000
....

Comment 7 Dave Wysochanski 2015-08-09 12:19:59 UTC

Proof that the Xorg process owns the mutex it is trying to obtain.

crash> set 1774
    PID: 1774
COMMAND: "Xorg"
   TASK: ffff8804674f38e0  [THREAD_INFO: ffff8804516fc000]
    CPU: 6
  STATE: TASK_UNINTERRUPTIBLE
crash> bt -f | grep "mutex_lock at"
 #3 [ffff8804516ff998] mutex_lock at ffffffff816075ff
crash> px (0xffff8804516ff998^C
crash> bt
PID: 1774   TASK: ffff8804674f38e0  CPU: 6   COMMAND: "Xorg"
 #0 [ffff8804516ff8c0] __schedule at ffffffff81608cfd
 #1 [ffff8804516ff928] schedule_preempt_disabled at ffffffff8160a449
 #2 [ffff8804516ff938] __mutex_lock_slowpath at ffffffff81608195
 #3 [ffff8804516ff998] mutex_lock at ffffffff816075ff
 #4 [ffff8804516ff9b0] intel_dp_destroy_mst_connector at ffffffffa0271da9 [i915]
 #5 [ffff8804516ff9e0] drm_dp_destroy_port at ffffffffa00e52bd [drm_kms_helper]
 #6 [ffff8804516ffa00] drm_dp_destroy_mst_branch_device at ffffffffa00e535e [drm_kms_helper]
 #7 [ffff8804516ffa30] drm_dp_mst_i2c_xfer at ffffffffa00e7e41 [drm_kms_helper]
 #8 [ffff8804516ffaf0] __i2c_transfer at ffffffffa00381fc [i2c_core]
 #9 [ffff8804516ffb28] i2c_transfer at ffffffffa0038c2c [i2c_core]
#10 [ffff8804516ffb50] drm_do_probe_ddc_edid at ffffffffa007d295 [drm]
#11 [ffff8804516ffbe8] drm_get_edid at ffffffffa007d3a4 [drm]
#12 [ffff8804516ffc50] drm_dp_mst_get_edid at ffffffffa00e5682 [drm_kms_helper]
#13 [ffff8804516ffc78] intel_dp_mst_get_modes at ffffffffa02717a9 [i915]
#14 [ffff8804516ffc98] drm_helper_probe_single_connector_modes_merge_bits at ffffffffa00e2ad6 [drm_kms_helper]
#15 [ffff8804516ffce0] drm_helper_probe_single_connector_modes at ffffffffa00e2be3 [drm_kms_helper]
#16 [ffff8804516ffcf0] drm_mode_getconnector at ffffffffa00788ff [drm]
#17 [ffff8804516ffdb8] drm_ioctl at ffffffffa0069b6c [drm]
#18 [ffff8804516ffeb8] do_vfs_ioctl at ffffffff811d9a75
#19 [ffff8804516fff30] sys_ioctl at ffffffff811d9cf1
#20 [ffff8804516fff80] system_call_fastpath at ffffffff81613da9
    RIP: 00007f87b345f257  RSP: 00007fff5ca45ec0  RFLAGS: 00013206
    RAX: 0000000000000010  RBX: ffffffff81613da9  RCX: 0000000000000001
    RDX: 00007fff5ca45f70  RSI: 00000000c05064a7  RDI: 0000000000000009
    RBP: 00007fff5ca45f70   R8: 000000000000ffff   R9: 0000000000000003
    R10: fffffffffffff5e3  R11: 0000000000003246  R12: 0000000000000000
    R13: 00007f87b5de4040  R14: 0000000000000009  R15: 00000000c05064a7
    ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b
crash> dis -r ffffffff816075ff | tail
0xffffffff816075e5 <mutex_lock+5>:      push   %rbp
0xffffffff816075e6 <mutex_lock+6>:      mov    %rsp,%rbp
0xffffffff816075e9 <mutex_lock+9>:      push   %rbx
0xffffffff816075ea <mutex_lock+10>:     mov    %rdi,%rbx
0xffffffff816075ed <mutex_lock+13>:     callq  0xffffffff816095f0 <_cond_resched>
0xffffffff816075f2 <mutex_lock+18>:     mov    %rbx,%rdi  <---- rdx == rdi == struct mutex *
0xffffffff816075f5 <mutex_lock+21>:     lock decl (%rdi)
0xffffffff816075f8 <mutex_lock+24>:     jns    0xffffffff816075ff <mutex_lock+31>
0xffffffff816075fa <mutex_lock+26>:     callq  0xffffffff816080d0 <__mutex_lock_slowpath>
0xffffffff816075ff <mutex_lock+31>:     mov    %gs:0xb7c0,%rax
crash> dis __mutex_lock_slowpath | head
0xffffffff816080d0 <__mutex_lock_slowpath>:     nopl   0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffff816080d5 <__mutex_lock_slowpath+5>:   push   %rbp
0xffffffff816080d6 <__mutex_lock_slowpath+6>:   mov    %rsp,%rbp
0xffffffff816080d9 <__mutex_lock_slowpath+9>:   push   %r15
0xffffffff816080db <__mutex_lock_slowpath+11>:  push   %r14
0xffffffff816080dd <__mutex_lock_slowpath+13>:  push   %r13
0xffffffff816080df <__mutex_lock_slowpath+15>:  mov    %gs:0xb7c0,%r13
0xffffffff816080e8 <__mutex_lock_slowpath+24>:  push   %r12
0xffffffff816080ea <__mutex_lock_slowpath+26>:  push   %rbx  <---- 6th push
0xffffffff816080eb <__mutex_lock_slowpath+27>:  mov    %rdi,%rbx
crash> bt -f | grep "mutex_lock at"
 #3 [ffff8804516ff998] mutex_lock at ffffffff816075ff
crash> px (0xffff8804516ff998 - 6*8)
$1 = 0xffff8804516ff968
crash> rd 0xffff8804516ff968
ffff8804516ff968:  ffff880463cb8a08                    ...c....
crash> mutex.owner ffff880463cb8a08
  owner = 0xffff8804674f38e0
crash> bt 0xffff8804674f38e0
PID: 1774   TASK: ffff8804674f38e0  CPU: 6   COMMAND: "Xorg"
 #0 [ffff8804516ff8c0] __schedule at ffffffff81608cfd
 #1 [ffff8804516ff928] schedule_preempt_disabled at ffffffff8160a449
 #2 [ffff8804516ff938] __mutex_lock_slowpath at ffffffff81608195
 #3 [ffff8804516ff998] mutex_lock at ffffffff816075ff
 #4 [ffff8804516ff9b0] intel_dp_destroy_mst_connector at ffffffffa0271da9 [i915]
 #5 [ffff8804516ff9e0] drm_dp_destroy_port at ffffffffa00e52bd [drm_kms_helper]
 #6 [ffff8804516ffa00] drm_dp_destroy_mst_branch_device at ffffffffa00e535e [drm_kms_helper]
 #7 [ffff8804516ffa30] drm_dp_mst_i2c_xfer at ffffffffa00e7e41 [drm_kms_helper]
 #8 [ffff8804516ffaf0] __i2c_transfer at ffffffffa00381fc [i2c_core]
 #9 [ffff8804516ffb28] i2c_transfer at ffffffffa0038c2c [i2c_core]
#10 [ffff8804516ffb50] drm_do_probe_ddc_edid at ffffffffa007d295 [drm]
#11 [ffff8804516ffbe8] drm_get_edid at ffffffffa007d3a4 [drm]
#12 [ffff8804516ffc50] drm_dp_mst_get_edid at ffffffffa00e5682 [drm_kms_helper]
#13 [ffff8804516ffc78] intel_dp_mst_get_modes at ffffffffa02717a9 [i915]
#14 [ffff8804516ffc98] drm_helper_probe_single_connector_modes_merge_bits at ffffffffa00e2ad6 [drm_kms_helper]
#15 [ffff8804516ffce0] drm_helper_probe_single_connector_modes at ffffffffa00e2be3 [drm_kms_helper]
#16 [ffff8804516ffcf0] drm_mode_getconnector at ffffffffa00788ff [drm]
#17 [ffff8804516ffdb8] drm_ioctl at ffffffffa0069b6c [drm]
#18 [ffff8804516ffeb8] do_vfs_ioctl at ffffffff811d9a75
#19 [ffff8804516fff30] sys_ioctl at ffffffff811d9cf1
#20 [ffff8804516fff80] system_call_fastpath at ffffffff81613da9
    RIP: 00007f87b345f257  RSP: 00007fff5ca45ec0  RFLAGS: 00013206
    RAX: 0000000000000010  RBX: ffffffff81613da9  RCX: 0000000000000001
    RDX: 00007fff5ca45f70  RSI: 00000000c05064a7  RDI: 0000000000000009
    RBP: 00007fff5ca45f70   R8: 000000000000ffff   R9: 0000000000000003
    R10: fffffffffffff5e3  R11: 0000000000003246  R12: 0000000000000000
    R13: 00007f87b5de4040  R14: 0000000000000009  R15: 00000000c05064a7
    ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b

Comment 8 Dave Wysochanski 2015-08-12 01:58:52 UTC

Ok so the bug appears to be this.

The dev->mode_config.mutex is first obtained inside what appears to be generic code in drm_mode_getconnector, before calling fill_modes == drm_helper_probe_single_connector_modes.
~~~
drivers/gpu/drm/drm_crtc.c
1831 int drm_mode_getconnector(struct drm_device *dev, void *data,
1832 			  struct drm_file *file_priv)
1833 {
...
1853 
1854 	DRM_DEBUG_KMS("[CONNECTOR:%d:?]\n", out_resp->connector_id);
1855 
1856 	mutex_lock(&dev->mode_config.mutex);
...
1872 	if (out_resp->count_modes == 0) {
1873-->		connector->funcs->fill_modes(connector,
1874 					     dev->mode_config.max_width,
1875 					     dev->mode_config.max_height);
~~~

Ultimately later we call down into the i915 driver and get inside drivers/gpu/drm/i915/intel_dp_mst.c line 427 function intel_dp_destroy_mst_connector where the driver tries to obtain the same mutex a second time, and deadlocks on itself.
~~~
drivers/gpu/drm/i915/intel_dp_mst.c
420 static void intel_dp_destroy_mst_connector(struct drm_dp_mst_topology_mgr *mgr,
421 					   struct drm_connector *connector)
422 {
423 	struct intel_connector *intel_connector = to_intel_connector(connector);
424 	struct drm_device *dev = connector->dev;
425 	struct drm_i915_private *dev_priv = dev->dev_private;
426 	/* need to nuke the connector */
427-->	mutex_lock(&dev->mode_config.mutex);
428 	intel_connector_dpms(connector, DRM_MODE_DPMS_OFF);
~~~

Next steps is to look for some obvious patch though it may be nearly impossible depending on how much the code has changed upstream vs RHEL7.1


Analysis details
================


Which mutex has Xorg pid 1774 tried to obtain twice?
~~~
 #3 [ffff8804516ff998] mutex_lock at ffffffff816075ff
 #4 [ffff8804516ff9b0] intel_dp_destroy_mst_connector at ffffffffa0271da9 [i915]

crash> dis -lr ffffffffa0271da9 | tail
0xffffffffa0271d93 <intel_dp_destroy_mst_connector+19>: mov    %rsi,%rbx
/usr/src/debug/kernel-3.10.0-229.el7/linux-3.10.0-229.el7.x86_64/drivers/gpu/drm/i915/intel_dp_mst.c: 427
0xffffffffa0271d96 <intel_dp_destroy_mst_connector+22>: lea    0x208(%r13),%r12
/usr/src/debug/kernel-3.10.0-229.el7/linux-3.10.0-229.el7.x86_64/drivers/gpu/drm/i915/intel_dp_mst.c: 425
0xffffffffa0271d9d <intel_dp_destroy_mst_connector+29>: mov    0x28(%r13),%r14
/usr/src/debug/kernel-3.10.0-229.el7/linux-3.10.0-229.el7.x86_64/drivers/gpu/drm/i915/intel_dp_mst.c: 427
0xffffffffa0271da1 <intel_dp_destroy_mst_connector+33>: mov    %r12,%rdi
0xffffffffa0271da4 <intel_dp_destroy_mst_connector+36>: callq  0xffffffff816075e0 <mutex_lock>
/usr/src/debug/kernel-3.10.0-229.el7/linux-3.10.0-229.el7.x86_64/drivers/gpu/drm/i915/intel_dp_mst.c: 428
0xffffffffa0271da9 <intel_dp_destroy_mst_connector+41>: mov    $0x3,%esi

drivers/gpu/drm/i915/intel_dp_mst.c
420 static void intel_dp_destroy_mst_connector(struct drm_dp_mst_topology_mgr *mgr,
421 					   struct drm_connector *connector)
422 {
423 	struct intel_connector *intel_connector = to_intel_connector(connector);
424 	struct drm_device *dev = connector->dev;
425 	struct drm_i915_private *dev_priv = dev->dev_private;
426 	/* need to nuke the connector */
427-->	mutex_lock(&dev->mode_config.mutex);
428 	intel_connector_dpms(connector, DRM_MODE_DPMS_OFF);
429 	mutex_unlock(&dev->mode_config.mutex);
430 
431 	intel_connector->unregister(intel_connector);
432 
433 	mutex_lock(&dev->mode_config.mutex);
434 	drm_fb_helper_remove_one_connector(&dev_priv->fbdev->helper, connector);
435 	drm_connector_cleanup(connector);
436 	mutex_unlock(&dev->mode_config.mutex);
~~~

Looks like `drm_device.mode_config.mutex`

That was simple, but where did we obtain it first?  Where else is it locked?
~~~
$ grep -n mode_config.mutex drivers/gpu/drm/i915/*.[ch] | grep mutex_lock
drivers/gpu/drm/i915/i915_debugfs.c:1680:	int ret = mutex_lock_interruptible(&dev->mode_config.mutex);
drivers/gpu/drm/i915/i915_debugfs.c:1726:	ret = mutex_lock_interruptible(&dev->mode_config.mutex);
drivers/gpu/drm/i915/intel_dp.c:4411:	mutex_lock(&dev->mode_config.mutex);
drivers/gpu/drm/i915/intel_dp_mst.c:413:	mutex_lock(&dev->mode_config.mutex);
drivers/gpu/drm/i915/intel_dp_mst.c:427:	mutex_lock(&dev->mode_config.mutex);
drivers/gpu/drm/i915/intel_dp_mst.c:433:	mutex_lock(&dev->mode_config.mutex);
drivers/gpu/drm/i915/intel_lvds.c:1003:	mutex_lock(&dev->mode_config.mutex);
~~~

We can rule out /intel_dp_mst.c:427 and 433 since we see them above.

Examine the following ones
~~~
drivers/gpu/drm/i915/i915_debugfs.c:1680:	int ret = mutex_lock_interruptible(&dev->mode_config.mutex);
drivers/gpu/drm/i915/i915_debugfs.c:1726:	ret = mutex_lock_interruptible(&dev->mode_config.mutex);
drivers/gpu/drm/i915/intel_dp.c:4411:	mutex_lock(&dev->mode_config.mutex);
drivers/gpu/drm/i915/intel_dp_mst.c:413:	mutex_lock(&dev->mode_config.mutex);
drivers/gpu/drm/i915/intel_lvds.c:1003:	mutex_lock(&dev->mode_config.mutex);
~~~

There's a handful there so let's go back to the backtrace and see if we can spot something that fits
~~~
#16 [ffff8804516ffcf0] drm_mode_getconnector at ffffffffa00788ff [drm]

crash> dis -lr  ffffffffa00788ff | tail
0xffffffffa00788d5 <drm_mode_getconnector+981>: cmp    $0xc,%rdx
0xffffffffa00788d9 <drm_mode_getconnector+985>: jne    0xffffffffa00788b2 <drm_mode_getconnector+946>
0xffffffffa00788db <drm_mode_getconnector+987>: jmpq   0xffffffffa007884e <drm_mode_getconnector+846>
/usr/src/debug/kernel-3.10.0-229.el7/linux-3.10.0-229.el7.x86_64/drivers/gpu/drm/drm_crtc.c: 1873
0xffffffffa00788e0 <drm_mode_getconnector+992>: mov    0xa0(%r13),%rax
0xffffffffa00788e7 <drm_mode_getconnector+999>: mov    0x39c(%r15),%edx
0xffffffffa00788ee <drm_mode_getconnector+1006>:        mov    0x398(%r15),%esi
0xffffffffa00788f5 <drm_mode_getconnector+1013>:        mov    -0xa8(%rbp),%rdi
0xffffffffa00788fc <drm_mode_getconnector+1020>:        callq  *0x28(%rax)
0xffffffffa00788ff <drm_mode_getconnector+1023>:        jmpq   0xffffffffa00785fe <drm_mode_getconnector+254>

drivers/gpu/drm/drm_crtc.c
1831 int drm_mode_getconnector(struct drm_device *dev, void *data,
1832 			  struct drm_file *file_priv)
1833 {
...
1853 
1854 	DRM_DEBUG_KMS("[CONNECTOR:%d:?]\n", out_resp->connector_id);
1855 
1856 	mutex_lock(&dev->mode_config.mutex);
1857 
1858 	connector = drm_connector_find(dev, out_resp->connector_id);
1859 	if (!connector) {
1860 		ret = -ENOENT;
1861 		goto out;
1862 	}
1863 
...
1871 
1872 	if (out_resp->count_modes == 0) {
1873-->		connector->funcs->fill_modes(connector,
1874 					     dev->mode_config.max_width,
1875 					     dev->mode_config.max_height);
1876 	}
~~~

Looks like we may have gotten lucky.  Assuming 'dev->mode_config.mutex' is the same on line 1856 of drm_mode_getconnector, then I'd say this is where this Xorg process first obtained it.

Note that the next function matches fill_modes even though it is i915 specific
Now it does seem like i915 calls drm_helper_probe_single_connector_modes
~~~
#15 [ffff8804516ffce0] drm_helper_probe_single_connector_modes at ffffffffa00e2be3 [drm_kms_helper]
#16 [ffff8804516ffcf0] drm_mode_getconnector at ffffffffa00788ff [drm]

n gpu/drm/i915/intel_crt.c                    768 .fill_modes = drm_helper_probe_single_connector_modes,
o gpu/drm/i915/intel_dp.c                    3898 .fill_modes = drm_helper_probe_single_connector_modes,
p gpu/drm/i915/intel_dp_mst.c                 321 .fill_modes = drm_helper_probe_single_connector_modes,
q gpu/drm/i915/intel_dsi.c                    658 .fill_modes = drm_helper_probe_single_connector_modes,
r gpu/drm/i915/intel_dvo.c                    387 .fill_modes = drm_helper_probe_single_connector_modes,
s gpu/drm/i915/intel_hdmi.c                  1403 .fill_modes = drm_helper_probe_single_connector_modes,
t gpu/drm/i915/intel_lvds.c                   526 .fill_modes = drm_helper_probe_single_connector_modes,
u gpu/drm/i915/intel_sdvo.c                  2235 .fill_modes = drm_helper_probe_single_connector_modes,
v gpu/drm/i915/intel_tv.c                    1516 .fill_modes = drm_helper_probe_single_connector_modes,
~~~

So let's check the mutex here.
Let's get the first arg 'dev' from the stack here and see if the mutex address matches up.
~~~
crash> dis -r ffffffffa0069b6c | tail
0xffffffffa0069b42 <drm_ioctl+450>:     js     0xffffffffa0069e10 <drm_ioctl+1168>
0xffffffffa0069b48 <drm_ioctl+456>:     testb  $0x10,0x4(%r10)
0xffffffffa0069b4d <drm_ioctl+461>:     mov    %r11,-0xe0(%rbp)
0xffffffffa0069b54 <drm_ioctl+468>:     je     0xffffffffa0069c80 <drm_ioctl+768>
0xffffffffa0069b5a <drm_ioctl+474>:     mov    %r10,-0xd8(%rbp)
0xffffffffa0069b61 <drm_ioctl+481>:     mov    %r15,%rdx
0xffffffffa0069b64 <drm_ioctl+484>:     mov    %rbx,%rsi
0xffffffffa0069b67 <drm_ioctl+487>:     mov    %r14,%rdi
0xffffffffa0069b6a <drm_ioctl+490>:     callq  *%rcx
0xffffffffa0069b6c <drm_ioctl+492>:     mov    -0xd8(%rbp),%r10
crash> dis drm_mode_getconnector | head
0xffffffffa0078500 <drm_mode_getconnector>:     nopl   0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffffa0078505 <drm_mode_getconnector+5>:   push   %rbp
0xffffffffa0078506 <drm_mode_getconnector+6>:   mov    %rsp,%rbp
0xffffffffa0078509 <drm_mode_getconnector+9>:   push   %r15
0xffffffffa007850b <drm_mode_getconnector+11>:  mov    %rdi,%r15
0xffffffffa007850e <drm_mode_getconnector+14>:  push   %r14  <--- 3rd push

crash> bt -f | grep drm_ioctl
#17 [ffff8804516ffdb8] drm_ioctl at ffffffffa0069b6c [drm]
crash> px (0xffff8804516ffdb8 - 3*8)
$1 = 0xffff8804516ffda0
crash> rd 0xffff8804516ffda0
ffff8804516ffda0:  ffff880463cb8800                    ...c....
crash> p &((struct drm_device *)0xffff880463cb8800)->mode_config.mutex
$2 = (struct mutex *) 0xffff880463cb8a08

~~~

Recall our previous mutex address we obtained:
~~~
crash> mutex.owner ffff880463cb8a08
~~~

Indeed this is the same mutex.

So the bug is that inside drivers/gpu/drm/drm_crtc.c:1856   drm_mode_getconnector() we first obtain the mutex, then we call into the i915 driver intel_dp_destroy_mst_connector  where we try to obtain it again and so we deadlock ourselves.
~~~
1831 int drm_mode_getconnector(struct drm_device *dev, void *data,
1832 			  struct drm_file *file_priv)
1833 {
...
1853 
1854 	DRM_DEBUG_KMS("[CONNECTOR:%d:?]\n", out_resp->connector_id);
1855 
1856 	mutex_lock(&dev->mode_config.mutex);
...
1891 	if (connector->encoder)
1892 		out_resp->encoder_id = connector->encoder->base.id;
1893-->	else
1894 		out_resp->encoder_id = 0;

~~~

Comment 9 Dave Airlie 2015-08-25 02:00:00 UTC

this should be fixed by the stable backport from upstream with the patches

6b8eeca65b18ae77e175cc2b6571731f0ee413bf  drm/dp/mst: close deadlock in connector destruction.
4772ff03df8094fd99d28de5fcf5df3a3e9c68bb drm/dp/mst: Remove port after removing connector.

Comment 10 Dave Wysochanski 2015-08-27 20:10:55 UTC

(In reply to Dave Airlie from comment #9)
> this should be fixed by the stable backport from upstream with the patches
> 
> 6b8eeca65b18ae77e175cc2b6571731f0ee413bf  drm/dp/mst: close deadlock in
> connector destruction.
> 4772ff03df8094fd99d28de5fcf5df3a3e9c68bb drm/dp/mst: Remove port after
> removing connector.

Thanks!  When you backport it and/or get a build for test let me know.  It reliably fails for me so should be an easy test.

Comment 11 Dave Wysochanski 2015-08-27 20:31:11 UTC

Actually those are clean cherry-picks so I can do a quick build / test on 3.10.0-267.el7 + those two patches.

Comment 14 Dave Wysochanski 2015-08-28 22:20:53 UTC

Tried the patches, and still doesn't work.  But I've got another vmcore and it looks slightly different.
Don't have much analysis yet but will try to work on it next week.

Comment 16 Rob Clark 2015-09-15 19:44:09 UTC

so, from the backtrace splats in #c15, it still looks like there is quite a lot different in the -314 kernel vs what you are running?  Would it be possible to try again w/ the -314 kernel (which has the two patches that David mentioned in #c9 plus a bunch of other stable backport fixes)?

Comment 17 Dave Wysochanski 2015-09-16 23:46:34 UTC

(In reply to Rob Clark from comment #16)
> so, from the backtrace splats in #c15, it still looks like there is quite a
> lot different in the -314 kernel vs what you are running?  Would it be
> possible to try again w/ the -314 kernel (which has the two patches that
> David mentioned in #c9 plus a bunch of other stable backport fixes)?

Can you point at any specific patches in -314 which you feel will fix the deadlock shown in my last vmcore?

Testing -314 kernel is not a simple "install the -314 kernel" test as there is a dependency on dracut apparently.  I'm not sure I want to risk updating that and bricking my main system unless there's some evidence which points at specific patches fixing this problem.

Comment 18 Dave Airlie 2015-09-17 00:04:44 UTC

ab236e1ef798e2d5105207134cc3e3263feabb04 [drm] dp-mst: Remove debug WARN_ON

will at least clean up the warn's you see,

can you point me where the backtrace for the deadlock is that isn't in the WARN_ON? the warn on is just noise from debugging, and can happen depending on the mst receiver.

I've sent a few patches upstream to fix some stuff in this area,

Comment 19 Dave Wysochanski 2015-09-21 18:07:42 UTC

(In reply to Dave Airlie from comment #18)
> ab236e1ef798e2d5105207134cc3e3263feabb04 [drm] dp-mst: Remove debug WARN_ON
> 
> will at least clean up the warn's you see,
> 
Ok, yeah I was not sure if the WARN was important to the deadlock or not.

> can you point me where the backtrace for the deadlock is that isn't in the
> WARN_ON? the warn on is just noise from debugging, and can happen depending
> on the mst receiver.
> 
I've not delved into the latest vmcore I took on 3.10.0-267.el7.bz1251331.1.x86_64 but will see what I can come up with.

> I've sent a few patches upstream to fix some stuff in this area,
Ok.

Comment 20 Dave Wysochanski 2015-09-21 21:26:09 UTC

Created attachment 1075607 [details]
vmcore analysis of hang on 3.10.0-267.el7.bz1251331.1.x86_64

Comment 21 Dave Wysochanski 2015-09-21 21:27:16 UTC

Comment on attachment 1075607 [details]
vmcore analysis of hang on 3.10.0-267.el7.bz1251331.1.x86_64

In summary, there appears to be another deadlock between Xorg pid 2967, who is blocked the longest, and a kworker pid 290, which appears to be doing the work which Xorg is waiting to be cancelled.  The problem is similar to the first in that the same ioctl is being done by Xorg, but he gets blocked a different way this time.

Now Xorg is waiting here, but I'm not quite sure I fully understand where the corresponding queue_work was called from, or if indeed it should be waiting here.  Assuming it is correct though, this process does appear to be tied to a kworker thread, which is the next oldest blocked task.
~~~
crash> bt 0xffff8804404b9700
PID: 2967   TASK: ffff8804404b9700  CPU: 2   COMMAND: "Xorg"
 #0 [ffff8800359f37c0] __schedule at ffffffff81624655
 #1 [ffff8800359f3828] schedule at ffffffff81624cd9
 #2 [ffff8800359f3838] schedule_timeout at ffffffff81622979
 #3 [ffff8800359f38e0] wait_for_completion at ffffffff816250a6
 #4 [ffff8800359f3940] flush_work at ffffffff810946dc
 #5 [ffff8800359f39b8] __cancel_work_timer at ffffffff81094828
 #6 [ffff8800359f39f0] cancel_work_sync at ffffffff810948d0
 #7 [ffff8800359f3a00] drm_dp_destroy_mst_branch_device at ffffffffa02d5b87 [drm_kms_helper]
 #8 [ffff8800359f3a30] drm_dp_mst_i2c_xfer at ffffffffa02d87c1 [drm_kms_helper]
 #9 [ffff8800359f3af0] __i2c_transfer at ffffffffa00551fc [i2c_core]
#10 [ffff8800359f3b28] i2c_transfer at ffffffffa0055c2c [i2c_core]  <----- obtains rt_mutex waited on by pid 290
#11 [ffff8800359f3b50] drm_do_probe_ddc_edid at ffffffffa00cec55 [drm]
#12 [ffff8800359f3be8] drm_get_edid at ffffffffa00d03de [drm]
#13 [ffff8800359f3c48] drm_dp_mst_detect_port at ffffffffa02d600a [drm_kms_helper]
#14 [ffff8800359f3c78] intel_dp_mst_detect at ffffffffa0386b63 [i915]
#15 [ffff8800359f3c88] drm_helper_probe_single_connector_modes_merge_bits at ffffffffa02d2d40 [drm_kms_helper]
#16 [ffff8800359f3ce0] drm_helper_probe_single_connector_modes at ffffffffa02d2f13 [drm_kms_helper]
#17 [ffff8800359f3cf0] drm_mode_getconnector at ffffffffa00ca8a4 [drm]
#18 [ffff8800359f3db8] drm_ioctl at ffffffffa00bbcf9 [drm]
#19 [ffff8800359f3eb8] do_vfs_ioctl at ffffffff811e4d95
#20 [ffff8800359f3f30] sys_ioctl at ffffffff811e5011
#21 [ffff8800359f3f80] system_call_fastpath at ffffffff8162fd49
~~~

The kworker involved is blocked here, and it's waiting on an rt_mutex which is held by Xorg (obtained i2c_transfer)
~~~
PID: 290    TASK: ffff88046749f300  CPU: 4   COMMAND: "kworker/4:1"
 #0 [ffff880463c6b978] __schedule at ffffffff81624655
 #1 [ffff880463c6b9e0] schedule at ffffffff81624cd9
 #2 [ffff880463c6b9f0] __rt_mutex_slowlock at ffffffff8162607b
 #3 [ffff880463c6ba40] rt_mutex_slowlock at ffffffff8162621e
 #4 [ffff880463c6baf0] rt_mutex_lock at ffffffff8162639c
 #5 [ffff880463c6bb08] i2c_lock_adapter at ffffffffa00555a5 [i2c_core]
 #6 [ffff880463c6bb18] i2c_transfer at ffffffffa0055c4d [i2c_core]
 #7 [ffff880463c6bb40] drm_do_probe_ddc_edid at ffffffffa00cec55 [drm]
 #8 [ffff880463c6bc30] drm_dp_add_port at ffffffffa02d85c9 [drm_kms_helper]
 #9 [ffff880463c6bd80] drm_dp_send_link_address at ffffffffa02d80fc [drm_kms_helper]
#10 [ffff880463c6bde0] drm_dp_check_and_send_link_address at ffffffffa02d86dd [drm_kms_helper]
#11 [ffff880463c6be10] drm_dp_mst_link_probe_work at ffffffffa02d870c [drm_kms_helper]
#12 [ffff880463c6be20] process_one_work at ffffffff8109352b
#13 [ffff880463c6be68] worker_thread at ffffffff810942fb
#14 [ffff880463c6bec8] kthread at ffffffff8109b7bf
#15 [ffff880463c6bf50] ret_from_fork at ffffffff8162fc98
~~~


So it seems like the Xorg is waiting for the completion of the work function, which is being executed by kworker.  But kworker can't proceed because he is waiting for the rt_mutex held by Xorg.  I can't say I fully understand this though.

This is all I know at this point.  This may be enough to identify the problem and/or patches which may fix this.  Or it may require another pass at the vmcore.

If I take another pass at the vmcore, I would probably study where the queue_work was called for the Xorg function which is blocked, and make sure that was fully understood / valid.

Full gory details of the vmcore analysis is attached.

Comment 22 Dave Wysochanski 2015-09-21 21:29:21 UTC

What do you think of https://bugzilla.redhat.com/show_bug.cgi?id=1251331#c21 and the analysis so far?  Does any of this sound familiar to you?

Comment 23 Dave Airlie 2015-09-21 22:00:48 UTC

Hi Dave,

this looks good,

https://patchwork.kernel.org/patch/7192371/

I think this patch will stop this from occuring, as it stops X.org from using the device until things are setup further.

Dave.

Comment 24 Dave Wysochanski 2015-09-23 01:56:26 UTC

Created attachment 1076065 [details]
backport of https://patchwork.kernel.org/patch/7192371/ to 3.10.0-317.el7

Comment 25 Dave Wysochanski 2015-09-23 02:02:06 UTC

Comment on attachment 1076065 [details]
backport of https://patchwork.kernel.org/patch/7192371/ to 3.10.0-317.el7

Simpleton backport of https://bugzilla.redhat.com/show_bug.cgi?id=1251331#c24 to 3.10.0-317.el7 though this did not backport cleanly.  Do we need other patches?

Not sure about radeon_dp_register_mst_connector and intel_dp_register_mst_connector.

Original patch has only drm_modeset_lock_all
@@ -462,11 +462,17 @@  static struct drm_connector *intel_dp_add_mst_connector(struct drm_dp_mst_topolo
 	drm_object_attach_property(&connector->base, dev->mode_config.tile_property, 0);
 
 	drm_mode_connector_set_path_property(connector, pathprop);
+	return connector;
+}
+
+static void intel_dp_register_mst_connector(struct drm_connector *connector)
+{
+	struct intel_connector *intel_connector = to_intel_connector(connector);
+	struct drm_device *dev = connector->dev;
 	drm_modeset_lock_all(dev);
 	intel_connector_add_to_fbdev(intel_connector);
 	drm_modeset_unlock_all(dev);
 	drm_connector_register(&intel_connector->base);
-	return connector;
 }
 


RHEL7 has drm_reinit_primary_mode_group and mutex_lock(&dev->mode_config.mutex) / mutex_unlock
+static void intel_dp_register_mst_connector(struct drm_connector *connector)
+{
+       struct intel_connector *intel_connector = to_intel_connector(connector);
+       struct drm_device *dev = connector->dev;
+
        drm_reinit_primary_mode_group(dev);
        mutex_lock(&dev->mode_config.mutex);
        intel_connector_add_to_fbdev(intel_connector);
        mutex_unlock(&dev->mode_config.mutex);
        drm_connector_register(&intel_connector->base);
-       return connector;
 }

Comment 28 Ian Maxon 2015-09-30 18:25:01 UTC

Hi,

I believe I've hit this bug as well, but with a different setup. I have the same laptop (W541), however I am not using a Dell 2815Q, and I am on Debian testing instead. The display(s) I am trying to use is a EVGA DP MST hub with 2 Dell U2412Mb displays attached to it. 

I tried out the above patch, but I didn't use the back ported patch. Instead, I just went ahead and checked out drm-fixes-mst directly from git://people.freedesktop.org/~airlied/linux and compiled from there. This actually did stop the freezing for me, and nothing suspicious showed up in dmesg. However none of the displays showed up even after scanning or connecting them individually at the hub. Previously they would occasionally show up in xrandr if the machine hadn't frozen, but they weren't usable in any case. 

Sorry in advance if this is just noise due to the other variables (distribution, Xorg version, hardware) :) But if not, I'm happy to help in testing anything.

Comment 29 Dave Wysochanski 2015-10-02 22:23:57 UTC

Thanks Dave A - I will try to test all the patches you posted and verify.

Comment 30 Dave Wysochanski 2015-10-03 15:15:55 UTC

(In reply to Ian Maxon from comment #28)
> Hi,
> 
> I believe I've hit this bug as well, but with a different setup. I have the
> same laptop (W541), however I am not using a Dell 2815Q, and I am on Debian
> testing instead. The display(s) I am trying to use is a EVGA DP MST hub with
> 2 Dell U2412Mb displays attached to it. 
> 
> I tried out the above patch, but I didn't use the back ported patch.
> Instead, I just went ahead and checked out drm-fixes-mst directly from
> git://people.freedesktop.org/~airlied/linux and compiled from there. This
> actually did stop the freezing for me, and nothing suspicious showed up in
> dmesg. However none of the displays showed up even after scanning or
> connecting them individually at the hub. Previously they would occasionally
> show up in xrandr if the machine hadn't frozen, but they weren't usable in
> any case. 
> 

Thanks for the info.  I just had the same high-level outcome as well with the latest patches.  Basically, no more freezes, but the monitor is not recognized - it's like it's not connected.

Here's what I did to build my kernel(s).

1. Start with 3.10.0-267.el7, add all drivers/drm patches up to 3.10.0-321.el7.  There was 56 patches total.
This kernel was built as 3.10.0-267.el7.bz1251331.2.x86_64

2. As a second test, add on the 6 patches just posted on Oct 1.
This kernel was built as 3.10.0-267.el7.bz1251331.3.x86_64

Both kernels seemed to behave the same - no more freezes like on the earlier kernels, but no external display either (the laptop panel works fine).  I did not update anything else such as systemd or other X packages.  I'm not sure if this is required or not.

> Sorry in advance if this is just noise due to the other variables
> (distribution, Xorg version, hardware) :) But if not, I'm happy to help in
> testing anything.

Not at all.  Your setup may be different but it looks like we may be getting the same result though perhaps until we go deeper into diagnostics / logs we won't know for sure.  It could look the same at the high level but could be totally different.

Did your setup ever work?  I do know my setup still works on a 7.1 install plus 7.0 kernel 3.10.0-123.el7, which is why this is marked as a Regression.

If no more patches come for this bug before 7.2, I may try a clean install of latest 7.2 nightly to some other drive or partition as a final definitive test as to whether this is fixed or not.

Another approach could be to bisect based on the original PASS / FAIL:
https://bugzilla.redhat.com/show_bug.cgi?id=1251331#c3

Comment 32 Rob Clark 2015-10-05 22:15:00 UTC

(In reply to Dave Wysochanski from comment #30)
> (In reply to Ian Maxon from comment #28)
> > Hi,
> > 
> > I believe I've hit this bug as well, but with a different setup. I have the
> > same laptop (W541), however I am not using a Dell 2815Q, and I am on Debian
> > testing instead. The display(s) I am trying to use is a EVGA DP MST hub with
> > 2 Dell U2412Mb displays attached to it. 
> > 
> > I tried out the above patch, but I didn't use the back ported patch.
> > Instead, I just went ahead and checked out drm-fixes-mst directly from
> > git://people.freedesktop.org/~airlied/linux and compiled from there. This
> > actually did stop the freezing for me, and nothing suspicious showed up in
> > dmesg. However none of the displays showed up even after scanning or
> > connecting them individually at the hub. Previously they would occasionally
> > show up in xrandr if the machine hadn't frozen, but they weren't usable in
> > any case. 
> > 
> 
> Thanks for the info.  I just had the same high-level outcome as well with
> the latest patches.  Basically, no more freezes, but the monitor is not
> recognized - it's like it's not connected.
> 
> Here's what I did to build my kernel(s).
> 
> 1. Start with 3.10.0-267.el7, add all drivers/drm patches up to
> 3.10.0-321.el7.  There was 56 patches total.
> This kernel was built as 3.10.0-267.el7.bz1251331.2.x86_64
> 
> 2. As a second test, add on the 6 patches just posted on Oct 1.
> This kernel was built as 3.10.0-267.el7.bz1251331.3.x86_64
> 
> Both kernels seemed to behave the same - no more freezes like on the earlier
> kernels, but no external display either (the laptop panel works fine).  I
> did not update anything else such as systemd or other X packages.  I'm not
> sure if this is required or not.
> 
> > Sorry in advance if this is just noise due to the other variables
> > (distribution, Xorg version, hardware) :) But if not, I'm happy to help in
> > testing anything.
> 
> Not at all.  Your setup may be different but it looks like we may be getting
> the same result though perhaps until we go deeper into diagnostics / logs we
> won't know for sure.  It could look the same at the high level but could be
> totally different.
> 
> Did your setup ever work?  I do know my setup still works on a 7.1 install
> plus 7.0 kernel 3.10.0-123.el7, which is why this is marked as a Regression.
> 
> If no more patches come for this bug before 7.2, I may try a clean install
> of latest 7.2 nightly to some other drive or partition as a final definitive
> test as to whether this is fixed or not.
> 
> Another approach could be to bisect based on the original PASS / FAIL:
> https://bugzilla.redhat.com/show_bug.cgi?id=1251331#c3

So I have a Dell MST display here.. not sure the exact model # but it fails in the same way.

airlied's patches do clean up the various problems that are triggered by this failure mode, but ofc not the root issue of not getting a picture on the display.  The problem is that the display gets into a mode where it sends DP_AUX_NATIVE_REPLY_DEFER for every aux ch msg we send it, and eventually after 32 tries we give up.  I've tried boosting that limit (which we've had to raise in the past), but it does not seem to do any good.

I'm not entirely sure what windows does differently (if, presumably it works there).  But this seems to be a bug in the monitor.  The question, I think, is how to work around it in the kernel.

I'm not sure bisecting from 7.0 kernel will do much good.  Honestly I'm not even sure if we had MST support in the 7.0 kernel, so it may have simply been working via some fallback path.  Maybe Dave A. remembers better?

Comment 33 Tomas Pelka 2015-10-07 07:14:10 UTC

Any progress here? Still considering for 7.2?

Comment 34 Rob Clark 2015-10-07 13:38:21 UTC

(In reply to Tomas Pelka from comment #33)
> Any progress here? Still considering for 7.2?

Well, the issue w/ the Dell monitor is unlikely to be fixed for 7.2.. but maybe a new bug report should be filed for that issue, since the original issue mentioned in this bz has been solved and patches POSTed.

Comment 35 Rafael Aquini 2015-10-13 12:14:56 UTC

Patch(es) available on kernel-3.10.0-323.el7

Comment 39 Dave Wysochanski 2015-10-21 20:51:20 UTC

(In reply to Rob Clark from comment #34)
> (In reply to Tomas Pelka from comment #33)
> > Any progress here? Still considering for 7.2?
> 
> Well, the issue w/ the Dell monitor is unlikely to be fixed for 7.2.. but
> maybe a new bug report should be filed for that issue, since the original
> issue mentioned in this bz has been solved and patches POSTed.

FWIW, I did build a DVD of the latest nightly and booted to it, but got the same result - monitor was as though it was not there.  I did not install but my guess is it won't matter based on previous tests.

For now I'll continue booting to the RHEL7.0 GA kernel if I need the external monitor via the display port cable.

I guess I will file a new regression bug for the 7.2 GA kernel when it comes out and I've updated to that.

Comment 43 errata-xmlrpc 2015-11-19 23:04:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-2152.html

Comment 44 Dave Wysochanski 2015-12-10 14:57:31 UTC

This deadlock was not fixed in RHEL7.2.

The patch which fixes the deadlock is Rob's patch "drm/dp: move hw_mutex up the call stack" which was posted on the new bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1274157#c23

FWIW, I'll attach my backport to RHEL7.1 kernel 3.10.0-229.22.1.el7 to this bug.

Comment 45 Dave Wysochanski 2015-12-10 14:59:10 UTC

Created attachment 1104378 [details]
Rob Clark's patch patch which fixes this bug - backported from upstream posting

Comment 47 Lyude 2016-04-28 22:14:18 UTC

*** Bug 1313120 has been marked as a duplicate of this bug. ***