1292481 – device mapper hung tasks on an openshift/docker system

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1292481 - device mapper hung tasks on an openshift/docker system

Summary: device mapper hung tasks on an openshift/docker system

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	7.3
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Mike Snitzer
QA Contact:	Bruno Goncalves
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1267966 (view as bug list)
Depends On:
Blocks:	1296566
TreeView+	depends on / blocked

Reported:	2015-12-17 14:48 UTC by Jeremy Eder
Modified:	2021-09-06 14:56 UTC (History)
CC List:	26 users (show)
Fixed In Version:	kernel-3.10.0-337.el7
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1296566 (view as bug list)
Environment:
Last Closed:	2016-11-03 14:44:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2016:2574	0	normal	SHIPPED_LIVE	Important: kernel security, bug fix, and enhancement update	2016-11-03 12:06:10 UTC

Description Jeremy Eder 2015-12-17 14:48:43 UTC

Description of problem:

Several recent reports of hung tasks when running many containers (several dozen, up to 100).  This seems to be a new issue, as we've been doing that sort of scale for a long time.

http://post-office.corp.redhat.com/archives/aos-devel/2015-December/msg00571.html

The relevant components are:

RHEL7.2 kernel-3.10.0-327
docker-1.8.2-10
thin-lvm via docker-storage-setup

Comment 1 Jeremy Eder 2015-12-17 14:51:12 UTC

Here is the last container that was attempted to be started before the issue occurred:

Dec 15 07:17:09 ose3-master.example.com systemd[1]: Started docker container 74850fb6e89c5e021eab904829d8539973389cac3683c79bde5e3f34793b1bfc.
Dec 15 07:17:09 ose3-master.example.com systemd[1]: Starting docker container 74850fb6e89c5e021eab904829d8539973389cac3683c79bde5e3f34793b1bfc.
Dec 15 07:17:09 ose3-master.example.com kernel: XFS (dm-77): Unmounting Filesystem
Dec 15 07:17:09 ose3-master.example.com kernel: SELinux: initialized (dev tmpfs, type tmpfs), uses mountpoint labeling
Dec 15 07:17:09 ose3-master.example.com kernel: SELinux: initialized (dev devpts, type devpts), uses mountpoint labeling
Dec 15 07:17:09 ose3-master.example.com kernel: SELinux: initialized (dev tmpfs, type tmpfs), uses mountpoint labeling
Dec 15 07:17:09 ose3-master.example.com kernel: SELinux: initialized (dev tmpfs, type tmpfs), uses mountpoint labeling
Dec 15 07:17:09 ose3-master.example.com docker[7714]: time="2015-12-15T07:17:09.413642640-05:00" level=error msg="Handler for GET /containers/{name:.*}/json returned error: Unknown device 713377cde7341cb5536ec15b9ada9a21bc41ad8e4a18d75b500f63f63c74c4c9"
Dec 15 07:17:09 ose3-master.example.com docker[7714]: time="2015-12-15T07:17:09.413678122-05:00" level=error msg="HTTP Error" err="Unknown device 713377cde7341cb5536ec15b9ada9a21bc41ad8e4a18d75b500f63f63c74c4c9" statusCode=500
Dec 15 07:17:09 ose3-master.example.com kernel: XFS (dm-80): Unmounting Filesystem

We don't see the "Unknown device" error above on any normal/healthy system.

A few minutes later, we see the following tracebacks in dmesg:

[ 1922.389502] INFO: task kworker/u66:0:81 blocked for more than 120 seconds.
[ 1922.396371] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1922.404206] kworker/u66:0   D ffff8805bd5c28e0     0    81      2 0x00000000
[ 1922.411293] Workqueue: kdmremove do_deferred_remove [dm_mod]
[ 1922.416963]  ffff88060e937cf0 0000000000000046 ffff88060e917300 ffff88060e937fd8
[ 1922.424387]  ffff88060e937fd8 ffff88060e937fd8 ffff88060e917300 ffff8805bd5c28d8
[ 1922.431809]  ffff8805bd5c28dc ffff88060e917300 00000000ffffffff ffff8805bd5c28e0
[ 1922.439230] Call Trace:
[ 1922.441676]  [<ffffffff8163b9e9>] schedule_preempt_disabled+0x29/0x70
[ 1922.448120]  [<ffffffff816396e5>] __mutex_lock_slowpath+0xc5/0x1c0
[ 1922.454290]  [<ffffffff81638b4f>] mutex_lock+0x1f/0x2f
[ 1922.459423]  [<ffffffffa0002e9d>] __dm_destroy+0xad/0x340 [dm_mod]
[ 1922.465595]  [<ffffffffa00047e3>] dm_destroy+0x13/0x20 [dm_mod]
[ 1922.471509]  [<ffffffffa0008d6d>] dm_hash_remove_all+0x6d/0x130 [dm_mod]
[ 1922.478203]  [<ffffffffa000b50a>] dm_deferred_remove+0x1a/0x20 [dm_mod]
[ 1922.484806]  [<ffffffffa0000dae>] do_deferred_remove+0xe/0x10 [dm_mod]
[ 1922.491321]  [<ffffffff8109d5fb>] process_one_work+0x17b/0x470
[ 1922.497143]  [<ffffffff8109e3cb>] worker_thread+0x11b/0x400
[ 1922.502703]  [<ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400
[ 1922.508528]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
[ 1922.513439]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[ 1922.520018]  [<ffffffff81645858>] ret_from_fork+0x58/0x90
[ 1922.525427]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[ 1922.532040] INFO: task docker:77805 blocked for more than 120 seconds.
[ 1922.538610] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1922.546456] docker          D 0000000000000000     0 77805      1 0x00000080
[ 1922.553635]  ffff8805bf213ab0 0000000000000086 ffff880bbb116780 ffff8805bf213fd8
[ 1922.561164]  ffff8805bf213fd8 ffff8805bf213fd8 ffff880bbb116780 ffff8805bf213bf0
[ 1922.568636]  ffff8805bf213bf8 7fffffffffffffff ffff880bbb116780 0000000000000000
[ 1922.576138] Call Trace:
[ 1922.578608]  [<ffffffff8163a909>] schedule+0x29/0x70
[ 1922.583563]  [<ffffffff816385f9>] schedule_timeout+0x209/0x2d0
[ 1922.589386]  [<ffffffff8108e4cd>] ? mod_timer+0x11d/0x240
[ 1922.594774]  [<ffffffff8163acd6>] wait_for_completion+0x116/0x170
[ 1922.600858]  [<ffffffff810b8c10>] ? wake_up_state+0x20/0x20
[ 1922.606430]  [<ffffffff810ab676>] __synchronize_srcu+0x106/0x1a0
[ 1922.612459]  [<ffffffff810ab190>] ? call_srcu+0x70/0x70
[ 1922.617728]  [<ffffffff81219ebf>] ? __sync_blockdev+0x1f/0x40
[ 1922.623479]  [<ffffffff810ab72d>] synchronize_srcu+0x1d/0x20
[ 1922.629157]  [<ffffffffa000318d>] __dm_suspend+0x5d/0x220 [dm_mod]
[ 1922.635346]  [<ffffffffa0004c9a>] dm_suspend+0xca/0xf0 [dm_mod]
[ 1922.641269]  [<ffffffffa0009fe0>] ? table_load+0x380/0x380 [dm_mod]
[ 1922.647560]  [<ffffffffa000a174>] dev_suspend+0x194/0x250 [dm_mod]
[ 1922.653743]  [<ffffffffa0009fe0>] ? table_load+0x380/0x380 [dm_mod]
[ 1922.660025]  [<ffffffffa000aa25>] ctl_ioctl+0x255/0x500 [dm_mod]
[ 1922.666055]  [<ffffffff8112483d>] ? call_rcu_sched+0x1d/0x20
[ 1922.671720]  [<ffffffffa000ace3>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
[ 1922.677827]  [<ffffffff811f1ef5>] do_vfs_ioctl+0x2e5/0x4c0
[ 1922.683333]  [<ffffffff8128bc6e>] ? file_has_perm+0xae/0xc0
[ 1922.688909]  [<ffffffff811f2171>] SyS_ioctl+0xa1/0xc0
[ 1922.693950]  [<ffffffff81645909>] system_call_fastpath+0x16/0x1b
[ 2042.704387] INFO: task kworker/u66:0:81 blocked for more than 120 seconds.
[ 2042.711258] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2042.719072] kworker/u66:0   D ffff8805bd5c28e0     0    81      2 0x00000000
[ 2042.726146] Workqueue: kdmremove do_deferred_remove [dm_mod]
[ 2042.731799]  ffff88060e937cf0 0000000000000046 ffff88060e917300 ffff88060e937fd8
[ 2042.739218]  ffff88060e937fd8 ffff88060e937fd8 ffff88060e917300 ffff8805bd5c28d8
[ 2042.746635]  ffff8805bd5c28dc ffff88060e917300 00000000ffffffff ffff8805bd5c28e0
[ 2042.754052] Call Trace:
[ 2042.756496]  [<ffffffff8163b9e9>] schedule_preempt_disabled+0x29/0x70
[ 2042.762923]  [<ffffffff816396e5>] __mutex_lock_slowpath+0xc5/0x1c0
[ 2042.769092]  [<ffffffff81638b4f>] mutex_lock+0x1f/0x2f
[ 2042.774223]  [<ffffffffa0002e9d>] __dm_destroy+0xad/0x340 [dm_mod]
[ 2042.780401]  [<ffffffffa00047e3>] dm_destroy+0x13/0x20 [dm_mod]
[ 2042.786317]  [<ffffffffa0008d6d>] dm_hash_remove_all+0x6d/0x130 [dm_mod]
[ 2042.793009]  [<ffffffffa000b50a>] dm_deferred_remove+0x1a/0x20 [dm_mod]
[ 2042.799613]  [<ffffffffa0000dae>] do_deferred_remove+0xe/0x10 [dm_mod]
[ 2042.806129]  [<ffffffff8109d5fb>] process_one_work+0x17b/0x470
[ 2042.811950]  [<ffffffff8109e3cb>] worker_thread+0x11b/0x400
[ 2042.817511]  [<ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400
[ 2042.823333]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
[ 2042.828200]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[ 2042.834717]  [<ffffffff81645858>] ret_from_fork+0x58/0x90
[ 2042.840105]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[ 2042.846690] INFO: task docker:77805 blocked for more than 120 seconds.
[ 2042.853203] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2042.861019] docker          D 0000000000000000     0 77805      1 0x00000080
[ 2042.868084]  ffff8805bf213ab0 0000000000000086 ffff880bbb116780 ffff8805bf213fd8
[ 2042.875502]  ffff8805bf213fd8 ffff8805bf213fd8 ffff880bbb116780 ffff8805bf213bf0
[ 2042.882922]  ffff8805bf213bf8 7fffffffffffffff ffff880bbb116780 0000000000000000
[ 2042.890344] Call Trace:
[ 2042.892786]  [<ffffffff8163a909>] schedule+0x29/0x70
[ 2042.897741]  [<ffffffff816385f9>] schedule_timeout+0x209/0x2d0
[ 2042.903567]  [<ffffffff8108e4cd>] ? mod_timer+0x11d/0x240
[ 2042.908961]  [<ffffffff8163acd6>] wait_for_completion+0x116/0x170
[ 2042.915047]  [<ffffffff810b8c10>] ? wake_up_state+0x20/0x20
[ 2042.920610]  [<ffffffff810ab676>] __synchronize_srcu+0x106/0x1a0
[ 2042.926604]  [<ffffffff810ab190>] ? call_srcu+0x70/0x70
[ 2042.931820]  [<ffffffff81219ebf>] ? __sync_blockdev+0x1f/0x40
[ 2042.937555]  [<ffffffff810ab72d>] synchronize_srcu+0x1d/0x20
[ 2042.943222]  [<ffffffffa000318d>] __dm_suspend+0x5d/0x220 [dm_mod]
[ 2042.949395]  [<ffffffffa0004c9a>] dm_suspend+0xca/0xf0 [dm_mod]
[ 2042.955320]  [<ffffffffa0009fe0>] ? table_load+0x380/0x380 [dm_mod]
[ 2042.961596]  [<ffffffffa000a174>] dev_suspend+0x194/0x250 [dm_mod]
[ 2042.967782]  [<ffffffffa0009fe0>] ? table_load+0x380/0x380 [dm_mod]
[ 2042.974049]  [<ffffffffa000aa25>] ctl_ioctl+0x255/0x500 [dm_mod]
[ 2042.980057]  [<ffffffff8112483d>] ? call_rcu_sched+0x1d/0x20
[ 2042.985726]  [<ffffffffa000ace3>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
[ 2042.991817]  [<ffffffff811f1ef5>] do_vfs_ioctl+0x2e5/0x4c0
[ 2042.997304]  [<ffffffff8128bc6e>] ? file_has_perm+0xae/0xc0
[ 2043.002875]  [<ffffffff811f2171>] SyS_ioctl+0xa1/0xc0
[ 2043.007930]  [<ffffffff81645909>] system_call_fastpath+0x16/0x1b
[ 2163.018203] INFO: task kworker/u66:0:81 blocked for more than 120 seconds.
[ 2163.025074] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2163.032890] kworker/u66:0   D ffff8805bd5c28e0     0    81      2 0x00000000
[ 2163.039964] Workqueue: kdmremove do_deferred_remove [dm_mod]
[ 2163.045617]  ffff88060e937cf0 0000000000000046 ffff88060e917300 ffff88060e937fd8
[ 2163.053035]  ffff88060e937fd8 ffff88060e937fd8 ffff88060e917300 ffff8805bd5c28d8
[ 2163.060453]  ffff8805bd5c28dc ffff88060e917300 00000000ffffffff ffff8805bd5c28e0
[ 2163.067870] Call Trace:
[ 2163.070318]  [<ffffffff8163b9e9>] schedule_preempt_disabled+0x29/0x70
[ 2163.076747]  [<ffffffff816396e5>] __mutex_lock_slowpath+0xc5/0x1c0
[ 2163.082915]  [<ffffffff81638b4f>] mutex_lock+0x1f/0x2f
[ 2163.088045]  [<ffffffffa0002e9d>] __dm_destroy+0xad/0x340 [dm_mod]
[ 2163.094266]  [<ffffffffa00047e3>] dm_destroy+0x13/0x20 [dm_mod]
[ 2163.100209]  [<ffffffffa0008d6d>] dm_hash_remove_all+0x6d/0x130 [dm_mod]
[ 2163.106948]  [<ffffffffa000b50a>] dm_deferred_remove+0x1a/0x20 [dm_mod]
[ 2163.113553]  [<ffffffffa0000dae>] do_deferred_remove+0xe/0x10 [dm_mod]
[ 2163.120071]  [<ffffffff8109d5fb>] process_one_work+0x17b/0x470
[ 2163.125892]  [<ffffffff8109e3cb>] worker_thread+0x11b/0x400
[ 2163.131462]  [<ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400
[ 2163.137291]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
[ 2163.142169]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[ 2163.148693]  [<ffffffff81645858>] ret_from_fork+0x58/0x90
[ 2163.154085]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[ 2163.160681] INFO: task docker:77805 blocked for more than 120 seconds.
[ 2163.167200] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2163.175017] docker          D 0000000000000000     0 77805      1 0x00000080
[ 2163.182118]  ffff8805bf213ab0 0000000000000086 ffff880bbb116780 ffff8805bf213fd8
[ 2163.189533]  ffff8805bf213fd8 ffff8805bf213fd8 ffff880bbb116780 ffff8805bf213bf0
[ 2163.196987]  ffff8805bf213bf8 7fffffffffffffff ffff880bbb116780 0000000000000000
[ 2163.204468] Call Trace:
[ 2163.206941]  [<ffffffff8163a909>] schedule+0x29/0x70
[ 2163.211896]  [<ffffffff816385f9>] schedule_timeout+0x209/0x2d0
[ 2163.217749]  [<ffffffff8108e4cd>] ? mod_timer+0x11d/0x240
[ 2163.223136]  [<ffffffff8163acd6>] wait_for_completion+0x116/0x170
[ 2163.229251]  [<ffffffff810b8c10>] ? wake_up_state+0x20/0x20
[ 2163.234816]  [<ffffffff810ab676>] __synchronize_srcu+0x106/0x1a0
[ 2163.240840]  [<ffffffff810ab190>] ? call_srcu+0x70/0x70
[ 2163.246060]  [<ffffffff81219ebf>] ? __sync_blockdev+0x1f/0x40
[ 2163.251825]  [<ffffffff810ab72d>] synchronize_srcu+0x1d/0x20
[ 2163.257523]  [<ffffffffa000318d>] __dm_suspend+0x5d/0x220 [dm_mod]
[ 2163.263736]  [<ffffffffa0004c9a>] dm_suspend+0xca/0xf0 [dm_mod]
[ 2163.269691]  [<ffffffffa0009fe0>] ? table_load+0x380/0x380 [dm_mod]
[ 2163.275982]  [<ffffffffa000a174>] dev_suspend+0x194/0x250 [dm_mod]
[ 2163.282158]  [<ffffffffa0009fe0>] ? table_load+0x380/0x380 [dm_mod]
[ 2163.288456]  [<ffffffffa000aa25>] ctl_ioctl+0x255/0x500 [dm_mod]
[ 2163.294461]  [<ffffffff8112483d>] ? call_rcu_sched+0x1d/0x20
[ 2163.300149]  [<ffffffffa000ace3>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
[ 2163.306265]  [<ffffffff811f1ef5>] do_vfs_ioctl+0x2e5/0x4c0
[ 2163.311772]  [<ffffffff8128bc6e>] ? file_has_perm+0xae/0xc0
[ 2163.317363]  [<ffffffff811f2171>] SyS_ioctl+0xa1/0xc0
[ 2163.322435]  [<ffffffff81645909>] system_call_fastpath+0x16/0x1b
[ 2283.332984] INFO: task kworker/u66:0:81 blocked for more than 120 seconds.
[ 2283.339855] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2283.347669] kworker/u66:0   D ffff8805bd5c28e0     0    81      2 0x00000000
[ 2283.354741] Workqueue: kdmremove do_deferred_remove [dm_mod]
[ 2283.360396]  ffff88060e937cf0 0000000000000046 ffff88060e917300 ffff88060e937fd8
[ 2283.367816]  ffff88060e937fd8 ffff88060e937fd8 ffff88060e917300 ffff8805bd5c28d8
[ 2283.375240]  ffff8805bd5c28dc ffff88060e917300 00000000ffffffff ffff8805bd5c28e0
[ 2283.382659] Call Trace:
[ 2283.385105]  [<ffffffff8163b9e9>] schedule_preempt_disabled+0x29/0x70
[ 2283.391534]  [<ffffffff816396e5>] __mutex_lock_slowpath+0xc5/0x1c0
[ 2283.397704]  [<ffffffff81638b4f>] mutex_lock+0x1f/0x2f
[ 2283.402838]  [<ffffffffa0002e9d>] __dm_destroy+0xad/0x340 [dm_mod]
[ 2283.409013]  [<ffffffffa00047e3>] dm_destroy+0x13/0x20 [dm_mod]
[ 2283.414925]  [<ffffffffa0008d6d>] dm_hash_remove_all+0x6d/0x130 [dm_mod]
[ 2283.421617]  [<ffffffffa000b50a>] dm_deferred_remove+0x1a/0x20 [dm_mod]
[ 2283.428220]  [<ffffffffa0000dae>] do_deferred_remove+0xe/0x10 [dm_mod]
[ 2283.434736]  [<ffffffff8109d5fb>] process_one_work+0x17b/0x470
[ 2283.440557]  [<ffffffff8109e3cb>] worker_thread+0x11b/0x400
[ 2283.446116]  [<ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400
[ 2283.451938]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
[ 2283.456806]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[ 2283.463324]  [<ffffffff81645858>] ret_from_fork+0x58/0x90
[ 2283.468713]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[ 2283.475295] INFO: task docker:77805 blocked for more than 120 seconds.
[ 2283.481811] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2283.489625] docker          D 0000000000000000     0 77805      1 0x00000080
[ 2283.496681]  ffff8805bf213ab0 0000000000000086 ffff880bbb116780 ffff8805bf213fd8
[ 2283.504208]  ffff8805bf213fd8 ffff8805bf213fd8 ffff880bbb116780 ffff8805bf213bf0
[ 2283.511781]  ffff8805bf213bf8 7fffffffffffffff ffff880bbb116780 0000000000000000
[ 2283.519340] Call Trace:
[ 2283.521815]  [<ffffffff8163a909>] schedule+0x29/0x70
[ 2283.526805]  [<ffffffff816385f9>] schedule_timeout+0x209/0x2d0
[ 2283.532663]  [<ffffffff8108e4cd>] ? mod_timer+0x11d/0x240
[ 2283.538091]  [<ffffffff8163acd6>] wait_for_completion+0x116/0x170
[ 2283.544212]  [<ffffffff810b8c10>] ? wake_up_state+0x20/0x20
[ 2283.549809]  [<ffffffff810ab676>] __synchronize_srcu+0x106/0x1a0
[ 2283.555841]  [<ffffffff810ab190>] ? call_srcu+0x70/0x70
[ 2283.561092]  [<ffffffff81219ebf>] ? __sync_blockdev+0x1f/0x40
[ 2283.566877]  [<ffffffff810ab72d>] synchronize_srcu+0x1d/0x20
[ 2283.572535]  [<ffffffffa000318d>] __dm_suspend+0x5d/0x220 [dm_mod]
[ 2283.578729]  [<ffffffffa0004c9a>] dm_suspend+0xca/0xf0 [dm_mod]
[ 2283.584683]  [<ffffffffa0009fe0>] ? table_load+0x380/0x380 [dm_mod]
[ 2283.590986]  [<ffffffffa000a174>] dev_suspend+0x194/0x250 [dm_mod]
[ 2283.597200]  [<ffffffffa0009fe0>] ? table_load+0x380/0x380 [dm_mod]
[ 2283.603502]  [<ffffffffa000aa25>] ctl_ioctl+0x255/0x500 [dm_mod]
[ 2283.609546]  [<ffffffff8112483d>] ? call_rcu_sched+0x1d/0x20
[ 2283.615248]  [<ffffffffa000ace3>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
[ 2283.621372]  [<ffffffff811f1ef5>] do_vfs_ioctl+0x2e5/0x4c0
[ 2283.626891]  [<ffffffff8128bc6e>] ? file_has_perm+0xae/0xc0
[ 2283.632495]  [<ffffffff811f2171>] SyS_ioctl+0xa1/0xc0
[ 2283.637570]  [<ffffffff81645909>] system_call_fastpath+0x16/0x1b
[ 2403.647866] INFO: task kworker/u66:0:81 blocked for more than 120 seconds.
[ 2403.654736] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2403.662549] kworker/u66:0   D ffff8805bd5c28e0     0    81      2 0x00000000
[ 2403.669622] Workqueue: kdmremove do_deferred_remove [dm_mod]
[ 2403.675276]  ffff88060e937cf0 0000000000000046 ffff88060e917300 ffff88060e937fd8
[ 2403.682696]  ffff88060e937fd8 ffff88060e937fd8 ffff88060e917300 ffff8805bd5c28d8
[ 2403.690138]  ffff8805bd5c28dc ffff88060e917300 00000000ffffffff ffff8805bd5c28e0
[ 2403.697722] Call Trace:
[ 2403.700209]  [<ffffffff8163b9e9>] schedule_preempt_disabled+0x29/0x70
[ 2403.706637]  [<ffffffff816396e5>] __mutex_lock_slowpath+0xc5/0x1c0
[ 2403.712806]  [<ffffffff81638b4f>] mutex_lock+0x1f/0x2f
[ 2403.717942]  [<ffffffffa0002e9d>] __dm_destroy+0xad/0x340 [dm_mod]
[ 2403.724114]  [<ffffffffa00047e3>] dm_destroy+0x13/0x20 [dm_mod]
[ 2403.730029]  [<ffffffffa0008d6d>] dm_hash_remove_all+0x6d/0x130 [dm_mod]
[ 2403.736722]  [<ffffffffa000b50a>] dm_deferred_remove+0x1a/0x20 [dm_mod]
[ 2403.743367]  [<ffffffffa0000dae>] do_deferred_remove+0xe/0x10 [dm_mod]
[ 2403.749926]  [<ffffffff8109d5fb>] process_one_work+0x17b/0x470
[ 2403.755747]  [<ffffffff8109e3cb>] worker_thread+0x11b/0x400
[ 2403.761354]  [<ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400
[ 2403.767217]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
[ 2403.772128]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[ 2403.778658]  [<ffffffff81645858>] ret_from_fork+0x58/0x90
[ 2403.784083]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[ 2403.790707] INFO: task docker:77805 blocked for more than 120 seconds.
[ 2403.797220] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2403.805035] docker          D 0000000000000000     0 77805      1 0x00000080
[ 2403.812203]  ffff8805bf213ab0 0000000000000086 ffff880bbb116780 ffff8805bf213fd8
[ 2403.819787]  ffff8805bf213fd8 ffff8805bf213fd8 ffff880bbb116780 ffff8805bf213bf0
[ 2403.827237]  ffff8805bf213bf8 7fffffffffffffff ffff880bbb116780 0000000000000000
[ 2403.834721] Call Trace:
[ 2403.837213]  [<ffffffff8163a909>] schedule+0x29/0x70
[ 2403.842209]  [<ffffffff816385f9>] schedule_timeout+0x209/0x2d0
[ 2403.848034]  [<ffffffff8108e4cd>] ? mod_timer+0x11d/0x240
[ 2403.853421]  [<ffffffff8163acd6>] wait_for_completion+0x116/0x170
[ 2403.859546]  [<ffffffff810b8c10>] ? wake_up_state+0x20/0x20
[ 2403.865150]  [<ffffffff810ab676>] __synchronize_srcu+0x106/0x1a0
[ 2403.871144]  [<ffffffff810ab190>] ? call_srcu+0x70/0x70
[ 2403.876368]  [<ffffffff81219ebf>] ? __sync_blockdev+0x1f/0x40
[ 2403.882112]  [<ffffffff810ab72d>] synchronize_srcu+0x1d/0x20
[ 2403.887817]  [<ffffffffa000318d>] __dm_suspend+0x5d/0x220 [dm_mod]
[ 2403.894033]  [<ffffffffa0004c9a>] dm_suspend+0xca/0xf0 [dm_mod]
[ 2403.899957]  [<ffffffffa0009fe0>] ? table_load+0x380/0x380 [dm_mod]
[ 2403.906260]  [<ffffffffa000a174>] dev_suspend+0x194/0x250 [dm_mod]
[ 2403.912481]  [<ffffffffa0009fe0>] ? table_load+0x380/0x380 [dm_mod]
[ 2403.918790]  [<ffffffffa000aa25>] ctl_ioctl+0x255/0x500 [dm_mod]
[ 2403.924861]  [<ffffffff8112483d>] ? call_rcu_sched+0x1d/0x20
[ 2403.930577]  [<ffffffffa000ace3>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
[ 2403.936682]  [<ffffffff811f1ef5>] do_vfs_ioctl+0x2e5/0x4c0
[ 2403.942160]  [<ffffffff8128bc6e>] ? file_has_perm+0xae/0xc0
[ 2403.947724]  [<ffffffff811f2171>] SyS_ioctl+0xa1/0xc0
[ 2403.952806]  [<ffffffff81645909>] system_call_fastpath+0x16/0x1b

Comment 2 Mike Snitzer 2015-12-17 17:01:55 UTC

We need this fix (already upstream): http://git.kernel.org/linus/2a708cff93f1

Comment 5 Mike Snitzer 2015-12-18 19:38:18 UTC

*** Bug 1267966 has been marked as a duplicate of this bug. ***

Comment 10 Jeremy Eder 2015-12-22 11:53:55 UTC

Test kernel seems to have resolved our deadlock issue.  Cluster is up and stable with the same container count that we had problems with last week (500 on 4 nodes).

3.10.0-327.4.4.el7.bz1292481.x86_64


** PM:  We would like to nominate this for 7.2.z

Comment 12 Jeremy Eder 2016-01-04 13:32:03 UTC

Can this be evaluated for 7.2.z please?

Comment 17 Rafael Aquini 2016-01-08 14:12:36 UTC

Patch(es) available on kernel-3.10.0-337.el7

Comment 25 Gerard Ryan 2016-02-18 11:34:38 UTC

Hi, is there a known date when we can expect this to be pushed to stable repos?

Comment 26 Jeremy Eder 2016-02-18 11:57:19 UTC

Hi Gerard -- we shipped this fix on Tuesday Feb 16.

Do a yum update and make sure you get kernel-3.10.0-327.10.1.el7 or later.

There was a different BZ tracking the RHEL7.2.z kernel release:
https://bugzilla.redhat.com/show_bug.cgi?id=1296566

This BZ is to track the fix into RHEL7.3.

Comment 30 Bruno Goncalves 2016-07-04 08:17:48 UTC

Tested kernel-3.10.0-445.el7 and it did pass on our Regression test and the tickets are closed.

Setting SanityOnly as we are not able to reproduce the problem in our environment.

Comment 32 errata-xmlrpc 2016-11-03 14:44:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2574.html

Note You need to log in before you can comment on or make changes to this bug.

agk
akokshar
aquini
bdonahue
bgoncalv
dhoward
dshaw
dwalsh
ederevea
fedora
hklein
javier.ramirez
jeder
jmagrini
mpatocka
msnitzer
nmurray
pep
pm-rhel
sdodson
stalexan
thornber
tlavigne
uobergfe
vgoyal
xhe