Bug 1292481

Summary: device mapper hung tasks on an openshift/docker system
Product: Red Hat Enterprise Linux 7 Reporter: Jeremy Eder <jeder>
Component: kernelAssignee: Mike Snitzer <msnitzer>
kernel sub component: Thin Provisioning QA Contact: Bruno Goncalves <bgoncalv>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: agk, akokshar, aquini, bdonahue, bgoncalv, dhoward, dshaw, dwalsh, ederevea, fedora, hklein, javier.ramirez, jeder, jmagrini, mpatocka, msnitzer, nmurray, pep, pm-rhel, sdodson, stalexan, thornber, tlavigne, uobergfe, vgoyal, xhe
Version: 7.3Keywords: ZStream
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-3.10.0-337.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1296566 (view as bug list) Environment:
Last Closed: 2016-11-03 14:44:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1296566    

Description Jeremy Eder 2015-12-17 14:48:43 UTC
Description of problem:

Several recent reports of hung tasks when running many containers (several dozen, up to 100).  This seems to be a new issue, as we've been doing that sort of scale for a long time.

http://post-office.corp.redhat.com/archives/aos-devel/2015-December/msg00571.html

The relevant components are:

RHEL7.2 kernel-3.10.0-327
docker-1.8.2-10
thin-lvm via docker-storage-setup

Comment 1 Jeremy Eder 2015-12-17 14:51:12 UTC
Here is the last container that was attempted to be started before the issue occurred:

Dec 15 07:17:09 ose3-master.example.com systemd[1]: Started docker container 74850fb6e89c5e021eab904829d8539973389cac3683c79bde5e3f34793b1bfc.
Dec 15 07:17:09 ose3-master.example.com systemd[1]: Starting docker container 74850fb6e89c5e021eab904829d8539973389cac3683c79bde5e3f34793b1bfc.
Dec 15 07:17:09 ose3-master.example.com kernel: XFS (dm-77): Unmounting Filesystem
Dec 15 07:17:09 ose3-master.example.com kernel: SELinux: initialized (dev tmpfs, type tmpfs), uses mountpoint labeling
Dec 15 07:17:09 ose3-master.example.com kernel: SELinux: initialized (dev devpts, type devpts), uses mountpoint labeling
Dec 15 07:17:09 ose3-master.example.com kernel: SELinux: initialized (dev tmpfs, type tmpfs), uses mountpoint labeling
Dec 15 07:17:09 ose3-master.example.com kernel: SELinux: initialized (dev tmpfs, type tmpfs), uses mountpoint labeling
Dec 15 07:17:09 ose3-master.example.com docker[7714]: time="2015-12-15T07:17:09.413642640-05:00" level=error msg="Handler for GET /containers/{name:.*}/json returned error: Unknown device 713377cde7341cb5536ec15b9ada9a21bc41ad8e4a18d75b500f63f63c74c4c9"
Dec 15 07:17:09 ose3-master.example.com docker[7714]: time="2015-12-15T07:17:09.413678122-05:00" level=error msg="HTTP Error" err="Unknown device 713377cde7341cb5536ec15b9ada9a21bc41ad8e4a18d75b500f63f63c74c4c9" statusCode=500
Dec 15 07:17:09 ose3-master.example.com kernel: XFS (dm-80): Unmounting Filesystem

We don't see the "Unknown device" error above on any normal/healthy system.

A few minutes later, we see the following tracebacks in dmesg:

[ 1922.389502] INFO: task kworker/u66:0:81 blocked for more than 120 seconds.
[ 1922.396371] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1922.404206] kworker/u66:0   D ffff8805bd5c28e0     0    81      2 0x00000000
[ 1922.411293] Workqueue: kdmremove do_deferred_remove [dm_mod]
[ 1922.416963]  ffff88060e937cf0 0000000000000046 ffff88060e917300 ffff88060e937fd8
[ 1922.424387]  ffff88060e937fd8 ffff88060e937fd8 ffff88060e917300 ffff8805bd5c28d8
[ 1922.431809]  ffff8805bd5c28dc ffff88060e917300 00000000ffffffff ffff8805bd5c28e0
[ 1922.439230] Call Trace:
[ 1922.441676]  [<ffffffff8163b9e9>] schedule_preempt_disabled+0x29/0x70
[ 1922.448120]  [<ffffffff816396e5>] __mutex_lock_slowpath+0xc5/0x1c0
[ 1922.454290]  [<ffffffff81638b4f>] mutex_lock+0x1f/0x2f
[ 1922.459423]  [<ffffffffa0002e9d>] __dm_destroy+0xad/0x340 [dm_mod]
[ 1922.465595]  [<ffffffffa00047e3>] dm_destroy+0x13/0x20 [dm_mod]
[ 1922.471509]  [<ffffffffa0008d6d>] dm_hash_remove_all+0x6d/0x130 [dm_mod]
[ 1922.478203]  [<ffffffffa000b50a>] dm_deferred_remove+0x1a/0x20 [dm_mod]
[ 1922.484806]  [<ffffffffa0000dae>] do_deferred_remove+0xe/0x10 [dm_mod]
[ 1922.491321]  [<ffffffff8109d5fb>] process_one_work+0x17b/0x470
[ 1922.497143]  [<ffffffff8109e3cb>] worker_thread+0x11b/0x400
[ 1922.502703]  [<ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400
[ 1922.508528]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
[ 1922.513439]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[ 1922.520018]  [<ffffffff81645858>] ret_from_fork+0x58/0x90
[ 1922.525427]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[ 1922.532040] INFO: task docker:77805 blocked for more than 120 seconds.
[ 1922.538610] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1922.546456] docker          D 0000000000000000     0 77805      1 0x00000080
[ 1922.553635]  ffff8805bf213ab0 0000000000000086 ffff880bbb116780 ffff8805bf213fd8
[ 1922.561164]  ffff8805bf213fd8 ffff8805bf213fd8 ffff880bbb116780 ffff8805bf213bf0
[ 1922.568636]  ffff8805bf213bf8 7fffffffffffffff ffff880bbb116780 0000000000000000
[ 1922.576138] Call Trace:
[ 1922.578608]  [<ffffffff8163a909>] schedule+0x29/0x70
[ 1922.583563]  [<ffffffff816385f9>] schedule_timeout+0x209/0x2d0
[ 1922.589386]  [<ffffffff8108e4cd>] ? mod_timer+0x11d/0x240
[ 1922.594774]  [<ffffffff8163acd6>] wait_for_completion+0x116/0x170
[ 1922.600858]  [<ffffffff810b8c10>] ? wake_up_state+0x20/0x20
[ 1922.606430]  [<ffffffff810ab676>] __synchronize_srcu+0x106/0x1a0
[ 1922.612459]  [<ffffffff810ab190>] ? call_srcu+0x70/0x70
[ 1922.617728]  [<ffffffff81219ebf>] ? __sync_blockdev+0x1f/0x40
[ 1922.623479]  [<ffffffff810ab72d>] synchronize_srcu+0x1d/0x20
[ 1922.629157]  [<ffffffffa000318d>] __dm_suspend+0x5d/0x220 [dm_mod]
[ 1922.635346]  [<ffffffffa0004c9a>] dm_suspend+0xca/0xf0 [dm_mod]
[ 1922.641269]  [<ffffffffa0009fe0>] ? table_load+0x380/0x380 [dm_mod]
[ 1922.647560]  [<ffffffffa000a174>] dev_suspend+0x194/0x250 [dm_mod]
[ 1922.653743]  [<ffffffffa0009fe0>] ? table_load+0x380/0x380 [dm_mod]
[ 1922.660025]  [<ffffffffa000aa25>] ctl_ioctl+0x255/0x500 [dm_mod]
[ 1922.666055]  [<ffffffff8112483d>] ? call_rcu_sched+0x1d/0x20
[ 1922.671720]  [<ffffffffa000ace3>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
[ 1922.677827]  [<ffffffff811f1ef5>] do_vfs_ioctl+0x2e5/0x4c0
[ 1922.683333]  [<ffffffff8128bc6e>] ? file_has_perm+0xae/0xc0
[ 1922.688909]  [<ffffffff811f2171>] SyS_ioctl+0xa1/0xc0
[ 1922.693950]  [<ffffffff81645909>] system_call_fastpath+0x16/0x1b
[ 2042.704387] INFO: task kworker/u66:0:81 blocked for more than 120 seconds.
[ 2042.711258] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2042.719072] kworker/u66:0   D ffff8805bd5c28e0     0    81      2 0x00000000
[ 2042.726146] Workqueue: kdmremove do_deferred_remove [dm_mod]
[ 2042.731799]  ffff88060e937cf0 0000000000000046 ffff88060e917300 ffff88060e937fd8
[ 2042.739218]  ffff88060e937fd8 ffff88060e937fd8 ffff88060e917300 ffff8805bd5c28d8
[ 2042.746635]  ffff8805bd5c28dc ffff88060e917300 00000000ffffffff ffff8805bd5c28e0
[ 2042.754052] Call Trace:
[ 2042.756496]  [<ffffffff8163b9e9>] schedule_preempt_disabled+0x29/0x70
[ 2042.762923]  [<ffffffff816396e5>] __mutex_lock_slowpath+0xc5/0x1c0
[ 2042.769092]  [<ffffffff81638b4f>] mutex_lock+0x1f/0x2f
[ 2042.774223]  [<ffffffffa0002e9d>] __dm_destroy+0xad/0x340 [dm_mod]
[ 2042.780401]  [<ffffffffa00047e3>] dm_destroy+0x13/0x20 [dm_mod]
[ 2042.786317]  [<ffffffffa0008d6d>] dm_hash_remove_all+0x6d/0x130 [dm_mod]
[ 2042.793009]  [<ffffffffa000b50a>] dm_deferred_remove+0x1a/0x20 [dm_mod]
[ 2042.799613]  [<ffffffffa0000dae>] do_deferred_remove+0xe/0x10 [dm_mod]
[ 2042.806129]  [<ffffffff8109d5fb>] process_one_work+0x17b/0x470
[ 2042.811950]  [<ffffffff8109e3cb>] worker_thread+0x11b/0x400
[ 2042.817511]  [<ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400
[ 2042.823333]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
[ 2042.828200]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[ 2042.834717]  [<ffffffff81645858>] ret_from_fork+0x58/0x90
[ 2042.840105]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[ 2042.846690] INFO: task docker:77805 blocked for more than 120 seconds.
[ 2042.853203] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2042.861019] docker          D 0000000000000000     0 77805      1 0x00000080
[ 2042.868084]  ffff8805bf213ab0 0000000000000086 ffff880bbb116780 ffff8805bf213fd8
[ 2042.875502]  ffff8805bf213fd8 ffff8805bf213fd8 ffff880bbb116780 ffff8805bf213bf0
[ 2042.882922]  ffff8805bf213bf8 7fffffffffffffff ffff880bbb116780 0000000000000000
[ 2042.890344] Call Trace:
[ 2042.892786]  [<ffffffff8163a909>] schedule+0x29/0x70
[ 2042.897741]  [<ffffffff816385f9>] schedule_timeout+0x209/0x2d0
[ 2042.903567]  [<ffffffff8108e4cd>] ? mod_timer+0x11d/0x240
[ 2042.908961]  [<ffffffff8163acd6>] wait_for_completion+0x116/0x170
[ 2042.915047]  [<ffffffff810b8c10>] ? wake_up_state+0x20/0x20
[ 2042.920610]  [<ffffffff810ab676>] __synchronize_srcu+0x106/0x1a0
[ 2042.926604]  [<ffffffff810ab190>] ? call_srcu+0x70/0x70
[ 2042.931820]  [<ffffffff81219ebf>] ? __sync_blockdev+0x1f/0x40
[ 2042.937555]  [<ffffffff810ab72d>] synchronize_srcu+0x1d/0x20
[ 2042.943222]  [<ffffffffa000318d>] __dm_suspend+0x5d/0x220 [dm_mod]
[ 2042.949395]  [<ffffffffa0004c9a>] dm_suspend+0xca/0xf0 [dm_mod]
[ 2042.955320]  [<ffffffffa0009fe0>] ? table_load+0x380/0x380 [dm_mod]
[ 2042.961596]  [<ffffffffa000a174>] dev_suspend+0x194/0x250 [dm_mod]
[ 2042.967782]  [<ffffffffa0009fe0>] ? table_load+0x380/0x380 [dm_mod]
[ 2042.974049]  [<ffffffffa000aa25>] ctl_ioctl+0x255/0x500 [dm_mod]
[ 2042.980057]  [<ffffffff8112483d>] ? call_rcu_sched+0x1d/0x20
[ 2042.985726]  [<ffffffffa000ace3>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
[ 2042.991817]  [<ffffffff811f1ef5>] do_vfs_ioctl+0x2e5/0x4c0
[ 2042.997304]  [<ffffffff8128bc6e>] ? file_has_perm+0xae/0xc0
[ 2043.002875]  [<ffffffff811f2171>] SyS_ioctl+0xa1/0xc0
[ 2043.007930]  [<ffffffff81645909>] system_call_fastpath+0x16/0x1b
[ 2163.018203] INFO: task kworker/u66:0:81 blocked for more than 120 seconds.
[ 2163.025074] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2163.032890] kworker/u66:0   D ffff8805bd5c28e0     0    81      2 0x00000000
[ 2163.039964] Workqueue: kdmremove do_deferred_remove [dm_mod]
[ 2163.045617]  ffff88060e937cf0 0000000000000046 ffff88060e917300 ffff88060e937fd8
[ 2163.053035]  ffff88060e937fd8 ffff88060e937fd8 ffff88060e917300 ffff8805bd5c28d8
[ 2163.060453]  ffff8805bd5c28dc ffff88060e917300 00000000ffffffff ffff8805bd5c28e0
[ 2163.067870] Call Trace:
[ 2163.070318]  [<ffffffff8163b9e9>] schedule_preempt_disabled+0x29/0x70
[ 2163.076747]  [<ffffffff816396e5>] __mutex_lock_slowpath+0xc5/0x1c0
[ 2163.082915]  [<ffffffff81638b4f>] mutex_lock+0x1f/0x2f
[ 2163.088045]  [<ffffffffa0002e9d>] __dm_destroy+0xad/0x340 [dm_mod]
[ 2163.094266]  [<ffffffffa00047e3>] dm_destroy+0x13/0x20 [dm_mod]
[ 2163.100209]  [<ffffffffa0008d6d>] dm_hash_remove_all+0x6d/0x130 [dm_mod]
[ 2163.106948]  [<ffffffffa000b50a>] dm_deferred_remove+0x1a/0x20 [dm_mod]
[ 2163.113553]  [<ffffffffa0000dae>] do_deferred_remove+0xe/0x10 [dm_mod]
[ 2163.120071]  [<ffffffff8109d5fb>] process_one_work+0x17b/0x470
[ 2163.125892]  [<ffffffff8109e3cb>] worker_thread+0x11b/0x400
[ 2163.131462]  [<ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400
[ 2163.137291]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
[ 2163.142169]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[ 2163.148693]  [<ffffffff81645858>] ret_from_fork+0x58/0x90
[ 2163.154085]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[ 2163.160681] INFO: task docker:77805 blocked for more than 120 seconds.
[ 2163.167200] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2163.175017] docker          D 0000000000000000     0 77805      1 0x00000080
[ 2163.182118]  ffff8805bf213ab0 0000000000000086 ffff880bbb116780 ffff8805bf213fd8
[ 2163.189533]  ffff8805bf213fd8 ffff8805bf213fd8 ffff880bbb116780 ffff8805bf213bf0
[ 2163.196987]  ffff8805bf213bf8 7fffffffffffffff ffff880bbb116780 0000000000000000
[ 2163.204468] Call Trace:
[ 2163.206941]  [<ffffffff8163a909>] schedule+0x29/0x70
[ 2163.211896]  [<ffffffff816385f9>] schedule_timeout+0x209/0x2d0
[ 2163.217749]  [<ffffffff8108e4cd>] ? mod_timer+0x11d/0x240
[ 2163.223136]  [<ffffffff8163acd6>] wait_for_completion+0x116/0x170
[ 2163.229251]  [<ffffffff810b8c10>] ? wake_up_state+0x20/0x20
[ 2163.234816]  [<ffffffff810ab676>] __synchronize_srcu+0x106/0x1a0
[ 2163.240840]  [<ffffffff810ab190>] ? call_srcu+0x70/0x70
[ 2163.246060]  [<ffffffff81219ebf>] ? __sync_blockdev+0x1f/0x40
[ 2163.251825]  [<ffffffff810ab72d>] synchronize_srcu+0x1d/0x20
[ 2163.257523]  [<ffffffffa000318d>] __dm_suspend+0x5d/0x220 [dm_mod]
[ 2163.263736]  [<ffffffffa0004c9a>] dm_suspend+0xca/0xf0 [dm_mod]
[ 2163.269691]  [<ffffffffa0009fe0>] ? table_load+0x380/0x380 [dm_mod]
[ 2163.275982]  [<ffffffffa000a174>] dev_suspend+0x194/0x250 [dm_mod]
[ 2163.282158]  [<ffffffffa0009fe0>] ? table_load+0x380/0x380 [dm_mod]
[ 2163.288456]  [<ffffffffa000aa25>] ctl_ioctl+0x255/0x500 [dm_mod]
[ 2163.294461]  [<ffffffff8112483d>] ? call_rcu_sched+0x1d/0x20
[ 2163.300149]  [<ffffffffa000ace3>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
[ 2163.306265]  [<ffffffff811f1ef5>] do_vfs_ioctl+0x2e5/0x4c0
[ 2163.311772]  [<ffffffff8128bc6e>] ? file_has_perm+0xae/0xc0
[ 2163.317363]  [<ffffffff811f2171>] SyS_ioctl+0xa1/0xc0
[ 2163.322435]  [<ffffffff81645909>] system_call_fastpath+0x16/0x1b
[ 2283.332984] INFO: task kworker/u66:0:81 blocked for more than 120 seconds.
[ 2283.339855] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2283.347669] kworker/u66:0   D ffff8805bd5c28e0     0    81      2 0x00000000
[ 2283.354741] Workqueue: kdmremove do_deferred_remove [dm_mod]
[ 2283.360396]  ffff88060e937cf0 0000000000000046 ffff88060e917300 ffff88060e937fd8
[ 2283.367816]  ffff88060e937fd8 ffff88060e937fd8 ffff88060e917300 ffff8805bd5c28d8
[ 2283.375240]  ffff8805bd5c28dc ffff88060e917300 00000000ffffffff ffff8805bd5c28e0
[ 2283.382659] Call Trace:
[ 2283.385105]  [<ffffffff8163b9e9>] schedule_preempt_disabled+0x29/0x70
[ 2283.391534]  [<ffffffff816396e5>] __mutex_lock_slowpath+0xc5/0x1c0
[ 2283.397704]  [<ffffffff81638b4f>] mutex_lock+0x1f/0x2f
[ 2283.402838]  [<ffffffffa0002e9d>] __dm_destroy+0xad/0x340 [dm_mod]
[ 2283.409013]  [<ffffffffa00047e3>] dm_destroy+0x13/0x20 [dm_mod]
[ 2283.414925]  [<ffffffffa0008d6d>] dm_hash_remove_all+0x6d/0x130 [dm_mod]
[ 2283.421617]  [<ffffffffa000b50a>] dm_deferred_remove+0x1a/0x20 [dm_mod]
[ 2283.428220]  [<ffffffffa0000dae>] do_deferred_remove+0xe/0x10 [dm_mod]
[ 2283.434736]  [<ffffffff8109d5fb>] process_one_work+0x17b/0x470
[ 2283.440557]  [<ffffffff8109e3cb>] worker_thread+0x11b/0x400
[ 2283.446116]  [<ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400
[ 2283.451938]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
[ 2283.456806]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[ 2283.463324]  [<ffffffff81645858>] ret_from_fork+0x58/0x90
[ 2283.468713]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[ 2283.475295] INFO: task docker:77805 blocked for more than 120 seconds.
[ 2283.481811] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2283.489625] docker          D 0000000000000000     0 77805      1 0x00000080
[ 2283.496681]  ffff8805bf213ab0 0000000000000086 ffff880bbb116780 ffff8805bf213fd8
[ 2283.504208]  ffff8805bf213fd8 ffff8805bf213fd8 ffff880bbb116780 ffff8805bf213bf0
[ 2283.511781]  ffff8805bf213bf8 7fffffffffffffff ffff880bbb116780 0000000000000000
[ 2283.519340] Call Trace:
[ 2283.521815]  [<ffffffff8163a909>] schedule+0x29/0x70
[ 2283.526805]  [<ffffffff816385f9>] schedule_timeout+0x209/0x2d0
[ 2283.532663]  [<ffffffff8108e4cd>] ? mod_timer+0x11d/0x240
[ 2283.538091]  [<ffffffff8163acd6>] wait_for_completion+0x116/0x170
[ 2283.544212]  [<ffffffff810b8c10>] ? wake_up_state+0x20/0x20
[ 2283.549809]  [<ffffffff810ab676>] __synchronize_srcu+0x106/0x1a0
[ 2283.555841]  [<ffffffff810ab190>] ? call_srcu+0x70/0x70
[ 2283.561092]  [<ffffffff81219ebf>] ? __sync_blockdev+0x1f/0x40
[ 2283.566877]  [<ffffffff810ab72d>] synchronize_srcu+0x1d/0x20
[ 2283.572535]  [<ffffffffa000318d>] __dm_suspend+0x5d/0x220 [dm_mod]
[ 2283.578729]  [<ffffffffa0004c9a>] dm_suspend+0xca/0xf0 [dm_mod]
[ 2283.584683]  [<ffffffffa0009fe0>] ? table_load+0x380/0x380 [dm_mod]
[ 2283.590986]  [<ffffffffa000a174>] dev_suspend+0x194/0x250 [dm_mod]
[ 2283.597200]  [<ffffffffa0009fe0>] ? table_load+0x380/0x380 [dm_mod]
[ 2283.603502]  [<ffffffffa000aa25>] ctl_ioctl+0x255/0x500 [dm_mod]
[ 2283.609546]  [<ffffffff8112483d>] ? call_rcu_sched+0x1d/0x20
[ 2283.615248]  [<ffffffffa000ace3>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
[ 2283.621372]  [<ffffffff811f1ef5>] do_vfs_ioctl+0x2e5/0x4c0
[ 2283.626891]  [<ffffffff8128bc6e>] ? file_has_perm+0xae/0xc0
[ 2283.632495]  [<ffffffff811f2171>] SyS_ioctl+0xa1/0xc0
[ 2283.637570]  [<ffffffff81645909>] system_call_fastpath+0x16/0x1b
[ 2403.647866] INFO: task kworker/u66:0:81 blocked for more than 120 seconds.
[ 2403.654736] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2403.662549] kworker/u66:0   D ffff8805bd5c28e0     0    81      2 0x00000000
[ 2403.669622] Workqueue: kdmremove do_deferred_remove [dm_mod]
[ 2403.675276]  ffff88060e937cf0 0000000000000046 ffff88060e917300 ffff88060e937fd8
[ 2403.682696]  ffff88060e937fd8 ffff88060e937fd8 ffff88060e917300 ffff8805bd5c28d8
[ 2403.690138]  ffff8805bd5c28dc ffff88060e917300 00000000ffffffff ffff8805bd5c28e0
[ 2403.697722] Call Trace:
[ 2403.700209]  [<ffffffff8163b9e9>] schedule_preempt_disabled+0x29/0x70
[ 2403.706637]  [<ffffffff816396e5>] __mutex_lock_slowpath+0xc5/0x1c0
[ 2403.712806]  [<ffffffff81638b4f>] mutex_lock+0x1f/0x2f
[ 2403.717942]  [<ffffffffa0002e9d>] __dm_destroy+0xad/0x340 [dm_mod]
[ 2403.724114]  [<ffffffffa00047e3>] dm_destroy+0x13/0x20 [dm_mod]
[ 2403.730029]  [<ffffffffa0008d6d>] dm_hash_remove_all+0x6d/0x130 [dm_mod]
[ 2403.736722]  [<ffffffffa000b50a>] dm_deferred_remove+0x1a/0x20 [dm_mod]
[ 2403.743367]  [<ffffffffa0000dae>] do_deferred_remove+0xe/0x10 [dm_mod]
[ 2403.749926]  [<ffffffff8109d5fb>] process_one_work+0x17b/0x470
[ 2403.755747]  [<ffffffff8109e3cb>] worker_thread+0x11b/0x400
[ 2403.761354]  [<ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400
[ 2403.767217]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
[ 2403.772128]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[ 2403.778658]  [<ffffffff81645858>] ret_from_fork+0x58/0x90
[ 2403.784083]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[ 2403.790707] INFO: task docker:77805 blocked for more than 120 seconds.
[ 2403.797220] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2403.805035] docker          D 0000000000000000     0 77805      1 0x00000080
[ 2403.812203]  ffff8805bf213ab0 0000000000000086 ffff880bbb116780 ffff8805bf213fd8
[ 2403.819787]  ffff8805bf213fd8 ffff8805bf213fd8 ffff880bbb116780 ffff8805bf213bf0
[ 2403.827237]  ffff8805bf213bf8 7fffffffffffffff ffff880bbb116780 0000000000000000
[ 2403.834721] Call Trace:
[ 2403.837213]  [<ffffffff8163a909>] schedule+0x29/0x70
[ 2403.842209]  [<ffffffff816385f9>] schedule_timeout+0x209/0x2d0
[ 2403.848034]  [<ffffffff8108e4cd>] ? mod_timer+0x11d/0x240
[ 2403.853421]  [<ffffffff8163acd6>] wait_for_completion+0x116/0x170
[ 2403.859546]  [<ffffffff810b8c10>] ? wake_up_state+0x20/0x20
[ 2403.865150]  [<ffffffff810ab676>] __synchronize_srcu+0x106/0x1a0
[ 2403.871144]  [<ffffffff810ab190>] ? call_srcu+0x70/0x70
[ 2403.876368]  [<ffffffff81219ebf>] ? __sync_blockdev+0x1f/0x40
[ 2403.882112]  [<ffffffff810ab72d>] synchronize_srcu+0x1d/0x20
[ 2403.887817]  [<ffffffffa000318d>] __dm_suspend+0x5d/0x220 [dm_mod]
[ 2403.894033]  [<ffffffffa0004c9a>] dm_suspend+0xca/0xf0 [dm_mod]
[ 2403.899957]  [<ffffffffa0009fe0>] ? table_load+0x380/0x380 [dm_mod]
[ 2403.906260]  [<ffffffffa000a174>] dev_suspend+0x194/0x250 [dm_mod]
[ 2403.912481]  [<ffffffffa0009fe0>] ? table_load+0x380/0x380 [dm_mod]
[ 2403.918790]  [<ffffffffa000aa25>] ctl_ioctl+0x255/0x500 [dm_mod]
[ 2403.924861]  [<ffffffff8112483d>] ? call_rcu_sched+0x1d/0x20
[ 2403.930577]  [<ffffffffa000ace3>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
[ 2403.936682]  [<ffffffff811f1ef5>] do_vfs_ioctl+0x2e5/0x4c0
[ 2403.942160]  [<ffffffff8128bc6e>] ? file_has_perm+0xae/0xc0
[ 2403.947724]  [<ffffffff811f2171>] SyS_ioctl+0xa1/0xc0
[ 2403.952806]  [<ffffffff81645909>] system_call_fastpath+0x16/0x1b

Comment 2 Mike Snitzer 2015-12-17 17:01:55 UTC
We need this fix (already upstream): http://git.kernel.org/linus/2a708cff93f1

Comment 5 Mike Snitzer 2015-12-18 19:38:18 UTC
*** Bug 1267966 has been marked as a duplicate of this bug. ***

Comment 10 Jeremy Eder 2015-12-22 11:53:55 UTC
Test kernel seems to have resolved our deadlock issue.  Cluster is up and stable with the same container count that we had problems with last week (500 on 4 nodes).

3.10.0-327.4.4.el7.bz1292481.x86_64


** PM:  We would like to nominate this for 7.2.z

Comment 12 Jeremy Eder 2016-01-04 13:32:03 UTC
Can this be evaluated for 7.2.z please?

Comment 17 Rafael Aquini 2016-01-08 14:12:36 UTC
Patch(es) available on kernel-3.10.0-337.el7

Comment 25 Gerard Ryan 2016-02-18 11:34:38 UTC
Hi, is there a known date when we can expect this to be pushed to stable repos?

Comment 26 Jeremy Eder 2016-02-18 11:57:19 UTC
Hi Gerard -- we shipped this fix on Tuesday Feb 16.

Do a yum update and make sure you get kernel-3.10.0-327.10.1.el7 or later.

There was a different BZ tracking the RHEL7.2.z kernel release:
https://bugzilla.redhat.com/show_bug.cgi?id=1296566

This BZ is to track the fix into RHEL7.3.

Comment 30 Bruno Goncalves 2016-07-04 08:17:48 UTC
Tested kernel-3.10.0-445.el7 and it did pass on our Regression test and the tickets are closed.

Setting SanityOnly as we are not able to reproduce the problem in our environment.

Comment 32 errata-xmlrpc 2016-11-03 14:44:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2574.html