Bug 595883

Summary: kdmflush hung task timeout warnings
Product: Red Hat Enterprise Linux 6 Reporter: Mike Gahagan <mgahagan>
Component: kernelAssignee: Mikuláš Patočka <mpatocka>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: high    
Version: 6.0CC: agk, coughlan, prajnoha
Target Milestone: rc   
Target Release: ---   
Hardware: ppc64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-11-26 13:00:34 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 846704    

Description Mike Gahagan 2010-05-25 20:32:18 UTC
Description of problem:
During run of the rhts test /kernel/distribution/ctcs/1.3.0 I'm seeing a number of hung task warnings for kdmflush while the system is under heavy load. I don't recall seeing them on anything other than ppc.


INFO: task kdmflush:324 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kdmflush      D 0000000000000000     0   324      2 0x00008000
Call Trace:
[c00000008448b700] [c00000000006dbe0] .default_scale_smt_power+0x20/0x50 (unreliable)
[c00000008448b8d0] [c000000000014120] .__switch_to+0x100/0x1d0
[c00000008448b960] [c000000000572fe8] .schedule+0x438/0xd50
[c00000008448baf0] [c000000000573990] .io_schedule+0x90/0x110
[c00000008448bb80] [d000000000ae2c78] .dm_wait_for_completion+0xb8/0x160 [dm_mod]
[c00000008448bc60] [d000000000ae4474] .dm_flush+0x24/0x70 [dm_mod]
[c00000008448bcf0] [d000000000ae45b4] .dm_wq_work+0xf4/0x270 [dm_mod]
[c00000008448bdb0] [c0000000000aa4bc] .worker_thread+0x1dc/0x3c0
[c00000008448bed0] [c0000000000b05bc] .kthread+0xbc/0xd0
[c00000008448bf90] [c00000000002f104] .kernel_thread+0x54/0x70
INFO: task crashme:9395 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
crashme       D 00000fffafe1bc08     0  9395   5163 0x00008080
Call Trace:
[c0000000843cf1d0] [c00000000010bb18] .acct_update_integrals+0x108/0x130 (unreliable)
[c0000000843cf3a0] [c000000000014120] .__switch_to+0x100/0x1d0
[c0000000843cf430] [c000000000572fe8] .schedule+0x438/0xd50
[c0000000843cf5c0] [c0000000005740a8] .schedule_timeout+0x248/0x350
[c0000000843cf6e0] [c000000000573cb8] .wait_for_common+0x198/0x210
[c0000000843cf7c0] [c0000000002ae69c] .blkdev_issue_flush+0xcc/0x140
[c0000000843cf880] [d000000001c92d4c] .ext4_sync_file+0x19c/0x250 [ext4]
[c0000000843cf930] [c0000000001d5ee4] .vfs_fsync_range+0xd4/0x130
[c0000000843cf9e0] [c00000000013d1cc] .generic_file_aio_write+0xdc/0x100
[c0000000843cfaa0] [d000000001c92b14] .ext4_file_write+0x44/0xe0 [ext4]
[c0000000843cfb40] [c0000000001a3724] .do_sync_write+0xd4/0x160
[c0000000843cfce0] [c0000000001a466c] .vfs_write+0xec/0x1f0
[c0000000843cfd80] [c0000000001a4898] .SyS_write+0x58/0xb0
[c0000000843cfe30] [c00000000000852c] syscall_exit+0x0/0x40
INFO: task cat:9415 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
cat           D 00000fff9eab3674     0  9415   9414 0x00008080
Call Trace:
[c000000086ce32f0] [c00000000010bb18] .acct_update_integrals+0x108/0x130 (unreliable)
[c000000086ce34c0] [c000000000014120] .__switch_to+0x100/0x1d0
[c000000086ce3550] [c000000000572fe8] .schedule+0x438/0xd50
[c000000086ce36e0] [c000000000573990] .io_schedule+0x90/0x110
[c000000086ce3770] [c00000000013b1f0] .sync_page+0x70/0xa0
[c000000086ce37f0] [c0000000005742b4] .__wait_on_bit_lock+0xd4/0x1b0
[c000000086ce38b0] [c00000000013b134] .__lock_page+0x54/0x70
[c000000086ce3970] [c00000000013b3f8] .find_lock_page+0x98/0xc0
[c000000086ce3a10] [c00000000013bef0] .filemap_fault+0x1e0/0x500
[c000000086ce3ae0] [c000000000167b88] .__do_fault+0xa8/0x640
[c000000086ce3be0] [c000000000168ffc] .handle_pte_fault+0x14c/0xa90
[c000000086ce3cf0] [c000000000578f90] .do_page_fault+0x390/0x720
[c000000086ce3e30] [c00000000000518c] handle_page_fault+0x20/0x74
INFO: task awk:9416 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
awk           D 00000fff84df3674     0  9416   9414 0x00008080
Call Trace:
[c0000000848c7390] [c00000000010bb18] .acct_update_integrals+0x108/0x130 (unreliable)
[c0000000848c7560] [c000000000014120] .__switch_to+0x100/0x1d0
[c0000000848c75f0] [c000000000572fe8] .schedule+0x438/0xd50
[c0000000848c7780] [c000000000573990] .io_schedule+0x90/0x110
[c0000000848c7810] [c00000000013b1f0] .sync_page+0x70/0xa0
[c0000000848c7890] [c0000000005742b4] .__wait_on_bit_lock+0xd4/0x1b0
[c0000000848c7950] [c00000000013b134] .__lock_page+0x54/0x70
[c0000000848c7a10] [c00000000013c0b0] .filemap_fault+0x3a0/0x500
[c0000000848c7ae0] [c000000000167b88] .__do_fault+0xa8/0x640
[c0000000848c7be0] [c000000000168ffc] .handle_pte_fault+0x14c/0xa90
[c0000000848c7cf0] [c000000000578f90] .do_page_fault+0x390/0x720
[c0000000848c7e30] [c00000000000518c] handle_page_fault+0x20/0x74
INFO: task kdmflush:324 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kdmflush      D 0000000000000000     0   324      2 0x00008000
Call Trace:
[c00000008448b700] [c00000000006dbe0] .default_scale_smt_power+0x20/0x50 (unreliable)
[c00000008448b8d0] [c000000000014120] .__switch_to+0x100/0x1d0
[c00000008448b960] [c000000000572fe8] .schedule+0x438/0xd50
[c00000008448baf0] [c000000000573990] .io_schedule+0x90/0x110
[c00000008448bb80] [d000000000ae2c78] .dm_wait_for_completion+0xb8/0x160 [dm_mod]
[c00000008448bc60] [d000000000ae4474] .dm_flush+0x24/0x70 [dm_mod]
[c00000008448bcf0] [d000000000ae45b4] .dm_wq_work+0xf4/0x270 [dm_mod]
[c00000008448bdb0] [c0000000000aa4bc] .worker_thread+0x1dc/0x3c0
[c00000008448bed0] [c0000000000b05bc] .kthread+0xbc/0xd0
[c00000008448bf90] [c00000000002f104] .kernel_thread+0x54/0x70
INFO: task jbd2/dm-0-8:346 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
jbd2/dm-0-8   D 0000000000000000     0   346      2 0x00008000
Call Trace:
[c0000000848ff440] [c00000000006dbe0] .default_scale_smt_power+0x20/0x50 (unreliable)
[c0000000848ff610] [c000000000014120] .__switch_to+0x100/0x1d0
[c0000000848ff6a0] [c000000000572fe8] .schedule+0x438/0xd50
[c0000000848ff830] [c000000000573990] .io_schedule+0x90/0x110
[c0000000848ff8c0] [c00000000013b1f0] .sync_page+0x70/0xa0
[c0000000848ff940] [c0000000005744d4] .__wait_on_bit+0xb4/0x130
[c0000000848ff9f0] [c00000000013b50c] .wait_on_page_bit+0x7c/0xa0
[c0000000848ffab0] [c00000000013c4cc] .wait_on_page_writeback_range+0x10c/0x1d0
[c0000000848ffc10] [d000000001b44710] .jbd2_journal_commit_transaction+0x940/0x1ba0 [jbd2]
[c0000000848ffdd0] [d000000001b4e870] .kjournald2+0x110/0x470 [jbd2]
[c0000000848ffed0] [c0000000000b05bc] .kthread+0xbc/0xd0
[c0000000848fff90] [c00000000002f104] .kernel_thread+0x54/0x70
INFO: task rs:main Q:Reg:1112 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
rs:main Q:Reg D 00000fff961c0d34     0  1112      1 0x00008080
Call Trace:
[c0000000869832f0] [c00000000010bb18] .acct_update_integrals+0x108/0x130 (unreliable)
[c0000000869834c0] [c000000000014120] .__switch_to+0x100/0x1d0
[c000000086983550] [c000000000572fe8] .schedule+0x438/0xd50
[c0000000869836e0] [c000000000573990] .io_schedule+0x90/0x110
[c000000086983770] [c00000000013b1f0] .sync_page+0x70/0xa0
[c0000000869837f0] [c0000000005742b4] .__wait_on_bit_lock+0xd4/0x1b0
[c0000000869838b0] [c00000000013b134] .__lock_page+0x54/0x70
[c000000086983970] [c00000000013b3f8] .find_lock_page+0x98/0xc0
[c000000086983a10] [c00000000013bef0] .filemap_fault+0x1e0/0x500
[c000000086983ae0] [c000000000167b88] .__do_fault+0xa8/0x640
[c000000086983be0] [c000000000168ffc] .handle_pte_fault+0x14c/0xa90
[c000000086983cf0] [c000000000578f90] .do_page_fault+0x390/0x720
[c000000086983e30] [c00000000000518c] handle_page_fault+0x20/0x74
INFO: task flush-253:0:1124 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
flush-253:0   D 0000000000000000     0  1124      2 0x00008080
Call Trace:
[c000000085fee990] [c000000085feea30] 0xc000000085feea30 (unreliable)
[c000000085feeb60] [c000000000014120] .__switch_to+0x100/0x1d0
[c000000085feebf0] [c000000000572fe8] .schedule+0x438/0xd50
[c000000085feed80] [c000000000573990] .io_schedule+0x90/0x110
[c000000085feee10] [c0000000001da2b8] .sync_buffer+0x68/0x80
[c000000085feee80] [c0000000005744d4] .__wait_on_bit+0xb4/0x130
[c000000085feef30] [c0000000005745c0] .out_of_line_wait_on_bit+0x70/0x90
[c000000085fef010] [c0000000001da170] .__wait_on_buffer+0x30/0x50
[c000000085fef090] [d000000001cc8674] .ext4_mb_init_cache+0x2d4/0x960 [ext4]
[c000000085fef1c0] [d000000001ccb0d0] .ext4_mb_load_buddy+0x330/0x400 [ext4]
[c000000085fef270] [d000000001cce64c] .ext4_mb_regular_allocator+0x27c/0x620 [ext4]
[c000000085fef3c0] [d000000001cceea0] .ext4_mb_new_blocks+0x4b0/0x670 [ext4]
[c000000085fef4a0] [d000000001cc437c] .ext4_ext_get_blocks+0x4fc/0x890 [ext4]
[c000000085fef5f0] [d000000001c9b5fc] .ext4_get_blocks+0x2dc/0x570 [ext4]
[c000000085fef6c0] [d000000001c9bec0] .mpage_da_map_blocks+0xb0/0x3e0 [ext4]
[c000000085fef7f0] [d000000001c9c504] .ext4_da_writepages+0x314/0x630 [ext4]
[c000000085fef970] [c000000000153154] .do_writepages+0x44/0x70
[c000000085fef9e0] [c0000000001cfc34] .writeback_single_inode+0xc4/0x2d0
[c000000085fefa90] [c0000000001d0dc0] .writeback_inodes_wb+0x390/0x5b0
[c000000085fefba0] [c0000000001d1148] .wb_writeback+0x168/0x290
[c000000085fefcd0] [c0000000001d1654] .wb_do_writeback+0x274/0x310
[c000000085fefda0] [c0000000001d174c] .bdi_writeback_task+0x5c/0x140
[c000000085fefe40] [c000000000165824] .bdi_start_fn+0xa4/0x150
[c000000085fefed0] [c0000000000b05bc] .kthread+0xbc/0xd0
[c000000085feff90] [c00000000002f104] .kernel_thread+0x54/0x70
INFO: task ntpd:3793 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ntpd          D 00000fff8d00b7b0     0  3793      1 0x00008080
Call Trace:
[c0000000870532f0] [c00000000010bb18] .acct_update_integrals+0x108/0x130 (unreliable)
[c0000000870534c0] [c000000000014120] .__switch_to+0x100/0x1d0
[c000000087053550] [c000000000572fe8] .schedule+0x438/0xd50
[c0000000870536e0] [c000000000573990] .io_schedule+0x90/0x110
[c000000087053770] [c00000000013b1f0] .sync_page+0x70/0xa0
[c0000000870537f0] [c0000000005742b4] .__wait_on_bit_lock+0xd4/0x1b0
[c0000000870538b0] [c00000000013b134] .__lock_page+0x54/0x70
[c000000087053970] [c00000000013b3f8] .find_lock_page+0x98/0xc0
[c000000087053a10] [c00000000013bef0] .filemap_fault+0x1e0/0x500
[c000000087053ae0] [c000000000167b88] .__do_fault+0xa8/0x640
[c000000087053be0] [c000000000168ffc] .handle_pte_fault+0x14c/0xa90
[c000000087053cf0] [c000000000578f90] .do_page_fault+0x390/0x720
[c000000087053e30] [c00000000000518c] handle_page_fault+0x20/0x74
INFO: task master:3871 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
master        D 00000fff794bbc08     0  3871      1 0x00008080
Call Trace:
[c000000085b1f080] [c00000000010bb18] .acct_update_integrals+0x108/0x130 (unreliable)
[c000000085b1f250] [c000000000014120] .__switch_to+0x100/0x1d0
[c000000085b1f2e0] [c000000000572fe8] .schedule+0x438/0xd50
[c000000085b1f470] [d000000001b42454] .do_get_write_access+0x4a4/0x650 [jbd2]
[c000000085b1f5a0] [d000000001b42638] .jbd2_journal_get_write_access+0x38/0x70 [jbd2]
[c000000085b1f630] [d000000001cc4ae4] .__ext4_journal_get_write_access+0x44/0xc0 [ext4]
[c000000085b1f6d0] [d000000001c97f98] .ext4_reserve_inode_write+0xa8/0xe0 [ext4]
[c000000085b1f770] [d000000001c9802c] .ext4_mark_inode_dirty+0x5c/0x210 [ext4]
[c000000085b1f850] [d000000001c98388] .ext4_dirty_inode+0x48/0x80 [ext4]
[c000000085b1f8e0] [c0000000001d050c] .__mark_inode_dirty+0x5c/0x1b0
[c000000085b1f970] [c0000000001c0d5c] .file_update_time+0x10c/0x1c0
[c000000085b1fa30] [c0000000001af348] .pipe_write+0x248/0x5c0
[c000000085b1fb40] [c0000000001a3724] .do_sync_write+0xd4/0x160
[c000000085b1fce0] [c0000000001a466c] .vfs_write+0xec/0x1f0
[c000000085b1fd80] [c0000000001a4898] .SyS_write+0x58/0xb0
[c000000085b1fe30] [c00000000000852c] syscall_exit+0x0/0x40

Version-Release number of selected component (if applicable):
0520 tree, 2.6.32-27.el6 kernel

How reproducible:
frequent

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:
no warnings as on other arch's

Additional info:

http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=14310730

Comment 1 RHEL Program Management 2010-05-25 20:46:27 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

Comment 4 Mikuláš Patočka 2010-05-26 15:59:57 UTC
I suspect either a bug in device mapper barrier support or a bug in the underlying device driver (it can't process barrier requests). Another possibility is broken disk firmware (some disks crash their firmware if too many/too long commands are being sent to them).

Questions:
* What kind of device do you run the tests on? What md, dm and other layers are stacked and how? What is the physical device under this?

* Are the tasks hung forever? Or do they eventually unwedge?

You say that it is frequently reproducible, so try the same test using different host adapters and different disks and try to pinpoint conditions of the bug.

Comment 5 Mike Gahagan 2010-06-01 20:42:20 UTC
The storage device should have been provided by IBM's virtual scsi driver, I am not 100% sure of what the underlying storage was. The install was using LVM per anaconda's/Beaker's defaults.

The tasks did appear to recover eventually.

I'll try and grab some differently configured ppc's this week and do some more testing.

Comment 6 Mikuláš Patočka 2010-07-08 20:12:35 UTC
If the tasks recover, it suggests that it is not a device mapper problem (there are no timers in device mapper). I would look at the underlying device, why is it causing i/o delays.

Comment 7 RHEL Program Management 2010-07-15 15:15:20 UTC
This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release. It has
been denied for the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **

Comment 9 RHEL Program Management 2011-01-07 04:48:31 UTC
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.

Comment 10 Suzanne Logcher 2011-01-07 16:19:51 UTC
This request was erroneously denied for the current release of Red Hat
Enterprise Linux.  The error has been fixed and this request has been
re-proposed for the current release.

Comment 11 RHEL Program Management 2011-02-01 06:18:51 UTC
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.

Comment 12 RHEL Program Management 2011-02-01 18:22:54 UTC
This request was erroneously denied for the current release of
Red Hat Enterprise Linux.  The error has been fixed and this
request has been re-proposed for the current release.

Comment 13 RHEL Program Management 2011-04-04 02:41:58 UTC
Since RHEL 6.1 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 14 RHEL Program Management 2011-10-07 15:04:35 UTC
Since RHEL 6.2 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 15 Alasdair Kergon 2012-11-26 13:00:34 UTC
No update for over 2 years so I'm closing this one.

If further data is found that points to something devel needs to investigate, then please reopen it.