Bug 652262 - Slow writes to ext4 partition - INFO: task flush-253:7:2137 blocked for more than 120 seconds.
Slow writes to ext4 partition - INFO: task flush-253:7:2137 blocked for more...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.0
x86_64 Unspecified
low Severity medium
: rc
: ---
Assigned To: Eric Sandeen
Eryu Guan
: Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-11-11 08:43 EST by steve.drew
Modified: 2014-03-27 00:38 EDT (History)
9 users (show)

See Also:
Fixed In Version: kernel-2.6.32-164.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-12-06 07:35:15 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
df -h, df-i and tune2fs -l from volume (2.04 KB, text/plain)
2011-01-26 05:55 EST, steve.drew
no flags Details
copy of syreq-w (87.83 KB, text/plain)
2011-01-29 20:40 EST, steve.drew
no flags Details
copy of syreq-w for comment 12 (63.93 KB, text/plain)
2011-01-31 11:49 EST, steve.drew
no flags Details

  None (edit)
Description steve.drew 2010-11-11 08:43:25 EST
Description of problem:

On a ext4 parition (iscsi target) which is 6.2Tb of 8Tb full, when every a file is attempted to be written to the disk (using rsync or cp) the command hangs, and then IOWAIT hits 100% on of of the 2 cores. 
We have fsck'd the filesystem. When we are waiting for the file to be written (shortly after the below entry in messages) we can read OK from the filesystem.

Other iscsi mounts to the same SAN are functioning correctly. Any advive of where to look next?


INFO: task flush-253:4:19598 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
flush-253:4   D ffff88023fc23240     0 19598      2 0x00000080
 ffff880120387900 0000000000000046 0000000000000000 00000000008000a0
 ffff880120387930 ffffffff8123c024 0000000000000000 00000001016a021b
 ffff88023085fab0 ffff880120387fd8 00000000000103b8 ffff88023085fab0
Call Trace:
 [<ffffffff8123c024>] ? generic_make_request+0x1b4/0x4f0
 [<ffffffff8110c100>] ? sync_page+0x0/0x50
 [<ffffffff814d8723>] io_schedule+0x73/0xc0
 [<ffffffff8110c13d>] sync_page+0x3d/0x50
 [<ffffffff814d8e4a>] __wait_on_bit_lock+0x5a/0xc0
 [<ffffffff8110c0d7>] __lock_page+0x67/0x70
 [<ffffffff81090d90>] ? wake_bit_function+0x0/0x50
 [<ffffffff8111f915>] ? pagevec_lookup_tag+0x25/0x40
 [<ffffffff81198f80>] ? end_buffer_async_write+0x0/0x190
 [<ffffffffa00e363f>] ext4_num_dirty_pages+0x23f/0x250 [ext4]
 [<ffffffffa00e7288>] ext4_da_writepages+0x538/0x600 [ext4]
 [<ffffffffa00e3921>] ? __ext4_get_inode_loc+0xf1/0x3b0 [ext4]
 [<ffffffffa00e43dc>] ? ext4_get_inode_loc+0x1c/0x20 [ext4]
 [<ffffffffa00e941c>] ? ext4_write_inode+0x8c/0x130 [ext4]
 [<ffffffff8111eaf1>] do_writepages+0x21/0x40
 [<ffffffff81190d8d>] writeback_single_inode+0xdd/0x2c0
 [<ffffffff8119118e>] writeback_sb_inodes+0xce/0x180
 [<ffffffff811912e3>] writeback_inodes_wb+0xa3/0x1a0
 [<ffffffff8119163b>] wb_writeback+0x25b/0x2c0
 [<ffffffff8107d0f2>] ? del_timer_sync+0x22/0x30
 [<ffffffff81191829>] wb_do_writeback+0x189/0x200
 [<ffffffff811918f3>] bdi_writeback_task+0x53/0xf0
 [<ffffffff8112c960>] ? bdi_start_fn+0x0/0x100
 [<ffffffff8112c9e6>] bdi_start_fn+0x86/0x100
 [<ffffffff8112c960>] ? bdi_start_fn+0x0/0x100
 [<ffffffff810909e6>] kthread+0x96/0xa0
 [<ffffffff810141ca>] child_rip+0xa/0x20
 [<ffffffff81090950>] ? kthread+0x0/0xa0
 [<ffffffff810141c0>] ? child_rip+0x0/0x20
INFO: task flush-253:4:19598 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
flush-253:4   D ffff88023fc23240     0 19598      2 0x00000080
 ffff880120387900 0000000000000046 0000000000000000 00000000008000a0
 ffff880120387930 ffffffff8123c024 0000000000000000 00000001016a021b
 ffff88023085fab0 ffff880120387fd8 00000000000103b8 ffff88023085fab0
Call Trace:
 [<ffffffff8123c024>] ? generic_make_request+0x1b4/0x4f0
 [<ffffffff8110c100>] ? sync_page+0x0/0x50
 [<ffffffff814d8723>] io_schedule+0x73/0xc0
 [<ffffffff8110c13d>] sync_page+0x3d/0x50
 [<ffffffff814d8e4a>] __wait_on_bit_lock+0x5a/0xc0
 [<ffffffff8110c0d7>] __lock_page+0x67/0x70
 [<ffffffff81090d90>] ? wake_bit_function+0x0/0x50
 [<ffffffff8111f915>] ? pagevec_lookup_tag+0x25/0x40
 [<ffffffff81198f80>] ? end_buffer_async_write+0x0/0x190
 [<ffffffffa00e363f>] ext4_num_dirty_pages+0x23f/0x250 [ext4]
 [<ffffffffa00e7288>] ext4_da_writepages+0x538/0x600 [ext4]
 [<ffffffffa00e3921>] ? __ext4_get_inode_loc+0xf1/0x3b0 [ext4]
 [<ffffffffa00e43dc>] ? ext4_get_inode_loc+0x1c/0x20 [ext4]
 [<ffffffffa00e941c>] ? ext4_write_inode+0x8c/0x130 [ext4]
 [<ffffffff8111eaf1>] do_writepages+0x21/0x40
 [<ffffffff81190d8d>] writeback_single_inode+0xdd/0x2c0
 [<ffffffff8119118e>] writeback_sb_inodes+0xce/0x180
 [<ffffffff811912e3>] writeback_inodes_wb+0xa3/0x1a0
 [<ffffffff8119163b>] wb_writeback+0x25b/0x2c0
 [<ffffffff8107d0f2>] ? del_timer_sync+0x22/0x30
 [<ffffffff81191829>] wb_do_writeback+0x189/0x200
 [<ffffffff811918f3>] bdi_writeback_task+0x53/0xf0
 [<ffffffff8112c960>] ? bdi_start_fn+0x0/0x100
 [<ffffffff8112c9e6>] bdi_start_fn+0x86/0x100
 [<ffffffff8112c960>] ? bdi_start_fn+0x0/0x100
 [<ffffffff810909e6>] kthread+0x96/0xa0
 [<ffffffff810141ca>] child_rip+0xa/0x20
 [<ffffffff81090950>] ? kthread+0x0/0xa0
 [<ffffffff810141c0>] ? child_rip+0x0/0x20


After 6 mins IOWAIT dropped to 0 and could unmount the partition.

---
Server rebooted, and attempt to write a file to the disk again

Nov 11 05:31:11 ldn-espfs2 kernel: INFO: task flush-253:7:2137 blocked for more than 120 seconds.
Nov 11 05:31:11 ldn-espfs2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 11 05:31:11 ldn-espfs2 kernel: flush-253:7   D ffff88033fc23240     0  2137      2 0x00000080
Nov 11 05:31:11 ldn-espfs2 kernel: ffff880322ff9900 0000000000000046 0000000000000000 0000000000800080
Nov 11 05:31:11 ldn-espfs2 kernel: ffff880322ff9930 ffffffff8123c024 0000000000000000 00000000fffe3578
Nov 11 05:31:11 ldn-espfs2 kernel: ffff880322ff0670 ffff880322ff9fd8 00000000000103b8 ffff880322ff0670
Nov 11 05:31:11 ldn-espfs2 kernel: Call Trace:
Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff8123c024>] ? generic_make_request+0x1b4/0x4f0
Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff8110c100>] ? sync_page+0x0/0x50
Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff814d8723>] io_schedule+0x73/0xc0
Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff8110c13d>] sync_page+0x3d/0x50
Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff814d8e4a>] __wait_on_bit_lock+0x5a/0xc0
Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff8110c0d7>] __lock_page+0x67/0x70
Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff81090d90>] ? wake_bit_function+0x0/0x50
Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff8111f915>] ? pagevec_lookup_tag+0x25/0x40
Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff81198f80>] ? end_buffer_async_write+0x0/0x190
Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffffa00db63f>] ext4_num_dirty_pages+0x23f/0x250 [ext4]
Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffffa00df288>] ext4_da_writepages+0x538/0x600 [ext4]
Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffffa00db921>] ? __ext4_get_inode_loc+0xf1/0x3b0 [ext4]
Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffffa00dc3dc>] ? ext4_get_inode_loc+0x1c/0x20 [ext4]
Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffffa00e141c>] ? ext4_write_inode+0x8c/0x130 [ext4]
Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff8111eaf1>] do_writepages+0x21/0x40
Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff81190d8d>] writeback_single_inode+0xdd/0x2c0
Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff8119118e>] writeback_sb_inodes+0xce/0x180
Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff811912e3>] writeback_inodes_wb+0xa3/0x1a0
Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff8119163b>] wb_writeback+0x25b/0x2c0
Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff8107d0f2>] ? del_timer_sync+0x22/0x30
Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff81191829>] wb_do_writeback+0x189/0x200
Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff811918f3>] bdi_writeback_task+0x53/0xf0
Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff8112c960>] ? bdi_start_fn+0x0/0x100
Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff8112c9e6>] bdi_start_fn+0x86/0x100
Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff8112c960>] ? bdi_start_fn+0x0/0x100
Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff810909e6>] kthread+0x96/0xa0
Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff810141ca>] child_rip+0xa/0x20
Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff81090950>] ? kthread+0x0/0xa0
Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff810141c0>] ? child_rip+0x0/0x20
Nov 11 05:33:11 ldn-espfs2 kernel: INFO: task flush-253:7:2137 blocked for more than 120 seconds.
Nov 11 05:33:11 ldn-espfs2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 11 05:33:11 ldn-espfs2 kernel: flush-253:7   D ffff88033fc23240     0  2137      2 0x00000080
Nov 11 05:33:11 ldn-espfs2 kernel: ffff880322ff9900 0000000000000046 0000000000000000 0000000000800080
Nov 11 05:33:11 ldn-espfs2 kernel: ffff880322ff9930 ffffffff8123c024 0000000000000000 00000000fffe3578
Nov 11 05:33:11 ldn-espfs2 kernel: ffff880322ff0670 ffff880322ff9fd8 00000000000103b8 ffff880322ff0670
Nov 11 05:33:11 ldn-espfs2 kernel: Call Trace:
Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff8123c024>] ? generic_make_request+0x1b4/0x4f0
Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff8110c100>] ? sync_page+0x0/0x50
Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff814d8723>] io_schedule+0x73/0xc0
Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff8110c13d>] sync_page+0x3d/0x50
Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff814d8e4a>] __wait_on_bit_lock+0x5a/0xc0
Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff8110c0d7>] __lock_page+0x67/0x70
Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff81090d90>] ? wake_bit_function+0x0/0x50
Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff8111f915>] ? pagevec_lookup_tag+0x25/0x40
Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff81198f80>] ? end_buffer_async_write+0x0/0x190
Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffffa00db63f>] ext4_num_dirty_pages+0x23f/0x250 [ext4]
Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffffa00df288>] ext4_da_writepages+0x538/0x600 [ext4]
Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffffa00db921>] ? __ext4_get_inode_loc+0xf1/0x3b0 [ext4]
Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffffa00dc3dc>] ? ext4_get_inode_loc+0x1c/0x20 [ext4]
Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffffa00e141c>] ? ext4_write_inode+0x8c/0x130 [ext4]
Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff8111eaf1>] do_writepages+0x21/0x40
Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff81190d8d>] writeback_single_inode+0xdd/0x2c0
Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff8119118e>] writeback_sb_inodes+0xce/0x180
Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff811912e3>] writeback_inodes_wb+0xa3/0x1a0
Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff8119163b>] wb_writeback+0x25b/0x2
Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff8107d0f2>] ? del_timer_sync+0x22/0x30
Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff81191829>] wb_do_writeback+0x189/0x200
Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff811918f3>] bdi_writeback_task+0x53/0xf0
Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff8112c960>] ? bdi_start_fn+0x0/0x100
Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff8112c9e6>] bdi_start_fn+0x86/0x100
Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff8112c960>] ? bdi_start_fn+0x0/0x100
Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff810909e6>] kthread+0x96/0xa0
Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff810141ca>] child_rip+0xa/0x20
Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff81090950>] ? kthread+0x0/0xa0
Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff810141c0>] ? child_rip+0x0/0x20

Wait 4 mins or so, and can remount disk..


Version-Release number of selected component (if applicable):
 2.6.32-44.2.el6.x86_64

How reproducible:
Every time

Steps to Reproduce:
1. Write data to partition
2.
3.
  
Actual results:

2-6 minutes to write a 1mb file do disk due to wait

---

Expected results:

Files written to disk at normal speed.

Additional info:
Comment 2 Eric Sandeen 2010-11-11 12:11:55 EST
At first I thought this is likely a dup of bug #648632, but the bug I fixed there is more related to ext4_num_dirty_pages() scanning too far, and in this case you've only got 1MB files (?).  Or perhaps there are other much larger files which are dirty at the time the 1MB file is copied?

Can you tell me a little more about what's going on with the system; how much memory it has, perhaps, and a rough description of what sort of IO is going on in the filesystem when this happens?

When you say:

> After 6 mins IOWAIT dropped to 0 and could unmount the partition.

> ---
> Server rebooted, and attempt to write a file to the disk again

Is this the first write to the freshly mounted fs or is other IO going on?  How big was the file in question this time?

Honestly since the kernel version in this bug is a beta2 refresh kernel, it'd be best to first test this on the released RHEL6 kernel to verify that we're not chasing a bug that was fixed in RHEL6 GA already.

Please let me know if the bug persists in the RHEL6 released kernel, and we'll pursue this further.

Thanks,
-Eric
Comment 3 RHEL Product and Program Management 2011-01-06 23:22:26 EST
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.
Comment 4 Suzanne Yeghiayan 2011-01-07 11:09:37 EST
This request was erroneously denied for the current release of Red Hat
Enterprise Linux.  The error has been fixed and this request has been
re-proposed for the current release.
Comment 5 Ric Wheeler 2011-01-07 13:31:13 EST
Closing since we have not had an update or seen it internally.

Steve, if you can you verify that this issue exists in the released kernel, we can re-open the bz.
Comment 6 steve.drew 2011-01-26 05:53:39 EST
Hi Guys,

Ok apologies for the delay, thought it was best to give an update.
We have now upgraded to 2.6.32-71.7.1 (have also testing with the 2 patches referenced in #648632) but the problems are still intermittently persisting.

This file contains around 7 million images, 95% of files are in the 500k-3mb range.

We are seeing problems with high iowait when we are updating old file (for example replacing a 500k file with a 5mb file). Approx 2% of the updates are causing the high io wait (where it takes 5-10minutes to replace the file, compared to < 1s at other times). This did happen when the partition was first mounted, and is now happening at random intervals). There is read IO going on at this partition, but volume is very low (<100mb/minute)
We've seen update file goes on at 15mb and they are fine, and others at 4-5mb that cause problems. 

The server has 12Gb of RAM, 2cpu. 

I've been digging around https://bugzilla.kernel.org/show_bug.cgi?id=20902#c19 -< seems to be similar?

We were thinking of remounting the partition with nodelalloc to remove delayed allocation and the multiblock allocator, as file fragmentation would not be so much of an issue on this volume.

From auditing the code that does the update, its just writes the newfile into the same filename/path as the old image. Not sure if that makes a potential difference. 

Is there any extra tools we can run for debugging, or any more info that we should try and capture?
Comment 7 steve.drew 2011-01-26 05:55:19 EST
Created attachment 475364 [details]
df -h, df-i and tune2fs -l from volume
Comment 8 Ric Wheeler 2011-01-26 08:00:20 EST
I will reopen this one for this issue - thanks for the testing (and we should get a new kernel out soon with the latest patches).
Comment 9 steve.drew 2011-01-26 08:34:35 EST
On the first write on boot we got the following after a file update: 

INFO: task jbd2/dm-9-8:19777 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
jbd2/dm-9-8   D ffff88032c9d78e0     0 19777      2 0x00000080
 ffff8802f847fd10 0000000000000046 ffff880028210fa0 0000000000000008
 0000000000016980 0000000000016980 ffff8800282112c0 0000000000016980
 ffff88032cbfdb18 ffff8802f847ffd8 0000000000010518 ffff88032cbfdb18
Call Trace:
 [<ffffffff8109202e>] ? prepare_to_wait+0x4e/0x80
 [<ffffffffa00a58f0>] jbd2_journal_commit_transaction+0x1c0/0x14e0 [jbd2]
 [<ffffffff810116e0>] ? __switch_to+0xd0/0x320
 [<ffffffff8107d64c>] ? lock_timer_base+0x3c/0x70
 [<ffffffff81091d40>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa00ac0b8>] kjournald2+0xb8/0x220 [jbd2]
 [<ffffffff81091d40>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa00ac000>] ? kjournald2+0x0/0x220 [jbd2]
 [<ffffffff810919d6>] kthread+0x96/0xa0
 [<ffffffff810141ca>] child_rip+0xa/0x20
 [<ffffffff81091940>] ? kthread+0x0/0xa0
 [<ffffffff810141c0>] ? child_rip+0x0/0x20
INFO: task sync:21266 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
sync          D ffff88033fc23240     0 21266  21247 0x00000088
 ffff8802e59e3d38 0000000000000086 ffff8802e59e3d00 ffff8802e59e3cfc
 0000000000000000 ffff88033fc23240 ffff880028216980 000000010009df80
 ffff8803203b7068 ffff8802e59e3fd8 0000000000010518 ffff8803203b7068
Call Trace:
 [<ffffffff814c9375>] schedule_timeout+0x225/0x2f0
 [<ffffffff814c8fe3>] wait_for_common+0x123/0x180
 [<ffffffff8105c530>] ? default_wake_function+0x0/0x20
 [<ffffffff814c90fd>] wait_for_completion+0x1d/0x20
 [<ffffffff81195831>] writeback_inodes_sb+0xb1/0xe0
 [<ffffffff8119b17e>] __sync_filesystem+0x3e/0x90
 [<ffffffff8119b2c8>] sync_filesystems+0xf8/0x130
 [<ffffffff8119b357>] sys_sync+0x17/0x40
 [<ffffffff81013172>] system_call_fastpath+0x16/0x1b

This message has not been logged since (and other slow updates have happened)
Comment 10 Eric Sandeen 2011-01-26 17:47:14 EST
This could also be an iscsi issue, I suppose.

The patchset for bug #638349 might also be worth testing too, although it's related to livelocks while syncing a lot of data, and if you're hitting this on the first write after mount, that doesn't make much sense.

Looks like that last stacktrace, at:

 [<ffffffffa00a58f0>] jbd2_journal_commit_transaction+0x1c0/0x14e0 [jbd2]

is waiting on a spinlock:

    284b:       e8 00 00 00 00          callq  2850 <jbd2_journal_commit_transaction+0x1c0>
                        spin_lock(&journal->j_state_lock);
    2850:       48 8b bd 40 ff ff ff    mov    -0xc0(%rbp),%rdi
    2857:       e8 00 00 00 00          callq  285c <jbd2_journal_commit_transaction+0x1cc>
                        spin_lock(&commit_transaction->t_handle_lock);

(2850 should be jbd2_journal_commit_transaction+0x1c0)

(though I'm not sure which kernel you're reporting against, so that might be off)

Doing a sysrq-w when things are hung up would show us all the sleeping threads...

-Eric
Comment 11 steve.drew 2011-01-29 20:37:37 EST
Ok, the partition was mounted with the nodelalloc option (to see if that would make a difference)

On a file update (same filename replaced with larger file) on at the first attempt, the operation took 7 minutes. I'll attatch the sysrq-w from when update was going on.

Subsequent file updates/write to the partitions have been fine. I will monitor over the next few days (previously we have had a slow (5min+) file update after a remount, but then intermittantly afterwards, roughly every 150 updates.
Comment 12 steve.drew 2011-01-29 20:40:13 EST
Created attachment 475977 [details]
copy of syreq-w
Comment 13 steve.drew 2011-01-31 11:48:46 EST
Ok, now 2 days since the remount (when the partition was mounted with nodelalloc as above comments).

We've just had: (after a 500k file replaced with a 3mb file)

From a sysreq-w while file update appeared hung:

Call Trace:
 [<ffffffffa00fabcc>] ? release_blocks_on_commit+0xdc/0x2f0 [ext4]
 [<ffffffff8109202e>] ? prepare_to_wait+0x4e/0x80
 [<ffffffffa00a58f0>] jbd2_journal_commit_transaction+0x1c0/0x14e0 [jbd2]
 [<ffffffff810116e0>] ? __switch_to+0xd0/0x320
 [<ffffffff8107d64c>] ? lock_timer_base+0x3c/0x70
 [<ffffffff81091d40>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa00ac0b8>] kjournald2+0xb8/0x220 [jbd2]
 [<ffffffff81091d40>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa00ac000>] ? kjournald2+0x0/0x220 [jbd2]
 [<ffffffff810919d6>] kthread+0x96/0xa0
 [<ffffffff810141ca>] child_rip+0xa/0x20
 [<ffffffff81091940>] ? kthread+0x0/0xa0
 [<ffffffff810141c0>] ? child_rip+0x0/0x20
nfsd          D 0000000000000002     0 19868      2 0x00000080
 ffff880322834fc0 0000000000000046 ffff880322834f80 ffffffffa000471c
 ffff88032b9fb408 ffff8802b3ab2200 0000000000000000 ffff8802aa10c740
 ffff880323277ad8 ffff880322835fd8 0000000000010518 ffff880323277ad8
Call Trace:
 [<ffffffffa000471c>] ? dm_table_unplug_all+0x5c/0xd0 [dm_mod]
 [<ffffffff8109ba49>] ? ktime_get_ts+0xa9/0xe0
 [<ffffffff8119e1b0>] ? sync_buffer+0x0/0x50
 [<ffffffff814c8dd3>] io_schedule+0x73/0xc0
 [<ffffffff8119e1f0>] sync_buffer+0x40/0x50
 [<ffffffff814c964f>] __wait_on_bit+0x5f/0x90
 [<ffffffff8119e1b0>] ? sync_buffer+0x0/0x50
 [<ffffffff814c96f8>] out_of_line_wait_on_bit+0x78/0x90
 [<ffffffff81091d80>] ? wake_bit_function+0x0/0x50
 [<ffffffff8119e1a6>] __wait_on_buffer+0x26/0x30
 [<ffffffffa00f91fd>] ext4_mb_init_cache+0x34d/0x970 [ext4]
 [<ffffffff8110c519>] ? add_to_page_cache_locked+0xc9/0x140
 [<ffffffffa00f9dba>] ext4_mb_load_buddy+0x33a/0x3d0 [ext4]
 [<ffffffffa00fcc06>] ext4_mb_regular_allocator+0x1e6/0x550 [ext4]
 [<ffffffffa00fd2fd>] ext4_mb_new_blocks+0x38d/0x560 [ext4]
 [<ffffffffa00d02f3>] ext4_alloc_branch+0x4a3/0x5a0 [ext4]
 [<ffffffffa00ce2ce>] ? ext4_get_branch+0xfe/0x130 [ext4]
 [<ffffffffa00d1fed>] ext4_ind_get_blocks+0x1dd/0x610 [ext4]
 [<ffffffffa00d2519>] ext4_get_blocks+0xf9/0x2a0 [ext4]
 [<ffffffffa00d2f8d>] ext4_get_block+0xbd/0x120 [ext4]
 [<ffffffff8119ff83>] __block_prepare_write+0x1e3/0x580
 [<ffffffffa00d2ed0>] ? ext4_get_block+0x0/0x120 [ext4]
 [<ffffffff811a04c4>] block_write_begin+0x64/0x100
 [<ffffffffa00d54d8>] ext4_write_begin+0x218/0x2d0 [ext4]
 [<ffffffffa00d2ed0>] ? ext4_get_block+0x0/0x120 [ext4]
 [<ffffffff8110c9fe>] generic_file_buffered_write+0x10e/0x2a0
 [<ffffffffa00d17af>] ? ext4_dirty_inode+0x4f/0x60 [ext4]
 [<ffffffff8110e350>] __generic_file_aio_write+0x250/0x480
 [<ffffffffa03e500b>] ? svc_tcp_data_ready+0x3b/0x90 [sunrpc]
 [<ffffffff8110e5ef>] generic_file_aio_write+0x6f/0xe0
 [<ffffffffa00cb130>] ? ext4_file_write+0x0/0xb0 [ext4]
 [<ffffffffa00cb169>] ext4_file_write+0x39/0xb0 [ext4]
 [<ffffffff8116c89b>] do_sync_readv_writev+0xfb/0x140
 [<ffffffff81091d40>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8120c28b>] ? selinux_file_permission+0xfb/0x150
 [<ffffffff811ff6f6>] ? security_file_permission+0x16/0x20
 [<ffffffff8116d95f>] do_readv_writev+0xcf/0x1f0
 [<ffffffff811ffa56>] ? security_task_setgroups+0x16/0x20
 [<ffffffff81099b95>] ? set_groups+0x25/0x1a0
 [<ffffffff8116dac6>] vfs_writev+0x46/0x60
 [<ffffffffa04772d7>] nfsd_vfs_write+0x107/0x430 [nfsd]
 [<ffffffff810dd7be>] ? rcu_start_gp+0x1be/0x230
 [<ffffffffa04757c2>] ? nfsd_setuser_and_check_port+0x62/0xb0 [nfsd]
 [<ffffffffa047a019>] nfsd_write+0x99/0x100 [nfsd]
 [<ffffffffa0484290>] nfsd4_write+0x100/0x130 [nfsd]
 [<ffffffffa0484a11>] nfsd4_proc_compound+0x401/0x500 [nfsd]
 [<ffffffffa047243e>] nfsd_dispatch+0xfe/0x250 [nfsd]
 [<ffffffffa03e3544>] svc_process_common+0x344/0x610 [sunrpc]
 [<ffffffff8105c530>] ? default_wake_function+0x0/0x20
 [<ffffffffa03e3b50>] svc_process+0x110/0x150 [sunrpc]
 [<ffffffffa0472b62>] nfsd+0xc2/0x160 [nfsd]
 [<ffffffffa0472aa0>] ? nfsd+0x0/0x160 [nfsd]
 [<ffffffff810919d6>] kthread+0x96/0xa0
 [<ffffffff810141ca>] child_rip+0xa/0x20
 [<ffffffff81091940>] ? kthread+0x0/0xa0
 [<ffffffff810141c0>] ? child_rip+0x0/0x20
Comment 14 steve.drew 2011-01-31 11:49:57 EST
Created attachment 476230 [details]
copy of syreq-w for comment 12
Comment 15 Eric Sandeen 2011-01-31 14:25:07 EST
Ok, seeing ext4_mb_init_cache() in the stuck traces is a big clue I think, will look into this more.  Both times that was on the nfsd thread, and many other nfsd threads are blocked as well.  How is NFS in use here, is it related to the same filesystem?

Thanks,
-Eric
Comment 16 steve.drew 2011-01-31 14:40:24 EST
NFS is used to export this filesystem (as well as another 12 or so) to several foreign servers (only 1 writes to this filesystem)

We have seen the issue when updating the file on the local system (to rule out NFS).

If you can think of any extra debug/testing to run let me know and I'll try and find a quiet time to go through it.

Thanks
steve
Comment 17 Eric Sandeen 2011-01-31 17:09:34 EST
There are many nfsd processes writing, all trying to read in the metadata in ext4_mb_init_cache, it seems.

I see that the fs is mostly full...

This may be a case where the allocator is trying really hard to find new contiguous space by searching further and further from the inode; replacing existing smaller files with larger ones may have this effect.

It might be interesting to see what the e2freefrag tool says about the filesystem, to see what fragmentation looks like.

One other thing that might be interesting is to run blktrace against the device; mount it and do that first (7-minute) write, then stop blktrace.  We can see exactly what IO was going on during that time from the blktrace files, but I'm pretty sure we'd just see it loading up those buddy bitmaps, looking for free contiguous space...

The other way we could verify that is by tracing these tracepoints:

# mount -t debugfs none /sys/kernel/debug
# echo 1 > /sys/kernel/debug/tracing/tracing_enabled
# echo 1 > /sys/kernel/debug/tracing/events/ext4/ext4_mb_bitmap_load/enable 
# echo 1 > /sys/kernel/debug/tracing/events/ext4/ext4_mb_buddy_bitmap_load/enable

<run your slow write>

# cat /sys/kernel/debug/tracing/trace
Comment 18 RHEL Product and Program Management 2011-02-01 00:32:09 EST
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.
Comment 19 steve.drew 2011-02-01 05:53:57 EST
Unfortunatly th event (/sys/kernel/debug/tracing/events/ext4/ext4_mb_bitmap_load/enable) isnt available. Does it need to be specifically enabled? (kernel: 2.6.32-71.7.1_ext4p.el6.x86_64

ls -l /sys/kernel/debug/tracing/events/ext4/ext4_mb_bitmap_load
ls: cannot access /sys/kernel/debug/tracing/events/ext4/ext4_mb_bitmap_load: No such file or directory

Would enabling the events via /sys/kernel/debug/tracing/events/ext4/enable be useful?
 
Is there any way to mount the filesystem so that it will allocate in non-contigous free space? We would be happy to accept file fragmentation in this scenario. 

We had tried mounting with nodelalloc, but we are still experience the occasional slow updates with it. Is there another combination of mount options that could be tried?  

People have reported that this patch can improve performance:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=8a57d9d61a6e361c7bb159dda797672c1df1a691

Thanks
steve
Comment 20 Ric Wheeler 2011-02-01 08:17:24 EST
Steve, do you have an official subscription and access to Red Hat support?

RH bugzilla is not meant to help development work with our support organization and our customers. RH support can help reproduce and gather information, so it would be great for you to get them involved.

If you do not have a RH subscription, we should work this through other channels (Fedora testing or upstream email lists, kernel.org bugzilla, etc).

Thanks!
Comment 21 Ric Wheeler 2011-02-02 13:50:45 EST
Steve, just a note for you - I see that you have subscriptions. If you could open a ticket with RH support, it helps us in the process while we juggle which issues get fixed when. In addition to helping debug the situation, RH support advocates for customer issues in major releases :)
Comment 22 Eric Sandeen 2011-02-02 16:47:38 EST
Steve, based on the backtraces you've gathered I do expect that patch to help.  As Ric asked, if you can get this opened up with support channels it will help us guide it into a release.

Thanks,
-Eric
Comment 23 Eric Sandeen 2011-02-02 17:05:32 EST
Also, regarding those tracepoints, they aren't in rhel6 yet, they got added just after the commit you referred to, sorry for the bad advice there, I was looking at an upstream kernel.
Comment 24 Eric Sandeen 2011-02-03 16:37:09 EST
I created an 8T loopback filesystem, and "filled" it in a fragmented way by preallocating files which were one block shy of a block group size.

Then, a write of a larger 512M file into the filesystem shows delay, though not as badly as your testcase.

The patch in question reduces the time to write from about 3 minutes to about 2 minutes; not great in either case, but better.

In the stock case we're spending time in:

samples  %        image name  app name            symbol name
1748304  30.3997  ext4.ko     ext4.ko             ext4_mb_regular_allocator
1026635  17.8513  vmlinux     vmlinux             find_get_page
549747    9.5591  ext4.ko     ext4.ko             ext4_mb_load_buddy


patched, we're spending time elsewhere:

samples  %        image name   app name           symbol name
2067466  73.5126  ext4.ko      ext4.ko            ext4_mb_init_cache
104695    3.7226  ext4.ko      ext4.ko            ext4_mb_find_by_goal
29041     1.0326  vmlinux      vmlinux            __set_page_dirty_nobuffers

(higher % of runtime, but fewer samples overall, compared to ext4_mb_regular_allocator + find_get_page before..)

So at this point we're avoiding the buddy bitmap page loading, but on an initial mount it still requires initializing via ext4_mb_init_cache, which is expensive primarily due to the memset it contains, it seems.  It's 50% better but not great.

Without a central location for freespace info, this may be the best it can do.  Bear in mind that some of this is due to heavily fragmented freespace on a very full filesystem, there may be changes you can make to the application to alleviate some of it, depending on the io you expect.
Comment 25 Eric Sandeen 2011-06-09 12:53:37 EDT
I'll go ahead & schedule the commit in comment #19 as an update; I don't know for sure that it will resolve Steve's problem, but it helped in my testing.

Steve, please do get your support folks involved in this too for proper tracking.

Thanks,
-Eric
Comment 26 RHEL Product and Program Management 2011-06-09 13:00:30 EDT
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.
Comment 28 Kyle McMartin 2011-06-29 17:49:42 EDT
Patch(es) available on kernel-2.6.32-164.el6
Comment 32 errata-xmlrpc 2011-12-06 07:35:15 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2011-1530.html

Note You need to log in before you can comment on or make changes to this bug.