Hide Forgot
Description of problem: On a ext4 parition (iscsi target) which is 6.2Tb of 8Tb full, when every a file is attempted to be written to the disk (using rsync or cp) the command hangs, and then IOWAIT hits 100% on of of the 2 cores. We have fsck'd the filesystem. When we are waiting for the file to be written (shortly after the below entry in messages) we can read OK from the filesystem. Other iscsi mounts to the same SAN are functioning correctly. Any advive of where to look next? INFO: task flush-253:4:19598 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. flush-253:4 D ffff88023fc23240 0 19598 2 0x00000080 ffff880120387900 0000000000000046 0000000000000000 00000000008000a0 ffff880120387930 ffffffff8123c024 0000000000000000 00000001016a021b ffff88023085fab0 ffff880120387fd8 00000000000103b8 ffff88023085fab0 Call Trace: [<ffffffff8123c024>] ? generic_make_request+0x1b4/0x4f0 [<ffffffff8110c100>] ? sync_page+0x0/0x50 [<ffffffff814d8723>] io_schedule+0x73/0xc0 [<ffffffff8110c13d>] sync_page+0x3d/0x50 [<ffffffff814d8e4a>] __wait_on_bit_lock+0x5a/0xc0 [<ffffffff8110c0d7>] __lock_page+0x67/0x70 [<ffffffff81090d90>] ? wake_bit_function+0x0/0x50 [<ffffffff8111f915>] ? pagevec_lookup_tag+0x25/0x40 [<ffffffff81198f80>] ? end_buffer_async_write+0x0/0x190 [<ffffffffa00e363f>] ext4_num_dirty_pages+0x23f/0x250 [ext4] [<ffffffffa00e7288>] ext4_da_writepages+0x538/0x600 [ext4] [<ffffffffa00e3921>] ? __ext4_get_inode_loc+0xf1/0x3b0 [ext4] [<ffffffffa00e43dc>] ? ext4_get_inode_loc+0x1c/0x20 [ext4] [<ffffffffa00e941c>] ? ext4_write_inode+0x8c/0x130 [ext4] [<ffffffff8111eaf1>] do_writepages+0x21/0x40 [<ffffffff81190d8d>] writeback_single_inode+0xdd/0x2c0 [<ffffffff8119118e>] writeback_sb_inodes+0xce/0x180 [<ffffffff811912e3>] writeback_inodes_wb+0xa3/0x1a0 [<ffffffff8119163b>] wb_writeback+0x25b/0x2c0 [<ffffffff8107d0f2>] ? del_timer_sync+0x22/0x30 [<ffffffff81191829>] wb_do_writeback+0x189/0x200 [<ffffffff811918f3>] bdi_writeback_task+0x53/0xf0 [<ffffffff8112c960>] ? bdi_start_fn+0x0/0x100 [<ffffffff8112c9e6>] bdi_start_fn+0x86/0x100 [<ffffffff8112c960>] ? bdi_start_fn+0x0/0x100 [<ffffffff810909e6>] kthread+0x96/0xa0 [<ffffffff810141ca>] child_rip+0xa/0x20 [<ffffffff81090950>] ? kthread+0x0/0xa0 [<ffffffff810141c0>] ? child_rip+0x0/0x20 INFO: task flush-253:4:19598 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. flush-253:4 D ffff88023fc23240 0 19598 2 0x00000080 ffff880120387900 0000000000000046 0000000000000000 00000000008000a0 ffff880120387930 ffffffff8123c024 0000000000000000 00000001016a021b ffff88023085fab0 ffff880120387fd8 00000000000103b8 ffff88023085fab0 Call Trace: [<ffffffff8123c024>] ? generic_make_request+0x1b4/0x4f0 [<ffffffff8110c100>] ? sync_page+0x0/0x50 [<ffffffff814d8723>] io_schedule+0x73/0xc0 [<ffffffff8110c13d>] sync_page+0x3d/0x50 [<ffffffff814d8e4a>] __wait_on_bit_lock+0x5a/0xc0 [<ffffffff8110c0d7>] __lock_page+0x67/0x70 [<ffffffff81090d90>] ? wake_bit_function+0x0/0x50 [<ffffffff8111f915>] ? pagevec_lookup_tag+0x25/0x40 [<ffffffff81198f80>] ? end_buffer_async_write+0x0/0x190 [<ffffffffa00e363f>] ext4_num_dirty_pages+0x23f/0x250 [ext4] [<ffffffffa00e7288>] ext4_da_writepages+0x538/0x600 [ext4] [<ffffffffa00e3921>] ? __ext4_get_inode_loc+0xf1/0x3b0 [ext4] [<ffffffffa00e43dc>] ? ext4_get_inode_loc+0x1c/0x20 [ext4] [<ffffffffa00e941c>] ? ext4_write_inode+0x8c/0x130 [ext4] [<ffffffff8111eaf1>] do_writepages+0x21/0x40 [<ffffffff81190d8d>] writeback_single_inode+0xdd/0x2c0 [<ffffffff8119118e>] writeback_sb_inodes+0xce/0x180 [<ffffffff811912e3>] writeback_inodes_wb+0xa3/0x1a0 [<ffffffff8119163b>] wb_writeback+0x25b/0x2c0 [<ffffffff8107d0f2>] ? del_timer_sync+0x22/0x30 [<ffffffff81191829>] wb_do_writeback+0x189/0x200 [<ffffffff811918f3>] bdi_writeback_task+0x53/0xf0 [<ffffffff8112c960>] ? bdi_start_fn+0x0/0x100 [<ffffffff8112c9e6>] bdi_start_fn+0x86/0x100 [<ffffffff8112c960>] ? bdi_start_fn+0x0/0x100 [<ffffffff810909e6>] kthread+0x96/0xa0 [<ffffffff810141ca>] child_rip+0xa/0x20 [<ffffffff81090950>] ? kthread+0x0/0xa0 [<ffffffff810141c0>] ? child_rip+0x0/0x20 After 6 mins IOWAIT dropped to 0 and could unmount the partition. --- Server rebooted, and attempt to write a file to the disk again Nov 11 05:31:11 ldn-espfs2 kernel: INFO: task flush-253:7:2137 blocked for more than 120 seconds. Nov 11 05:31:11 ldn-espfs2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 11 05:31:11 ldn-espfs2 kernel: flush-253:7 D ffff88033fc23240 0 2137 2 0x00000080 Nov 11 05:31:11 ldn-espfs2 kernel: ffff880322ff9900 0000000000000046 0000000000000000 0000000000800080 Nov 11 05:31:11 ldn-espfs2 kernel: ffff880322ff9930 ffffffff8123c024 0000000000000000 00000000fffe3578 Nov 11 05:31:11 ldn-espfs2 kernel: ffff880322ff0670 ffff880322ff9fd8 00000000000103b8 ffff880322ff0670 Nov 11 05:31:11 ldn-espfs2 kernel: Call Trace: Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff8123c024>] ? generic_make_request+0x1b4/0x4f0 Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff8110c100>] ? sync_page+0x0/0x50 Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff814d8723>] io_schedule+0x73/0xc0 Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff8110c13d>] sync_page+0x3d/0x50 Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff814d8e4a>] __wait_on_bit_lock+0x5a/0xc0 Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff8110c0d7>] __lock_page+0x67/0x70 Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff81090d90>] ? wake_bit_function+0x0/0x50 Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff8111f915>] ? pagevec_lookup_tag+0x25/0x40 Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff81198f80>] ? end_buffer_async_write+0x0/0x190 Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffffa00db63f>] ext4_num_dirty_pages+0x23f/0x250 [ext4] Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffffa00df288>] ext4_da_writepages+0x538/0x600 [ext4] Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffffa00db921>] ? __ext4_get_inode_loc+0xf1/0x3b0 [ext4] Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffffa00dc3dc>] ? ext4_get_inode_loc+0x1c/0x20 [ext4] Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffffa00e141c>] ? ext4_write_inode+0x8c/0x130 [ext4] Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff8111eaf1>] do_writepages+0x21/0x40 Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff81190d8d>] writeback_single_inode+0xdd/0x2c0 Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff8119118e>] writeback_sb_inodes+0xce/0x180 Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff811912e3>] writeback_inodes_wb+0xa3/0x1a0 Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff8119163b>] wb_writeback+0x25b/0x2c0 Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff8107d0f2>] ? del_timer_sync+0x22/0x30 Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff81191829>] wb_do_writeback+0x189/0x200 Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff811918f3>] bdi_writeback_task+0x53/0xf0 Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff8112c960>] ? bdi_start_fn+0x0/0x100 Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff8112c9e6>] bdi_start_fn+0x86/0x100 Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff8112c960>] ? bdi_start_fn+0x0/0x100 Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff810909e6>] kthread+0x96/0xa0 Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff810141ca>] child_rip+0xa/0x20 Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff81090950>] ? kthread+0x0/0xa0 Nov 11 05:31:11 ldn-espfs2 kernel: [<ffffffff810141c0>] ? child_rip+0x0/0x20 Nov 11 05:33:11 ldn-espfs2 kernel: INFO: task flush-253:7:2137 blocked for more than 120 seconds. Nov 11 05:33:11 ldn-espfs2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 11 05:33:11 ldn-espfs2 kernel: flush-253:7 D ffff88033fc23240 0 2137 2 0x00000080 Nov 11 05:33:11 ldn-espfs2 kernel: ffff880322ff9900 0000000000000046 0000000000000000 0000000000800080 Nov 11 05:33:11 ldn-espfs2 kernel: ffff880322ff9930 ffffffff8123c024 0000000000000000 00000000fffe3578 Nov 11 05:33:11 ldn-espfs2 kernel: ffff880322ff0670 ffff880322ff9fd8 00000000000103b8 ffff880322ff0670 Nov 11 05:33:11 ldn-espfs2 kernel: Call Trace: Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff8123c024>] ? generic_make_request+0x1b4/0x4f0 Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff8110c100>] ? sync_page+0x0/0x50 Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff814d8723>] io_schedule+0x73/0xc0 Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff8110c13d>] sync_page+0x3d/0x50 Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff814d8e4a>] __wait_on_bit_lock+0x5a/0xc0 Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff8110c0d7>] __lock_page+0x67/0x70 Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff81090d90>] ? wake_bit_function+0x0/0x50 Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff8111f915>] ? pagevec_lookup_tag+0x25/0x40 Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff81198f80>] ? end_buffer_async_write+0x0/0x190 Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffffa00db63f>] ext4_num_dirty_pages+0x23f/0x250 [ext4] Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffffa00df288>] ext4_da_writepages+0x538/0x600 [ext4] Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffffa00db921>] ? __ext4_get_inode_loc+0xf1/0x3b0 [ext4] Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffffa00dc3dc>] ? ext4_get_inode_loc+0x1c/0x20 [ext4] Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffffa00e141c>] ? ext4_write_inode+0x8c/0x130 [ext4] Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff8111eaf1>] do_writepages+0x21/0x40 Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff81190d8d>] writeback_single_inode+0xdd/0x2c0 Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff8119118e>] writeback_sb_inodes+0xce/0x180 Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff811912e3>] writeback_inodes_wb+0xa3/0x1a0 Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff8119163b>] wb_writeback+0x25b/0x2 Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff8107d0f2>] ? del_timer_sync+0x22/0x30 Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff81191829>] wb_do_writeback+0x189/0x200 Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff811918f3>] bdi_writeback_task+0x53/0xf0 Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff8112c960>] ? bdi_start_fn+0x0/0x100 Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff8112c9e6>] bdi_start_fn+0x86/0x100 Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff8112c960>] ? bdi_start_fn+0x0/0x100 Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff810909e6>] kthread+0x96/0xa0 Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff810141ca>] child_rip+0xa/0x20 Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff81090950>] ? kthread+0x0/0xa0 Nov 11 05:33:11 ldn-espfs2 kernel: [<ffffffff810141c0>] ? child_rip+0x0/0x20 Wait 4 mins or so, and can remount disk.. Version-Release number of selected component (if applicable): 2.6.32-44.2.el6.x86_64 How reproducible: Every time Steps to Reproduce: 1. Write data to partition 2. 3. Actual results: 2-6 minutes to write a 1mb file do disk due to wait --- Expected results: Files written to disk at normal speed. Additional info:
At first I thought this is likely a dup of bug #648632, but the bug I fixed there is more related to ext4_num_dirty_pages() scanning too far, and in this case you've only got 1MB files (?). Or perhaps there are other much larger files which are dirty at the time the 1MB file is copied? Can you tell me a little more about what's going on with the system; how much memory it has, perhaps, and a rough description of what sort of IO is going on in the filesystem when this happens? When you say: > After 6 mins IOWAIT dropped to 0 and could unmount the partition. > --- > Server rebooted, and attempt to write a file to the disk again Is this the first write to the freshly mounted fs or is other IO going on? How big was the file in question this time? Honestly since the kernel version in this bug is a beta2 refresh kernel, it'd be best to first test this on the released RHEL6 kernel to verify that we're not chasing a bug that was fixed in RHEL6 GA already. Please let me know if the bug persists in the RHEL6 released kernel, and we'll pursue this further. Thanks, -Eric
This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. If you would like it considered as an exception in the current release, please ask your support representative.
This request was erroneously denied for the current release of Red Hat Enterprise Linux. The error has been fixed and this request has been re-proposed for the current release.
Closing since we have not had an update or seen it internally. Steve, if you can you verify that this issue exists in the released kernel, we can re-open the bz.
Hi Guys, Ok apologies for the delay, thought it was best to give an update. We have now upgraded to 2.6.32-71.7.1 (have also testing with the 2 patches referenced in #648632) but the problems are still intermittently persisting. This file contains around 7 million images, 95% of files are in the 500k-3mb range. We are seeing problems with high iowait when we are updating old file (for example replacing a 500k file with a 5mb file). Approx 2% of the updates are causing the high io wait (where it takes 5-10minutes to replace the file, compared to < 1s at other times). This did happen when the partition was first mounted, and is now happening at random intervals). There is read IO going on at this partition, but volume is very low (<100mb/minute) We've seen update file goes on at 15mb and they are fine, and others at 4-5mb that cause problems. The server has 12Gb of RAM, 2cpu. I've been digging around https://bugzilla.kernel.org/show_bug.cgi?id=20902#c19 -< seems to be similar? We were thinking of remounting the partition with nodelalloc to remove delayed allocation and the multiblock allocator, as file fragmentation would not be so much of an issue on this volume. From auditing the code that does the update, its just writes the newfile into the same filename/path as the old image. Not sure if that makes a potential difference. Is there any extra tools we can run for debugging, or any more info that we should try and capture?
Created attachment 475364 [details] df -h, df-i and tune2fs -l from volume
I will reopen this one for this issue - thanks for the testing (and we should get a new kernel out soon with the latest patches).
On the first write on boot we got the following after a file update: INFO: task jbd2/dm-9-8:19777 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. jbd2/dm-9-8 D ffff88032c9d78e0 0 19777 2 0x00000080 ffff8802f847fd10 0000000000000046 ffff880028210fa0 0000000000000008 0000000000016980 0000000000016980 ffff8800282112c0 0000000000016980 ffff88032cbfdb18 ffff8802f847ffd8 0000000000010518 ffff88032cbfdb18 Call Trace: [<ffffffff8109202e>] ? prepare_to_wait+0x4e/0x80 [<ffffffffa00a58f0>] jbd2_journal_commit_transaction+0x1c0/0x14e0 [jbd2] [<ffffffff810116e0>] ? __switch_to+0xd0/0x320 [<ffffffff8107d64c>] ? lock_timer_base+0x3c/0x70 [<ffffffff81091d40>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa00ac0b8>] kjournald2+0xb8/0x220 [jbd2] [<ffffffff81091d40>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa00ac000>] ? kjournald2+0x0/0x220 [jbd2] [<ffffffff810919d6>] kthread+0x96/0xa0 [<ffffffff810141ca>] child_rip+0xa/0x20 [<ffffffff81091940>] ? kthread+0x0/0xa0 [<ffffffff810141c0>] ? child_rip+0x0/0x20 INFO: task sync:21266 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. sync D ffff88033fc23240 0 21266 21247 0x00000088 ffff8802e59e3d38 0000000000000086 ffff8802e59e3d00 ffff8802e59e3cfc 0000000000000000 ffff88033fc23240 ffff880028216980 000000010009df80 ffff8803203b7068 ffff8802e59e3fd8 0000000000010518 ffff8803203b7068 Call Trace: [<ffffffff814c9375>] schedule_timeout+0x225/0x2f0 [<ffffffff814c8fe3>] wait_for_common+0x123/0x180 [<ffffffff8105c530>] ? default_wake_function+0x0/0x20 [<ffffffff814c90fd>] wait_for_completion+0x1d/0x20 [<ffffffff81195831>] writeback_inodes_sb+0xb1/0xe0 [<ffffffff8119b17e>] __sync_filesystem+0x3e/0x90 [<ffffffff8119b2c8>] sync_filesystems+0xf8/0x130 [<ffffffff8119b357>] sys_sync+0x17/0x40 [<ffffffff81013172>] system_call_fastpath+0x16/0x1b This message has not been logged since (and other slow updates have happened)
This could also be an iscsi issue, I suppose. The patchset for bug #638349 might also be worth testing too, although it's related to livelocks while syncing a lot of data, and if you're hitting this on the first write after mount, that doesn't make much sense. Looks like that last stacktrace, at: [<ffffffffa00a58f0>] jbd2_journal_commit_transaction+0x1c0/0x14e0 [jbd2] is waiting on a spinlock: 284b: e8 00 00 00 00 callq 2850 <jbd2_journal_commit_transaction+0x1c0> spin_lock(&journal->j_state_lock); 2850: 48 8b bd 40 ff ff ff mov -0xc0(%rbp),%rdi 2857: e8 00 00 00 00 callq 285c <jbd2_journal_commit_transaction+0x1cc> spin_lock(&commit_transaction->t_handle_lock); (2850 should be jbd2_journal_commit_transaction+0x1c0) (though I'm not sure which kernel you're reporting against, so that might be off) Doing a sysrq-w when things are hung up would show us all the sleeping threads... -Eric
Ok, the partition was mounted with the nodelalloc option (to see if that would make a difference) On a file update (same filename replaced with larger file) on at the first attempt, the operation took 7 minutes. I'll attatch the sysrq-w from when update was going on. Subsequent file updates/write to the partitions have been fine. I will monitor over the next few days (previously we have had a slow (5min+) file update after a remount, but then intermittantly afterwards, roughly every 150 updates.
Created attachment 475977 [details] copy of syreq-w
Ok, now 2 days since the remount (when the partition was mounted with nodelalloc as above comments). We've just had: (after a 500k file replaced with a 3mb file) From a sysreq-w while file update appeared hung: Call Trace: [<ffffffffa00fabcc>] ? release_blocks_on_commit+0xdc/0x2f0 [ext4] [<ffffffff8109202e>] ? prepare_to_wait+0x4e/0x80 [<ffffffffa00a58f0>] jbd2_journal_commit_transaction+0x1c0/0x14e0 [jbd2] [<ffffffff810116e0>] ? __switch_to+0xd0/0x320 [<ffffffff8107d64c>] ? lock_timer_base+0x3c/0x70 [<ffffffff81091d40>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa00ac0b8>] kjournald2+0xb8/0x220 [jbd2] [<ffffffff81091d40>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa00ac000>] ? kjournald2+0x0/0x220 [jbd2] [<ffffffff810919d6>] kthread+0x96/0xa0 [<ffffffff810141ca>] child_rip+0xa/0x20 [<ffffffff81091940>] ? kthread+0x0/0xa0 [<ffffffff810141c0>] ? child_rip+0x0/0x20 nfsd D 0000000000000002 0 19868 2 0x00000080 ffff880322834fc0 0000000000000046 ffff880322834f80 ffffffffa000471c ffff88032b9fb408 ffff8802b3ab2200 0000000000000000 ffff8802aa10c740 ffff880323277ad8 ffff880322835fd8 0000000000010518 ffff880323277ad8 Call Trace: [<ffffffffa000471c>] ? dm_table_unplug_all+0x5c/0xd0 [dm_mod] [<ffffffff8109ba49>] ? ktime_get_ts+0xa9/0xe0 [<ffffffff8119e1b0>] ? sync_buffer+0x0/0x50 [<ffffffff814c8dd3>] io_schedule+0x73/0xc0 [<ffffffff8119e1f0>] sync_buffer+0x40/0x50 [<ffffffff814c964f>] __wait_on_bit+0x5f/0x90 [<ffffffff8119e1b0>] ? sync_buffer+0x0/0x50 [<ffffffff814c96f8>] out_of_line_wait_on_bit+0x78/0x90 [<ffffffff81091d80>] ? wake_bit_function+0x0/0x50 [<ffffffff8119e1a6>] __wait_on_buffer+0x26/0x30 [<ffffffffa00f91fd>] ext4_mb_init_cache+0x34d/0x970 [ext4] [<ffffffff8110c519>] ? add_to_page_cache_locked+0xc9/0x140 [<ffffffffa00f9dba>] ext4_mb_load_buddy+0x33a/0x3d0 [ext4] [<ffffffffa00fcc06>] ext4_mb_regular_allocator+0x1e6/0x550 [ext4] [<ffffffffa00fd2fd>] ext4_mb_new_blocks+0x38d/0x560 [ext4] [<ffffffffa00d02f3>] ext4_alloc_branch+0x4a3/0x5a0 [ext4] [<ffffffffa00ce2ce>] ? ext4_get_branch+0xfe/0x130 [ext4] [<ffffffffa00d1fed>] ext4_ind_get_blocks+0x1dd/0x610 [ext4] [<ffffffffa00d2519>] ext4_get_blocks+0xf9/0x2a0 [ext4] [<ffffffffa00d2f8d>] ext4_get_block+0xbd/0x120 [ext4] [<ffffffff8119ff83>] __block_prepare_write+0x1e3/0x580 [<ffffffffa00d2ed0>] ? ext4_get_block+0x0/0x120 [ext4] [<ffffffff811a04c4>] block_write_begin+0x64/0x100 [<ffffffffa00d54d8>] ext4_write_begin+0x218/0x2d0 [ext4] [<ffffffffa00d2ed0>] ? ext4_get_block+0x0/0x120 [ext4] [<ffffffff8110c9fe>] generic_file_buffered_write+0x10e/0x2a0 [<ffffffffa00d17af>] ? ext4_dirty_inode+0x4f/0x60 [ext4] [<ffffffff8110e350>] __generic_file_aio_write+0x250/0x480 [<ffffffffa03e500b>] ? svc_tcp_data_ready+0x3b/0x90 [sunrpc] [<ffffffff8110e5ef>] generic_file_aio_write+0x6f/0xe0 [<ffffffffa00cb130>] ? ext4_file_write+0x0/0xb0 [ext4] [<ffffffffa00cb169>] ext4_file_write+0x39/0xb0 [ext4] [<ffffffff8116c89b>] do_sync_readv_writev+0xfb/0x140 [<ffffffff81091d40>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8120c28b>] ? selinux_file_permission+0xfb/0x150 [<ffffffff811ff6f6>] ? security_file_permission+0x16/0x20 [<ffffffff8116d95f>] do_readv_writev+0xcf/0x1f0 [<ffffffff811ffa56>] ? security_task_setgroups+0x16/0x20 [<ffffffff81099b95>] ? set_groups+0x25/0x1a0 [<ffffffff8116dac6>] vfs_writev+0x46/0x60 [<ffffffffa04772d7>] nfsd_vfs_write+0x107/0x430 [nfsd] [<ffffffff810dd7be>] ? rcu_start_gp+0x1be/0x230 [<ffffffffa04757c2>] ? nfsd_setuser_and_check_port+0x62/0xb0 [nfsd] [<ffffffffa047a019>] nfsd_write+0x99/0x100 [nfsd] [<ffffffffa0484290>] nfsd4_write+0x100/0x130 [nfsd] [<ffffffffa0484a11>] nfsd4_proc_compound+0x401/0x500 [nfsd] [<ffffffffa047243e>] nfsd_dispatch+0xfe/0x250 [nfsd] [<ffffffffa03e3544>] svc_process_common+0x344/0x610 [sunrpc] [<ffffffff8105c530>] ? default_wake_function+0x0/0x20 [<ffffffffa03e3b50>] svc_process+0x110/0x150 [sunrpc] [<ffffffffa0472b62>] nfsd+0xc2/0x160 [nfsd] [<ffffffffa0472aa0>] ? nfsd+0x0/0x160 [nfsd] [<ffffffff810919d6>] kthread+0x96/0xa0 [<ffffffff810141ca>] child_rip+0xa/0x20 [<ffffffff81091940>] ? kthread+0x0/0xa0 [<ffffffff810141c0>] ? child_rip+0x0/0x20
Created attachment 476230 [details] copy of syreq-w for comment 12
Ok, seeing ext4_mb_init_cache() in the stuck traces is a big clue I think, will look into this more. Both times that was on the nfsd thread, and many other nfsd threads are blocked as well. How is NFS in use here, is it related to the same filesystem? Thanks, -Eric
NFS is used to export this filesystem (as well as another 12 or so) to several foreign servers (only 1 writes to this filesystem) We have seen the issue when updating the file on the local system (to rule out NFS). If you can think of any extra debug/testing to run let me know and I'll try and find a quiet time to go through it. Thanks steve
There are many nfsd processes writing, all trying to read in the metadata in ext4_mb_init_cache, it seems. I see that the fs is mostly full... This may be a case where the allocator is trying really hard to find new contiguous space by searching further and further from the inode; replacing existing smaller files with larger ones may have this effect. It might be interesting to see what the e2freefrag tool says about the filesystem, to see what fragmentation looks like. One other thing that might be interesting is to run blktrace against the device; mount it and do that first (7-minute) write, then stop blktrace. We can see exactly what IO was going on during that time from the blktrace files, but I'm pretty sure we'd just see it loading up those buddy bitmaps, looking for free contiguous space... The other way we could verify that is by tracing these tracepoints: # mount -t debugfs none /sys/kernel/debug # echo 1 > /sys/kernel/debug/tracing/tracing_enabled # echo 1 > /sys/kernel/debug/tracing/events/ext4/ext4_mb_bitmap_load/enable # echo 1 > /sys/kernel/debug/tracing/events/ext4/ext4_mb_buddy_bitmap_load/enable <run your slow write> # cat /sys/kernel/debug/tracing/trace
Unfortunatly th event (/sys/kernel/debug/tracing/events/ext4/ext4_mb_bitmap_load/enable) isnt available. Does it need to be specifically enabled? (kernel: 2.6.32-71.7.1_ext4p.el6.x86_64 ls -l /sys/kernel/debug/tracing/events/ext4/ext4_mb_bitmap_load ls: cannot access /sys/kernel/debug/tracing/events/ext4/ext4_mb_bitmap_load: No such file or directory Would enabling the events via /sys/kernel/debug/tracing/events/ext4/enable be useful? Is there any way to mount the filesystem so that it will allocate in non-contigous free space? We would be happy to accept file fragmentation in this scenario. We had tried mounting with nodelalloc, but we are still experience the occasional slow updates with it. Is there another combination of mount options that could be tried? People have reported that this patch can improve performance: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=8a57d9d61a6e361c7bb159dda797672c1df1a691 Thanks steve
Steve, do you have an official subscription and access to Red Hat support? RH bugzilla is not meant to help development work with our support organization and our customers. RH support can help reproduce and gather information, so it would be great for you to get them involved. If you do not have a RH subscription, we should work this through other channels (Fedora testing or upstream email lists, kernel.org bugzilla, etc). Thanks!
Steve, just a note for you - I see that you have subscriptions. If you could open a ticket with RH support, it helps us in the process while we juggle which issues get fixed when. In addition to helping debug the situation, RH support advocates for customer issues in major releases :)
Steve, based on the backtraces you've gathered I do expect that patch to help. As Ric asked, if you can get this opened up with support channels it will help us guide it into a release. Thanks, -Eric
Also, regarding those tracepoints, they aren't in rhel6 yet, they got added just after the commit you referred to, sorry for the bad advice there, I was looking at an upstream kernel.
I created an 8T loopback filesystem, and "filled" it in a fragmented way by preallocating files which were one block shy of a block group size. Then, a write of a larger 512M file into the filesystem shows delay, though not as badly as your testcase. The patch in question reduces the time to write from about 3 minutes to about 2 minutes; not great in either case, but better. In the stock case we're spending time in: samples % image name app name symbol name 1748304 30.3997 ext4.ko ext4.ko ext4_mb_regular_allocator 1026635 17.8513 vmlinux vmlinux find_get_page 549747 9.5591 ext4.ko ext4.ko ext4_mb_load_buddy patched, we're spending time elsewhere: samples % image name app name symbol name 2067466 73.5126 ext4.ko ext4.ko ext4_mb_init_cache 104695 3.7226 ext4.ko ext4.ko ext4_mb_find_by_goal 29041 1.0326 vmlinux vmlinux __set_page_dirty_nobuffers (higher % of runtime, but fewer samples overall, compared to ext4_mb_regular_allocator + find_get_page before..) So at this point we're avoiding the buddy bitmap page loading, but on an initial mount it still requires initializing via ext4_mb_init_cache, which is expensive primarily due to the memset it contains, it seems. It's 50% better but not great. Without a central location for freespace info, this may be the best it can do. Bear in mind that some of this is due to heavily fragmented freespace on a very full filesystem, there may be changes you can make to the application to alleviate some of it, depending on the io you expect.
I'll go ahead & schedule the commit in comment #19 as an update; I don't know for sure that it will resolve Steve's problem, but it helped in my testing. Steve, please do get your support folks involved in this too for proper tracking. Thanks, -Eric
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Patch(es) available on kernel-2.6.32-164.el6
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2011-1530.html