Bug 813137

Summary:

[xfs/xfstests 273] heavy cp workload hang

Product:

Red Hat Enterprise Linux 6

Reporter:

Eryu Guan <eguan>

Component:

kernel

Assignee:

Brian Foster <bfoster>

Status:

CLOSED ERRATA

QA Contact:

Boris Ranto <branto>

Severity:

high

Docs Contact:

Priority:

urgent

Version:

6.3

CC:

akrherz, bfoster, branto, dchinner, dhoward, esandeen, fhrbata, garima_kochhar, gireesha_us, jcpunk, jwest, kernel-eus-qe, leo, narebeestjes, pasteur, rhinduja, rwheeler, torel

Target Milestone:

Keywords:

ZStream

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

kernel-2.6.32-288.el6

Doc Type:

Bug Fix

Doc Text:

Various race conditions that led to indefinite log reservation hangs due to xfsaild "idle" mode occurred in XFS file system. This could lead to certain tasks being unresponsive; for example, the cp utility could become unresponsive on heavy workload. This update improves the Active Item List (AIL) pushing logic in xfsaild. Also, the log reservation algorithm and interactions with xfsaild have been improved. As a result, the aforementioned problems no longer occur in this scenario.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2013-02-21 06:07:57 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

860787, 969150, 972578

Attachments:

Description	Flags
Full console log	none
Patch against upstream XFS to cause xfsaild idling	none

Description Eryu Guan 2012-04-17 03:28:57 UTC

Description of problem:
xfstests 273 hangs on xfs

INFO: task flush-8:0:2259 blocked for more than 120 seconds. 
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
flush-8:0     D 0000000000000025     0  2259      2 0x00000000 
 ffff88047349f670 0000000000000046 ffff88047349f660 ffffffff8126ec49 
 ffff88047349f5e0 ffffffff811166b5 ffff880401082001 ffffffff8116204c 
 ffff8804724ee638 ffff88047349ffd8 000000000000fb88 ffff8804724ee638 
Call Trace: 
 [<ffffffff8126ec49>] ? cfq_set_request+0x329/0x560 
 [<ffffffff811166b5>] ? mempool_alloc_slab+0x15/0x20 
 [<ffffffff8116204c>] ? transfer_objects+0x5c/0x80 
 [<ffffffffa029c492>] xlog_wait+0x72/0x90 [xfs] 
 [<ffffffff810601c0>] ? default_wake_function+0x0/0x20 
 [<ffffffffa029e08b>] xlog_grant_log_space+0x3ab/0x520 [xfs] 
 [<ffffffffa02b2d1a>] ? kmem_zone_zalloc+0x3a/0x50 [xfs] 
 [<ffffffff8127ef2c>] ? random32+0x1c/0x20 
 [<ffffffffa02ad201>] ? xfs_trans_ail_push+0x21/0x80 [xfs] 
 [<ffffffffa029e2e6>] xfs_log_reserve+0xe6/0x140 [xfs] 
 [<ffffffffa02abad0>] xfs_trans_reserve+0xa0/0x210 [xfs] 
 [<ffffffffa02abdbf>] ? xfs_trans_alloc+0x9f/0xb0 [xfs] 
 [<ffffffffa0299815>] xfs_iomap_write_allocate+0x215/0x340 [xfs] 
 [<ffffffffa02b43a3>] xfs_map_blocks+0x193/0x250 [xfs] 
 [<ffffffffa02b4e99>] xfs_vm_writepage+0x1f9/0x510 [xfs] 
 [<ffffffff811281e7>] __writepage+0x17/0x40 
 [<ffffffff81129579>] write_cache_pages+0x1c9/0x4a0 
 [<ffffffff8112a6f5>] ? pagevec_lookup_tag+0x25/0x40 
 [<ffffffff811281d0>] ? __writepage+0x0/0x40 
 [<ffffffff81129874>] generic_writepages+0x24/0x30 
 [<ffffffffa02b41ce>] xfs_vm_writepages+0x5e/0x80 [xfs] 
 [<ffffffff811298a1>] do_writepages+0x21/0x40 
 [<ffffffff811a564d>] writeback_single_inode+0xdd/0x2c0 
 [<ffffffff811a5a8e>] writeback_sb_inodes+0xce/0x180 
 [<ffffffff811a5beb>] writeback_inodes_wb+0xab/0x1b0 
 [<ffffffff811a5f8b>] wb_writeback+0x29b/0x3f0 
 [<ffffffff814fcff0>] ? thread_return+0x4e/0x76e 
 [<ffffffff8107eab2>] ? del_timer_sync+0x22/0x30 
 [<ffffffff811a6279>] wb_do_writeback+0x199/0x240 
 [<ffffffff811a6383>] bdi_writeback_task+0x63/0x1b0 
 [<ffffffff81091f07>] ? bit_waitqueue+0x17/0xd0 
 [<ffffffff81138500>] ? bdi_start_fn+0x0/0x100 
 [<ffffffff81138586>] bdi_start_fn+0x86/0x100 
 [<ffffffff81138500>] ? bdi_start_fn+0x0/0x100 
 [<ffffffff81091cd6>] kthread+0x96/0xa0 
 [<ffffffff8100c14a>] child_rip+0xa/0x20 
 [<ffffffff81091c40>] ? kthread+0x0/0xa0 
 [<ffffffff8100c140>] ? child_rip+0x0/0x20 
INFO: task xfssyncd/sda6:49238 blocked for more than 120 seconds. 
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
xfssyncd/sda6 D 0000000000000026     0 49238      2 0x00000080 
 ffff8804701bbc70 0000000000000046 0000000000000000 0000000000016680 
 0000000000016680 ffff8804745b7540 ffff881075de0080 ffffffff8160b400 
 ffff8804745b7af8 ffff8804701bbfd8 000000000000fb88 ffff8804745b7af8 
Call Trace: 
 [<ffffffffa029c492>] xlog_wait+0x72/0x90 [xfs] 
 [<ffffffff810601c0>] ? default_wake_function+0x0/0x20 
 [<ffffffffa029e08b>] xlog_grant_log_space+0x3ab/0x520 [xfs] 
 [<ffffffffa02b2d1a>] ? kmem_zone_zalloc+0x3a/0x50 [xfs] 
 [<ffffffff8127ef2c>] ? random32+0x1c/0x20 
 [<ffffffffa02ad201>] ? xfs_trans_ail_push+0x21/0x80 [xfs] 
 [<ffffffffa029e2e6>] xfs_log_reserve+0xe6/0x140 [xfs] 
 [<ffffffffa02abad0>] xfs_trans_reserve+0xa0/0x210 [xfs] 
 [<ffffffffa028f0a3>] xfs_fs_log_dummy+0x43/0x90 [xfs] 
 [<ffffffffa029c9c4>] ? xfs_log_need_covered+0x94/0xd0 [xfs] 
 [<ffffffffa02c1831>] xfs_sync_worker+0x81/0x90 [xfs] 
 [<ffffffffa02c171e>] xfssyncd+0x17e/0x210 [xfs] 
 [<ffffffffa02c15a0>] ? xfssyncd+0x0/0x210 [xfs] 
 [<ffffffff81091cd6>] kthread+0x96/0xa0 
 [<ffffffff8100c14a>] child_rip+0xa/0x20 
 [<ffffffff81091c40>] ? kthread+0x0/0xa0 
 [<ffffffff8100c140>] ? child_rip+0x0/0x20 
INFO: task cp:52859 blocked for more than 120 seconds. 
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
cp            D 0000000000000001     0 52859  52767 0x00000080 
 ffff880474065c28 0000000000000082 0000000000000000 ffff880c5ec7d058 
 ffff880474064000 0000000000000000 ffff880474064000 ffff880474065d48 
 ffff880473687ab8 ffff880474065fd8 000000000000fb88 ffff880473687ab8 
Call Trace: 
 [<ffffffffa029c492>] xlog_wait+0x72/0x90 [xfs] 
 [<ffffffff810601c0>] ? default_wake_function+0x0/0x20 
 [<ffffffffa029e08b>] xlog_grant_log_space+0x3ab/0x520 [xfs] 
 [<ffffffffa02b2d1a>] ? kmem_zone_zalloc+0x3a/0x50 [xfs] 
 [<ffffffff8127ef2c>] ? random32+0x1c/0x20 
 [<ffffffffa02ad201>] ? xfs_trans_ail_push+0x21/0x80 [xfs] 
 [<ffffffffa029e2e6>] xfs_log_reserve+0xe6/0x140 [xfs] 
 [<ffffffffa02abad0>] xfs_trans_reserve+0xa0/0x210 [xfs] 
 [<ffffffffa02ba720>] ? xfs_tosspages+0x20/0x30 [xfs] 
 [<ffffffffa02b08bf>] xfs_free_eofblocks+0x17f/0x280 [xfs] 
 [<ffffffffa02b1377>] xfs_release+0x147/0x260 [xfs] 
 [<ffffffffa02b90a5>] xfs_file_release+0x15/0x20 [xfs] 
 [<ffffffff8117cfc5>] __fput+0xf5/0x210 
 [<ffffffff8117d105>] fput+0x25/0x30 
 [<ffffffff81178b2d>] filp_close+0x5d/0x90 
 [<ffffffff81178c05>] sys_close+0xa5/0x100 
 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b 
INFO: task cp:52930 blocked for more than 120 seconds. 
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
cp            D 0000000000000034     0 52930  52768 0x00000080 
 ffff880474769aa8 0000000000000086 0000000000000000 0000000000000400 
 0000000000000034 0000000000000296 ffff880474769a48 ffffffff81097d9f 
 ffff88046fa83af8 ffff880474769fd8 000000000000fb88 ffff88046fa83af8 
Call Trace: 
 [<ffffffff81097d9f>] ? up+0x2f/0x50 
 [<ffffffffa029c492>] xlog_wait+0x72/0x90 [xfs] 
 [<ffffffff810601c0>] ? default_wake_function+0x0/0x20 
 [<ffffffffa029e08b>] xlog_grant_log_space+0x3ab/0x520 [xfs] 
 [<ffffffffa02b2d1a>] ? kmem_zone_zalloc+0x3a/0x50 [xfs] 
 [<ffffffff8127ef2c>] ? random32+0x1c/0x20 
 [<ffffffffa02ad201>] ? xfs_trans_ail_push+0x21/0x80 [xfs] 
 [<ffffffffa029e2e6>] xfs_log_reserve+0xe6/0x140 [xfs] 
 [<ffffffffa02abad0>] xfs_trans_reserve+0xa0/0x210 [xfs] 
 [<ffffffffa02abdbf>] ? xfs_trans_alloc+0x9f/0xb0 [xfs] 
 [<ffffffffa02b0170>] xfs_create+0x180/0x640 [xfs] 
 [<ffffffffa02bd45d>] xfs_vn_mknod+0xad/0x1c0 [xfs] 
 [<ffffffffa02bd5a0>] xfs_vn_create+0x10/0x20 [xfs] 
 [<ffffffff81189044>] vfs_create+0xb4/0xe0 
 [<ffffffff8118cbaf>] do_filp_open+0xb2f/0xd60 
 [<ffffffff811987c2>] ? alloc_fd+0x92/0x160 
 [<ffffffff81178cc9>] do_sys_open+0x69/0x140 
 [<ffffffff81178de0>] sys_open+0x20/0x30 
 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b 
INFO: task cp:52931 blocked for more than 120 seconds. 
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
cp            D 000000000000000a     0 52931  52837 0x00000080 
 ffff880874715c28 0000000000000082 0000000000000000 ffff8808543f4c58 
 ffff880874714000 0000000000000000 ffff880874714000 ffff880874715d48 
 ffff8808747f5058 ffff880874715fd8 000000000000fb88 ffff8808747f5058 
Call Trace: 
 [<ffffffffa029c492>] xlog_wait+0x72/0x90 [xfs] 
 [<ffffffff810601c0>] ? default_wake_function+0x0/0x20 
 [<ffffffffa029e08b>] xlog_grant_log_space+0x3ab/0x520 [xfs] 
 [<ffffffffa02b2d1a>] ? kmem_zone_zalloc+0x3a/0x50 [xfs] 
 [<ffffffff8127ef2c>] ? random32+0x1c/0x20 
 [<ffffffffa02ad201>] ? xfs_trans_ail_push+0x21/0x80 [xfs] 
 [<ffffffffa029e2e6>] xfs_log_reserve+0xe6/0x140 [xfs] 
 [<ffffffffa02abad0>] xfs_trans_reserve+0xa0/0x210 [xfs] 
 [<ffffffffa02ba720>] ? xfs_tosspages+0x20/0x30 [xfs] 
 [<ffffffffa02b08bf>] xfs_free_eofblocks+0x17f/0x280 [xfs] 
 [<ffffffffa02b1377>] xfs_release+0x147/0x260 [xfs] 
 [<ffffffffa02b90a5>] xfs_file_release+0x15/0x20 [xfs] 
 [<ffffffff8117cfc5>] __fput+0xf5/0x210 
 [<ffffffff8117d105>] fput+0x25/0x30 
 [<ffffffff81178b2d>] filp_close+0x5d/0x90 
 [<ffffffff81178c05>] sys_close+0xa5/0x100 
 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b 
INFO: task cp:52933 blocked for more than 120 seconds. 
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
cp            D 0000000000000022     0 52933  52771 0x00000080 
 ffff88105c689aa8 0000000000000082 0000000000000000 0000000000000400 
 0000000000000022 0000000000000296 ffff88105c689a48 ffffffff81097d9f 
 ffff8810713bbaf8 ffff88105c689fd8 000000000000fb88 ffff8810713bbaf8 
Call Trace: 
 [<ffffffff81097d9f>] ? up+0x2f/0x50 
 [<ffffffffa029c492>] xlog_wait+0x72/0x90 [xfs] 
 [<ffffffff810601c0>] ? default_wake_function+0x0/0x20 
 [<ffffffffa029e08b>] xlog_grant_log_space+0x3ab/0x520 [xfs] 
 [<ffffffffa02b2d1a>] ? kmem_zone_zalloc+0x3a/0x50 [xfs] 
 [<ffffffff8127ef2c>] ? random32+0x1c/0x20 
 [<ffffffffa02ad201>] ? xfs_trans_ail_push+0x21/0x80 [xfs] 
 [<ffffffffa029e2e6>] xfs_log_reserve+0xe6/0x140 [xfs] 
 [<ffffffffa02abad0>] xfs_trans_reserve+0xa0/0x210 [xfs] 
 [<ffffffffa02abdbf>] ? xfs_trans_alloc+0x9f/0xb0 [xfs] 
 [<ffffffffa02b0170>] xfs_create+0x180/0x640 [xfs] 
 [<ffffffffa02bd45d>] xfs_vn_mknod+0xad/0x1c0 [xfs] 
 [<ffffffffa02bd5a0>] xfs_vn_create+0x10/0x20 [xfs] 
 [<ffffffff81189044>] vfs_create+0xb4/0xe0 
 [<ffffffff8118cbaf>] do_filp_open+0xb2f/0xd60 
 [<ffffffff811987c2>] ? alloc_fd+0x92/0x160 
 [<ffffffff81178cc9>] do_sys_open+0x69/0x140 
 [<ffffffff81178de0>] sys_open+0x20/0x30 
 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b 
INFO: task cp:52934 blocked for more than 120 seconds. 
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
cp            D 000000000000001f     0 52934  52769 0x00000080 
 ffff880c7572faa8 0000000000000082 0000000000000000 0000000000000400 
 000000000000001f 0000000000000296 ffff880c7572fa48 ffffffff81097d9f 
 ffff880c74fe3af8 ffff880c7572ffd8 000000000000fb88 ffff880c74fe3af8 
Call Trace: 
 [<ffffffff81097d9f>] ? up+0x2f/0x50 
 [<ffffffffa029c492>] xlog_wait+0x72/0x90 [xfs] 
 [<ffffffff810601c0>] ? default_wake_function+0x0/0x20 
 [<ffffffffa029e08b>] xlog_grant_log_space+0x3ab/0x520 [xfs] 
 [<ffffffffa02b2d1a>] ? kmem_zone_zalloc+0x3a/0x50 [xfs] 
 [<ffffffff8127ef2c>] ? random32+0x1c/0x20 
 [<ffffffffa02ad201>] ? xfs_trans_ail_push+0x21/0x80 [xfs] 
 [<ffffffffa029e2e6>] xfs_log_reserve+0xe6/0x140 [xfs] 
 [<ffffffffa02abad0>] xfs_trans_reserve+0xa0/0x210 [xfs] 
 [<ffffffffa02abdbf>] ? xfs_trans_alloc+0x9f/0xb0 [xfs] 
 [<ffffffffa02b0170>] xfs_create+0x180/0x640 [xfs] 
 [<ffffffffa02bd45d>] xfs_vn_mknod+0xad/0x1c0 [xfs] 
 [<ffffffffa02bd5a0>] xfs_vn_create+0x10/0x20 [xfs] 
 [<ffffffff81189044>] vfs_create+0xb4/0xe0 
 [<ffffffff8118cbaf>] do_filp_open+0xb2f/0xd60 
 [<ffffffff811987c2>] ? alloc_fd+0x92/0x160 
 [<ffffffff81178cc9>] do_sys_open+0x69/0x140 
 [<ffffffff81178de0>] sys_open+0x20/0x30 
 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b 
INFO: task cp:52935 blocked for more than 120 seconds. 
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
cp            D 0000000000000025     0 52935  52912 0x00000080 
 ffff880c674adc28 0000000000000086 0000000000000000 ffff880c66942058 
 ffff880c674ac000 0000000000000000 ffff880c674ac000 ffff880c674add48 
 ffff880c63cea5f8 ffff880c674adfd8 000000000000fb88 ffff880c63cea5f8 
Call Trace: 
 [<ffffffffa029c492>] xlog_wait+0x72/0x90 [xfs] 
 [<ffffffff810601c0>] ? default_wake_function+0x0/0x20 
 [<ffffffffa029e08b>] xlog_grant_log_space+0x3ab/0x520 [xfs] 
 [<ffffffffa02b2d1a>] ? kmem_zone_zalloc+0x3a/0x50 [xfs] 
 [<ffffffff8127ef2c>] ? random32+0x1c/0x20 
 [<ffffffffa02ad201>] ? xfs_trans_ail_push+0x21/0x80 [xfs] 
 [<ffffffffa029e2e6>] xfs_log_reserve+0xe6/0x140 [xfs] 
 [<ffffffffa02abad0>] xfs_trans_reserve+0xa0/0x210 [xfs] 
 [<ffffffffa02ba720>] ? xfs_tosspages+0x20/0x30 [xfs] 
 [<ffffffffa02b08bf>] xfs_free_eofblocks+0x17f/0x280 [xfs] 
 [<ffffffffa02b1377>] xfs_release+0x147/0x260 [xfs] 
 [<ffffffffa02b90a5>] xfs_file_release+0x15/0x20 [xfs] 
 [<ffffffff8117cfc5>] __fput+0xf5/0x210 
 [<ffffffff8117d105>] fput+0x25/0x30 
 [<ffffffff81178b2d>] filp_close+0x5d/0x90 
 [<ffffffff81178c05>] sys_close+0xa5/0x100 
 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b 
INFO: task cp:52938 blocked for more than 120 seconds. 
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
cp            D 0000000000000013     0 52938  52910 0x00000080 
 ffff88046d0c5c28 0000000000000086 0000000000000000 ffff88106300d058 
 ffff88046d0c4000 0000000000000000 ffff88046d0c4000 ffff88046d0c5d48 
 ffff880470783098 ffff88046d0c5fd8 000000000000fb88 ffff880470783098 
Call Trace: 
 [<ffffffffa029c492>] xlog_wait+0x72/0x90 [xfs] 
 [<ffffffff810601c0>] ? default_wake_function+0x0/0x20 
 [<ffffffffa029e08b>] xlog_grant_log_space+0x3ab/0x520 [xfs] 
 [<ffffffffa02b2d1a>] ? kmem_zone_zalloc+0x3a/0x50 [xfs] 
 [<ffffffff8127ef2c>] ? random32+0x1c/0x20 
 [<ffffffffa02ad201>] ? xfs_trans_ail_push+0x21/0x80 [xfs] 
 [<ffffffffa029e2e6>] xfs_log_reserve+0xe6/0x140 [xfs] 
 [<ffffffffa02abad0>] xfs_trans_reserve+0xa0/0x210 [xfs] 
 [<ffffffffa02ba720>] ? xfs_tosspages+0x20/0x30 [xfs] 
 [<ffffffffa02b08bf>] xfs_free_eofblocks+0x17f/0x280 [xfs] 
 [<ffffffffa02b1377>] xfs_release+0x147/0x260 [xfs] 
 [<ffffffffa02b90a5>] xfs_file_release+0x15/0x20 [xfs] 
 [<ffffffff8117cfc5>] __fput+0xf5/0x210 
 [<ffffffff8117d105>] fput+0x25/0x30 
 [<ffffffff81178b2d>] filp_close+0x5d/0x90 
 [<ffffffff81178c05>] sys_close+0xa5/0x100 
 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b 
INFO: task cp:52939 blocked for more than 120 seconds. 
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
cp            D 000000000000000c     0 52939  52919 0x00000080 
 ffff881071edbaa8 0000000000000082 0000000000000000 0000000000000400 
 000000000000000c 0000000000000296 ffff881071edba48 ffffffff81097d9f 
 ffff8810747f1ab8 ffff881071edbfd8 000000000000fb88 ffff8810747f1ab8 
Call Trace: 
 [<ffffffff81097d9f>] ? up+0x2f/0x50 
 [<ffffffffa029c492>] xlog_wait+0x72/0x90 [xfs] 
 [<ffffffff810601c0>] ? default_wake_function+0x0/0x20 
 [<ffffffffa029e08b>] xlog_grant_log_space+0x3ab/0x520 [xfs] 
 [<ffffffffa02b2d1a>] ? kmem_zone_zalloc+0x3a/0x50 [xfs] 
 [<ffffffff8127ef2c>] ? random32+0x1c/0x20 
 [<ffffffffa02ad201>] ? xfs_trans_ail_push+0x21/0x80 [xfs] 
 [<ffffffffa029e2e6>] xfs_log_reserve+0xe6/0x140 [xfs] 
 [<ffffffffa02abad0>] xfs_trans_reserve+0xa0/0x210 [xfs] 
 [<ffffffffa02abdbf>] ? xfs_trans_alloc+0x9f/0xb0 [xfs] 
 [<ffffffffa02b0170>] xfs_create+0x180/0x640 [xfs] 
 [<ffffffffa02bd45d>] xfs_vn_mknod+0xad/0x1c0 [xfs] 
 [<ffffffffa02bd5a0>] xfs_vn_create+0x10/0x20 [xfs] 
 [<ffffffff81189044>] vfs_create+0xb4/0xe0 
 [<ffffffff8118cbaf>] do_filp_open+0xb2f/0xd60 
 [<ffffffff811987c2>] ? alloc_fd+0x92/0x160 
 [<ffffffff81178cc9>] do_sys_open+0x69/0x140 
 [<ffffffff81178de0>] sys_open+0x20/0x30 
 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b 

Version-Release number of selected component (if applicable):
kernel-2.6.32-264.el6

How reproducible:
I reproduced it by running 273 in loop, the hang happened at the 5th iteration

Steps to Reproduce:
1. yum install kernel-kernel-filesystems-xfs-xfstests
2. cd /mnt/tests/kernel/filesystems/xfs/xfstests; ./install_all.sh
3. TEST_PARAM_FSTYPE=xfs TEST_PARAM_TEST_DEV=/dev/sda5 TEST_PARAM_SCRATCH_DEV=/dev/sda6 TEST_PARAM_LOOP=100 make run
  
Additional info:
I've tried to reproduce on 6.2 GA kernel and 2.6.32-250.el6 kernel, no hang so far. Will do more testings.

A failed recipe
https://beaker.engineering.redhat.com/recipes/464464
https://beaker.engineering.redhat.com/recipes/461164

Comment 1 Eryu Guan 2012-04-17 03:30:11 UTC

Created attachment 577872 [details]
Full console log

Comment 3 Eryu Guan 2012-04-18 03:04:23 UTC

Reproduced on 6.2 GA kernel, so it's not a regression

Comment 4 Dave Chinner 2012-04-20 01:10:39 UTC

can you see if it is reproducable on the 6.1 kernel? That will tell me if it was introduced in 6.2 or not, and hence whether it is a known problem or something new....

Comment 5 Eryu Guan 2012-04-20 06:28:39 UTC

I'm running 273 on 6.1 kernel (2.6.32-131.0.15.el6), seems it takes much longer (5397s) to finish one iteration than it takes on 6.2 or 6.3 kernel (295s). I'll leave it running the whole weekend and update results next Monday.

Comment 6 Eryu Guan 2012-04-23 06:35:05 UTC

273 takes less than 2 hours(about 6000s) to finish one iteration on rhel6.1, it doesn't finish 100 iterations within two days(the whole weekend). So far I see no hang tasks but only poor performance.

Comment 7 Brian Foster 2012-04-24 12:23:54 UTC

FWIW, I was messing around with this and I've reproduced 2/2 times on a local rhel6.3 VM (2.6.32-264.el6, x86_64, 8 cpu, 20GB vd). Looking through the associated area of code, I stuck a xlog_grant_push_ail() call in xlog_grant_log_space() before the first sleep and got through 100 iterations without a hang.

I'm trying to learn some of this code so it isn't totally clear if this is appropriate, fixing a symptom vs. a problem, otherwise wildly broken, etc. One thing I did notice is that xlog_grant_log_space() seems to follow a similar pattern as xlog_regrant_write_log_space(), and the latter function includes a bit of logic to wake up any queued tasks and potentially push the ail before the first sleep (though the purpose of xlog_regrant_write_log_space() is not clear to me..). I could try adding similar logic to xlog_grant_log_space() if that sounds reasonable. (FWIW, it seems that upstream is somewhat different, but does include looping ail push logic down in this codepath via xlog_grant_head_wait().)

Comment 8 Dave Chinner 2012-04-26 00:19:41 UTC

(In reply to comment #6)
> 273 takes less than 2 hours(about 6000s) to finish one iteration on rhel6.1, it
> doesn't finish 100 iterations within two days(the whole weekend). So far I see
> no hang tasks but only poor performance.

Ok, that is kind of what I expected. It's related to the speed ups in the code introduced in the 6.2 kernel.

Comment 9 Dave Chinner 2012-04-26 00:49:59 UTC

(In reply to comment #7)
> FWIW, I was messing around with this and I've reproduced 2/2 times on a local
> rhel6.3 VM (2.6.32-264.el6, x86_64, 8 cpu, 20GB vd). Looking through the
> associated area of code, I stuck a xlog_grant_push_ail() call in
> xlog_grant_log_space() before the first sleep and got through 100 iterations
> without a hang.

That just papers over the issue by adding more push calls that can break the deadlock. It doesn't avoid the deadlock at all, and from previous experience it isn't a reliable workaround for such log space hangs, either.

> I'm trying to learn some of this code so it isn't totally clear if this is
> appropriate, fixing a symptom vs. a problem, otherwise wildly broken,

It's a symptom, unfortunately. There are other problems in the code that have been recently fixed, but I think this one will be realtively easy to fix with a backport.

> etc. One
> thing I did notice is that xlog_grant_log_space() seems to follow a similar
> pattern as xlog_regrant_write_log_space(), and the latter function includes a
> bit of logic to wake up any queued tasks and potentially push the ail before
> the first sleep (though the purpose of xlog_regrant_write_log_space() is not
> clear to me..).

It's for permanent (rolling) transactions that already have a multiple transaction log space reservation through xlog_grant_log_space(). the regrant code is just bring the moving the reservation for the next transaction in the series from the reserve head to the write head so it can take place normally.

> I could try adding similar logic to xlog_grant_log_space() if
> that sounds reasonable. (FWIW, it seems that upstream is somewhat different,
> but does include looping ail push logic down in this codepath via
> xlog_grant_head_wait().)

The problem is much more likely the one fixed by this patch set that was added to the 3.0.x stable kernel series to solve the similar log space hangs that people were reporting upstream.

http://oss.sgi.com/archives/xfs/2011-10/msg00235.html

The "revert to kthread" patch is not necessary, because RHEL6 doesn't have the workqueue infrastructure that this patch is reverting from and hence still has a thread of the xfsaild. So I'd start by backporting this series and see if that solves the problem for you.

Cheers,

Dave.

Comment 10 Brian Foster 2012-04-27 12:22:41 UTC

(In reply to comment #9)
> (In reply to comment #7)
...
> > etc. One
> > thing I did notice is that xlog_grant_log_space() seems to follow a similar
> > pattern as xlog_regrant_write_log_space(), and the latter function includes a
> > bit of logic to wake up any queued tasks and potentially push the ail before
> > the first sleep (though the purpose of xlog_regrant_write_log_space() is not
> > clear to me..).
> 
> It's for permanent (rolling) transactions that already have a multiple
> transaction log space reservation through xlog_grant_log_space(). the regrant
> code is just bring the moving the reservation for the next transaction in the
> series from the reserve head to the write head so it can take place normally.
> 

Thanks for the comments and explanation.

> > I could try adding similar logic to xlog_grant_log_space() if
> > that sounds reasonable. (FWIW, it seems that upstream is somewhat different,
> > but does include looping ail push logic down in this codepath via
> > xlog_grant_head_wait().)
> 
> The problem is much more likely the one fixed by this patch set that was added
> to the 3.0.x stable kernel series to solve the similar log space hangs that
> people were reporting upstream.
> 
> http://oss.sgi.com/archives/xfs/2011-10/msg00235.html
> 
> The "revert to kthread" patch is not necessary, because RHEL6 doesn't have the
> workqueue infrastructure that this patch is reverting from and hence still has
> a thread of the xfsaild. So I'd start by backporting this series and see if
> that solves the problem for you.
> 
> Cheers,
> 
> Dave.

I did a quick backport of the core bits here (minus the kthread patch) to the latest rhel6 tree and ran a 100 loop test clean over night. I'm running it once more just to be sure. If it passes I'll pull in the rest (the tracepoints, mainly), run through a full xfstests and post a patch.

Thanks again for pointing me in the right direction here!

Comment 11 Brian Foster 2012-04-27 13:26:23 UTC

Sure enough, not long after I post and start the second test I reproduce the hang. ;) This patchset seems to make it harder to reproduce, anyways. I think I'll give this test a whirl on an upstream kernel...

Comment 12 Brian Foster 2012-04-30 18:57:45 UTC

I ran this 100 loop sequence against 3.4.0-rc4+ 3 or 4 times over the weekend and couldn't reproduce a hang. After reading/comparing some of this code with upstream, I noticed one obvious difference: upstream xfsaild never appears to sleep. My obvious question here.. is this intentional?

Specifically... the tout return of xfsaild_push() has changed as this code has gone from kthread->workqueue->kthread, and the current upstream code returns 50 in the !count case (as opposed to 0 as in rhel6). 

I'll attach a patch against upstream that makes this behave more like rhel for reference...

Comment 13 Brian Foster 2012-04-30 19:03:58 UTC

Created attachment 581261 [details]
Patch against upstream XFS to cause xfsaild idling

This is in reference to comment 12, particularly to illustrate the difference in xfsaild behavior between rhel6 and upstream (3.4.0-rc5+).

Comment 14 Dave Chinner 2012-04-30 23:56:03 UTC

(In reply to comment #12)
> I ran this 100 loop sequence against 3.4.0-rc4+ 3 or 4 times over the weekend
> and couldn't reproduce a hang. After reading/comparing some of this code with
> upstream, I noticed one obvious difference: upstream xfsaild never appears to
> sleep. My obvious question here.. is this intentional?

The idling was removed when reverting the workqueue code back to a kthread. Yes, it was intentional, and I forgot about that little detail of the revert. Well done on spotting it. :)

> Specifically... the tout return of xfsaild_push() has changed as this code has
> gone from kthread->workqueue->kthread, and the current upstream code returns 50
> in the !count case (as opposed to 0 as in rhel6). 
> 
> I'll attach a patch against upstream that makes this behave more like rhel for
> reference...

Did idling cause upstream to fail?

As it is, I forgot about another, more recent upstream patch that is probably relevant here: 

9f9c19e xfs: fix the logspace waiting algorithm

Can you backport that one to your series as well, and see if that fixes the rarer hang you are now seeing?

Comment 15 Brian Foster 2012-05-01 12:55:15 UTC

(In reply to comment #14)
> (In reply to comment #12)
> > I ran this 100 loop sequence against 3.4.0-rc4+ 3 or 4 times over the weekend
> > and couldn't reproduce a hang. After reading/comparing some of this code with
> > upstream, I noticed one obvious difference: upstream xfsaild never appears to
> > sleep. My obvious question here.. is this intentional?
> 
> The idling was removed when reverting the workqueue code back to a kthread.
> Yes, it was intentional, and I forgot about that little detail of the revert.
> Well done on spotting it. :)
> 

Thanks. :) Out of curiosity, what is the particular purpose for that behavior?

> > Specifically... the tout return of xfsaild_push() has changed as this code has
> > gone from kthread->workqueue->kthread, and the current upstream code returns 50
> > in the !count case (as opposed to 0 as in rhel6). 
> > 
> > I'll attach a patch against upstream that makes this behave more like rhel for
> > reference...
> 
> Did idling cause upstream to fail?
> 

Interestingly, it did not. I ran that test overnight and even commented out one or two of the "extra" (as compared to rhel6) ail push calls and didn't reproduce a hang. I suppose that's good in that upstream isn't apparently depending on the non-idle behavior to survive this test.

> As it is, I forgot about another, more recent upstream patch that is probably
> relevant here: 
> 
> 9f9c19e xfs: fix the logspace waiting algorithm
> 
> Can you backport that one to your series as well, and see if that fixes the
> rarer hang you are now seeing?

Ah... thanks. this looks like the origin of the largest difference I see between rhel6 and upstream (i.e., the waiting algorithm and the clean up to provide some commonality between log grant/regrant). This looks rather promising from what I've seen so far. I'll give it a shot... thanks again.

Comment 16 Dave Chinner 2012-05-01 14:24:25 UTC

(In reply to comment #15)
> (In reply to comment #14)
> > (In reply to comment #12)
> > > I ran this 100 loop sequence against 3.4.0-rc4+ 3 or 4 times over the weekend
> > > and couldn't reproduce a hang. After reading/comparing some of this code with
> > > upstream, I noticed one obvious difference: upstream xfsaild never appears to
> > > sleep. My obvious question here.. is this intentional?
> > 
> > The idling was removed when reverting the workqueue code back to a kthread.
> > Yes, it was intentional, and I forgot about that little detail of the revert.
> > Well done on spotting it. :)
> > 
> 
> Thanks. :) Out of curiosity, what is the particular purpose for that behavior?

To prevent excessive wakeups preventing long CPU idle sleep times. i.e. to allow the CPU to spend more time in deep power saving modes.

> 
> > > Specifically... the tout return of xfsaild_push() has changed as this code has
> > > gone from kthread->workqueue->kthread, and the current upstream code returns 50
> > > in the !count case (as opposed to 0 as in rhel6). 
> > > 
> > > I'll attach a patch against upstream that makes this behave more like rhel for
> > > reference...
> > 
> > Did idling cause upstream to fail?
> > 
> 
> Interestingly, it did not.

That's good to know - I've been considering reintroducing the idling upstream given all teh fixes we've done recently...

Comment 17 Brian Foster 2012-05-03 19:44:27 UTC

(In reply to comment #16)
> (In reply to comment #15)
> > (In reply to comment #14)
> > > (In reply to comment #12)
> > > > I ran this 100 loop sequence against 3.4.0-rc4+ 3 or 4 times over the weekend
> > > > and couldn't reproduce a hang. After reading/comparing some of this code with
> > > > upstream, I noticed one obvious difference: upstream xfsaild never appears to
> > > > sleep. My obvious question here.. is this intentional?
> > > 
> > > The idling was removed when reverting the workqueue code back to a kthread.
> > > Yes, it was intentional, and I forgot about that little detail of the revert.
> > > Well done on spotting it. :)
> > > 
> > 
> > Thanks. :) Out of curiosity, what is the particular purpose for that behavior?
> 
> To prevent excessive wakeups preventing long CPU idle sleep times. i.e. to
> allow the CPU to spend more time in deep power saving modes.
> 

This is why the idle functionality exists, no? I could have phrased that question better, but I'm wondering why the no-idle behavior was introduced...

As an update, I backported 9f9c19e to rhel6 and was still able to reproduce the hang. Doing a side-by-side walk through of the wait/wake code, the logic after this change seems to be relatively equivalent between upstream and rhel6. I started looking at the xfsaild_push() logic and came across 670ce93f. I included that change as well, got through two 100-loop runs without an issue and eventually reproduced the hang in a 3rd run. :(

I may try the no-idle change at least for reference, or perhaps adding some tracepoints to get more intuition on what's happening. In the meantime let me know if you think of anything else to test...

Comment 18 RHEL Program Management 2012-05-07 04:05:16 UTC

Since RHEL 6.3 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 19 Dave Chinner 2012-05-07 06:04:20 UTC

(In reply to comment #17)
> (In reply to comment #16)
> > (In reply to comment #15)
> > To prevent excessive wakeups preventing long CPU idle sleep times. i.e. to
> > allow the CPU to spend more time in deep power saving modes.
> > 
> 
> This is why the idle functionality exists, no? I could have phrased that
> question better, but I'm wondering why the no-idle behavior was introduced...

the idling code was relatively new, and we reverted it because we weren't sure it it was having a watchdog effect on AIL pushing.

> As an update, I backported 9f9c19e to rhel6 and was still able to reproduce the
> hang. Doing a side-by-side walk through of the wait/wake code, the logic after
> this change seems to be relatively equivalent between upstream and rhel6. I
> started looking at the xfsaild_push() logic and came across 670ce93f. I
> included that change as well, got through two 100-loop runs without an issue
> and eventually reproduced the hang in a 3rd run. :(

So, that means you have backported these patches?

9f9c19e xfs: fix the logspace waiting algorithm
670ce93 xfs: reduce the number of log forces from tail pushing
17b3847 xfs: force the log if we encounter pinned buffers in .iop_pushbuf
bc6e588 xfs: do not update xa_last_pushed_lsn for locked items

Which means that the fix might be in these 4:

e44f411 xfs: set cursor in xfs_ail_splice() even when AIL was empty
af3e402 xfs: convert AIL cursors to use struct list_head
16b5902 xfs: remove confusing ail cursor wrapper
1d8c95a xfs: use a cursor for bulk AIL insertion

e44f411 is a fix for a real bug (and could possibly be the cause of the problem), while the other 3 are fixes for excessive CPU usage during log Io completion when inserting items into the AIL. That could be a contributing factor, but I never saw that issue cause a hang....

> I may try the no-idle change at least for reference, or perhaps adding some
> tracepoints to get more intuition on what's happening. In the meantime let me
> know if you think of anything else to test...

Backporting:

9e4c109 xfs: add AIL pushing tracepoints

will give you some tracepoints...

Comment 20 Brian Foster 2012-05-07 14:18:17 UTC

(In reply to comment #19)
...
> 
> So, that means you have backported these patches?
> 
> 9f9c19e xfs: fix the logspace waiting algorithm
> 670ce93 xfs: reduce the number of log forces from tail pushing
> 17b3847 xfs: force the log if we encounter pinned buffers in .iop_pushbuf
> bc6e588 xfs: do not update xa_last_pushed_lsn for locked items
> 

Correct. I've been doing a little debugging that I'll describe below with 670ce93 reverted because it doesn't seem to fix the problem, but makes it harder to reproduce...

> Which means that the fix might be in these 4:
> 
> e44f411 xfs: set cursor in xfs_ail_splice() even when AIL was empty
> af3e402 xfs: convert AIL cursors to use struct list_head
> 16b5902 xfs: remove confusing ail cursor wrapper
> 1d8c95a xfs: use a cursor for bulk AIL insertion
> 

Thanks, I'll give these a shot...

> e44f411 is a fix for a real bug (and could possibly be the cause of the
> problem), while the other 3 are fixes for excessive CPU usage during log Io
> completion when inserting items into the AIL. That could be a contributing
> factor, but I never saw that issue cause a hang....
> 
> > I may try the no-idle change at least for reference, or perhaps adding some
> > tracepoints to get more intuition on what's happening. In the meantime let me
> > know if you think of anything else to test...
> 
> Backporting:
> 
> 9e4c109 xfs: add AIL pushing tracepoints
> 
> will give you some tracepoints...

Thanks again. I've hacked in a couple tracepoints to help me understand the paths I've been digging into and so far I can use them to try and describe 2 hangs that I'm reproducing. The tracepoints are at the end of xfsaild_push(), so I can see the end result of a particular push, and in xfs_trans_ail_push(), so I can see when we do or do not actually execute a wake. This is with the backported patches listed above (save 670ce93).

1.) The first hang I combined with a dump of the in memory ailp->xa_ail log item list using crash to see what was going on. What I see is a sequence like the following:

- xfsaild() is working with a target of 21474840310, hits the stuck > 100 heuristic and reschedules setting the last_lsn pointer to 21474837942.
- Concurrently, the xa_target value is updated to 21474840950.
- xfsaild restarts with the updated target, handles 0 items and goes into idle mode never to return.
- Using crash, I see a list with li_lsn entries larger than the last shown target (expected) preceded by a bunch of entries with li_lsn == 17179889014, which goes a bit further back in my tracepoint log. 

I was thinking that perhaps the push mechanism skipped past these inadvertently across an xa_target update, causing it to think even though it hit a target that everything previous was complete (???). TBH, I don't have a complete understanding of the log mechanism at the moment... so perhaps this is more likely related to your proposed ail list fixes..?

In any event, I'm able to avoid this hang by adding a little hack to xfsaild_push() to never enter idle mode on an xfsaild_push() sequence that doesn't start with last_lsn == 0. IOW, if we hit the target and last_lsn != 0, set last_lsn = 0 and schedule one more time.

2.) With the above "fix," I hit a hang and ended up diagnosing it to a slightly different cause. What I see in my run is that although we issue quite a few xfs_trans_push_ail() calls through the updated wait/wake algorithm, this function only calls xfsaild_wakeup() when the corresponding threshold_lsn moves the target_lsn forward (which implies we expected xfsaild, once woken, to reschedule itself intelligently until complete). It looks like I neglected to save my trace log from this one, so the high-level sequence of events from memory is as follows:

- xfsaild() is running.
- xfsaild_wakeup() runs, updates ailp->xa_target and issues a wake_up_process(). Via tracepoint, wake_up_process() returns 0 to indicate that xfsaild is already active.
- xfsaild_push() exits into idle mode.
- xfs_trans_ail_push() calls continue to occur for a period of time. None make it to xfsaild_wakeup() because they do not push xa_target forward and I end up hung.

I can avoid this hang by setting a flag (stored in ailp) based on the return value of wake_up_process() and using that to make sure that xfs_trans_ail_push() calls continue to result in a wake_up_process() calls until the thread activation actually occurs (adding a reset of this flag in xfsaild when it pulls a new xa_target might be better to avoid spurious wakes, but I'm just hacking around at the moment)...

--- Summary

So far I've run two successful 100 loop sequences with both of these changes, and started a 3rd this morning. Previously I hadn't got through a single 100-loop sequence with this combination of backports, so I'm a bit more confident that I've at least captured the characteristics of the hangs here. If this 3rd run succeeds, I plan to undo the first "fix," verify I reproduce and get into testing the AIL patches you've proposed.

With regard to hang 2... On inspection, it looks like upstream has the same logic to limit wakes in the case where xa_target is not moved forward, so I'm a little curious why I never reproduced that one. At the same time, upstream does include changes like 670ce93 which made this significantly harder for me to reproduce (and could technically be a different hang). If I get a chance, perhaps I'll try reverting that on an upstream kernel and see if the hang occurs...

Comment 21 Dave Chinner 2012-05-08 01:17:17 UTC

(In reply to comment #20)
> > Backporting:
> > 
> > 9e4c109 xfs: add AIL pushing tracepoints
> > 
> > will give you some tracepoints...
> 
> Thanks again. I've hacked in a couple tracepoints to help me understand the
> paths I've been digging into and so far I can use them to try and describe 2
> hangs that I'm reproducing. The tracepoints are at the end of xfsaild_push(),
> so I can see the end result of a particular push, and in xfs_trans_ail_push(),
> so I can see when we do or do not actually execute a wake. This is with the
> backported patches listed above (save 670ce93).

Ok.

> 1.) The first hang I combined with a dump of the in memory ailp->xa_ail log
> item list using crash to see what was going on. What I see is a sequence like
> the following:
> 
> - xfsaild() is working with a target of 21474840310, hits the stuck > 100
> heuristic and reschedules setting the last_lsn pointer to 21474837942.
> - Concurrently, the xa_target value is updated to 21474840950.
> - xfsaild restarts with the updated target, handles 0 items and goes into idle
> mode never to return.

Ah, numbers. I love numbers - I can debug code with numbers. ;)

FWIW, log sequence numbers make a lot more sense when printed in hex. e.g 21474837942 = 0x5000005B6 which tells me the cycle number (upper 32 bits) is 5, and the sector offset (lower 32 bits) is 0x5b6. When you then compare it with 21474840950 = 0x500001176, you can see that only the sector offset has changed, and by how much. XFS uses encodings like this everywhere, so printing stuff in hex is often much more insightful than printing in decimal....

Anyway, what this means is that xfs_trans_ail_cursor_first() never found an entry with a last_lsn >= 21474837942, so as far as it is concerned, the push is complete. It sets last_lsn = 0, returns 0 and hence idles.

The upstream code currently assumes that even when we reach our target, there is more to do. That's the first problem with the RHEL6 code - idle should only be entered when the AIL is empty, not when we have reached our push target.

> - Using crash, I see a list with li_lsn entries larger than the last shown
> target (expected) preceded by a bunch of entries with li_lsn == 17179889014,
> which goes a bit further back in my tracepoint log.

Yes, that is exactly what will happen if we hit this case.

> I was thinking that perhaps the push mechanism skipped past these inadvertently
> across an xa_target update, causing it to think even though it hit a target
> that everything previous was complete (???). TBH, I don't have a complete
> understanding of the log mechanism at the moment... so perhaps this is more
> likely related to your proposed ail list fixes..?

No, that's not fixed in any of the commits I listed - it was caught by the removal of the idling logic. IOWs, Christoph was right in his suspicion that the idling logic was contributing to the problems upstream - we were just never able to confirm it until right now.

> In any event, I'm able to avoid this hang by adding a little hack to
> xfsaild_push() to never enter idle mode on an xfsaild_push() sequence that
> doesn't start with last_lsn == 0. IOW, if we hit the target and last_lsn != 0,
> set last_lsn = 0 and schedule one more time.

Well, this is handled upstream simply by handling the !count/reached target/empty cursor cases all the same - sleep for 50ms, set last_lsn = 0 to restart the push. That is, there is no idle case at all. Idling should only occur if the AIL is empty so the idle logic in RHEL6 is completely broken and needs to be fixed.
 
> 2.) With the above "fix," I hit a hang and ended up diagnosing it to a slightly
> different cause. What I see in my run is that although we issue quite a few
> xfs_trans_push_ail() calls through the updated wait/wake algorithm, this
> function only calls xfsaild_wakeup() when the corresponding threshold_lsn moves
> the target_lsn forward (which implies we expected xfsaild, once woken, to
> reschedule itself intelligently until complete). It looks like I neglected to
> save my trace log from this one, so the high-level sequence of events from
> memory is as follows:
> 
> - xfsaild() is running.
> - xfsaild_wakeup() runs, updates ailp->xa_target and issues a
> wake_up_process(). Via tracepoint, wake_up_process() returns 0 to indicate that
> xfsaild is already active.
> - xfsaild_push() exits into idle mode.
> - xfs_trans_ail_push() calls continue to occur for a period of time. None make
> it to xfsaild_wakeup() because they do not push xa_target forward and I end up
> hung.

Right, that is also fixed by not having an idle mode.

> I can avoid this hang by setting a flag (stored in ailp) based on the return
> value of wake_up_process() and using that to make sure that
> xfs_trans_ail_push() calls continue to result in a wake_up_process() calls
> until the thread activation actually occurs (adding a reset of this flag in
> xfsaild when it pulls a new xa_target might be better to avoid spurious wakes,
> but I'm just hacking around at the moment)...

I think killing the idle mode is the first thing that is necessary. With just the basic backports you have, can you simply change the initial value of tout in xfsaild_push to 50, and see if that avoids the problems you have seen so far?

BTW, Brian - good work in debugging this :)

Cheers,

Dave.

Comment 22 Brian Foster 2012-05-08 13:41:00 UTC

(In reply to comment #21)
> > 1.) The first hang I combined with a dump of the in memory ailp->xa_ail log
> > item list using crash to see what was going on. What I see is a sequence like
> > the following:
> > 
> > - xfsaild() is working with a target of 21474840310, hits the stuck > 100
> > heuristic and reschedules setting the last_lsn pointer to 21474837942.
> > - Concurrently, the xa_target value is updated to 21474840950.
> > - xfsaild restarts with the updated target, handles 0 items and goes into idle
> > mode never to return.
> 
> Ah, numbers. I love numbers - I can debug code with numbers. ;)
> 
> FWIW, log sequence numbers make a lot more sense when printed in hex. e.g
> 21474837942 = 0x5000005B6 which tells me the cycle number (upper 32 bits) is 5,
> and the sector offset (lower 32 bits) is 0x5b6. When you then compare it with
> 21474840950 = 0x500001176, you can see that only the sector offset has changed,
> and by how much. XFS uses encodings like this everywhere, so printing stuff in
> hex is often much more insightful than printing in decimal....
> 

Good to know, thanks...

> Anyway, what this means is that xfs_trans_ail_cursor_first() never found an
> entry with a last_lsn >= 21474837942, so as far as it is concerned, the push is
> complete. It sets last_lsn = 0, returns 0 and hence idles.
> 
> The upstream code currently assumes that even when we reach our target, there
> is more to do. That's the first problem with the RHEL6 code - idle should only
> be entered when the AIL is empty, not when we have reached our push target.
> 

Ok, that clears that up...

> > I can avoid this hang by setting a flag (stored in ailp) based on the return
> > value of wake_up_process() and using that to make sure that
> > xfs_trans_ail_push() calls continue to result in a wake_up_process() calls
> > until the thread activation actually occurs (adding a reset of this flag in
> > xfsaild when it pulls a new xa_target might be better to avoid spurious wakes,
> > but I'm just hacking around at the moment)...
> 
> I think killing the idle mode is the first thing that is necessary. With just
> the basic backports you have, can you simply change the initial value of tout
> in xfsaild_push to 50, and see if that avoids the problems you have seen so
> far?
> 

Yep, I'll give that a shot. If that proves sufficient to resolve the hang, how much of the backport stuff would you recommend we include into rhel6? My debug branch is kind of botched with the various hacks I've been making, so I need to reconstruct it anyways. I'll be initially testing the no-idle-mode change in a rhel6 branch that effectively includes the following:

bc6e588a8971 xfs: do not update xa_last_pushed_lsn for locked items
17b38471c3c0 xfs: force the log if we encounter pinned buffers in .iop_pushbuf
9e4c109ac822 xfs: add AIL pushing tracepoints
9e7004e741de xfs: exit AIL push work correctly when AIL is empty
  - ^^^ I think the no-idle-mode change is just a cherry pick from this one
9f9c19ec1a59 xfs: fix the logspace waiting algorithm

... but it sounds like technically the no-idle change alone could address the problem. FWIW, the aforementioned backports were relatively straightforward and I have no issue including them if appropriate.

> BTW, Brian - good work in debugging this :)
> 

Thanks for pointing me in the right direction! :) 

BTW, I'm pretty sure I tested upstream with a change that re-enabled idle mode and never hit this problem through a weekend test run. Now that I have a bit more context, I think some of the performance related changes I've tried testing make this significantly more difficult to reproduce. Once this bug is resolved, is there any interest/value in pursuing upstream fixes to the idle mode logic such that it could be re-enabled in the future? I might have enough context now to revert some upstream changes (for the purposes of creating kernel where this is reproducible and testing potential fixes, but if not, I could always test by proxy on this rhel kernel as well).

Comment 23 Dave Chinner 2012-05-09 05:57:20 UTC

(In reply to comment #22)
> (In reply to comment #21)
> > > I can avoid this hang by setting a flag (stored in ailp) based on the return
> > > value of wake_up_process() and using that to make sure that
> > > xfs_trans_ail_push() calls continue to result in a wake_up_process() calls
> > > until the thread activation actually occurs (adding a reset of this flag in
> > > xfsaild when it pulls a new xa_target might be better to avoid spurious wakes,
> > > but I'm just hacking around at the moment)...
> > 
> > I think killing the idle mode is the first thing that is necessary. With just
> > the basic backports you have, can you simply change the initial value of tout
> > in xfsaild_push to 50, and see if that avoids the problems you have seen so
> > far?
> > 
> 
> Yep, I'll give that a shot. If that proves sufficient to resolve the hang, how
> much of the backport stuff would you recommend we include into rhel6? My debug
> branch is kind of botched with the various hacks I've been making, so I need to
> reconstruct it anyways. I'll be initially testing the no-idle-mode change in a
> rhel6 branch that effectively includes the following:
> 
> bc6e588a8971 xfs: do not update xa_last_pushed_lsn for locked items
> 17b38471c3c0 xfs: force the log if we encounter pinned buffers in .iop_pushbuf
> 9e4c109ac822 xfs: add AIL pushing tracepoints
> 9e7004e741de xfs: exit AIL push work correctly when AIL is empty
>   - ^^^ I think the no-idle-mode change is just a cherry pick from this one

Not really. That was a fix for an idling condition when the AIL pushing was done via a workqueue. The workqueue code had different idle cases, and it was protected by a flag which had a race condition in clearing it. When the code was reverted back to a kthread, that flag was removed and the original idling behaviour was not restored. Hence the idling logic was removed by selective code reversion, not due to any specific patch.

> 9f9c19ec1a59 xfs: fix the logspace waiting algorithm
> 
> ... but it sounds like technically the no-idle change alone could address the
> problem. FWIW, the aforementioned backports were relatively straightforward and
> I have no issue including them if appropriate.

Given that they are in the 3.0.x stable kernel, and XFS in RHEL6 is only slightly behind 3.0.x, they probably should be backported....

> > BTW, Brian - good work in debugging this :)
> > 
> 
> Thanks for pointing me in the right direction! :) 
> 
> BTW, I'm pretty sure I tested upstream with a change that re-enabled idle mode
> and never hit this problem through a weekend test run. Now that I have a bit
> more context, I think some of the performance related changes I've tried
> testing make this significantly more difficult to reproduce. 

It would seem that way - upstream still appears to have at least one bug that results in a log space hang still in it that I'm trying to understand at the moment.....

> Once this bug is
> resolved, is there any interest/value in pursuing upstream fixes to the idle
> mode logic such that it could be re-enabled in the future?

Yes, there is. Laptop users will get noticable longer battery life if we don't wake up the CPU 20x/sec per filesystem....

Cheers,

Dave.

Comment 24 Brian Foster 2012-05-14 20:27:23 UTC

I've run through the 273 100-loop test probably 8-10 times now through the end of last week and over the weekend and can no longer reproduce hangs. I ran a full xfstests today and observed no regressions.

The patchset has been posted for rhel6.4:

http://post-office.corp.redhat.com/archives/rhkernel-list/2012-May/msg00840.html

Comment 25 RHEL Program Management 2012-06-27 13:46:42 UTC

This request was evaluated by Red Hat Product Management for
inclusion in a Red Hat Enterprise Linux release.  Product
Management has requested further review of this request by
Red Hat Engineering, for potential inclusion in a Red Hat
Enterprise Linux release for currently deployed products.
This request is not yet committed for inclusion in a release.

Comment 26 Jarod Wilson 2012-07-23 16:08:58 UTC

Patch(es) available on kernel-2.6.32-288.el6

Comment 30 Eric Sandeen 2012-09-12 00:26:20 UTC

Clearing rh private, nothing secret here that I can see.

Comment 31 Garima Kochhar 2012-09-18 16:23:32 UTC

We're looking for a RHEL 6.3 z-stream fix for this bug.

Comment 33 daryl herzmann 2012-11-06 18:54:52 UTC

Sorry to make noise, but the async 6.3 z-stream errata just released:  kernel-2.6.32-279.14.1 does not mention this bug nor bugzilla #860787 in the rpm changelog? The erratum text mentions #860787 though.  I am not clever enough to figure out if the patch is in the source or not, sorry.

Did the patch make it?  Thank you

Comment 34 Frantisek Hrbata 2012-11-06 20:21:48 UTC

rpm  -q --changelog -p kernel-2.6.32-279.14.1.el6.x86_64.rpm | grep 860787
- [fs] xfs: disable xfsaild idle mode (Brian Foster) [860787 813137]
- [fs] xfs: fix the logspace waiting algorithm (Brian Foster) [860787 813137]
- [fs] xfs: add AIL pushing tracepoints (Brian Foster) [860787 813137]
- [fs] xfs: force the log if we encounter pinned buffers in .iop_pushbuf (Brian Foster) [860787 813137]
- [fs] xfs: do not update xa_last_pushed_lsn for locked items (Brian Foster) [860787 813137]

Patches are in, changelog also contains info about this fix.

Comment 35 daryl herzmann 2012-11-06 21:29:23 UTC

Thank you, I was looking here:

https://rhn.redhat.com/network/software/packages/change_log.pxt?pid=734552

and it stops at "[fs] udf: Avoid run away".  I should have not assumed this was correct.  Sorry for the noise...

Comment 36 Tore H. Larsen 2012-12-11 16:35:52 UTC

Ref. Daryl's question. I started seeing this with upgrade to 2.6.32-279.14.1. On an idle node loadavg = 1. 

Note I did not see this on 2.6.32-279.5.2.el6.x86_64

top - 16:30:53 up 11 days, 18:03,  1 user,  load average: 1.03, 1.02, 1.00
Tasks: 331 total,   1 running, 330 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.1%sy,  0.0%ni, 99.9%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  49551592k total,   951012k used, 48600580k free,     9664k buffers
Swap: 15999992k total,    16556k used, 15983436k free,   101316k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                            
 3304 root      20   0     0    0    0 D  0.3  0.0  11:40.47 xfsaild/sda5  


Is this fixed or not in 2.6.32-279.14.1 ?

Comment 37 daryl herzmann 2012-12-11 16:41:35 UTC

Just to add a *me too* on 2.6.32-279.14.1 , xfsaild goes into deadlock state.  Unmounting the XFS filesystem causes it to return to "normal" and the load average to go back to zero, but then a few hours after remounting, the issue is back.  I can't tell is anything bad is happening though.

Comment 38 Brian Foster 2012-12-11 17:44:12 UTC

Tore and Daryl,

The load average issue has been reported and is known. It is in fact a side effect of this fix, but it is considered to be generally harmless. We do have bug 883905 filed to track and fix this in rhel going forward.

Daryl, you referred to xfsaild going into a "deadlocked" state. By that I assume you mean top shows xfsaild in D state (for IO wait), not that your filesystem has deadlocked. Is that an accurate assumption? In any event, if your issue is as described in bug 883905, I would appreciate it if we could take further comments over there. Thanks. :)

Comment 39 daryl herzmann 2012-12-11 17:57:36 UTC

Hi, bug 883905 is private, so it is hard to say :)  Yes, top shows the 'D' state as is indicated by comment 36.  The filesystem appears to be a-ok.

Comment 40 Brian Foster 2012-12-11 18:06:49 UTC

(In reply to comment #39)
> Hi, bug 883905 is private, so it is hard to say :)  Yes, top shows the 'D'
> state as is indicated by comment 36.  The filesystem appears to be a-ok.

Oops! That should be fixed now.

Comment 42 errata-xmlrpc 2013-02-21 06:07:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0496.html

Comment 43 Brian Foster 2013-06-07 16:27:29 UTC

*** Bug 970111 has been marked as a duplicate of this bug. ***