Created attachment 315045 [details] SysRq-t output from dash-01,2,3 Description of problem: While trying to verify bug 458289 with a long run I found my three node cluster unresponsive. I was only able to log into one of three nodes so I issued a sysrq-T and found that many processes were stuck waiting for GFS2 to free memory. For example: sendmail D ffffffff884f92c8 0 11020 8481 (NOTLB) ffff810031b5b948 0000000000000086 0000000000000018 ffffffff88589488 0000000000000292 0000000000000008 ffff81003fb61860 ffffffff802e5ae0 000013b85e598ebc 0000000000000d44 ffff81003fb61a48 000000008858adea Call Trace: [<ffffffff88589488>] :dlm:request_lock+0x93/0xa0 [<ffffffff884f92c8>] :gfs2:just_schedule+0x0/0xe [<ffffffff884f92d1>] :gfs2:just_schedule+0x9/0xe [<ffffffff800639be>] __wait_on_bit+0x40/0x6e [<ffffffff884f92c8>] :gfs2:just_schedule+0x0/0xe [<ffffffff80063a58>] out_of_line_wait_on_bit+0x6c/0x78 [<ffffffff8009e107>] wake_bit_function+0x0/0x23 [<ffffffff884f92c3>] :gfs2:gfs2_glock_wait+0x2b/0x30 [<ffffffff884fc9bc>] :gfs2:gfs2_dinode_dealloc+0xfb/0x1a7 [<ffffffff88507b0f>] :gfs2:gfs2_delete_inode+0xf1/0x18d [<ffffffff88507a64>] :gfs2:gfs2_delete_inode+0x46/0x18d [<ffffffff88507a1e>] :gfs2:gfs2_delete_inode+0x0/0x18d [<ffffffff8002f4f4>] generic_delete_inode+0xc6/0x143 [<ffffffff800e2c75>] prune_one_dentry+0x4d/0x76 [<ffffffff8002ea85>] prune_dcache+0x10f/0x149 [<ffffffff800e2cb5>] shrink_dcache_memory+0x17/0x30 [<ffffffff8003f159>] shrink_slab+0xdc/0x153 [<ffffffff800c5af4>] try_to_free_pages+0x1c2/0x2ba [<ffffffff8000f299>] __alloc_pages+0x1cb/0x2ce [<ffffffff80012a0e>] __do_page_cache_readahead+0x95/0x1d9 [<ffffffff8000f1b4>] __alloc_pages+0xe6/0x2ce [<ffffffff880f3d4d>] :dm_mod:dm_any_congested+0x38/0x3f [<ffffffff800133ad>] filemap_nopage+0x148/0x322 [<ffffffff8000884b>] __handle_mm_fault+0x1f8/0xe5c [<ffffffff800a0084>] ktime_get_ts+0x1a/0x4e [<ffffffff8006686f>] do_page_fault+0x4cb/0x830 [<ffffffff8009dec1>] keventd_create_kthread+0x0/0xc4 [<ffffffff80062f4b>] thread_return+0x0/0xdf [<ffffffff8003357e>] do_setitimer+0xcf/0x4b7 [<ffffffff8005dde9>] error_exit+0x0/0x84 Version-Release number of selected component (if applicable): kernel-2.6.18-104.el5 gfs2-kmod-1.104-1.1.el5.abhi.1 How reproducible: Unknown Steps to Reproduce: 1. run this load in dd_io d_metaverify -i 0 -s creat,unlink,rename,mkdir,rmdir -d 10 -n 50 -w . Actual results: See above and attached Expected results: GFS2 should be able to handle a long running load. Additional info:
Ok, I think we need to see the glocks here to make much of a judgement. I suspect that its waiting for the rgrp glock, but there is no info here to say why that should be the case.
Hmm, so I can see what is wrong now.... when we take a page fault, we do the glock as the first thing. Then we call filemap_nopage() which might allocate pages and it uses the mapping's GFP setting, so that will be GFP_KERNEL, resulting in it potentially trying to call into the inode freeing code, resulting in deadlock. So it might be possible to fix this by changing the mapping's GFP setting when we call filemap_nopage(). In the upstream code we avoid this by using the "normal" fault path, so this looks like a RHEL only bug. Looking at the current code though, we might land up having to set GFP_NOFS on all GFS2 data mappings (we already do this for metadata mappings) when the inode is created. Not at all ideal, but it does get us out of this particular hole.
Created attachment 315777 [details] Propsoed patch This is the simple "fix". Lets hope that there are no other issues caused by fixing things in this way.
I was able to reproduce this bug on a 4GB machine by introducing a memory hog. kernel-2.6.18-115.gfs2abhi.001 stress D ffff810002376400 0 6681 6677 6680 (NOTLB) ffff8101148a7ad8 0000000000000086 0000000000000018 ffffffff884ba4bc 0000000000000292 0000000000000007 ffff8101229fb7a0 ffffffff802f5ae0 000005c5f4ced30d 000000000001e383 ffff8101229fb988 00000000884bbe1e Call Trace: [<ffffffff884ba4bc>] :dlm:request_lock+0x93/0xa0 [<ffffffff884e5138>] :gfs2:just_schedule+0x0/0xe [<ffffffff884e5141>] :gfs2:just_schedule+0x9/0xe [<ffffffff80063acd>] __wait_on_bit+0x40/0x6e [<ffffffff884e5138>] :gfs2:just_schedule+0x0/0xe [<ffffffff80063b67>] out_of_line_wait_on_bit+0x6c/0x78 [<ffffffff8009ea2a>] wake_bit_function+0x0/0x23 [<ffffffff884e5133>] :gfs2:gfs2_glock_wait+0x2b/0x30 [<ffffffff884f4f9f>] :gfs2:gfs2_delete_inode+0x4e/0x18d [<ffffffff884f4f97>] :gfs2:gfs2_delete_inode+0x46/0x18d [<ffffffff884f4f51>] :gfs2:gfs2_delete_inode+0x0/0x18d [<ffffffff8002f275>] generic_delete_inode+0xc6/0x143 [<ffffffff800e7027>] prune_one_dentry+0x4d/0x76 [<ffffffff8002e72e>] prune_dcache+0x10f/0x149 [<ffffffff800e7067>] shrink_dcache_memory+0x17/0x30 [<ffffffff8003ef5e>] shrink_slab+0xdc/0x153 [<ffffffff800c8e67>] try_to_free_pages+0x1c2/0x2ba [<ffffffff8000f25f>] __alloc_pages+0x1cb/0x2ce [<ffffffff80008c60>] __handle_mm_fault+0x5fb/0xe5c [<ffffffff80066b9a>] do_page_fault+0x4cb/0x830 [<ffffffff800630aa>] thread_return+0x62/0xf1 [<ffffffff8000de52>] do_mmap_pgoff+0x66c/0x7d7 [<ffffffff8005dde9>] error_exit+0x0/0x84
*** Bug 464932 has been marked as a duplicate of this bug. ***
I hit this panic while trying to reproduce this bug. I was running -118.gfs2abhi.001 which includes patches from bugs 463134, 437803, 462579, and 465876. Kernel BUG at fs/gfs2/trans.c:34 invalid opcode: 0000 [1] SMP last sysfs file: /fs/gfs2/dash:brawl0/lock_module/recover_done CPU 0 Modules linked in: sctp lock_dlm gfs2 dlm configfs qla2xxx scsi_transport_fc ipd Pid: 9749, comm: d_doio Not tainted 2.6.18-118.gfs2abhi.001 #1 RIP: 0010:[<ffffffff8861d013>] [<ffffffff8861d013>] :gfs2:gfs2_do_trans_begin+4 RSP: 0018:ffff81002d2bd968 EFLAGS: 00010282 RAX: ffff81002d2860c0 RBX: 0000000000000000 RCX: 0000000000000001 RDX: 0000000000008000 RSI: 0000000000000000 RDI: ffff81002cdc8000 RBP: ffff810026a60048 R08: ffff81002d2bc000 R09: ffff810026a7f060 R10: ffff81003dddf4c0 R11: 0000000000000246 R12: 0000000000008000 R13: ffff81002cdc8000 R14: ffff81002cdc8000 R15: 0000000000000001 FS: 00002ac529321020(0000) GS:ffffffff803b8000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00002aaaaacd2000 CR3: 000000002d2de000 CR4: 00000000000006e0 Process d_doio (pid: 9749, threadinfo ffff81002d2bc000, task ffff81002d2860c0) Stack: ffff81002d2bd978 ffff810026a602c8 ffff810026a60048 0000000000000000 ffff81002cdc8000 0000000000000000 0000000000000080 ffffffff88615dde ffff810026a7f1f8 ffff810026a7f1f8 ffff810026a7f1a8 0000000100002615 Call Trace: [<ffffffff88615dde>] :gfs2:gfs2_delete_inode+0x10a/0x191 [<ffffffff88615d1a>] :gfs2:gfs2_delete_inode+0x46/0x191 [<ffffffff88615cd4>] :gfs2:gfs2_delete_inode+0x0/0x191 [<ffffffff8002f2ae>] generic_delete_inode+0xc6/0x143 [<ffffffff800e6a8a>] prune_one_dentry+0x4d/0x76 [<ffffffff8002e767>] prune_dcache+0x10f/0x149 [<ffffffff800e6aca>] shrink_dcache_memory+0x17/0x30 [<ffffffff8003efc0>] shrink_slab+0xdc/0x153 [<ffffffff800c896b>] try_to_free_pages+0x1c1/0x2b9 [<ffffffff8000f276>] __alloc_pages+0x1cb/0x2ce [<ffffffff8861d0e9>] :gfs2:gfs2_do_trans_begin+0x102/0x144 [<ffffffff8860f4d8>] :gfs2:gfs2_write_begin+0x19d/0x33f [<ffffffff88610ed3>] :gfs2:gfs2_file_buffered_write+0x14b/0x2e5 [<ffffffff88611309>] :gfs2:__gfs2_file_aio_write_nolock+0x29c/0x2d4 [<ffffffff8000c3d0>] __wake_up_bit+0x28/0x2d [<ffffffff886114ac>] :gfs2:gfs2_file_write_nolock+0xaa/0x10f [<ffffffff8009e526>] autoremove_wake_function+0x0/0x2e [<ffffffff88611cca>] :gfs2:gfs2_open+0x0/0x13a [<ffffffff8001e698>] __dentry_open+0x101/0x1dc [<ffffffff80064d35>] lock_kernel+0x1b/0x32 [<ffffffff886115fc>] :gfs2:gfs2_file_write+0x49/0xa7 [<ffffffff8001671c>] vfs_write+0xce/0x174 [<ffffffff80016fe9>] sys_write+0x45/0x6e [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Code: 0f 0b 68 0c 1c 62 88 c2 22 00 44 89 e0 09 d8 75 0a 0f 0b 68 RIP [<ffffffff8861d013>] :gfs2:gfs2_do_trans_begin+0x2c/0x144 RSP <ffff81002d2bd968> <0>Kernel panic - not syncing: Fatal exception
There is only one memory allocation in gfs2_do_trans_begin(), namely: tr = kzalloc(sizeof(struct gfs2_trans), GFP_NOFS); so its quite clear that this has the GFP_NOFS flag set. I've looked at the code in mm, and the shrinker in dcache.c and it seems that they do take into account the required __GFP_FS bit. So why this is happening is still a mystery to me at the moment.
The report of comment #7 seems to be on a kernel which is missing the patch attached to this bz. Maybe thats why its still failing?
I reverted back to testing -115.gfs2abhi.001 which we're sure has the patch in comment #4. I constructed a test case as follows: 1. Run d_metaverify forever on a GFS2 file system 2. In parallel run a memory stress test which allocates 75% of memory which exits, then waits one minute before restarting. I ran this with lock_nolock on one node for 24 hours without hitting any problems. I'm restarting the load with one node and lock_dlm to see if it can handle the load.
One node mounted with lock_dlm: PASS Two nodes mounted with load on one: PASS Two nodes mounted with load on two: FAIL After reviewing the logs I see that dash-01 tried to reconnect DLM, probably for clvmd. There were no openais messages. On dash-02 I see that openais noticed that the token was lost and it started forming a new ring. Interspersed with the openais messages are dlm messages. Oct 16 18:00:34 dash-02 openais[9105]: [TOTEM] Did not need to originate any mes sages in recovery. Oct 16 18:00:35 dash-02 openais[9105]: [TOTEM] Sending initial ORF token Oct 16 18:00:35 dash-02 kernel: dlm: closing connection to node 1 Oct 16 18:00:35 dash-02 openais[9105]: [CLM ] CLM CONFIGURATION CHANGE Oct 16 18:00:35 dash-02 kernel: dlm: connect from non cluster node Oct 16 18:00:35 dash-02 openais[9105]: [CLM ] New Configuration: Oct 16 18:00:35 dash-02 kernel: dlm: connect from non cluster node Oct 16 18:00:35 dash-02 openais[9105]: [CLM ] r(0) ip(10.15.89.169) Oct 16 18:00:35 dash-02 kernel: dlm: connect from non cluster node Oct 16 18:00:35 dash-02 openais[9105]: [CLM ] r(0) ip(10.15.89.170) Oct 16 18:00:35 dash-02 kernel: dlm: connect from non cluster node Oct 16 18:00:35 dash-02 openais[9105]: [CLM ] Members Left: Oct 16 18:00:35 dash-02 kernel: dlm: connect from non cluster node Oct 16 18:00:35 dash-02 openais[9105]: [CLM ] r(0) ip(10.15.89.168) Oct 16 18:00:35 dash-02 openais[9105]: [CLM ] Members Joined: Oct 16 18:00:35 dash-02 openais[9105]: [CLM ] CLM CONFIGURATION CHANGE Oct 16 18:00:35 dash-02 openais[9105]: [CLM ] New Configuration: Oct 16 18:00:35 dash-02 openais[9105]: [CLM ] r(0) ip(10.15.89.169) Oct 16 18:00:35 dash-02 openais[9105]: [CLM ] r(0) ip(10.15.89.170) dash-02 then fences dash-01.
Created attachment 321605 [details] Further test patch This needs to be combined with the earlier patch in order to build a kernel for testing. This includes the DLM fix that I sent upstream and also a few noinline annotations to try and narrow down the source of the problem further, should it happen again. Abhi, can you build a kernel from the latest RHEL with the two patches from this bug for testing?
Abhi has built a kernel, does that fix the problem?
I was able to reproduce the issue on GFS2 with abhi's -120.gfs2abhi.001 build. It took 4:48:08 to reproduce while running the load on two nodes. I had top and slabtop running on all nodes when the fence action occurred. Configuration: - 3 node cluster, dash-01, dash-02, dash-03 - file system mounted on all nodes - load running on dash-01 and dash-02 Results: - dash-02 was fenced - top shows gfs2_scand was using 0.07% CPU before the fence action - dlm_send on dash-01 is hogging CPU time - sysrq-t didn't show a stack for dlm_send, but I got something out of sysrq-w SysRq : Show CPUs CPU0: ffffffff80431dc8 0000000000000000 0000000000000001 0000000000000000 ffffffff801a3b5d ffff81002f8f9d78 0000000000000006 ffffffff801a3b8c ffff81002f8f9d78 ffffffff80091877 ffffffff80321520 0000000000000077 Call Trace: <IRQ> [<ffffffff801a3b5d>] showacpu+0x0/0x3b [<ffffffff801a3b8c>] showacpu+0x2f/0x3b [<ffffffff80091877>] on_each_cpu+0x19/0x22 [<ffffffff801a3a92>] __handle_sysrq+0x90/0x121 [<ffffffff801b2718>] receive_chars+0x128/0x276 [<ffffffff801b38ae>] serial8250_interrupt+0x7a/0xfe [<ffffffff80010bd1>] handle_IRQ_event+0x51/0xa6 [<ffffffff800b78be>] __do_IRQ+0xa4/0x103 [<ffffffff8001214f>] __do_softirq+0x89/0x133 [<ffffffff885e68c6>] :dlm:process_send_sockets+0x0/0x15d [<ffffffff8006c95d>] do_IRQ+0xe7/0xf5 [<ffffffff8005d615>] ret_from_intr+0x0/0xa <EOI> [<ffffffff80064c08>] _spin_unlock_irqrestore+0x8/0x9 [<ffffffff8009a8be>] queue_work+0x4e/0x57 [<ffffffff8004d25b>] run_workqueue+0x94/0xe4 [<ffffffff80049b14>] worker_thread+0x0/0x122 [<ffffffff8009d6d5>] keventd_create_kthread+0x0/0xc4 [<ffffffff80049c04>] worker_thread+0xf0/0x122 [<ffffffff8008a22e>] default_wake_function+0x0/0xe [<ffffffff8009d6d5>] keventd_create_kthread+0x0/0xc4 [<ffffffff8009d6d5>] keventd_create_kthread+0x0/0xc4 [<ffffffff800324bd>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff8009d6d5>] keventd_create_kthread+0x0/0xc4 [<ffffffff800323bf>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11
After further investigation I found I'm starting to get processes stuck on memory allocations again. ps D ffff810001576400 0 15620 11465 (NOTLB) ffff8100045a5998 0000000000000086 0000000000000018 ffffffff885e2abe 0000000000000292 0000000000000008 ffff81002f8dc820 ffffffff802f4ae0 0000137ee8d393c6 00000000000205ef ffff81002f8dca08 00000000885e2be2 Call Trace: [<ffffffff885e2abe>] :dlm:request_lock+0x93/0xa0 [<ffffffff8860d138>] :gfs2:just_schedule+0x0/0xe [<ffffffff8860d141>] :gfs2:just_schedule+0x9/0xe [<ffffffff80063ac7>] __wait_on_bit+0x40/0x6e [<ffffffff8860d138>] :gfs2:just_schedule+0x0/0xe [<ffffffff80063b61>] out_of_line_wait_on_bit+0x6c/0x78 [<ffffffff8009d91b>] wake_bit_function+0x0/0x23 [<ffffffff8860d133>] :gfs2:gfs2_glock_wait+0x2b/0x30 [<ffffffff8861cd46>] :gfs2:gfs2_delete_inode+0x4e/0x191 [<ffffffff8861cd3e>] :gfs2:gfs2_delete_inode+0x46/0x191 [<ffffffff8861ccf8>] :gfs2:gfs2_delete_inode+0x0/0x191 [<ffffffff8002f270>] generic_delete_inode+0xc6/0x143 [<ffffffff800e5eef>] prune_one_dentry+0x4d/0x76 [<ffffffff8002e729>] prune_dcache+0x10f/0x149 [<ffffffff800e5f2f>] shrink_dcache_memory+0x17/0x30 [<ffffffff8003ef82>] shrink_slab+0xdc/0x153 [<ffffffff800c7da7>] try_to_free_pages+0x1c1/0x2b9 [<ffffffff8000f276>] __alloc_pages+0x1cb/0x2ce ... top D ffff8100301a07a0 0 11768 11723 (NOTLB) ffff810036cffbd8 0000000000000082 0000000000000018 ffffffff885e2abe 0000000000000292 000000000000000a ffff81000d7e80c0 ffff8100301a07a0 0000130d7bb09916 000000000007cbec ffff81000d7e82a8 00000000885e2be2 Call Trace: [<ffffffff885e2abe>] :dlm:request_lock+0x93/0xa0 [<ffffffff88687556>] :lock_dlm:gdlm_ast+0x0/0x311 [<ffffffff886872c1>] :lock_dlm:gdlm_bast+0x0/0x8d [<ffffffff8860d138>] :gfs2:just_schedule+0x0/0xe [<ffffffff8860d141>] :gfs2:just_schedule+0x9/0xe [<ffffffff80063ac7>] __wait_on_bit+0x40/0x6e [<ffffffff8860d138>] :gfs2:just_schedule+0x0/0xe [<ffffffff80063b61>] out_of_line_wait_on_bit+0x6c/0x78 [<ffffffff8009d91b>] wake_bit_function+0x0/0x23 [<ffffffff8860d133>] :gfs2:gfs2_glock_wait+0x2b/0x30 [<ffffffff8861cd46>] :gfs2:gfs2_delete_inode+0x4e/0x191 [<ffffffff8861cd3e>] :gfs2:gfs2_delete_inode+0x46/0x191 [<ffffffff8861ccf8>] :gfs2:gfs2_delete_inode+0x0/0x191 [<ffffffff8002f270>] generic_delete_inode+0xc6/0x143 [<ffffffff800e5eef>] prune_one_dentry+0x4d/0x76 [<ffffffff8002e729>] prune_dcache+0x10f/0x149 [<ffffffff800e5f2f>] shrink_dcache_memory+0x17/0x30 [<ffffffff8003ef82>] shrink_slab+0xdc/0x153 [<ffffffff800c7da7>] try_to_free_pages+0x1c1/0x2b9 [<ffffffff8000f276>] __alloc_pages+0x1cb/0x2ce ... slabtop D ffff810001576400 0 11770 11722 (NOTLB) ffff81003403b978 0000000000000082 0000000000000018 ffffffff885e2abe 0000000000000296 0000000000000009 ffff810014e9f100 ffffffff802f4ae0 000013aac6c63807 0000000000019f95 ffff810014e9f2e8 00000000885e2be2 Call Trace: [<ffffffff885e2abe>] :dlm:request_lock+0x93/0xa0 [<ffffffff8860d138>] :gfs2:just_schedule+0x0/0xe [<ffffffff8860d141>] :gfs2:just_schedule+0x9/0xe [<ffffffff80063ac7>] __wait_on_bit+0x40/0x6e [<ffffffff8860d138>] :gfs2:just_schedule+0x0/0xe [<ffffffff80063b61>] out_of_line_wait_on_bit+0x6c/0x78 [<ffffffff8009d91b>] wake_bit_function+0x0/0x23 [<ffffffff8860d133>] :gfs2:gfs2_glock_wait+0x2b/0x30 [<ffffffff8861cd46>] :gfs2:gfs2_delete_inode+0x4e/0x191 [<ffffffff8861cd3e>] :gfs2:gfs2_delete_inode+0x46/0x191 [<ffffffff8861ccf8>] :gfs2:gfs2_delete_inode+0x0/0x191 [<ffffffff8002f270>] generic_delete_inode+0xc6/0x143 [<ffffffff800e5eef>] prune_one_dentry+0x4d/0x76 [<ffffffff8002e729>] prune_dcache+0x10f/0x149 [<ffffffff800e5f2f>] shrink_dcache_memory+0x17/0x30 [<ffffffff8003ef82>] shrink_slab+0xdc/0x153 [<ffffffff800c7da7>] try_to_free_pages+0x1c1/0x2b9 [<ffffffff8000f276>] __alloc_pages+0x1cb/0x2ce ... gfs2_quotad D ffff810001576400 0 11425 15 14610 11424 (L-TLB) ffff81002d109d50 0000000000000046 0000000000000018 ffffffff885e2abe 0000000000000282 000000000000000a ffff81002d018080 ffffffff802f4ae0 000012c24b40c9a1 000000000001e119 ffff81002d018268 00000000885e2be2 Call Trace: [<ffffffff885e2abe>] :dlm:request_lock+0x93/0xa0 [<ffffffff8860d138>] :gfs2:just_schedule+0x0/0xe [<ffffffff8860d141>] :gfs2:just_schedule+0x9/0xe [<ffffffff80063ac7>] __wait_on_bit+0x40/0x6e [<ffffffff8860d138>] :gfs2:just_schedule+0x0/0xe [<ffffffff80063b61>] out_of_line_wait_on_bit+0x6c/0x78 [<ffffffff8009d91b>] wake_bit_function+0x0/0x23 [<ffffffff8860d133>] :gfs2:gfs2_glock_wait+0x2b/0x30 [<ffffffff88622f7c>] :gfs2:gfs2_statfs_sync+0x45/0x201 [<ffffffff8004accb>] try_to_del_timer_sync+0x51/0x5a [<ffffffff88622f74>] :gfs2:gfs2_statfs_sync+0x3d/0x201 [<ffffffff80063964>] schedule_timeout+0x92/0xad [<ffffffff8009d6d5>] keventd_create_kthread+0x0/0xc4 [<ffffffff88607490>] :gfs2:gfs2_quotad+0x50/0x154 [<ffffffff88607440>] :gfs2:gfs2_quotad+0x0/0x154 [<ffffffff800324bd>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff8009d6d5>] keventd_create_kthread+0x0/0xc4 [<ffffffff800323bf>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 gfs2_logd S ffff810001576400 0 11424 15 11425 11421 (L-TLB) ffff81002058de20 0000000000000046 ffff81002c92d048 ffff81002058de30 ffffffff88607fa3 000000000000000a ffff810013a8f0c0 ffffffff802f4ae0 000014a582097b64 000000000000160d ffff810013a8f2a8 000000002c92d048 Call Trace: [<ffffffff88607fa3>] :gfs2:gfs2_dirent_find+0x0/0x4e [<ffffffff8009d6d5>] keventd_create_kthread+0x0/0xc4 [<ffffffff8006395c>] schedule_timeout+0x8a/0xad [<ffffffff80094d76>] process_timeout+0x0/0x5 [<ffffffff88607424>] :gfs2:gfs2_logd+0x140/0x15c [<ffffffff886072e4>] :gfs2:gfs2_logd+0x0/0x15c [<ffffffff800324bd>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff8009d6d5>] keventd_create_kthread+0x0/0xc4 [<ffffffff800323bf>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 dlm_recoverd S ffff8100267b5d80 0 10998 15 11405 10997 (L-TLB) ffff81002f90dea0 0000000000000046 ffff81003271e800 ffffffff885e9da1 0000000000000000 000000000000000a ffff81002f8d9080 ffff81002f8d4040 00001210455d920d 0000000000002eda ffff81002f8d9268 00000000885e8d07 Call Trace: [<ffffffff885e9da1>] :dlm:dlm_wait_function+0xf9/0x135 [<ffffffff885e95e5>] :dlm:dlm_rcom_status+0xae/0x179 [<ffffffff80063bb6>] mutex_lock+0xd/0x1d [<ffffffff8002feba>] __up_write+0x27/0xf2 [<ffffffff8009d6d5>] keventd_create_kthread+0x0/0xc4 [<ffffffff885ea78c>] :dlm:dlm_recoverd+0x56/0x47f [<ffffffff885ea736>] :dlm:dlm_recoverd+0x0/0x47f [<ffffffff800324bd>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff8009d6d5>] keventd_create_kthread+0x0/0xc4 [<ffffffff800323bf>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 dlm_send D ffff810001576400 0 10997 15 10998 10996 (L-TLB) ffff81002f8f9850 0000000000000046 0000000000000018 ffffffff885e2abe 0000000000000282 0000000000000001 ffff81002f8d97e0 ffffffff802f4ae0 000012be3226bd3b 000000000001ceb9 ffff81002f8d99c8 00000000885e2be2 Call Trace: [<ffffffff885e2abe>] :dlm:request_lock+0x93/0xa0 [<ffffffff8860d138>] :gfs2:just_schedule+0x0/0xe [<ffffffff8860d141>] :gfs2:just_schedule+0x9/0xe [<ffffffff80063ac7>] __wait_on_bit+0x40/0x6e [<ffffffff8860d138>] :gfs2:just_schedule+0x0/0xe [<ffffffff80063b61>] out_of_line_wait_on_bit+0x6c/0x78 [<ffffffff8009d91b>] wake_bit_function+0x0/0x23 [<ffffffff8860d133>] :gfs2:gfs2_glock_wait+0x2b/0x30 [<ffffffff8861cd46>] :gfs2:gfs2_delete_inode+0x4e/0x191 [<ffffffff8861cd3e>] :gfs2:gfs2_delete_inode+0x46/0x191 [<ffffffff8861ccf8>] :gfs2:gfs2_delete_inode+0x0/0x191 [<ffffffff8002f270>] generic_delete_inode+0xc6/0x143 [<ffffffff800e5eef>] prune_one_dentry+0x4d/0x76 [<ffffffff8002e729>] prune_dcache+0x10f/0x149 [<ffffffff800e5f2f>] shrink_dcache_memory+0x17/0x30 [<ffffffff8003ef82>] shrink_slab+0xdc/0x153 [<ffffffff800c7da7>] try_to_free_pages+0x1c1/0x2b9 [<ffffffff8000f276>] __alloc_pages+0x1cb/0x2ce [<ffffffff8000f191>] __alloc_pages+0xe6/0x2ce [<ffffffff8001762b>] cache_grow+0x137/0x395 [<ffffffff8005bda0>] cache_alloc_refill+0x136/0x186 [<ffffffff8000a96e>] kmem_cache_alloc+0x6c/0x76 [<ffffffff80043c1d>] sk_alloc+0x2e/0xf3 [<ffffffff8005982b>] inet_create+0x137/0x267 [<ffffffff8004cab1>] __sock_create+0x170/0x27c [<ffffffff885e68c6>] :dlm:process_send_sockets+0x0/0x15d [<ffffffff885e634c>] :dlm:tcp_connect_to_sock+0x70/0x1de [<ffffffff8001c8ff>] __mod_timer+0xb0/0xbe [<ffffffff80054ec6>] sk_reset_timer+0xf/0x19 [<ffffffff80032dd0>] __tcp_push_pending_frames+0x7a0/0x87a [<ffffffff8002dfe0>] __alloc_skb+0x77/0x123 [<ffffffff885e68e2>] :dlm:process_send_sockets+0x1c/0x15d [<ffffffff8004d25b>] run_workqueue+0x94/0xe4
I just flipped over to see what slabtop was doing on dash-01. This looks interesting: OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 263555 263555 100% 1.50K 52711 5 421688K TCP 263748 263748 100% 0.62K 43958 6 175832K sock_inode_cache 78800 76634 97% 0.78K 15760 5 63040K gfs2_inode 76216 68488 89% 0.52K 10888 7 43552K radix_tree_node 152100 147624 97% 0.25K 10140 15 40560K size-256
Updating component to kernel, gfs2-kmod doesn't exist any more. Should we have a gfs2-kernel now, or is just plain kernel ok? Having said that this seem to be more dlm than gfs2, so maybe it should be dlm-kernel, but I'll leave it at plain kernel for now.
I was able to hit the issue again, but this time fencing got stuck looking for memory so I was able to gather some more information. Same configuration as comment #14. I found the system in a state where dash-01 was attempting to fence dash-02, but fence_apc was stuck trying to allocate memory with gfs2_delete_inode in its stack. On dash-03 I see that dash-02 was removed from the cluster Oct 28 23:13:43 dash-03 openais[10896]: [TOTEM] The token was lost in the OPERATIONAL state. ... Oct 28 23:13:48 dash-03 openais[10896]: [CLM ] CLM CONFIGURATION CHANGE Oct 28 23:13:48 dash-03 openais[10896]: [CLM ] New Configuration: Oct 28 23:13:48 dash-03 kernel: dlm: closing connection to node 2 Oct 28 23:13:48 dash-03 openais[10896]: [CLM ] r(0) ip(10.15.89.168) Oct 28 23:13:48 dash-03 openais[10896]: [CLM ] r(0) ip(10.15.89.170) Oct 28 23:13:48 dash-03 openais[10896]: [CLM ] Members Left: Oct 28 23:13:48 dash-03 openais[10896]: [CLM ] r(0) ip(10.15.89.169) ... Oct 28 23:13:49 dash-03 kernel: dlm: connect from non cluster node Oct 28 23:13:49 dash-03 kernel: dlm: connect from non cluster node Oct 28 23:13:49 dash-03 fenced[10919]: fencing deferred to dash-01 Oct 28 23:13:54 dash-03 kernel: dlm: connect from non cluster node On dash-02 at the same time I found dlm trying to reconnect, but no openais log messages. Oct 28 23:13:41 dash-02 kernel: dlm: connecting to 3 Oct 28 23:13:41 dash-02 kernel: dlm: connecting to 1 Oct 28 23:13:41 dash-02 kernel: dlm: connecting to 3 Oct 28 23:13:41 dash-02 kernel: dlm: connecting to 1 The stack trace from dash-02's aisexec shows that it is stuck looking for memory and caught in the gfs2/dlm mess. aisexec D ffff81002b46a100 0 9928 1 9929 9923 (NOTLB) ffff81002b3e1818 0000000000000086 0000000000000018 ffffffff885d5abe 0000000000000296 0000000000000001 ffff81002dc35080 ffff81002b46a100 00002155a9b007ed 00000000000166d8 ffff81002dc35268 00000000885d5be2 Call Trace: [<ffffffff885d5abe>] :dlm:request_lock+0x93/0xa0 [<ffffffff8867a556>] :lock_dlm:gdlm_ast+0x0/0x311 [<ffffffff8867a2c1>] :lock_dlm:gdlm_bast+0x0/0x8d [<ffffffff88600138>] :gfs2:just_schedule+0x0/0xe [<ffffffff88600141>] :gfs2:just_schedule+0x9/0xe [<ffffffff80063ac7>] __wait_on_bit+0x40/0x6e [<ffffffff88600138>] :gfs2:just_schedule+0x0/0xe [<ffffffff80063b61>] out_of_line_wait_on_bit+0x6c/0x78 [<ffffffff8009d91b>] wake_bit_function+0x0/0x23 [<ffffffff88600133>] :gfs2:gfs2_glock_wait+0x2b/0x30 [<ffffffff8860357b>] :gfs2:gfs2_dinode_dealloc+0xfb/0x1a7 [<ffffffff8860fde3>] :gfs2:gfs2_delete_inode+0xeb/0x191 [<ffffffff8860fd3e>] :gfs2:gfs2_delete_inode+0x46/0x191 [<ffffffff8860fcf8>] :gfs2:gfs2_delete_inode+0x0/0x191 [<ffffffff8002f270>] generic_delete_inode+0xc6/0x143 [<ffffffff800e5eef>] prune_one_dentry+0x4d/0x76 [<ffffffff8002e729>] prune_dcache+0x10f/0x149 [<ffffffff800e5f2f>] shrink_dcache_memory+0x17/0x30 [<ffffffff8003ef82>] shrink_slab+0xdc/0x153 [<ffffffff800c7da7>] try_to_free_pages+0x1c1/0x2b9 [<ffffffff8000f276>] __alloc_pages+0x1cb/0x2ce [<ffffffff8024e7cd>] udp_push_pending_frames+0x21e/0x243 [<ffffffff8003c19b>] __get_free_pages+0xe/0x71 [<ffffffff8001e814>] __pollwait+0x58/0xe2 [<ffffffff8002ff9d>] unix_poll+0x18/0x96
I don't see how GFP allocation flags have anything to do with this. In fact, I don't see how memory allocations at all have anything to do with this. No one is even blocked trying to allocate memory (much less blocked on an allocation that's in turn trying to flush gfs2 objects to free up memory.) The obvious common factor in all these traces is that: 1. gfs2 inode free/delete/dealloc is in the kernel's general free-up-memory path like inodes from other fs's 2. gfs2 inode free/delete/dealloc requires doing a lock operation I say "free/delete/dealloc" because it's not clear to me how/why those are related in the traces. The backtraces clearly begin with the system trying to free up memory, a normal kind of thing that's expected. This calls into gfs2 to free up inodes. But, in gfs2 this appears to translate into *ondisk deallocation* of disk blocks, not just freeing up incore memory. That doesn't seem like a good thing to be doing. For one thing, the ondisk operation ends up taking dlm locks, and waiting on them. It looks very wrong to be doing all this in response to the system just wanting to free up memory from cached inodes. I'm guessing the way gfs1's inodes are freed is designed to avoid these problems. I'm not sure at which level it solves the problem. At the top level perhaps gfs1 has its own way of freeing inodes (e.g. via its own thread or something) which avoids this. And/or I suspect that freeing inodes is not related to ondisk deallocation in gfs1. And at a minimum, I really doubt gfs1 requires taking dlm locks to free an inode. So, my suspicion is that this is a gfs2 design issue.
To correct what I wrote above, there are memory allocations which everything gets blocked on, but they are not usually explicit allocations, much less allocations in our own code. The kernel just finds at any random point, in the context of any random process, that it needs to free up memory, which causes it to try to free up memory from things like incore inodes. GFS blocking on a long dlm lock operations in the context of any random processes that happen to trigger kernel memory reclaim is simply not going to work.
gfs1 and the original gfs2 have nothing attached to the .delete_inode hook, so they don't do anything in the code paths shown above.
To try and answer some of the points above: There was an original bug (in gfs2) which started off this particular bugzilla, and that indeed was fixed and has (just) gone into RHEL. While doing tests to try and verify that fix, some other issues have come up. One of the problems in tracking down what those issues are is that they seem to be memory related and when the problems occur, everything it locked up so we are a bit short on data, but things are getting clearer as time goes on at least. So the issue relating to memory allocations and the filesystem is that unless we specify GFP_NOFS. The only cache which affects this situation is the dcache, since usually the dcache is holding the last reference to an inode. Assuming that the inode has a zero link count, then we try to delete it (deallocate it) when its evicted from the inode cache. This is identical to any other filesystem, aside from gfs1 which did evict its vfs inodes this way, but left the gfs2_inodes in cache. Now its a known problem with the vfs that when this activity happens it tends to result in rather inefficient I/O due to the "random" order of cache evictions. So there is certainly some room for improvement. As for the taking of dlm locks at the point of deallocation, thats unavoidable since we have to lock the various objects (inode, resource groups, etc) involved. So the only way to avoid that in the memory reclaim path, would be to move the time at which inodes are removed from the inode cache to some other point in time by overriding the d_iput() function. The disadvantage of doing that is that we'd land up not really freeing much memory at all when the dcache was squeezed, and that might well give us other problems. We'd then have to work out when was a suitable time to actually dispose of those inodes, and hope that we had enough memory to do so given that we didn't dispose of them when the vfs/vm first requested it. If we can't think of anything better by tomorrow, then we can certainly try that. Either way, the patch which I sent for the upstream dlm does fix real issues and ones which might be hit from other points in the code, even if they haven't been so far. Now this may well all be moot anyway, since the trace from comment #18 has again stopped in dlm:request_lock (or appears to have done) and since I added a bunch of noinline to the code in my recent test patch, if we had got stuck allocating memory, then it should have shown up there, since dlm:request_lock doesn't allocate any memory directly itself. Also it may well be that the end of that stack trace isn't actually real entries and that it might have stopped further up, my bet being gfs2:just_schedule, i.e. waiting for a lock from another node.
Yeah, the stuff above the schedule is "left over". gfs1 played tricks like you mention (e.g. artificial reference counts), to change normal kernel behavior and give it control over when to do things like taking dlm locks or doing ondisk operations. And gfs not freeing as much memory as an ordinary fs might would definately be one of the trade-offs (at least in the normal memory reclaim paths; gfs had other mechanisms to try to compensate IIRC.) I think you'll be forced to do something like that because gfs can't block any random process (esp something like aisexec) for some indefinate period of time to do a remote lock operation.
Even with Christine's patch to connect DLM early I hit the bug. I don't think the problem is necessarily that DLM can't connect under memory pressure. It was connected and it got disconnected. I see in the logs: dlm: closing connection to node 3 dlm: connect from non cluster node Which seems to indicate that perhaps the dlm disconnects because it got evicted from the cluster from a hung aisexec process. The logs seem to support that. Oct 31 11:39:39 dash-02 openais[10097]: [TOTEM] entering GATHER state from 12. Oct 31 11:39:44 dash-02 openais[10097]: [TOTEM] entering GATHER state from 11. Oct 31 11:39:44 dash-02 openais[10097]: [TOTEM] Saving state aru 58 high seq rec eived 58 Oct 31 11:39:44 dash-02 openais[10097]: [TOTEM] Storing new sequence id for ring 98 Oct 31 11:39:44 dash-02 openais[10097]: [TOTEM] entering COMMIT state. Oct 31 11:39:44 dash-02 openais[10097]: [TOTEM] entering RECOVERY state. ... Oct 31 11:39:44 dash-02 openais[10097]: [CLM ] CLM CONFIGURATION CHANGE Oct 31 11:39:44 dash-02 openais[10097]: [CLM ] New Configuration: Oct 31 11:39:44 dash-02 openais[10097]: [CLM ] r(0) ip(10.15.89.168) Oct 31 11:39:44 dash-02 kernel: dlm: closing connection to node 3 Oct 31 11:39:44 dash-02 openais[10097]: [CLM ] r(0) ip(10.15.89.169) Oct 31 11:39:44 dash-02 kernel: dlm: connect from non cluster node Oct 31 11:39:44 dash-02 openais[10097]: [CLM ] Members Left: Oct 31 11:39:44 dash-02 openais[10097]: [CLM ] r(0) ip(10.15.89.170) The logs on dash-03 in this case show that aisexec is stuck down the gfs2_delete_inode path.
There are three possible paths which I'm aware of which might land up calling into the DLM as a result of memory pressure: 1. shrinking the dcache resulting in a final iput of an inode (as per above) 2. shrinking the icache resulting in drop of an unreferenced inode 3. shrinking the page cache resulting in a transaction during writeback (jdata) In the case of #1 and #2 this only happens if i_nlink is 0. I suspect that #2 is a rare case, compared with #1 but there seems to be no reason to suppose that it cannot happen. I have the beginnings of a patch which might address cases #1 and #2. I don't yet have any idea how to deal with #3 and I don't think its possible in gfs2 alone. There is also (in my current patch) the question of when to finally flush out the inodes whose deallocation has been deferred. In the case of umount, there are various bits of code which wait upon the destruction of inodes, so flushing them at that point will be critical, assuming that we don't want to hang forever. Also we'll need to export a couple of new functions from the VFS in order to allow us to defer the "delete inode" part of the state machine. I'm wondering whether it wouldn't be better to actually make this part of the VFS itself rather than specific to GFS2. I'm sure that OCFS2 will need the same thing, for example, since they have followed our lead in this area. Also due to #2, my suggestion of using d_iput as a hook in comment #22 doesn't work as we must ensure that we hook into the inode state machine directly.
I'd like to know which specific backtrace would be resolved by which specific part of this dlm allocation patch, https://www.redhat.com/archives/cluster-devel/2008-November/msg00032.html The only thing I've heard about the patch above is that it changes the overall behavior of the given test system. And the general effect is that the problem takes longer to appear. I think we need more solid facts if we're going to risk making a last minute change.
Ok, so there are several issues here, but lets start with comment #15 and the dlm_send backtrace. At that point its possible to see that we have a path from dlm_send back into the filesystem via a GFP_KERNEL allocation. Now at the moment we can do nothing about that particular allocation in that its part of the socket code, but having looked in some detail at that problem, I'm intending to ask the net developers their opinion on allowing us to specify the GFP setting for socket creation. With Chrissie's patch, we proved that moving the socket creation to an earlier point in time reduced the likelihood of having to allocate a socket during a call into the DLM from the filesystem. I'm not going to suggest that we should use that patch however as I think the real fix to this is to allow GFP_NOFS allocation of sockets. Nevertheless, that patch was very helpful in allowing us to narrow down the problem. Now we know that using GFP_KERNEL when calling down in the context of the filesystem is going to fail. There was a bug (whose bz number I have forgotten and I can't find it at the moment) about this issue some time ago which resulted in the addition of ls->ls_allocation in the first place. It seems that there is little point in having that feature if its ignored in some of the allocations, so the patch I recently posted to cluster-devel fixes all of the places which use GFP_KERNEL aside from two. One of those is the socket allocation, for the reasons previously stated, and the other is the kmap(), which, as mentioned in the patch comment will never cause a problem as with the current code it will never be passed a highmem page anyway. So I fully realise that this patch will not actually fix the socket allocation problem, but it will fix all the other possible cases. I'm also quite happy with the decision that its too invasive for RHEL at the moment, and so long as it gets into upstream, I'm happy with that. The point I was making at the meeting is that we need to make a decision about whether it will go into RHEL 5.3 very quickly now, even if that decision is "no". Also, with regards to why I'd like to fix things this way, see case #3 of comment #25, where its possible for jdata writeback to trigger something similar. If I could find the details of the previous bug where ls->ls_allocation was added in the first place, then I have a feeling that we used the same reasoning when that was fixed. In fact I look at this as the same bug, its just that some cases were missed when the original patch was produced.
There's no doubt your patch is needed; as you say, the whole point of ls_allocation is to use it and avoid looping back into gfs. And I'm not really worried about regressions from the part of the patch that does that (I'd split and leave the kmap changes for later). My only question is whether adding this patch will fix any of the problems we have seen. If it will, then I'll post it (without kmap bits) for 5.3 with that justification.
I suspect that we won't hit the problems that the current patch was designed to fix very often until we can fix the socket allocation issue. I suspect that triggers first because we land up allocating two bits of memory in quick sucsession (the inode and the struct sock) which are both fairly large, and the other allocations are generally pretty small and fairly rare too. Since this does take some time to set up and hit, I think we'd have to do a lot of runs before we could say for certain that we'd never hit it. I'd tend to err on the side of caution and add it anyway. I guess you'll need to clone this bug for that since we need to keep this one around for the remainder of the issues.
Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: There is a known issue relating to VM pressure when a GFS2 filesystem is mounted and there are cached GFS2 inodes which have been unlinked. This can only occur if the inodes have been used (accessed) on one node and then unlinked on a different node. Occasionally, a node may hang if these two conditions occur at the same time. The node will be fenced and recovered via the normal cluster recovery mechanism, so that the impact is the unavailability of a single node for a period of time and does not, in general require user intervention. A fix is scheduled for RHEL 5.4 (see bugzilla #460218). Users who suspect that they have this particular issue should look for the occurrence of gfs2_dinode_dealloc and shrink_dcache_memory in stack traces of stuck processes. Please also note that this issue does not affect single node GFS2 filesystems.
this bug is now documented in the RHEL5.3 release notes. please consult the following link to view the most recent build of this document: http://documentation-stage.bne.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5.3/html-single/Release_Notes/ this document is synced daily. as such, any recent additions are expected to show up within 24 hours or so.
Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,5 +1,3 @@ -There is a known issue relating to VM pressure when a GFS2 filesystem is mounted and there are cached GFS2 inodes which have been unlinked. This can only occur if the inodes have been used (accessed) on one node and then unlinked on a different node. Occasionally, a node may hang if these two conditions occur at the same time. The node will be fenced and recovered via the normal cluster recovery mechanism, so that the impact is the unavailability of a single node for a period of time and does not, in general require user intervention. +If your system has a GFS2 file system mounted, a node may hang if a cached inode is accessed in one node and unlinked on a different node. When this occurs, the hung node will be unavailable until you fence and recover it via the normal cluster recovery mechanism. The function calls gfs2_dinode_dealloc and shrink_dcache_memory will also appear in the stack traces of any processes stuck in the hung node. -A fix is scheduled for RHEL 5.4 (see bugzilla #460218). Users who suspect that they have this particular issue should look for the occurrence of gfs2_dinode_dealloc and shrink_dcache_memory in stack traces of stuck processes. +This issue does not affect single-node GFS2 file systems.- -Please also note that this issue does not affect single node GFS2 filesystems.
Created attachment 329073 [details] Patch to make the DLM connect as soon as a node joins a lockspace For reference, this is the patch mentioned in comment #24
Created attachment 329702 [details] The dlm connection patch Really attach the patch this time. The previous attachment seemed to be a single line file containing a copy of the filename.
The requested information was provided in comment #31.
Created attachment 329711 [details] Patch to use GFP_NOFS for dlm recovery It occured to me that we hadn't really gone though all the possible ways in which we might have landed up in the fs reclaim code. Since we did the original tests a number of changes have gone into the upstream code: a) The original patch in this bz has gone into dlm b) The write_begin function has gained its own GFP_NOFS allocation status so we don't need to mark the whole address space as GFP_NOFS for "normal" inodes c) Chrissie's "connect early" patch is very small and simple and certainly a lot less invasive than the other option (GFP_NOFS allocation of sockets) d) I've also now (in this attachment) been through all of the recovery code and checked each and every allocation to ensure that we will not be stuck waiting in a filesystem when we are asking for memory to recover a cluster. That I think should then close all possible windows for problems through dlm itself. So if this doesn't work, the we'll have to look at the socket side of things again, which is what I hope we might otherwise avoid. Fingers crossed :-)
Perhaps we need two bz's for the two issues here: 1. dlm: dlm_send can create sockets 2. gfs2: can take dlm locks when the vm asks it to free inodes to free up mem The arguement against 2 is very simple: . taking a dlm lock can block for an indefinite period of time . a app may fail if the kernel blocks it for an indefinite period of time . therefore, the kernel should not take a dlm lock in an application's context More specificially: 1. app does something requiring kernel memory 2. in the app's context, the vm sees a memory shortage and tries to free mem 3. vm asks fs's on the system to drop cached inodes to free up mem 4. gfs2 is asked to free an inode, for which it attempts to acquire a dlm lock 5. dlm lock blocks for long time 6. app is blocked for a long time 7. app fails To fix this, gfs2 in step 4 should either decline to free any inodes for which it would need to acquire dlm locks, or gfs2 should defer any steps like acquiring dlm locks to another thread (one of its own). Going back to the initial two issues, solving the dlm issue (1) will not in any way resolve or help the gfs2 issue (2). However, solving the gfs2 issue (2) may well make the dlm issue (1) a non-problem. If we want to further address the dlm/socket issue, let's do that in another bz. This bz should focus on the gfs2 issue.
To clarify: an app accessing a gfs file system should, of course, expect the kernel to acquire dlm locks in its context and possibly block for indefinite periods. It is random other apps not using gfs that could easily fail if forced to block for indefinite periods.
We should probably also take a look at OCFS2 to see what they do, but I suspect they suffer from the same issue. All apps have to block on memory allocation for indeterminate lengths of time - thats the mechanism which prevents them running the system totally out of memory. It should not be "forever", but depending on the system load, it might be as close as makes no practical difference, we can only do our best for "reasonable" loads, so we do need to consider what is "reasonable" and what isn't. It isn't any different to blocking due to waiting for I/O over some other channel, whether network or some local bus such as SCSI from the apps point of view. It still might be worth making some changes in the inode clear/delete space though. It would be nicer to have some kind of "batch" system where the fs is handed a bunch of inodes at the same time. That would allow sensible scheduling of I/O and also glocks. Its a large project though, so not something I'd want to do right now - and in any case it would be at least RHEL6 rather than RHEL5.4. It is a recognised issue that filesystems are not able to control writeback as much as they'd like to, and that ordering of I/O is not as good as it might be. When we get to the delete_inode code, we don't know if another node is still holding the inode open. If so then we can just forget about the inode rather than removing it. If there is another node holding the lock, then its reasonable to assume that it must be in use, so we should only need a trylock at that stage. The reason that we can't stick a try flag on the inode lock is that we must ensure that we send a callback on the iopen lock to pass on the message that the inode's link count has hit zero. Potentially we could change that I suppose and do a try lock on the iopen whether or not we get the inode's own lock. I'm not sure it improves the situation though. If the socket allocation needs to be made, I presume that it does not matter much whether the lock is a try lock or not. If try locks were to fail to send callbacks in the event of a connection issue requiring the allocation of a new socket (the _1CB variant of a try lock I mean here) then that would break the assumptions which the unlinked inode handling is based upon. So although its a rather nasty issue to solve, I don't see any other practical way at the moment.
apps will not block on memory allocations for long periods under normal or even stressful conditions. dlm locks can regularly be expected to block for long periods under normal conditions.
I suppose it depends on what you define as normal conditions.... any app that is dirtying pages at a rate above that at which they can be written to storage will eventually have to block. The length of time that it blocks for depends upon how early it is detected and the relative rates of writing and page dirtying. I don't think it makes any difference whether we are waiting for I/O over a congested network link or a dlm lock over the same link. Either will be slow. Likewise it doesn't matter whether we have a network block device (I mean a generic one here, not nbd specifically) server which is overloaded, or a heavy workload sharing the same node as the dlm lock master for a required lock. The result is the same in the end. So we can do our best to try and minimise these issues by such means as the recent work on I/O queuing priorities which has been done to solve fsync() issues. The possibility will always remain that it can happen from time to time. One might also reasonably argue that running a test load which is deliberately designed to place as much stress on the VM as possible, whilst at the same time killing random nodes in the cluster is not "normal use". I don't think that blocking for a long time is really the issue here anyway. What we want to do is break the cycle which results in complete deadlock from time to time. If it just gets very slow from time to time, then thats ok I think.
We're not talking about doing i/o, we're talking about a random process that might incidentally allocate memory at any point in time being blocked on a dlm lock for a long period. Just look at the backtraces, you have processes like "top", "sendmail", "aisexec" doing gfs2 disk deallocation and blocking on dlm locks! If "top" goes out to lunch for a while it may not matter, but if you do that to "aisexec" or who knows what other customer application, things don't work. > I don't think that blocking for a long time is really the issue here anyway. Like I keep repeating, there are two separate issues here: 1. gfs2 blocking random processes on dlm locks if they happen to need memory at the wrong time 2. dlm_send creating sockets and deadlocking because of 1 If you want to discuss the dlm issue let's open a separate bz and look at how we can isolate socket creation from the send path. But it doesn't fix the bigger gfs2 problem.
If you want to split off the dlm side of this into another bz, thats ok by me. I didn't know that it would be possible to isolate the socket creation as there is always the risk of needing to connect to new nodes, but if it is possible, that would certain solve that issue. With regard to: > We're not talking about doing i/o, I disagree - after all thats why we are getting stuck, because the dlm is sending a message which requires a socket allocation. It doesn't really matter all that much whether the I/O is via the network (dlm) or via a block device (other local fs) or even both, its all I/O and it can all block for long periods of time depending on circumstances. If we don't block when disposing of inodes, there is nothing to stop an app creating and unlinking them until its run the node in question out of memory. So if we can't dispose of them fast enough, then we have to have some way to slow down the rate of inode creation until the two are in balance. Given some upstream changes we might well be able to make disposal of inodes a more efficient process, and thats something I'd very much like to see, but we'll still have to have some kind of rate limiting for memory allocations and blocking in ->delete_inode() is not an unreasonable thing to do from time to time. As I mentioned in the penultimate pg of comment #41, changing the initial lock to a try lock might well be possible and that should solve the issue of blocking for a long time due to contention on the lock, so that is certainly something that we should add to the list. I'm certainly not trying to pretend that GFS2 is perfect in this area and things could be improved, but we need to break the cycle of dependencies first (i.e. item 2 in your list above), and then we can worry about improving the performance of inode deletion later. So I think we both agreed that both tasks need doing. The only question is how to actually do them at this stage. I think we can also be fairly sure that this won't hit 5.4, so we can also take some time to look at the issues properly and come with with (I hope) reasonable solutions. If you are happy with that, I'll twiddle the flags on this bz accordingly.
You're talking about what happens when dlm_send creates a socket, allocates memory and deadlocks. I'm not. I'm talking about what happens when dlm_send is not creating a socket, is not allocating memory, is not deadlocking -- dlm_send is working just fine. dlm messages are going out and coming in just fine. The problem I'm talking about is that a dlm lock is often not granted for quite a while when everything is working properly. gfs2 is waiting for dlm locks in the context of processes that incidentally allocate memory which triggers the vm to reclaim gfs2 inodes at which point gfs2 waits for dlm locks. The dlm locks will eventually be granted, but the delay is long enough that the innocent process picked to do gfs2 work has timed out, or failed in some way because of not being able to run for too long.
There is now a fix for this bug. Removing relnotes as they will not be required for 5.4.
Deleted Release Notes Contents. Old Contents: If your system has a GFS2 file system mounted, a node may hang if a cached inode is accessed in one node and unlinked on a different node. When this occurs, the hung node will be unavailable until you fence and recover it via the normal cluster recovery mechanism. The function calls gfs2_dinode_dealloc and shrink_dcache_memory will also appear in the stack traces of any processes stuck in the hung node. This issue does not affect single-node GFS2 file systems.
Created attachment 344187 [details] upstream dlm early connection patch Will be pushing this to linux-next for merging in the next cycle. It fixes a major problem in the earlier version of the connection patch that caused nodes to connect to themselves, leading to oopses.
Created attachment 344201 [details] upstream dlm allocation patch Will be pushing this to linux-next for merging in the next cycle.
Created attachment 344373 [details] Combined RHEL5 patch This is the combined RHEL5 patch from Dave's upstream patch above. There's one hunk in member.c that's missing in the latest RHEL5 kernel. Could you guys take a look before I post to rhkernel-list?
Abhi, please keep the patches separate in RHEL5 like they are upstream, and include the original description from each patch in the patch posted for rhel. Otherwise they look fine.
http://post-office.corp.redhat.com/archives/rhkernel-list/2009-May/msg00717.html http://post-office.corp.redhat.com/archives/rhkernel-list/2009-May/msg00718.html Posted the above patches to rhkernel-list
in kernel-2.6.18-150.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
I ran the test case for about 18 hours and I was not able to reproduce this panic. Marking VERIFIED based on kernel-2.6.18-154.el5.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html