460218 – GFS2: Hang when shrink_slab calls gfs2_delete_inode

Bug 460218 - GFS2: Hang when shrink_slab calls gfs2_delete_inode

Summary: GFS2: Hang when shrink_slab calls gfs2_delete_inode

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Abhijith Das
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	464932 (view as bug list)
Depends On:	485098
Blocks:	RHEL5u3_relnotes 467689 471871
TreeView+	depends on / blocked

Reported:	2008-08-26 20:40 UTC by Nate Straz
Modified:	2018-11-14 14:30 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-09-02 08:38:12 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
SysRq-t output from dash-01,2,3 (30.95 KB, application/x-bzip2) 2008-08-26 20:40 UTC, Nate Straz	no flags	Details
Propsoed patch (568 bytes, patch) 2008-09-04 16:39 UTC, Steve Whitehouse	no flags	Details \| Diff
Further test patch (6.88 KB, patch) 2008-10-27 11:32 UTC, Steve Whitehouse	no flags	Details \| Diff
Patch to make the DLM connect as soon as a node joins a lockspace (56 bytes, text/plain) 2009-01-15 08:59 UTC, Christine Caulfield	no flags	Details
The dlm connection patch (1.47 KB, patch) 2009-01-22 10:07 UTC, Steve Whitehouse	no flags	Details \| Diff
Patch to use GFP_NOFS for dlm recovery (4.19 KB, patch) 2009-01-22 14:39 UTC, Steve Whitehouse	no flags	Details \| Diff
upstream dlm early connection patch (3.67 KB, text/plain) 2009-05-15 15:43 UTC, David Teigland	no flags	Details
upstream dlm allocation patch (4.49 KB, text/plain) 2009-05-15 16:45 UTC, David Teigland	no flags	Details
Combined RHEL5 patch (5.85 KB, patch) 2009-05-18 04:17 UTC, Abhijith Das	no flags	Details \| Diff
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2009:1243	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.4 kernel security and bug fix update	2009-09-01 08:53:34 UTC

Description Nate Straz 2008-08-26 20:40:11 UTC

Created attachment 315045 [details]
SysRq-t output from dash-01,2,3

Description of problem:

While trying to verify bug 458289 with a long run I found my three node cluster unresponsive.  I was only able to log into one of three nodes so I issued a sysrq-T and found that many processes were stuck waiting for GFS2 to free memory.  For example:

sendmail      D ffffffff884f92c8     0 11020   8481                     (NOTLB)
 ffff810031b5b948 0000000000000086 0000000000000018 ffffffff88589488
 0000000000000292 0000000000000008 ffff81003fb61860 ffffffff802e5ae0
 000013b85e598ebc 0000000000000d44 ffff81003fb61a48 000000008858adea
Call Trace:
 [<ffffffff88589488>] :dlm:request_lock+0x93/0xa0
 [<ffffffff884f92c8>] :gfs2:just_schedule+0x0/0xe
 [<ffffffff884f92d1>] :gfs2:just_schedule+0x9/0xe
 [<ffffffff800639be>] __wait_on_bit+0x40/0x6e
 [<ffffffff884f92c8>] :gfs2:just_schedule+0x0/0xe
 [<ffffffff80063a58>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff8009e107>] wake_bit_function+0x0/0x23
 [<ffffffff884f92c3>] :gfs2:gfs2_glock_wait+0x2b/0x30
 [<ffffffff884fc9bc>] :gfs2:gfs2_dinode_dealloc+0xfb/0x1a7
 [<ffffffff88507b0f>] :gfs2:gfs2_delete_inode+0xf1/0x18d
 [<ffffffff88507a64>] :gfs2:gfs2_delete_inode+0x46/0x18d
 [<ffffffff88507a1e>] :gfs2:gfs2_delete_inode+0x0/0x18d
 [<ffffffff8002f4f4>] generic_delete_inode+0xc6/0x143
 [<ffffffff800e2c75>] prune_one_dentry+0x4d/0x76
 [<ffffffff8002ea85>] prune_dcache+0x10f/0x149
 [<ffffffff800e2cb5>] shrink_dcache_memory+0x17/0x30
 [<ffffffff8003f159>] shrink_slab+0xdc/0x153
 [<ffffffff800c5af4>] try_to_free_pages+0x1c2/0x2ba
 [<ffffffff8000f299>] __alloc_pages+0x1cb/0x2ce
 [<ffffffff80012a0e>] __do_page_cache_readahead+0x95/0x1d9
 [<ffffffff8000f1b4>] __alloc_pages+0xe6/0x2ce
 [<ffffffff880f3d4d>] :dm_mod:dm_any_congested+0x38/0x3f
 [<ffffffff800133ad>] filemap_nopage+0x148/0x322
 [<ffffffff8000884b>] __handle_mm_fault+0x1f8/0xe5c
 [<ffffffff800a0084>] ktime_get_ts+0x1a/0x4e
 [<ffffffff8006686f>] do_page_fault+0x4cb/0x830
 [<ffffffff8009dec1>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80062f4b>] thread_return+0x0/0xdf
 [<ffffffff8003357e>] do_setitimer+0xcf/0x4b7
 [<ffffffff8005dde9>] error_exit+0x0/0x84



Version-Release number of selected component (if applicable):
kernel-2.6.18-104.el5
gfs2-kmod-1.104-1.1.el5.abhi.1

How reproducible:
Unknown

Steps to Reproduce:
1. run this load in dd_io
  d_metaverify -i 0 -s creat,unlink,rename,mkdir,rmdir -d 10 -n 50 -w .
  
Actual results:
See above and attached

Expected results:
GFS2 should be able to handle a long running load.


Additional info:

Comment 1 Steve Whitehouse 2008-08-27 08:14:51 UTC

Ok, I think we need to see the glocks here to make much of a judgement. I suspect that its waiting for the rgrp glock, but there is no info here to say why that should be the case.

Comment 3 Steve Whitehouse 2008-09-02 10:43:54 UTC

Hmm, so I can see what is wrong now.... when we take a page fault, we do the glock as the first thing. Then we call filemap_nopage() which might allocate pages and it uses the mapping's GFP setting, so that will be GFP_KERNEL, resulting in it potentially trying to call into the inode freeing code, resulting in deadlock.

So it might be possible to fix this by changing the mapping's GFP setting when we call filemap_nopage(). In the upstream code we avoid this by using the "normal" fault path, so this looks like a RHEL only bug. Looking at the current code though, we might land up having to set GFP_NOFS on all GFS2 data mappings (we already do this for metadata mappings) when the inode is created. Not at all ideal, but it does get us out of this particular hole.

Comment 4 Steve Whitehouse 2008-09-04 16:39:36 UTC

Created attachment 315777 [details]
Propsoed patch

This is the simple "fix". Lets hope that there are no other issues caused by fixing things in this way.

Comment 5 Nate Straz 2008-09-29 15:09:10 UTC

I was able to reproduce this bug on a 4GB machine by introducing a memory hog.
kernel-2.6.18-115.gfs2abhi.001

stress        D ffff810002376400     0  6681   6677                6680 (NOTLB)
 ffff8101148a7ad8 0000000000000086 0000000000000018 ffffffff884ba4bc
 0000000000000292 0000000000000007 ffff8101229fb7a0 ffffffff802f5ae0
 000005c5f4ced30d 000000000001e383 ffff8101229fb988 00000000884bbe1e
Call Trace:
 [<ffffffff884ba4bc>] :dlm:request_lock+0x93/0xa0
 [<ffffffff884e5138>] :gfs2:just_schedule+0x0/0xe
 [<ffffffff884e5141>] :gfs2:just_schedule+0x9/0xe
 [<ffffffff80063acd>] __wait_on_bit+0x40/0x6e
 [<ffffffff884e5138>] :gfs2:just_schedule+0x0/0xe
 [<ffffffff80063b67>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff8009ea2a>] wake_bit_function+0x0/0x23
 [<ffffffff884e5133>] :gfs2:gfs2_glock_wait+0x2b/0x30
 [<ffffffff884f4f9f>] :gfs2:gfs2_delete_inode+0x4e/0x18d
 [<ffffffff884f4f97>] :gfs2:gfs2_delete_inode+0x46/0x18d
 [<ffffffff884f4f51>] :gfs2:gfs2_delete_inode+0x0/0x18d
 [<ffffffff8002f275>] generic_delete_inode+0xc6/0x143
 [<ffffffff800e7027>] prune_one_dentry+0x4d/0x76
 [<ffffffff8002e72e>] prune_dcache+0x10f/0x149
 [<ffffffff800e7067>] shrink_dcache_memory+0x17/0x30
 [<ffffffff8003ef5e>] shrink_slab+0xdc/0x153
 [<ffffffff800c8e67>] try_to_free_pages+0x1c2/0x2ba
 [<ffffffff8000f25f>] __alloc_pages+0x1cb/0x2ce
 [<ffffffff80008c60>] __handle_mm_fault+0x5fb/0xe5c
 [<ffffffff80066b9a>] do_page_fault+0x4cb/0x830
 [<ffffffff800630aa>] thread_return+0x62/0xf1
 [<ffffffff8000de52>] do_mmap_pgoff+0x66c/0x7d7
 [<ffffffff8005dde9>] error_exit+0x0/0x84

Comment 6 Steve Whitehouse 2008-10-07 10:20:41 UTC

*** Bug 464932 has been marked as a duplicate of this bug. ***

Comment 7 Nate Straz 2008-10-08 00:17:12 UTC

I hit this panic while trying to reproduce this bug.  I was running -118.gfs2abhi.001 which includes patches from bugs 463134, 437803, 462579, and 465876.


Kernel BUG at fs/gfs2/trans.c:34
invalid opcode: 0000 [1] SMP 
last sysfs file: /fs/gfs2/dash:brawl0/lock_module/recover_done
CPU 0 
Modules linked in: sctp lock_dlm gfs2 dlm configfs qla2xxx scsi_transport_fc ipd
Pid: 9749, comm: d_doio Not tainted 2.6.18-118.gfs2abhi.001 #1
RIP: 0010:[<ffffffff8861d013>]  [<ffffffff8861d013>] :gfs2:gfs2_do_trans_begin+4
RSP: 0018:ffff81002d2bd968  EFLAGS: 00010282
RAX: ffff81002d2860c0 RBX: 0000000000000000 RCX: 0000000000000001
RDX: 0000000000008000 RSI: 0000000000000000 RDI: ffff81002cdc8000
RBP: ffff810026a60048 R08: ffff81002d2bc000 R09: ffff810026a7f060
R10: ffff81003dddf4c0 R11: 0000000000000246 R12: 0000000000008000
R13: ffff81002cdc8000 R14: ffff81002cdc8000 R15: 0000000000000001
FS:  00002ac529321020(0000) GS:ffffffff803b8000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aaaaacd2000 CR3: 000000002d2de000 CR4: 00000000000006e0
Process d_doio (pid: 9749, threadinfo ffff81002d2bc000, task ffff81002d2860c0)
Stack:  ffff81002d2bd978 ffff810026a602c8 ffff810026a60048 0000000000000000
 ffff81002cdc8000 0000000000000000 0000000000000080 ffffffff88615dde
 ffff810026a7f1f8 ffff810026a7f1f8 ffff810026a7f1a8 0000000100002615
Call Trace:
 [<ffffffff88615dde>] :gfs2:gfs2_delete_inode+0x10a/0x191
 [<ffffffff88615d1a>] :gfs2:gfs2_delete_inode+0x46/0x191
 [<ffffffff88615cd4>] :gfs2:gfs2_delete_inode+0x0/0x191
 [<ffffffff8002f2ae>] generic_delete_inode+0xc6/0x143
 [<ffffffff800e6a8a>] prune_one_dentry+0x4d/0x76
 [<ffffffff8002e767>] prune_dcache+0x10f/0x149
 [<ffffffff800e6aca>] shrink_dcache_memory+0x17/0x30
 [<ffffffff8003efc0>] shrink_slab+0xdc/0x153
 [<ffffffff800c896b>] try_to_free_pages+0x1c1/0x2b9
 [<ffffffff8000f276>] __alloc_pages+0x1cb/0x2ce
 [<ffffffff8861d0e9>] :gfs2:gfs2_do_trans_begin+0x102/0x144
 [<ffffffff8860f4d8>] :gfs2:gfs2_write_begin+0x19d/0x33f
 [<ffffffff88610ed3>] :gfs2:gfs2_file_buffered_write+0x14b/0x2e5
 [<ffffffff88611309>] :gfs2:__gfs2_file_aio_write_nolock+0x29c/0x2d4
 [<ffffffff8000c3d0>] __wake_up_bit+0x28/0x2d
 [<ffffffff886114ac>] :gfs2:gfs2_file_write_nolock+0xaa/0x10f
 [<ffffffff8009e526>] autoremove_wake_function+0x0/0x2e
 [<ffffffff88611cca>] :gfs2:gfs2_open+0x0/0x13a
 [<ffffffff8001e698>] __dentry_open+0x101/0x1dc
 [<ffffffff80064d35>] lock_kernel+0x1b/0x32
 [<ffffffff886115fc>] :gfs2:gfs2_file_write+0x49/0xa7
 [<ffffffff8001671c>] vfs_write+0xce/0x174
 [<ffffffff80016fe9>] sys_write+0x45/0x6e
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0


Code: 0f 0b 68 0c 1c 62 88 c2 22 00 44 89 e0 09 d8 75 0a 0f 0b 68 
RIP  [<ffffffff8861d013>] :gfs2:gfs2_do_trans_begin+0x2c/0x144
 RSP <ffff81002d2bd968>
 <0>Kernel panic - not syncing: Fatal exception

Comment 8 Steve Whitehouse 2008-10-08 08:44:19 UTC

There is only one memory allocation in gfs2_do_trans_begin(), namely:

tr = kzalloc(sizeof(struct gfs2_trans), GFP_NOFS);

so its quite clear that this has the GFP_NOFS flag set. I've looked at the code in mm, and the shrinker in dcache.c and it seems that they do take into account the required __GFP_FS bit. So why this is happening is still a mystery to me at the moment.

Comment 9 Steve Whitehouse 2008-10-09 15:56:52 UTC

The report of comment #7 seems to be on a kernel which is missing the patch attached to this bz. Maybe thats why its still failing?

Comment 10 Nate Straz 2008-10-14 15:21:16 UTC

I reverted back to testing -115.gfs2abhi.001 which we're sure has the patch in comment #4.  I constructed a test case as follows:
 1. Run d_metaverify forever on a GFS2 file system
 2. In parallel run a memory stress test which allocates 75% of memory which exits, then waits one minute before restarting.

I ran this with lock_nolock on one node for 24 hours without hitting any problems.

I'm restarting the load with one node and lock_dlm to see if it can handle the load.

Comment 11 Nate Straz 2008-10-17 18:05:29 UTC

One node mounted with lock_dlm: PASS
Two nodes mounted with load on one: PASS
Two nodes mounted with load on two: FAIL

After reviewing the logs I see that dash-01 tried to reconnect DLM, probably for clvmd.  There were no openais messages.  On dash-02 I see that openais noticed that the token was lost and it started forming a new ring.  Interspersed with the openais messages are dlm messages.

Oct 16 18:00:34 dash-02 openais[9105]: [TOTEM] Did not need to originate any mes
sages in recovery. 
Oct 16 18:00:35 dash-02 openais[9105]: [TOTEM] Sending initial ORF token 
Oct 16 18:00:35 dash-02 kernel: dlm: closing connection to node 1
Oct 16 18:00:35 dash-02 openais[9105]: [CLM  ] CLM CONFIGURATION CHANGE 
Oct 16 18:00:35 dash-02 kernel: dlm: connect from non cluster node
Oct 16 18:00:35 dash-02 openais[9105]: [CLM  ] New Configuration: 
Oct 16 18:00:35 dash-02 kernel: dlm: connect from non cluster node
Oct 16 18:00:35 dash-02 openais[9105]: [CLM  ]  r(0) ip(10.15.89.169)  
Oct 16 18:00:35 dash-02 kernel: dlm: connect from non cluster node
Oct 16 18:00:35 dash-02 openais[9105]: [CLM  ]  r(0) ip(10.15.89.170)  
Oct 16 18:00:35 dash-02 kernel: dlm: connect from non cluster node
Oct 16 18:00:35 dash-02 openais[9105]: [CLM  ] Members Left: 
Oct 16 18:00:35 dash-02 kernel: dlm: connect from non cluster node
Oct 16 18:00:35 dash-02 openais[9105]: [CLM  ]  r(0) ip(10.15.89.168)  
Oct 16 18:00:35 dash-02 openais[9105]: [CLM  ] Members Joined: 
Oct 16 18:00:35 dash-02 openais[9105]: [CLM  ] CLM CONFIGURATION CHANGE 
Oct 16 18:00:35 dash-02 openais[9105]: [CLM  ] New Configuration: 
Oct 16 18:00:35 dash-02 openais[9105]: [CLM  ]  r(0) ip(10.15.89.169)  
Oct 16 18:00:35 dash-02 openais[9105]: [CLM  ]  r(0) ip(10.15.89.170)  

dash-02 then fences dash-01.

Comment 12 Steve Whitehouse 2008-10-27 11:32:27 UTC

Created attachment 321605 [details]
Further test patch

This needs to be combined with the earlier patch in order to build a kernel for testing. This includes the DLM fix that I sent upstream and also a few noinline annotations to try and narrow down the source of the problem further, should it happen again.

Abhi, can you build a kernel from the latest RHEL with the two patches from this bug for testing?

Comment 13 Steve Whitehouse 2008-10-28 13:14:34 UTC

Abhi has built a kernel, does that fix the problem?

Comment 14 Nate Straz 2008-10-28 18:20:26 UTC

I was able to reproduce the issue on GFS2 with abhi's -120.gfs2abhi.001 build.  It took 4:48:08 to reproduce while running the load on two nodes.  I had top and slabtop running on all nodes when the fence action occurred.

Configuration:
 - 3 node cluster, dash-01, dash-02, dash-03
 - file system mounted on all nodes
 - load running on dash-01 and dash-02

Results:
 - dash-02 was fenced
 - top shows gfs2_scand was using 0.07% CPU before the fence action
 - dlm_send on dash-01 is hogging CPU time
 - sysrq-t didn't show a stack for dlm_send, but I got something out of sysrq-w

SysRq : Show CPUs
CPU0:
 ffffffff80431dc8 0000000000000000 0000000000000001 0000000000000000
 ffffffff801a3b5d ffff81002f8f9d78 0000000000000006 ffffffff801a3b8c
 ffff81002f8f9d78 ffffffff80091877 ffffffff80321520 0000000000000077
Call Trace:
 <IRQ>  [<ffffffff801a3b5d>] showacpu+0x0/0x3b
 [<ffffffff801a3b8c>] showacpu+0x2f/0x3b
 [<ffffffff80091877>] on_each_cpu+0x19/0x22
 [<ffffffff801a3a92>] __handle_sysrq+0x90/0x121
 [<ffffffff801b2718>] receive_chars+0x128/0x276
 [<ffffffff801b38ae>] serial8250_interrupt+0x7a/0xfe
 [<ffffffff80010bd1>] handle_IRQ_event+0x51/0xa6
 [<ffffffff800b78be>] __do_IRQ+0xa4/0x103
 [<ffffffff8001214f>] __do_softirq+0x89/0x133
 [<ffffffff885e68c6>] :dlm:process_send_sockets+0x0/0x15d
 [<ffffffff8006c95d>] do_IRQ+0xe7/0xf5
 [<ffffffff8005d615>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff80064c08>] _spin_unlock_irqrestore+0x8/0x9
 [<ffffffff8009a8be>] queue_work+0x4e/0x57
 [<ffffffff8004d25b>] run_workqueue+0x94/0xe4
 [<ffffffff80049b14>] worker_thread+0x0/0x122
 [<ffffffff8009d6d5>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80049c04>] worker_thread+0xf0/0x122
 [<ffffffff8008a22e>] default_wake_function+0x0/0xe
 [<ffffffff8009d6d5>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8009d6d5>] keventd_create_kthread+0x0/0xc4
 [<ffffffff800324bd>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8009d6d5>] keventd_create_kthread+0x0/0xc4
 [<ffffffff800323bf>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

Comment 15 Nate Straz 2008-10-28 18:56:46 UTC

After further investigation I found I'm starting to get processes stuck on memory allocations again.

ps            D ffff810001576400     0 15620  11465                     (NOTLB)
 ffff8100045a5998 0000000000000086 0000000000000018 ffffffff885e2abe
 0000000000000292 0000000000000008 ffff81002f8dc820 ffffffff802f4ae0
 0000137ee8d393c6 00000000000205ef ffff81002f8dca08 00000000885e2be2
Call Trace:
 [<ffffffff885e2abe>] :dlm:request_lock+0x93/0xa0
 [<ffffffff8860d138>] :gfs2:just_schedule+0x0/0xe
 [<ffffffff8860d141>] :gfs2:just_schedule+0x9/0xe
 [<ffffffff80063ac7>] __wait_on_bit+0x40/0x6e
 [<ffffffff8860d138>] :gfs2:just_schedule+0x0/0xe
 [<ffffffff80063b61>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff8009d91b>] wake_bit_function+0x0/0x23
 [<ffffffff8860d133>] :gfs2:gfs2_glock_wait+0x2b/0x30
 [<ffffffff8861cd46>] :gfs2:gfs2_delete_inode+0x4e/0x191
 [<ffffffff8861cd3e>] :gfs2:gfs2_delete_inode+0x46/0x191
 [<ffffffff8861ccf8>] :gfs2:gfs2_delete_inode+0x0/0x191
 [<ffffffff8002f270>] generic_delete_inode+0xc6/0x143
 [<ffffffff800e5eef>] prune_one_dentry+0x4d/0x76
 [<ffffffff8002e729>] prune_dcache+0x10f/0x149
 [<ffffffff800e5f2f>] shrink_dcache_memory+0x17/0x30
 [<ffffffff8003ef82>] shrink_slab+0xdc/0x153
 [<ffffffff800c7da7>] try_to_free_pages+0x1c1/0x2b9
 [<ffffffff8000f276>] __alloc_pages+0x1cb/0x2ce
...

top           D ffff8100301a07a0     0 11768  11723                     (NOTLB)
 ffff810036cffbd8 0000000000000082 0000000000000018 ffffffff885e2abe
 0000000000000292 000000000000000a ffff81000d7e80c0 ffff8100301a07a0
 0000130d7bb09916 000000000007cbec ffff81000d7e82a8 00000000885e2be2
Call Trace:
 [<ffffffff885e2abe>] :dlm:request_lock+0x93/0xa0
 [<ffffffff88687556>] :lock_dlm:gdlm_ast+0x0/0x311
 [<ffffffff886872c1>] :lock_dlm:gdlm_bast+0x0/0x8d
 [<ffffffff8860d138>] :gfs2:just_schedule+0x0/0xe
 [<ffffffff8860d141>] :gfs2:just_schedule+0x9/0xe
 [<ffffffff80063ac7>] __wait_on_bit+0x40/0x6e
 [<ffffffff8860d138>] :gfs2:just_schedule+0x0/0xe
 [<ffffffff80063b61>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff8009d91b>] wake_bit_function+0x0/0x23
 [<ffffffff8860d133>] :gfs2:gfs2_glock_wait+0x2b/0x30
 [<ffffffff8861cd46>] :gfs2:gfs2_delete_inode+0x4e/0x191
 [<ffffffff8861cd3e>] :gfs2:gfs2_delete_inode+0x46/0x191
 [<ffffffff8861ccf8>] :gfs2:gfs2_delete_inode+0x0/0x191
 [<ffffffff8002f270>] generic_delete_inode+0xc6/0x143
 [<ffffffff800e5eef>] prune_one_dentry+0x4d/0x76
 [<ffffffff8002e729>] prune_dcache+0x10f/0x149
 [<ffffffff800e5f2f>] shrink_dcache_memory+0x17/0x30
 [<ffffffff8003ef82>] shrink_slab+0xdc/0x153
 [<ffffffff800c7da7>] try_to_free_pages+0x1c1/0x2b9
 [<ffffffff8000f276>] __alloc_pages+0x1cb/0x2ce
...
slabtop       D ffff810001576400     0 11770  11722                     (NOTLB)
 ffff81003403b978 0000000000000082 0000000000000018 ffffffff885e2abe
 0000000000000296 0000000000000009 ffff810014e9f100 ffffffff802f4ae0
 000013aac6c63807 0000000000019f95 ffff810014e9f2e8 00000000885e2be2
Call Trace:
 [<ffffffff885e2abe>] :dlm:request_lock+0x93/0xa0
 [<ffffffff8860d138>] :gfs2:just_schedule+0x0/0xe
 [<ffffffff8860d141>] :gfs2:just_schedule+0x9/0xe
 [<ffffffff80063ac7>] __wait_on_bit+0x40/0x6e
 [<ffffffff8860d138>] :gfs2:just_schedule+0x0/0xe
 [<ffffffff80063b61>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff8009d91b>] wake_bit_function+0x0/0x23
 [<ffffffff8860d133>] :gfs2:gfs2_glock_wait+0x2b/0x30
 [<ffffffff8861cd46>] :gfs2:gfs2_delete_inode+0x4e/0x191
 [<ffffffff8861cd3e>] :gfs2:gfs2_delete_inode+0x46/0x191
 [<ffffffff8861ccf8>] :gfs2:gfs2_delete_inode+0x0/0x191
 [<ffffffff8002f270>] generic_delete_inode+0xc6/0x143
 [<ffffffff800e5eef>] prune_one_dentry+0x4d/0x76
 [<ffffffff8002e729>] prune_dcache+0x10f/0x149
 [<ffffffff800e5f2f>] shrink_dcache_memory+0x17/0x30
 [<ffffffff8003ef82>] shrink_slab+0xdc/0x153
 [<ffffffff800c7da7>] try_to_free_pages+0x1c1/0x2b9
 [<ffffffff8000f276>] __alloc_pages+0x1cb/0x2ce
...
gfs2_quotad   D ffff810001576400     0 11425     15         14610 11424 (L-TLB)
 ffff81002d109d50 0000000000000046 0000000000000018 ffffffff885e2abe
 0000000000000282 000000000000000a ffff81002d018080 ffffffff802f4ae0
 000012c24b40c9a1 000000000001e119 ffff81002d018268 00000000885e2be2
Call Trace:
 [<ffffffff885e2abe>] :dlm:request_lock+0x93/0xa0
 [<ffffffff8860d138>] :gfs2:just_schedule+0x0/0xe
 [<ffffffff8860d141>] :gfs2:just_schedule+0x9/0xe
 [<ffffffff80063ac7>] __wait_on_bit+0x40/0x6e
 [<ffffffff8860d138>] :gfs2:just_schedule+0x0/0xe
 [<ffffffff80063b61>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff8009d91b>] wake_bit_function+0x0/0x23
 [<ffffffff8860d133>] :gfs2:gfs2_glock_wait+0x2b/0x30
 [<ffffffff88622f7c>] :gfs2:gfs2_statfs_sync+0x45/0x201
 [<ffffffff8004accb>] try_to_del_timer_sync+0x51/0x5a
 [<ffffffff88622f74>] :gfs2:gfs2_statfs_sync+0x3d/0x201
 [<ffffffff80063964>] schedule_timeout+0x92/0xad
 [<ffffffff8009d6d5>] keventd_create_kthread+0x0/0xc4
 [<ffffffff88607490>] :gfs2:gfs2_quotad+0x50/0x154
 [<ffffffff88607440>] :gfs2:gfs2_quotad+0x0/0x154
 [<ffffffff800324bd>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8009d6d5>] keventd_create_kthread+0x0/0xc4
 [<ffffffff800323bf>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11
gfs2_logd     S ffff810001576400     0 11424     15         11425 11421 (L-TLB)
 ffff81002058de20 0000000000000046 ffff81002c92d048 ffff81002058de30
 ffffffff88607fa3 000000000000000a ffff810013a8f0c0 ffffffff802f4ae0
 000014a582097b64 000000000000160d ffff810013a8f2a8 000000002c92d048
Call Trace:
 [<ffffffff88607fa3>] :gfs2:gfs2_dirent_find+0x0/0x4e
 [<ffffffff8009d6d5>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8006395c>] schedule_timeout+0x8a/0xad
 [<ffffffff80094d76>] process_timeout+0x0/0x5
 [<ffffffff88607424>] :gfs2:gfs2_logd+0x140/0x15c
 [<ffffffff886072e4>] :gfs2:gfs2_logd+0x0/0x15c
 [<ffffffff800324bd>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8009d6d5>] keventd_create_kthread+0x0/0xc4
 [<ffffffff800323bf>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11
dlm_recoverd  S ffff8100267b5d80     0 10998     15         11405 10997 (L-TLB)
 ffff81002f90dea0 0000000000000046 ffff81003271e800 ffffffff885e9da1
 0000000000000000 000000000000000a ffff81002f8d9080 ffff81002f8d4040
 00001210455d920d 0000000000002eda ffff81002f8d9268 00000000885e8d07
Call Trace:
 [<ffffffff885e9da1>] :dlm:dlm_wait_function+0xf9/0x135
 [<ffffffff885e95e5>] :dlm:dlm_rcom_status+0xae/0x179
 [<ffffffff80063bb6>] mutex_lock+0xd/0x1d
 [<ffffffff8002feba>] __up_write+0x27/0xf2
 [<ffffffff8009d6d5>] keventd_create_kthread+0x0/0xc4
 [<ffffffff885ea78c>] :dlm:dlm_recoverd+0x56/0x47f
 [<ffffffff885ea736>] :dlm:dlm_recoverd+0x0/0x47f
 [<ffffffff800324bd>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8009d6d5>] keventd_create_kthread+0x0/0xc4
 [<ffffffff800323bf>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11
dlm_send      D ffff810001576400     0 10997     15         10998 10996 (L-TLB)
 ffff81002f8f9850 0000000000000046 0000000000000018 ffffffff885e2abe
 0000000000000282 0000000000000001 ffff81002f8d97e0 ffffffff802f4ae0
 000012be3226bd3b 000000000001ceb9 ffff81002f8d99c8 00000000885e2be2
Call Trace:
 [<ffffffff885e2abe>] :dlm:request_lock+0x93/0xa0
 [<ffffffff8860d138>] :gfs2:just_schedule+0x0/0xe
 [<ffffffff8860d141>] :gfs2:just_schedule+0x9/0xe
 [<ffffffff80063ac7>] __wait_on_bit+0x40/0x6e
 [<ffffffff8860d138>] :gfs2:just_schedule+0x0/0xe
 [<ffffffff80063b61>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff8009d91b>] wake_bit_function+0x0/0x23
 [<ffffffff8860d133>] :gfs2:gfs2_glock_wait+0x2b/0x30
 [<ffffffff8861cd46>] :gfs2:gfs2_delete_inode+0x4e/0x191
 [<ffffffff8861cd3e>] :gfs2:gfs2_delete_inode+0x46/0x191
 [<ffffffff8861ccf8>] :gfs2:gfs2_delete_inode+0x0/0x191
 [<ffffffff8002f270>] generic_delete_inode+0xc6/0x143
 [<ffffffff800e5eef>] prune_one_dentry+0x4d/0x76
 [<ffffffff8002e729>] prune_dcache+0x10f/0x149
 [<ffffffff800e5f2f>] shrink_dcache_memory+0x17/0x30
 [<ffffffff8003ef82>] shrink_slab+0xdc/0x153
 [<ffffffff800c7da7>] try_to_free_pages+0x1c1/0x2b9
 [<ffffffff8000f276>] __alloc_pages+0x1cb/0x2ce
 [<ffffffff8000f191>] __alloc_pages+0xe6/0x2ce
 [<ffffffff8001762b>] cache_grow+0x137/0x395
 [<ffffffff8005bda0>] cache_alloc_refill+0x136/0x186
 [<ffffffff8000a96e>] kmem_cache_alloc+0x6c/0x76
 [<ffffffff80043c1d>] sk_alloc+0x2e/0xf3
 [<ffffffff8005982b>] inet_create+0x137/0x267
 [<ffffffff8004cab1>] __sock_create+0x170/0x27c
 [<ffffffff885e68c6>] :dlm:process_send_sockets+0x0/0x15d
 [<ffffffff885e634c>] :dlm:tcp_connect_to_sock+0x70/0x1de
 [<ffffffff8001c8ff>] __mod_timer+0xb0/0xbe
 [<ffffffff80054ec6>] sk_reset_timer+0xf/0x19
 [<ffffffff80032dd0>] __tcp_push_pending_frames+0x7a0/0x87a
 [<ffffffff8002dfe0>] __alloc_skb+0x77/0x123
 [<ffffffff885e68e2>] :dlm:process_send_sockets+0x1c/0x15d
 [<ffffffff8004d25b>] run_workqueue+0x94/0xe4

Comment 16 Nate Straz 2008-10-28 18:58:47 UTC

I just flipped over to see what slabtop was doing on dash-01.  This looks interesting:

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
263555 263555 100%    1.50K  52711        5    421688K TCP
263748 263748 100%    0.62K  43958        6    175832K sock_inode_cache
 78800  76634  97%    0.78K  15760        5     63040K gfs2_inode
 76216  68488  89%    0.52K  10888        7     43552K radix_tree_node
152100 147624  97%    0.25K  10140       15     40560K size-256

Comment 17 Steve Whitehouse 2008-10-29 10:11:55 UTC

Updating component to kernel, gfs2-kmod doesn't exist any more. Should we have a gfs2-kernel now, or is just plain kernel ok? Having said that this seem to be more dlm than gfs2, so maybe it should be dlm-kernel, but I'll leave it at plain kernel for now.

Comment 18 Nate Straz 2008-10-29 14:54:53 UTC

I was able to hit the issue again, but this time fencing got stuck looking for memory so I was able to gather some more information.

Same configuration as comment #14.

I found the system in a state where dash-01 was attempting to fence dash-02, but fence_apc was stuck trying to allocate memory with gfs2_delete_inode in its stack.

On dash-03 I see that dash-02 was removed from the cluster
Oct 28 23:13:43 dash-03 openais[10896]: [TOTEM] The token was lost in the OPERATIONAL state.
...
Oct 28 23:13:48 dash-03 openais[10896]: [CLM  ] CLM CONFIGURATION CHANGE 
Oct 28 23:13:48 dash-03 openais[10896]: [CLM  ] New Configuration: 
Oct 28 23:13:48 dash-03 kernel: dlm: closing connection to node 2
Oct 28 23:13:48 dash-03 openais[10896]: [CLM  ]         r(0) ip(10.15.89.168)  
Oct 28 23:13:48 dash-03 openais[10896]: [CLM  ]         r(0) ip(10.15.89.170)  
Oct 28 23:13:48 dash-03 openais[10896]: [CLM  ] Members Left: 
Oct 28 23:13:48 dash-03 openais[10896]: [CLM  ]         r(0) ip(10.15.89.169)  
...
Oct 28 23:13:49 dash-03 kernel: dlm: connect from non cluster node
Oct 28 23:13:49 dash-03 kernel: dlm: connect from non cluster node
Oct 28 23:13:49 dash-03 fenced[10919]: fencing deferred to dash-01
Oct 28 23:13:54 dash-03 kernel: dlm: connect from non cluster node

On dash-02 at the same time I found dlm trying to reconnect, but no openais log messages.

Oct 28 23:13:41 dash-02 kernel: dlm: connecting to 3
Oct 28 23:13:41 dash-02 kernel: dlm: connecting to 1
Oct 28 23:13:41 dash-02 kernel: dlm: connecting to 3
Oct 28 23:13:41 dash-02 kernel: dlm: connecting to 1

The stack trace from dash-02's aisexec shows that it is stuck looking for memory and caught in the gfs2/dlm mess.

aisexec       D ffff81002b46a100     0  9928      1          9929  9923 (NOTLB)
 ffff81002b3e1818 0000000000000086 0000000000000018 ffffffff885d5abe
 0000000000000296 0000000000000001 ffff81002dc35080 ffff81002b46a100
 00002155a9b007ed 00000000000166d8 ffff81002dc35268 00000000885d5be2
Call Trace:
 [<ffffffff885d5abe>] :dlm:request_lock+0x93/0xa0
 [<ffffffff8867a556>] :lock_dlm:gdlm_ast+0x0/0x311
 [<ffffffff8867a2c1>] :lock_dlm:gdlm_bast+0x0/0x8d
 [<ffffffff88600138>] :gfs2:just_schedule+0x0/0xe
 [<ffffffff88600141>] :gfs2:just_schedule+0x9/0xe
 [<ffffffff80063ac7>] __wait_on_bit+0x40/0x6e
 [<ffffffff88600138>] :gfs2:just_schedule+0x0/0xe
 [<ffffffff80063b61>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff8009d91b>] wake_bit_function+0x0/0x23
 [<ffffffff88600133>] :gfs2:gfs2_glock_wait+0x2b/0x30
 [<ffffffff8860357b>] :gfs2:gfs2_dinode_dealloc+0xfb/0x1a7
 [<ffffffff8860fde3>] :gfs2:gfs2_delete_inode+0xeb/0x191
 [<ffffffff8860fd3e>] :gfs2:gfs2_delete_inode+0x46/0x191
 [<ffffffff8860fcf8>] :gfs2:gfs2_delete_inode+0x0/0x191
 [<ffffffff8002f270>] generic_delete_inode+0xc6/0x143
 [<ffffffff800e5eef>] prune_one_dentry+0x4d/0x76
 [<ffffffff8002e729>] prune_dcache+0x10f/0x149
 [<ffffffff800e5f2f>] shrink_dcache_memory+0x17/0x30
 [<ffffffff8003ef82>] shrink_slab+0xdc/0x153
 [<ffffffff800c7da7>] try_to_free_pages+0x1c1/0x2b9
 [<ffffffff8000f276>] __alloc_pages+0x1cb/0x2ce
 [<ffffffff8024e7cd>] udp_push_pending_frames+0x21e/0x243
 [<ffffffff8003c19b>] __get_free_pages+0xe/0x71
 [<ffffffff8001e814>] __pollwait+0x58/0xe2
 [<ffffffff8002ff9d>] unix_poll+0x18/0x96

Comment 19 David Teigland 2008-10-29 16:05:12 UTC

I don't see how GFP allocation flags have anything to do with this.
In fact, I don't see how memory allocations at all have anything to do
with this. No one is even blocked trying to allocate memory (much less
blocked on an allocation that's in turn trying to flush gfs2 objects to
free up memory.)

The obvious common factor in all these traces is that:
1. gfs2 inode free/delete/dealloc is in the kernel's general free-up-memory
path like inodes from other fs's
2. gfs2 inode free/delete/dealloc requires doing a lock operation

I say "free/delete/dealloc" because it's not clear to me how/why those are
related in the traces. The backtraces clearly begin with the system trying to free up memory, a normal kind of thing that's expected. This calls into
gfs2 to free up inodes. But, in gfs2 this appears to translate into
*ondisk deallocation* of disk blocks, not just freeing up incore memory. That
doesn't seem like a good thing to be doing. For one thing, the ondisk operation
ends up taking dlm locks, and waiting on them. It looks very wrong to be doing
all this in response to the system just wanting to free up memory from cached
inodes.

I'm guessing the way gfs1's inodes are freed is designed to avoid these
problems. I'm not sure at which level it solves the problem. At the top
level perhaps gfs1 has its own way of freeing inodes (e.g. via its own thread
or something) which avoids this. And/or I suspect that freeing inodes is
not related to ondisk deallocation in gfs1. And at a minimum, I really
doubt gfs1 requires taking dlm locks to free an inode. So, my suspicion is that this is a gfs2 design issue.

Comment 20 David Teigland 2008-10-29 16:16:22 UTC

To correct what I wrote above, there are memory allocations which everything
gets blocked on, but they are not usually explicit allocations, much less
allocations in our own code.  The kernel just finds at any random point, in the
context of any random process, that it needs to free up memory, which causes
it to try to free up memory from things like incore inodes.  GFS blocking on a
long dlm lock operations in the context of any random processes that happen to trigger kernel memory reclaim is simply not going to work.

Comment 21 David Teigland 2008-10-29 17:42:06 UTC

gfs1 and the original gfs2 have nothing attached to the .delete_inode hook,
so they don't do anything in the code paths shown above.

Comment 22 Steve Whitehouse 2008-10-29 17:43:15 UTC

To try and answer some of the points above:

There was an original bug (in gfs2) which started off this particular bugzilla, and that indeed was fixed and has (just) gone into RHEL. While doing tests to try and verify that fix, some other issues have come up. One of the problems in tracking down what those issues are is that they seem to be memory related and when the problems occur, everything it locked up so we are a bit short on data, but things are getting clearer as time goes on at least.

So the issue relating to memory allocations and the filesystem is that unless we specify GFP_NOFS. The only cache which affects this situation is the dcache, since usually the dcache is holding the last reference to an inode. Assuming that the inode has a zero link count, then we try to delete it (deallocate it) when its evicted from the inode cache. This is identical to any other filesystem, aside from gfs1 which did evict its vfs inodes this way, but left the gfs2_inodes in cache.

Now its a known problem with the vfs that when this activity happens it tends to result in rather inefficient I/O due to the "random" order of cache evictions. So there is certainly some room for improvement.

As for the taking of dlm locks at the point of deallocation, thats unavoidable since we have to lock the various objects (inode, resource groups, etc) involved. So the only way to avoid that in the memory reclaim path, would be to move the time at which inodes are removed from the inode cache to some other point in time by overriding the d_iput() function.

The disadvantage of doing that is that we'd land up not really freeing much
memory at all when the dcache was squeezed, and that might well give us other problems. We'd then have to work out when was a suitable time to actually dispose of those inodes, and hope that we had enough memory to do so given that we didn't dispose of them when the vfs/vm first requested it. If we can't think of anything better by tomorrow, then we can certainly try that.

Either way, the patch which I sent for the upstream dlm does fix real issues and ones which might be hit from other points in the code, even if they haven't been so far.

Now this may well all be moot anyway, since the trace from comment #18 has again stopped in dlm:request_lock (or appears to have done) and since I added a bunch of noinline to the code in my recent test patch, if we had got stuck allocating memory, then it should have shown up there, since dlm:request_lock doesn't allocate any memory directly itself. Also it may well be that the end of that stack trace isn't actually real entries and that it might have stopped further up, my bet being gfs2:just_schedule, i.e. waiting for a lock from another node.

Comment 23 David Teigland 2008-10-29 18:13:25 UTC

Yeah, the stuff above the schedule is "left over".  gfs1 played tricks
like you mention (e.g. artificial reference counts), to change normal
kernel behavior and give it control over when to do things like taking
dlm locks or doing ondisk operations.  And gfs not freeing as much memory
as an ordinary fs might would definately be one of the trade-offs (at least
in the normal memory reclaim paths; gfs had other mechanisms to try to
compensate IIRC.)  I think you'll be forced to do something like that because
gfs can't block any random process (esp something like aisexec) for some indefinate period of time to do a remote lock operation.

Comment 24 Nate Straz 2008-10-31 18:32:48 UTC

Even with Christine's patch to connect DLM early I hit the bug.  I don't think the problem is necessarily that DLM can't connect under memory pressure.  It was connected and it got disconnected.  I see in the logs:

dlm: closing connection to node 3
dlm: connect from non cluster node

Which seems to indicate that perhaps the dlm disconnects because it got evicted from the cluster from a hung aisexec process.  The logs seem to support that.

Oct 31 11:39:39 dash-02 openais[10097]: [TOTEM] entering GATHER state from 12. 
Oct 31 11:39:44 dash-02 openais[10097]: [TOTEM] entering GATHER state from 11. 
Oct 31 11:39:44 dash-02 openais[10097]: [TOTEM] Saving state aru 58 high seq rec
eived 58 
Oct 31 11:39:44 dash-02 openais[10097]: [TOTEM] Storing new sequence id for ring
 98 
Oct 31 11:39:44 dash-02 openais[10097]: [TOTEM] entering COMMIT state. 
Oct 31 11:39:44 dash-02 openais[10097]: [TOTEM] entering RECOVERY state. 
...
Oct 31 11:39:44 dash-02 openais[10097]: [CLM  ] CLM CONFIGURATION CHANGE 
Oct 31 11:39:44 dash-02 openais[10097]: [CLM  ] New Configuration: 
Oct 31 11:39:44 dash-02 openais[10097]: [CLM  ]         r(0) ip(10.15.89.168)  
Oct 31 11:39:44 dash-02 kernel: dlm: closing connection to node 3
Oct 31 11:39:44 dash-02 openais[10097]: [CLM  ]         r(0) ip(10.15.89.169)  
Oct 31 11:39:44 dash-02 kernel: dlm: connect from non cluster node
Oct 31 11:39:44 dash-02 openais[10097]: [CLM  ] Members Left: 
Oct 31 11:39:44 dash-02 openais[10097]: [CLM  ]         r(0) ip(10.15.89.170)  

The logs on dash-03 in this case show that aisexec is stuck down the gfs2_delete_inode path.

Comment 25 Steve Whitehouse 2008-11-06 11:14:02 UTC

There are three possible paths which I'm aware of which might land up calling into the DLM as a result of memory pressure:

1. shrinking the dcache resulting in a final iput of an inode (as per above)
2. shrinking the icache resulting in drop of an unreferenced inode
3. shrinking the page cache resulting in a transaction during writeback (jdata)

In the case of #1 and #2 this only happens if i_nlink is 0. I suspect that #2 is a rare case, compared with #1 but there seems to be no reason to suppose that it cannot happen.

I have the beginnings of a patch which might address cases #1 and #2. I don't yet have any idea how to deal with #3 and I don't think its possible in gfs2 alone.

There is also (in my current patch) the question of when to finally flush out the inodes whose deallocation has been deferred. In the case of umount, there are various bits of code which wait upon the destruction of inodes, so flushing them at that point will be critical, assuming that we don't want to hang forever.

Also we'll need to export a couple of new functions from the VFS in order to allow us to defer the "delete inode" part of the state machine. I'm wondering whether it wouldn't be better to actually make this part of the VFS itself rather than specific to GFS2. I'm sure that OCFS2 will need the same thing, for example, since they have followed our lead in this area.

Also due to #2, my suggestion of using d_iput as a hook in comment #22 doesn't work as we must ensure that we hook into the inode state machine directly.

Comment 26 David Teigland 2008-11-11 18:28:22 UTC

I'd like to know which specific backtrace would be resolved by which specific
part of this dlm allocation patch,
https://www.redhat.com/archives/cluster-devel/2008-November/msg00032.html

The only thing I've heard about the patch above is that it changes the overall
behavior of the given test system.  And the general effect is that the problem
takes longer to appear.  I think we need more solid facts if we're going to risk
making a last minute change.

Comment 27 Steve Whitehouse 2008-11-12 10:13:54 UTC

Ok, so there are several issues here, but lets start with comment #15 and the dlm_send backtrace. At that point its possible to see that we have a path from dlm_send back into the filesystem via a GFP_KERNEL allocation. Now at the moment we can do nothing about that particular allocation in that its part of the socket code, but having looked in some detail at that problem, I'm intending to ask the net developers their opinion on allowing us to specify the GFP setting for socket creation.

With Chrissie's patch, we proved that moving the socket creation to an earlier point in time reduced the likelihood of having to allocate a socket during a call into the DLM from the filesystem. I'm not going to suggest that we should use that patch however as I think the real fix to this is to allow GFP_NOFS allocation of sockets. Nevertheless, that patch was very helpful in allowing us to narrow down the problem.

Now we know that using GFP_KERNEL when calling down in the context of the filesystem is going to fail. There was a bug (whose bz number I have forgotten and I can't find it at the moment) about this issue some time ago which resulted in the addition of ls->ls_allocation in the first place. It seems that there is little point in having that feature if its ignored in some of the allocations, so the patch I recently posted to cluster-devel fixes all of the places which use GFP_KERNEL aside from two. One of those is the socket allocation, for the reasons previously stated, and the other is the kmap(), which, as mentioned in the patch comment will never cause a problem as with the current code it will never be passed a highmem page anyway.

So I fully realise that this patch will not actually fix the socket allocation problem, but it will fix all the other possible cases. I'm also quite happy with the decision that its too invasive for RHEL at the moment, and so long as it gets into upstream, I'm happy with that. The point I was making at the meeting is that we need to make a decision about whether it will go into RHEL 5.3 very quickly now, even if that decision is "no".

Also, with regards to why I'd like to fix things this way, see case #3 of comment #25, where its possible for jdata writeback to trigger something similar. If I could find the details of the previous bug where ls->ls_allocation was added in the first place, then I have a feeling that we used the same reasoning when that was fixed. In fact I look at this as the same bug, its just that some cases were missed when the original patch was produced.

Comment 28 David Teigland 2008-11-12 19:47:22 UTC

There's no doubt your patch is needed; as you say, the whole point of
ls_allocation is to use it and avoid looping back into gfs.  And I'm not
really worried about regressions from the part of the patch that does that
(I'd split and leave the kmap changes for later).  My only question is whether
adding this patch will fix any of the problems we have seen.  If it will, then
I'll post it (without kmap bits) for 5.3 with that justification.

Comment 29 Steve Whitehouse 2008-11-13 14:29:17 UTC

I suspect that we won't hit the problems that the current patch was designed to fix very often until we can fix the socket allocation issue. I suspect that triggers first because we land up allocating two bits of memory in quick sucsession (the inode and the struct sock) which are both fairly large, and the other allocations are generally pretty small and fairly rare too.

Since this does take some time to set up and hit, I think we'd have to do a lot of runs before we could say for certain that we'd never hit it. I'd tend to err on the side of caution and add it anyway. I guess you'll need to clone this bug for that since we need to keep this one around for the remainder of the issues.

Comment 31 Steve Whitehouse 2008-11-17 11:49:36 UTC

Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
There is a known issue relating to VM pressure when a GFS2 filesystem is mounted and there are cached GFS2 inodes which have been unlinked. This can only occur if the inodes have been used (accessed) on one node and then unlinked on a different node. Occasionally, a node may hang if these two conditions occur at the same time. The node will be fenced and recovered via the normal cluster recovery mechanism, so that the impact is the unavailability of a single node for a period of time and does not, in general require user intervention.

A fix is scheduled for RHEL 5.4 (see bugzilla #460218). Users who suspect that they have this particular issue should look for the occurrence of  gfs2_dinode_dealloc and shrink_dcache_memory in stack traces of stuck processes.

Please also note that this issue does not affect single node GFS2 filesystems.

Comment 32 Don Domingo 2008-11-21 02:45:03 UTC

this bug is now documented in the RHEL5.3 release notes. please consult the following link to view the most recent build of this document:

http://documentation-stage.bne.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5.3/html-single/Release_Notes/

this document is synced daily. as such, any recent additions are expected to show up within 24 hours or so.

Comment 33 Don Domingo 2008-11-21 02:45:03 UTC

Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1,5 +1,3 @@
-There is a known issue relating to VM pressure when a GFS2 filesystem is mounted and there are cached GFS2 inodes which have been unlinked. This can only occur if the inodes have been used (accessed) on one node and then unlinked on a different node. Occasionally, a node may hang if these two conditions occur at the same time. The node will be fenced and recovered via the normal cluster recovery mechanism, so that the impact is the unavailability of a single node for a period of time and does not, in general require user intervention.
+If your system has a GFS2 file system mounted, a node may hang if a cached inode is accessed in one node and unlinked on a different node. When this occurs, the hung node will be unavailable until you fence and recover it via the normal cluster recovery mechanism. The function calls gfs2_dinode_dealloc and shrink_dcache_memory will also appear in the stack traces of any processes stuck in the hung node. 
 
-A fix is scheduled for RHEL 5.4 (see bugzilla #460218). Users who suspect that they have this particular issue should look for the occurrence of  gfs2_dinode_dealloc and shrink_dcache_memory in stack traces of stuck processes.
+This issue does not affect single-node GFS2 file systems.-
-Please also note that this issue does not affect single node GFS2 filesystems.

Comment 34 Christine Caulfield 2009-01-15 08:59:00 UTC

Created attachment 329073 [details]
Patch to make the DLM connect as soon as a node joins a lockspace

For reference, this is the patch mentioned in comment #24

Comment 35 Steve Whitehouse 2009-01-22 10:07:26 UTC

Created attachment 329702 [details]
The dlm connection patch

Really attach the patch this time. The previous attachment seemed to be a single line file containing a copy of the filename.

Comment 36 Steve Whitehouse 2009-01-22 10:08:27 UTC

The requested information was provided in comment #31.

Comment 37 Steve Whitehouse 2009-01-22 14:39:23 UTC

Created attachment 329711 [details]
Patch to use GFP_NOFS for dlm recovery

It occured to me that we hadn't really gone though all the possible ways in which we might have landed up in the fs reclaim code. Since we did the original tests a number of changes have gone into the upstream code:
a) The original patch in this bz has gone into dlm
b) The write_begin function has gained its own GFP_NOFS allocation status so we don't need to mark the whole address space as GFP_NOFS for "normal" inodes
c) Chrissie's "connect early" patch is very small and simple and certainly a lot less invasive than the other option (GFP_NOFS allocation of sockets)
d) I've also now (in this attachment) been through all of the recovery code and checked each and every allocation to ensure that we will not be stuck waiting in a filesystem when we are asking for memory to recover a cluster. That I think should then close all possible windows for problems through dlm itself.

So if this doesn't work, the we'll have to look at the socket side of things again, which is what I hope we might otherwise avoid. Fingers crossed :-)

Comment 39 David Teigland 2009-04-22 18:23:02 UTC

Perhaps we need two bz's for the two issues here:
1. dlm: dlm_send can create sockets
2. gfs2: can take dlm locks when the vm asks it to free inodes to free up mem

The arguement against 2 is very simple:
. taking a dlm lock can block for an indefinite period of time
. a app may fail if the kernel blocks it for an indefinite period of time
. therefore, the kernel should not take a dlm lock in an application's context

More specificially:
1. app does something requiring kernel memory
2. in the app's context, the vm sees a memory shortage and tries to free mem
3. vm asks fs's on the system to drop cached inodes to free up mem
4. gfs2 is asked to free an inode, for which it attempts to acquire a dlm lock
5. dlm lock blocks for long time
6. app is blocked for a long time
7. app fails

To fix this, gfs2 in step 4 should either decline to free any inodes for which it would need to acquire dlm locks, or gfs2 should defer any steps like acquiring dlm locks to another thread (one of its own).

Going back to the initial two issues, solving the dlm issue (1) will not in any way resolve or help the gfs2 issue (2).  However, solving the gfs2 issue (2) may well make the dlm issue (1) a non-problem.  If we want to further address the dlm/socket issue, let's do that in another bz.  This bz should focus on the gfs2 issue.

Comment 40 David Teigland 2009-04-22 18:32:46 UTC

To clarify: an app accessing a gfs file system should, of course, expect the kernel to acquire dlm locks in its context and possibly block for indefinite periods.  It is random other apps not using gfs that could easily fail if forced to block for indefinite periods.

Comment 41 Steve Whitehouse 2009-04-22 19:22:34 UTC

We should probably also take a look at OCFS2 to see what they do, but I suspect they suffer from the same issue.

All apps have to block on memory allocation for indeterminate lengths of time - thats the mechanism which prevents them running the system totally out of memory. It should not be "forever", but depending on the system load, it might be as close as makes no practical difference, we can only do our best for "reasonable" loads, so we do need to consider what is "reasonable" and what isn't. It isn't any different to blocking due to waiting for I/O over some other channel, whether network or some local bus such as SCSI from the apps point of view.

It still might be worth making some changes in the inode clear/delete space though. It would be nicer to have some kind of "batch" system where the fs is handed a bunch of inodes at the same time. That would allow sensible scheduling of I/O and also glocks. Its a large project though, so not something I'd want to do right now - and in any case it would be at least RHEL6 rather than RHEL5.4.

It is a recognised issue that filesystems are not able to control writeback as much as they'd like to, and that ordering of I/O is not as good as it might be.

When we get to the delete_inode code, we don't know if another node is still holding the inode open. If so then we can just forget about the inode rather than removing it. If there is another node holding the lock, then its reasonable to assume that it must be in use, so we should only need a trylock at that stage.

The reason that we can't stick a try flag on the inode lock is that we must ensure that we send a callback on the iopen lock to pass on the message that the inode's link count has hit zero. Potentially we could change that I suppose and do a try lock on the iopen whether or not we get the inode's own lock.

I'm not sure it improves the situation though. If the socket allocation needs to be made, I presume that it does not matter much whether the lock is a try lock or not. If try locks were to fail to send callbacks in the event of a connection issue requiring the allocation of a new socket (the _1CB variant of a try lock I mean here) then that would break the assumptions which the unlinked inode handling is based upon.

So although its a rather nasty issue to solve, I don't see any other practical way at the moment.

Comment 42 David Teigland 2009-04-22 21:12:24 UTC

apps will not block on memory allocations for long periods under normal or even stressful conditions.

dlm locks can regularly be expected to block for long periods under normal conditions.

Comment 43 Steve Whitehouse 2009-04-23 12:58:36 UTC

I suppose it depends on what you define as normal conditions.... any app that is dirtying pages at a rate above that at which they can be written to storage will eventually have to block. The length of time that it blocks for depends upon how early it is detected and the relative rates of writing and page dirtying.

I don't think it makes any difference whether we are waiting for I/O over a congested network link or a dlm lock over the same link. Either will be slow. Likewise it doesn't matter whether we have a network block device (I mean a generic one here, not nbd specifically) server which is overloaded, or a heavy workload sharing the same node as the dlm lock master for a required lock. The result is the same in the end.

So we can do our best to try and minimise these issues by such means as the recent work on I/O queuing priorities which has been done to solve fsync() issues. The possibility will always remain that it can happen from time to time.

One might also reasonably argue that running a test load which is deliberately designed to place as much stress on the VM as possible, whilst at the same time killing random nodes in the cluster is not "normal use". 

I don't think that blocking for a long time is really the issue here anyway. What we want to do is break the cycle which results in complete deadlock from time to time. If it just gets very slow from time to time, then thats ok I think.

Comment 44 David Teigland 2009-04-23 15:51:05 UTC

We're not talking about doing i/o, we're talking about a random process that might incidentally allocate memory at any point in time being blocked on a dlm lock for a long period.  Just look at the backtraces, you have processes like "top", "sendmail", "aisexec" doing gfs2 disk deallocation and blocking on dlm locks!  If "top" goes out to lunch for a while it may not matter, but if you do that to "aisexec" or who knows what other customer application, things don't work.

> I don't think that blocking for a long time is really the issue here anyway.

Like I keep repeating, there are two separate issues here:
1. gfs2 blocking random processes on dlm locks if they happen to need memory at the wrong time
2. dlm_send creating sockets and deadlocking because of 1

If you want to discuss the dlm issue let's open a separate bz and look at how we can isolate socket creation from the send path.  But it doesn't fix the bigger gfs2 problem.

Comment 45 Steve Whitehouse 2009-04-23 16:38:02 UTC

If you want to split off the dlm side of this into another bz, thats ok by me. I didn't know that it would be possible to isolate the socket creation as there is always the risk of needing to connect to new nodes, but if it is possible, that would certain solve that issue.

With regard to:

> We're not talking about doing i/o,

I disagree - after all thats why we are getting stuck, because the dlm is sending a message which requires a socket allocation. It doesn't really matter all that much whether the I/O is via the network (dlm) or via a block device (other local fs) or even both, its all I/O and it can all block for long periods of time depending on circumstances.

If we don't block when disposing of inodes, there is nothing to stop an app creating and unlinking them until its run the node in question out of memory. So if we can't dispose of them fast enough, then we have to have some way to slow down the rate of inode creation until the two are in balance.

Given some upstream changes we might well be able to make disposal of inodes a more efficient process, and thats something I'd very much like to see, but we'll still have to have some kind of rate limiting for memory allocations and blocking in ->delete_inode() is not an unreasonable thing to do from time to time.

As I mentioned in the penultimate pg of comment #41, changing the initial lock to a try lock might well be possible and that should solve the issue of blocking for a long time due to contention on the lock, so that is certainly something that we should add to the list. I'm certainly not trying to pretend that GFS2 is perfect in this area and things could be improved, but we need to break the cycle of dependencies first (i.e. item 2 in your list above), and then we can worry about improving the performance of inode deletion later.

So I think we both agreed that both tasks need doing. The only question is how to actually do them at this stage. I think we can also be fairly sure that this won't hit 5.4, so we can also take some time to look at the issues properly and come with with (I hope) reasonable solutions.

If you are happy with that, I'll twiddle the flags on this bz accordingly.

Comment 46 David Teigland 2009-04-23 17:45:32 UTC

You're talking about what happens when dlm_send creates a socket, allocates memory and deadlocks.

I'm not.  I'm talking about what happens when dlm_send is not creating a socket, is not allocating memory, is not deadlocking -- dlm_send is working just fine.  dlm messages are going out and coming in just fine.  The problem I'm talking about is that a dlm lock is often not granted for quite a while when everything is working properly.  gfs2 is waiting for dlm locks in the context of processes that incidentally allocate memory which triggers the vm to reclaim gfs2 inodes at which point gfs2 waits for dlm locks.  The dlm locks will eventually be granted, but the delay is long enough that the innocent process picked to do gfs2 work has timed out, or failed in some way because of not being able to run for too long.

Comment 47 Steve Whitehouse 2009-05-07 15:28:40 UTC

There is now a fix for this bug. Removing relnotes as they will not be required for 5.4.

Comment 48 Steve Whitehouse 2009-05-07 15:28:40 UTC

Deleted Release Notes Contents.

Old Contents:
If your system has a GFS2 file system mounted, a node may hang if a cached inode is accessed in one node and unlinked on a different node. When this occurs, the hung node will be unavailable until you fence and recover it via the normal cluster recovery mechanism. The function calls gfs2_dinode_dealloc and shrink_dcache_memory will also appear in the stack traces of any processes stuck in the hung node. 

This issue does not affect single-node GFS2 file systems.

Comment 49 David Teigland 2009-05-15 15:43:19 UTC

Created attachment 344187 [details]
upstream dlm early connection patch

Will be pushing this to linux-next for merging in the next cycle.
It fixes a major problem in the earlier version of the connection patch that caused nodes to connect to themselves, leading to oopses.

Comment 50 David Teigland 2009-05-15 16:45:31 UTC

Created attachment 344201 [details]
upstream dlm allocation patch

Will be pushing this to linux-next for merging in the next cycle.

Comment 51 Abhijith Das 2009-05-18 04:17:24 UTC

Created attachment 344373 [details]
Combined RHEL5 patch

This is the combined RHEL5 patch from Dave's upstream patch above. There's one hunk in member.c that's missing in the latest RHEL5 kernel. Could you guys take a look before I post to rhkernel-list?

Comment 52 David Teigland 2009-05-18 14:34:08 UTC

Abhi, please keep the patches separate in RHEL5 like they are upstream, and include the original description from each patch in the patch posted for rhel.
Otherwise they look fine.

Comment 53 Abhijith Das 2009-05-18 15:56:06 UTC

http://post-office.corp.redhat.com/archives/rhkernel-list/2009-May/msg00717.html
http://post-office.corp.redhat.com/archives/rhkernel-list/2009-May/msg00718.html

Posted the above patches to rhkernel-list

Comment 55 Don Zickus 2009-05-21 15:35:59 UTC

in kernel-2.6.18-150.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 57 Nate Straz 2009-06-26 12:13:41 UTC

I ran the test case for about 18 hours and I was not able to reproduce this panic.  Marking VERIFIED based on kernel-2.6.18-154.el5.

Comment 59 errata-xmlrpc 2009-09-02 08:38:12 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Note You need to log in before you can comment on or make changes to this bug.