Bug 464932

Summary: GFS2: Recursive glock detected, io_submit to shrink_slab to gfs2_dinode_dealloc
Product: Red Hat Enterprise Linux 5 Reporter: Nate Straz <nstraz>
Component: kernelAssignee: Steve Whitehouse <swhiteho>
Status: CLOSED DUPLICATE QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.3CC: cluster-maint, edamato
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-10-07 10:20:41 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Proposed patch none

Description Nate Straz 2008-10-01 14:41:57 UTC
Description of problem:

While running d_io with plocks on GFS2 with 32MB RGs I hit the following panic while laio_rwssync and laio_lock1 were running locally on dash-03.

original: gfs2_rindex_hold+0x32/0x153 [gfs2]
pid : 27035
lock type: 2 req lock state : 3
new: gfs2_rindex_hold+0x32/0x153 [gfs2]
pid: 27035
lock type: 2 req lock state : 3
 G:  s:SH n:2/18402 f:s t:SH d:EX/0 l:0 a:0 r:5
  H: s:SH f:H e:0 p:27035 [xdoio] gfs2_rindex_hold+0x32/0x153 [gfs2]
  H: s:SH f:H e:0 p:27027 [xdoio] gfs2_rindex_hold+0x32/0x153 [gfs2]
  I: n:19/99330 t:8 f:0x00000010
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at ...r/build/BUILD/gfs2-kmod-1.116/_kmod_build_/glock.c:957
invalid opcode: 0000 [1] SMP 
last sysfs file: /kernel/dlm/brawl0/id
CPU 0 
Modules linked in: sctp gnbd(U) lock_nolock gfs(U) lock_dlm gfs2(U) dlm configfs ipt_MASQUERADE iptable_n
at ip_nat xt_state ip_conntrack nfnetlink ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge a
utofs4 hidp rfcomm l2cap bluetooth dm_log_clustered(U) sunrpc ipv6 xfrm_nalgo crypto_api dm_multipath scs
i_dh video backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sg shpc
hp tg3 libphy pcspkr qla2xxx scsi_transport_fc i3000_edac edac_mc serio_raw i2c_i801 i2c_core ide_cd cdro
m dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd 
ehci_hcd
Pid: 27035, comm: xdoio Tainted: G      2.6.18-116.gfs2abhi.001 #1
RIP: 0010:[<ffffffff886129fd>]  [<ffffffff886129fd>] :gfs2:gfs2_glock_nq+0x231/0x273
RSP: 0018:ffff81000d3e9918  EFLAGS: 00010296
RAX: 0000000000000000 RBX: ffff8100301b5518 RCX: ffff810008da33b8
RDX: 00000000ffffffff RSI: 0000000000000046 RDI: ffffffff802feb5c
RBP: ffff810030bc9318 R08: 000000000000000d R09: 000000000000003f
R10: ffffffff803e5520 R11: 0000000000000000 R12: ffff810007d921a8
R13: ffff810007d921a8 R14: ffff81002fcc8000 R15: ffff810030bc9200
FS:  00002b1465044ce0(0000) GS:ffffffff803b8000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000018803c08 CR3: 00000000105de000 CR4: 00000000000006e0
Process xdoio (pid: 27035, threadinfo ffff81000d3e8000, task ffff810005dfd7e0)
Stack:  ffff81002fcc8000 ffff810017479cc8 ffff8100240e79a8 0000000000000000
 ffff810030bc9318 ffffffff88624fed ffff81000d3e99a8 0000000000000000
 ffff81000d3e99b8 ffffffff80063aed ffff810005dfd9d0 ffff810017479f70
Call Trace:
 [<ffffffff88624fed>] :gfs2:gfs2_rindex_hold+0x3a/0x153
 [<ffffffff80063aed>] __wait_on_bit+0x60/0x6e
 [<ffffffff88611138>] :gfs2:just_schedule+0x0/0xe
 [<ffffffff80063b67>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff886147b9>] :gfs2:gfs2_dinode_dealloc+0x8e/0x1a7
 [<ffffffff8862102e>] :gfs2:gfs2_delete_inode+0xf1/0x18d
 [<ffffffff88620f83>] :gfs2:gfs2_delete_inode+0x46/0x18d
 [<ffffffff88620f3d>] :gfs2:gfs2_delete_inode+0x0/0x18d
 [<ffffffff8002f2b6>] generic_delete_inode+0xc6/0x143
 [<ffffffff800e6fe7>] prune_one_dentry+0x4d/0x76
 [<ffffffff8002e76f>] prune_dcache+0x10f/0x149
 [<ffffffff800e7027>] shrink_dcache_memory+0x17/0x30
 [<ffffffff8003ef9f>] shrink_slab+0xdc/0x153
 [<ffffffff800c8e27>] try_to_free_pages+0x1c1/0x2b9
 [<ffffffff8000f265>] __alloc_pages+0x1cb/0x2ce
 [<ffffffff8862834d>] :gfs2:gfs2_do_trans_begin+0x102/0x144
 [<ffffffff8861a7ae>] :gfs2:gfs2_write_begin+0x1a0/0x342
 [<ffffffff8861c1ab>] :gfs2:gfs2_file_buffered_write+0x14b/0x2e5
 [<ffffffff8861c5e1>] :gfs2:__gfs2_file_aio_write_nolock+0x29c/0x2d4
 [<ffffffff8000f180>] __alloc_pages+0xe6/0x2ce
 [<ffffffff8861c67e>] :gfs2:gfs2_file_aio_write+0x65/0xc1
 [<ffffffff800ea4a4>] aio_pwrite+0x2c/0x75
 [<ffffffff800ead85>] aio_run_iocb+0xef/0x18a
 [<ffffffff800eb8ba>] io_submit_one+0x29d/0x2ea
 [<ffffffff800ebdf9>] sys_io_submit+0x9b/0x108
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

Version-Release number of selected component (if applicable):
kernel-2.6.18-116.gfs2abhi.001

How reproducible:
Unknown
  
Actual results:


Expected results:


Additional info:

Tests running across the cluster at the time.

dash-01: stuffed_rwssync laio_rwrandirect
dash-02: mtfile_rwransync stuffed_lockseq
dash-03: laio_lock1 laio_rwssync

Comment 1 Nate Straz 2008-10-01 14:46:18 UTC
With luck I hit it again, this time on dash-02.

Tags running at the time dash-02 failed, genesis_reg and laio_rwransync.


original: gfs2_rindex_hold+0x32/0x153 [gfs2]
pid : 28542
lock type: 2 req lock state : 3
new: gfs2_rindex_hold+0x32/0x153 [gfs2]
pid: 28542
lock type: 2 req lock state : 3
 G:  s:SH n:2/18402 f:s t:SH d:EX/0 l:0 a:0 r:5
  H: s:SH f:H e:0 p:28542 [genesis] gfs2_rindex_hold+0x32/0x153 [gfs2]
  I: n:19/99330 t:8 f:0x00000010
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at ...r/build/BUILD/gfs2-kmod-1.116/_kmod_build_/glock.c:957
invalid opcode: 0000 [1] SMP 
last sysfs file: /kernel/dlm/brawl0/control
CPU 0 
Modules linked in: sctp gnbd(U) lock_nolock gfs(U) lock_dlm gfs2(U) dlm configfs ipt_MASQUERADE iptable_nat ip_nat xt_state ip_conntrack nfnetlink ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge autofs4 hidp rfcomm l2cap bluetooth dm_log_clustered(U) sunrpc ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sg i2c_i801 i2c_core qla2xxx scsi_transport_fc shpchp tg3 libphy ide_cd i3000_edac pcspkr edac_mc cdrom serio_raw dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 28542, comm: genesis Tainted: G      2.6.18-116.gfs2abhi.001 #1
RIP: 0010:[<ffffffff886059fd>]  [<ffffffff886059fd>] :gfs2:gfs2_glock_nq+0x231/0x273
RSP: 0018:ffff810005f39898  EFLAGS: 00010292
RAX: 0000000000000000 RBX: ffff810004933318 RCX: ffff81001201fb18
RDX: 00000000ffffffff RSI: 0000000000000046 RDI: ffffffff802feb5c
RBP: ffff810004933f18 R08: 000000000000000d R09: 000000000000003f
R10: ffffffff803e5520 R11: 0000000000000000 R12: ffff81001bf3e1a8
R13: ffff81001bf3e1a8 R14: ffff81000455e000 R15: ffff810004933e00
FS:  00002ad311b26220(0000) GS:ffffffff803b8000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000001c145068 CR3: 0000000025d3c000 CR4: 00000000000006e0
Process genesis (pid: 28542, threadinfo ffff810005f38000, task ffff81001201f860)
Stack:  ffff81000455e000 ffff81003cf46368 ffff81001b99c9a8 0000000000000000
 ffff810004933f18 ffffffff88617fed ffff810005f39928 0000000000000000
 ffff810005f39938 ffffffff80063aed ffff81001201fa50 ffff81003cf46610
Call Trace:
 [<ffffffff88617fed>] :gfs2:gfs2_rindex_hold+0x3a/0x153
 [<ffffffff80063aed>] __wait_on_bit+0x60/0x6e
 [<ffffffff88604138>] :gfs2:just_schedule+0x0/0xe
 [<ffffffff80063b67>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff886077b9>] :gfs2:gfs2_dinode_dealloc+0x8e/0x1a7
 [<ffffffff8861402e>] :gfs2:gfs2_delete_inode+0xf1/0x18d
 [<ffffffff88613f83>] :gfs2:gfs2_delete_inode+0x46/0x18d
 [<ffffffff8860bb15>] :gfs2:gfs2_meta_wait+0x18/0x83
 [<ffffffff88613f3d>] :gfs2:gfs2_delete_inode+0x0/0x18d
 [<ffffffff8002f2b6>] generic_delete_inode+0xc6/0x143
 [<ffffffff800e6fe7>] prune_one_dentry+0x4d/0x76
 [<ffffffff8002e76f>] prune_dcache+0x10f/0x149
 [<ffffffff800e7027>] shrink_dcache_memory+0x17/0x30
 [<ffffffff8003ef9f>] shrink_slab+0xdc/0x153
 [<ffffffff800c8e27>] try_to_free_pages+0x1c1/0x2b9
 [<ffffffff8000f265>] __alloc_pages+0x1cb/0x2ce
 [<ffffffff8861b34d>] :gfs2:gfs2_do_trans_begin+0x102/0x144
 [<ffffffff8860d7ae>] :gfs2:gfs2_write_begin+0x1a0/0x342
 [<ffffffff8860f1ab>] :gfs2:gfs2_file_buffered_write+0x14b/0x2e5
 [<ffffffff8008ac42>] enqueue_task+0x41/0x56
 [<ffffffff8860f5e1>] :gfs2:__gfs2_file_aio_write_nolock+0x29c/0x2d4
 [<ffffffff800e5c0d>] __posix_lock_file_conf+0x376/0x3be
 [<ffffffff8860f784>] :gfs2:gfs2_file_write_nolock+0xaa/0x10f
 [<ffffffff885e09f5>] :dlm:dlm_posix_lock+0x1ff/0x210
 [<ffffffff8009e9e2>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8009e9e2>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8860ffab>] :gfs2:gfs2_open+0x0/0x13a
 [<ffffffff8860f8d4>] :gfs2:gfs2_file_write+0x49/0xa7
 [<ffffffff80016723>] vfs_write+0xce/0x174
 [<ffffffff80016ff0>] sys_write+0x45/0x6e
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0


Code: 0f 0b 68 5d db 61 88 c2 bd 03 be 01 00 00 00 4c 89 ef e8 45 
RIP  [<ffffffff886059fd>] :gfs2:gfs2_glock_nq+0x231/0x273
 RSP <ffff810005f39898>
 <0>Kernel panic - not syncing: Fatal exception

Comment 2 Robert Peterson 2008-10-01 23:23:20 UTC
Created attachment 319178 [details]
Proposed patch

After analyzing the call trace above, it appears that the "bad"
sequence of events was something like this:

gfs2_write_begin
   gfs2_inplace_reserve
      gfs2_rindex_hold
   ...
   gfs2_trans_begin
      try_to_free_pages
         shrink_slab
            gfs2_delete_inode
               gfs2_dinode_dealloc
                  gfs2_rindex_hold

It's my fear that the call to shrink_slab can occur at unpredictable
places, but it will end with gfs2_dinode_dealloc getting called.
Therefore, it seems reasonable to me to say that when we get to function
gfs2_dinode_dealloc, we may sometimes already have the rindex locked.
Therefore, I concocted this patch that basically makes that function
first check to see if the rindex is already held.

The function goes on to grab an exclusive lock on the rgrp, which may
also be held.  This patch doesn't cover that case.  After that, the
function does a trans_begin which grabs an exclusive glock for the
transaction glock, then log reserve.  These are also potential problems.
So this patch is not a final solution, just a beginning.
Hopefully Steve Whitehouse will have some input on this matter.

Comment 3 Steve Whitehouse 2008-10-03 12:56:35 UTC
This looks like the bug might be elsewhere. The allocation in gfs2_do_trans_begin() very clearly has the GFP_NOFS flag set. So I rather think that maybe the memory subsystem is broken.... there is no way that we should be calling shrink slab for the dcache for such allocations, after all thats the whole point of GFP_NOFS.

It would certainly explain why we had the problem after the other patch was applied. So I wonder what recent patches there have been which affect the slab relaim code.

Comment 4 Nate Straz 2008-10-03 19:51:59 UTC
After further inspecting my configuration while trying to verify 462579, I discovered that I had an old kmod-gfs2 package installed and I was loading gfs2.ko built Sep 19 instead of the kmod from the 2.6.18-116.gfs2abhi.001 kernel built Sep 29.  I need to rerun these tests.

Comment 5 Nate Straz 2008-10-06 14:25:51 UTC
After re-running with kmod-gfs2 removed and confirming that GFS2 was built the same day as the kernel, my tests have all succeeded.

Comment 6 Steve Whitehouse 2008-10-07 10:20:41 UTC

*** This bug has been marked as a duplicate of bug 460218 ***