502977 – panic in posix_lock_file() with GFS2 over NFS

Bug 502977 - panic in posix_lock_file() with GFS2 over NFS

Summary: panic in posix_lock_file() with GFS2 over NFS

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.3
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	5.4
Assignee:	Steve Whitehouse
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	502912 533192
TreeView+	depends on / blocked

Reported:	2009-05-28 04:48 UTC by Lachlan McIlroy
Modified:	2018-10-27 15:17 UTC (History)
CC List:	22 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-05-18 12:41:52 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Program to reproduce panic (795 bytes, text/x-csrc) 2009-05-28 04:50 UTC, Lachlan McIlroy	no flags	Details
debugging patch for reference (2.90 KB, text/plain) 2009-10-30 21:04 UTC, David Teigland	no flags	Details
debugging patch (8.14 KB, text/plain) 2009-11-06 19:34 UTC, David Teigland	no flags	Details
flock-loop test program (1.15 KB, text/plain) 2009-11-06 20:55 UTC, David Teigland	no flags	Details
latest testing/debugging patch (13.16 KB, patch) 2009-11-12 23:19 UTC, David Teigland	no flags	Details \| Diff
example patch (3.68 KB, text/plain) 2010-02-26 16:33 UTC, David Teigland	no flags	Details
updated patch (4.45 KB, text/plain) 2010-03-25 21:36 UTC, David Teigland	no flags	Details
patch (4.51 KB, application/octet-stream) 2010-03-25 22:04 UTC, David Teigland	no flags	Details
patch (4.51 KB, text/plain) 2010-03-26 19:03 UTC, David Teigland	no flags	Details
Show Obsolete (3) View All

Description Lachlan McIlroy 2009-05-28 04:48:32 UTC

Description of problem:

With the error messages seen in the logs as follows:

----------------
May  5 14:00:19 heim1 kernel: lockd: grant for unknown block
May  5 14:00:19 heim1 kernel: dlm: dlm_plock_callback: lock granted after lock request failed; dangling lock!
May  5 14:00:19 heim1 kernel:  
May  5 14:01:52 heim1 kernel: lockd: grant for unknown block
May  5 14:01:52 heim1 kernel: dlm: dlm_plock_callback: lock granted after lock request failed; dangling lock!
May  5 14:01:52 heim1 kernel:  
May  5 14:02:36 heim1 kernel: lockd: grant for unknown block
May  5 14:02:36 heim1 kernel: dlm: dlm_plock_callback: lock granted after lock request failed; dangling lock!
May  5 14:02:36 heim1 kernel:  
May  5 14:04:28 heim1 kernel: lockd: grant for unknown block
May  5 14:04:28 heim1 kernel: dlm: dlm_plock_callback: lock granted after lock request failed; dangling lock!
May  5 14:04:28 heim1 kernel:  
May  5 14:04:37 heim1 kernel: lockd: grant for unknown block
May  5 14:04:37 heim1 kernel: dlm: dlm_plock_callback: lock granted after lock request failed; dangling lock!
May  5 14:04:37 heim1 kernel:  
May  5 14:04:50 heim1 kernel: lockd: grant for unknown block
May  5 14:04:50 heim1 kernel: dlm: dlm_plock_callback: lock granted after lock request failed; dangling lock!
May  5 14:04:50 heim1 kernel:  
May  5 14:06:52 heim1 kernel: lockd: grant for unknown block
May  5 14:06:52 heim1 kernel: dlm: dlm_plock_callback: lock granted after lock request failed; dangling lock!
May  5 14:06:52 heim1 kernel:  
May  5 14:08:06 heim1 kernel: lockd: grant for unknown block
May  5 14:08:06 heim1 kernel: dlm: dlm_plock_callback: lock granted after lock request failed; dangling lock!
May  5 14:08:06 heim1 kernel:  
----------------

Kernel Oops on a cluster node:

This is node 1 of a two node cluster set up to NFS export home directories.  

Mar 17 11:02:50 heim1 kernel: Unable to handle kernel NULL pointer dereference at 0000000000000010 RIP:
Mar 17 11:02:50 heim1 kernel:  [<ffffffff800e4e68>] posix_lock_file+0x6/0xf
Mar 17 11:02:50 heim1 kernel: PGD 221e6d067 PUD 22227d067 PMD 0
Mar 17 11:02:50 heim1 kernel: Oops: 0000 [1] SMP
Mar 17 11:02:50 heim1 kernel: last sysfs file: /devices/pci0000:00/0000:00:00.0/irq
Mar 17 11:02:50 heim1 kernel: CPU 4
Mar 17 11:02:50 heim1 kernel: Modules linked in: ip_vs nfsd exportfs lockd nfs_acl auth_rpcgss sunrpc autofs4 ipmi_devintf ipmi_si ipmi_msghandler lock_dlm gfs2(U) dlm configfs bo\
nding ipv6 xfrm_nalgo crypto_api dm_emc dm_round_robin dm_multipath scsi_dh video hwmon backlight sbs i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp par\
port ide_cd i5000_edac sg e1000e edac_mc bnx2 cdrom serio_raw pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod qla2xxx scsi_tran\
sport_fc ata_piix libata shpchp megaraid_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Mar 17 11:02:50 heim1 kernel: Pid: 3982, comm: gfs_controld Tainted: G      2.6.18-128.1.1.el5 #1
Mar 17 11:02:50 heim1 kernel: RIP: 0010:[<ffffffff800e4e68>]  [<ffffffff800e4e68>] posix_lock_file+0x6/0xf
Mar 17 11:02:50 heim1 kernel: RSP: 0018:ffff810221ee3ea0  EFLAGS: 00010246
Mar 17 11:02:50 heim1 kernel: RAX: 0000000000000000 RBX: ffff81012c695000 RCX: 0000000000000000
Mar 17 11:02:50 heim1 kernel: RDX: 0000000000000000 RSI: ffff81012c695070 RDI: ffff81022d47a380
Mar 17 11:02:50 heim1 kernel: RBP: ffff81012c695070 R08: 0000000000000000 R09: 7fffffffffffffff
Mar 17 11:02:50 heim1 kernel: R10: 000000000000000c R11: 000000000003a2b4 R12: ffff81022d47a380
Mar 17 11:02:50 heim1 kernel: R13: ffff81022f14e8e0 R14: ffffffff88677fdf R15: 000000000cae4450
Mar 17 11:02:50 heim1 kernel: FS:  00002b6752d64a10(0000) GS:ffff81022fc1ed40(0000) knlGS:0000000000000000
Mar 17 11:02:50 heim1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Mar 17 11:02:50 heim1 kernel: CR2: 0000000000000010 CR3: 00000002243e9000 CR4: 00000000000006e0
Mar 17 11:02:50 heim1 kernel: Process gfs_controld (pid: 3982, threadinfo ffff810221ee2000, task ffff81022f489040)
Mar 17 11:02:50 heim1 kernel: Stack:  ffffffff885377f6 0000000100000001 0000000200000000 000000010000000c
Mar 17 11:02:50 heim1 kernel:  0007000200000000 000000000003a2b4 0000000000000000 7fffffffffffffff
Mar 17 11:02:50 heim1 kernel:  000000000000000c ffff81022fe658c0 0000000000000040 00007fff57cd43c0
Mar 17 11:02:50 heim1 kernel: Call Trace:
Mar 17 11:02:50 heim1 kernel:  [<ffffffff885377f6>] :dlm:dev_write+0x157/0x207
Mar 17 11:02:50 heim1 kernel:  [<ffffffff8001659e>] vfs_write+0xce/0x174
Mar 17 11:02:50 heim1 kernel:  [<ffffffff80016e6b>] sys_write+0x45/0x6e
Mar 17 11:02:50 heim1 kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Mar 17 11:02:50 heim1 kernel:
Mar 17 11:02:50 heim1 kernel:
Mar 17 11:02:50 heim1 kernel: Code: 48 8b 78 10 e9 fd fb ff ff 41 57 49 89 ff 41 56 41 55 41 54
Mar 17 11:02:50 heim1 kernel: RIP  [<ffffffff800e4e68>] posix_lock_file+0x6/0xf
Mar 17 11:02:50 heim1 kernel:  RSP <ffff810221ee3ea0>
Mar 17 11:02:50 heim1 kernel: CR2: 0000000000000010
Mar 17 11:02:50 heim1 kernel:  <0>Kernel panic - not syncing: Fatal exception

The filesystem is GFS2 and the panic happens on the node which is exporting the GFS2 filesystem through NFS configured in cluster.


Version-Release number of selected component (if applicable):
This panic has been reproduced on 2.6.18-150.el5.

How reproducible:
100%

Steps to Reproduce:

This how I reproduced it - may not represent what the customer was doing.

1. Setup cluster with two nodes and a GFS2 filesystem.
2. Export the GFS2 filesystem via NFS from node A and mount on node B
3. Export the GFS2 filesystem via NFS from node B and mount on node A
4. On both NFS clients run this:

for i in `seq 1 1000`
do
./flock $i &
done

[see attached flock.c for flock program]

One of the nodes will panic within 15 mins.


Additional info:

crash> bt
PID: 5229   TASK: ffff81022b1cc0c0  CPU: 1   COMMAND: "gfs_controld"
#0 [ffff81021d989c00] crash_kexec at ffffffff800aaa19
#1 [ffff81021d989cc0] __die at ffffffff8006520f
#2 [ffff81021d989d00] do_page_fault at ffffffff80066e1c
#3 [ffff81021d989df0] error_exit at ffffffff8005dde9
   [exception RIP: posix_lock_file+6]
   RIP: ffffffff800e4e68  RSP: ffff81021d989ea0  RFLAGS: 00010246
   RAX: 0000000000000000  RBX: ffff8101bd756200  RCX: 0000000000000000
   RDX: 0000000000000000  RSI: ffff8101bd756270  RDI: ffff8101bcb4a980
   RBP: ffff8101bd756270   R8: 0000000000000000   R9: 7fffffffffffffff
   R10: 0000000000000000  R11: 000000000003a2b4  R12: ffff8101bcb4a980
   R13: ffff81022e1e64e0  R14: ffffffff88675fdf  R15: 0000000006119280
   ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#4 [ffff81021d989ea0] dev_write at ffffffff885357f6
#5 [ffff81021d989f10] vfs_write at ffffffff8001659e
#6 [ffff81021d989f40] sys_write at ffffffff80016e6b
#7 [ffff81021d989f80] tracesys at ffffffff8005d28d (via system_call)
   RIP: 00000034130c56a0  RSP: 00007fffa5d97f88  RFLAGS: 00000246
   RAX: ffffffffffffffda  RBX: ffffffff8005d28d  RCX: ffffffffffffffff
   RDX: 0000000000000040  RSI: 00007fffa5d98080  RDI: 000000000000000a
   RBP: 0000000000000002   R8: 000000003a0698eb   R9: 00000000702bc85c
   R10: 0000000049d63f88  R11: 0000000000000246  R12: 0000000006119280
   R13: 00000000061192c0  R14: 0000000006119280  R15: 0000000006119280
   ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b
crash>
crash> dis posix_lock_file
0xffffffff800e4e62 <posix_lock_file>:   mov    0x10(%rdi),%rax
0xffffffff800e4e66 <posix_lock_file+4>: xor    %edx,%edx
0xffffffff800e4e68 <posix_lock_file+6>: mov    0x10(%rax),%rdi
0xffffffff800e4e6c <posix_lock_file+10>:        jmpq   0xffffffff800e4a6e <__posix_lock_file_co
nf>
crash> px *(struct file *)0xffff8101bcb4a980
$2 = {
 f_u = {
   fu_list = {
     next = 0x0,
     prev = 0xffffffff8003c0d9
   },
   fu_rcuhead = {
     next = 0x0,
     func = 0xffffffff8003c0d9 <file_free_rcu>
   }
 },
 f_dentry = 0x0,
 f_vfsmnt = 0x0,
 f_op = 0xffffffff88570a40,
 f_count = {
   counter = 0x0
 },
 f_flags = 0x8000,
 f_mode = 0xd,
 f_pos = 0x0,
 f_owner = {
   lock = {
     raw_lock = {
       lock = 0x1000000
     }
   },
   pid = 0x0,
   uid = 0x0,
   euid = 0x0,
   security = 0x0,
   signum = 0x0
 },
 f_uid = 0x96,
 f_gid = 0x96,
 f_ra = {
   start = 0x0,
   size = 0x0,
   flags = 0x0,
   cache_hit = 0x0,
   prev_page = 0xffffffffffffffff,
   ahead_start = 0x0,
   ahead_size = 0x0,
   ra_pages = 0x20,
   mmap_hit = 0x0,
   mmap_miss = 0x0
 },
 f_version = 0x0,
 f_security = 0x0,
 private_data = 0x0,
 f_ep_links = {
   next = 0xffff8101bcb4aa50,
   prev = 0xffff8101bcb4aa50
 },
 f_ep_lock = {
   raw_lock = {
     slock = 0x1
   }
 },
 f_mapping = 0xffff8101be425478
}

We've crashed because the struct file pointer passed to posix_lock_file() appears to have been freed (ie f_count is zero and other fields have been reset - in particular f_dentry which we panicked trying to dereference).

I suspect it's a bug in dlm (ie it needs to take an additional reference on the struct file when it's saved off in dlm_posix_lock() to prevent it from being freed before gfs_controld gets to it).

This issue looks similar to:

https://bugzilla.redhat.com/show_bug.cgi?id=470074
and
https://bugzilla.redhat.com/show_bug.cgi?id=466677

but since the problems still exist in 2.6.18-150 they weren't fixed in those BZs.

Comment 1 Lachlan McIlroy 2009-05-28 04:50:00 UTC

Created attachment 345702 [details]
Program to reproduce panic

Comment 2 David Teigland 2009-06-01 17:51:49 UTC

also similar to
https://bugzilla.redhat.com/show_bug.cgi?id=471254

but that patch should be in the -150 kernel as well.

Comment 3 David Teigland 2009-06-10 19:11:44 UTC

I'm testing with 5.4 beta, 2.6.18-151.el5xen.

nodes xen1 and xen2 are exporting,
# cat /etc/exports 
/gfs           *(rw,insecure,no_root_squash)

node xen3 mounts from xen1, node xen4 mounts from xen2.

xen3 and xen4 have been running my own test as well as the flock test in comment 1, and they seem to all work fine.  Given that these are all vm's on one host, everything is very slow.

I've tried both gfs1 and gfs2 as the shared fs between xen1 and xen2.

I'll next try xen2 mounting xen1's export, and xen1 mounting xen2's export,
although I wouldn't be too surprised if that arrangement produced an odd problem somewhere (and I wouldn't be too concerned about it.)

Comment 4 David Teigland 2009-06-10 19:18:34 UTC

Initial results from xen2 mounting xen1's export and xen1 mounting xen2's export.
The flock test often stops and doesn't make any progress on either node, I don't know why, I didn't notice this using separate clients and servers.  I've seen a couple "lockd: grant for unknown block" messages on each node after running for a few minutes.

Comment 6 David Teigland 2009-10-02 15:46:41 UTC

I've not been able to reproduce this.  Lachlan, could you try this again with separate nodes exporting and importing?  I was not able to reproduce in either case, but nodes both exporting and importing the same fs isn't a configuration we want to worry about.

Comment 7 Lachlan McIlroy 2009-10-05 07:21:06 UTC

Okay.  I have a 4 node cluster with node 1 exporting the GFS2 filesystem via NFS to node 3 and node 2 exporting to node 4 with the test running on the NFS clients on nodes 3 and 4.  I'm running 2.6.18-150 again on all nodes.  I still see the "grant for unknown block" and "dangling lock" messages but so far no panic.  I'll let it run overnight.  Separating the NFS client and servers onto different nodes may change the load/timing enough to avoid the problem but the bug will still be lurking.

Comment 8 Lachlan McIlroy 2009-10-06 00:28:11 UTC

It ran all night without panicking.  This morning I noticed the flock processes had been killed so I tried to unmount the NFS filesystems on nodes 3 and 4 and got EBUSY on both.  The flock processes had not terminated yet and still had references to the filesystem.  Slowly they terminated but not before node 1 panicked in dlm:dev_write() as above.

Comment 9 David Teigland 2009-10-06 14:23:27 UTC

OK, thanks, I'll get a cluster set up to try this again.

Comment 16 MSSL computing group 2009-10-30 14:50:36 UTC

This bug is manifesting under GFS1 as well as GFS2 (on a production system)

Comment 17 Steve Whitehouse 2009-10-30 14:55:46 UTC

Both GFS and GFS2 use the same code to deal with posix locks, so its quite likely that any bugs in this area will be shared between the two code bases.

Comment 18 David Teigland 2009-10-30 19:17:52 UTC

Have all four normal test nodes back.  Testing with upstream kernel 2.6.32-rc5
because it's easier to debug and should be about the same code in this area.
Reproduced the same bug with much less load.

node1 and node2 have gfs mounted
node3 mounts node1:/gfs /gfs
node4 mounts node2:/gfs /gfs

node3 and node4 each run three instances of looping flock test all in
foreground (and modified to show output on each iteration), on files 1, 2, 3

flock-loop 1
flock-loop 2
flock-loop 3

This ran for several minutes, periodically one to three of the flock-loop
instances would block for up to a minute at a time before resuming; reason
unknown.

While running, node1 had a single "lockd: grant for unknown block" message, and
node2 had none.  Neither reported a "dlm_plock_callback: lock granted" message.

Eventually the original oops occured on node2:

BUG: unable to handle kernel NULL pointer dereference at 00000000
00000050
IP: [<ffffffff8110deee>] posix_lock_file+0x8/0x13
PGD 6cc05067 PUD 6cc06067 PMD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
CPU 3
Modules linked in: nfsd nfs_acl auth_rpcgss exportfs gfs2 dlm configfs
ipt_REJEC
T xt_tcpudp iptable_filter ip_tables x_tables bridge stp autofs4 lockd sunrpc
ip
v6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi cpufreq_ondemand
dm_mult
ipath video output sbs sbshc battery ac parport_pc lp parport sg serio_raw
butto
n tg3 libphy i2c_nforce2 i2c_core pcspkr dm_snapshot dm_zero dm_mirror
dm_region
_hash dm_log dm_mod qla2xxx scsi_transport_fc shpchp mptspi mptscsih mptbase
scs
i_transport_spi sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 9630, comm: dlm_controld Not tainted 2.6.32-rc5 #2 ProLiant DL145 G2
RIP: 0010:[<ffffffff8110deee>]  [<ffffffff8110deee>] posix_lock_file+0x8/0x13
RSP: 0018:ffff88006cc63e88  EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff88005e90c270 RCX: ffff88006ce5cbb8
RDX: 0000000000000000 RSI: ffff88005e90c2e0 RDI: ffff88007ee752f8
RBP: ffff88006cc63e88 R08: ffffffffa02de53a R09: ffffffff8132af3c
R10: ffffffff810dd0e1 R11: 0000000000000206 R12: ffff88007eaac128
R13: ffff88007ee752f8 R14: ffffffffa02c2ca2 R15: 000000000133df60
FS:  00007fc9471286e0(0000) GS:ffff880083a00000(0000) knlGS:00000000f777a6c0
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000050 CR3: 000000006cc04000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process dlm_controld (pid: 9630, threadinfo ffff88006cc62000, task
ffff88006ce5c
4c0)
Stack:
 ffff88006cc63f08 ffffffffa02de495 ffff88006cc63ec8 ffffffff8115bee1
<0> 0000000100000001 0000000200000000 0000000200026817 6ce2339d00000000
<0> 00000000000308a8 0000000000000000 0000000000000000 0000000000026817
Call Trace:
 [<ffffffffa02de495>] dev_write+0x15e/0x221 [dlm]
 [<ffffffff8115bee1>] ? selinux_file_permission+0x5c/0x10f
 [<ffffffff810dc158>] vfs_write+0xae/0x14a
 [<ffffffff810dc6eb>] sys_write+0x47/0x6f
 [<ffffffff8100b9eb>] system_call_fastpath+0x16/0x1b
Code: f1 fe ff ff 41 bc db ff ff ff e8 93 d7 21 00 eb cd 48 83 c4 38 44 89 e0
5b 41 5c 41 5d 41 5e 41 5f c9 c3 55 48 8b 47 18 48 89 e5 <48> 8b 78 50 e8 c4 fb
ff ff c9 c3 48 8b 47 20 55 49 89 d0 48 89
RIP  [<ffffffff8110deee>] posix_lock_file+0x8/0x13
 RSP <ffff88006cc63e88>
CR2: 0000000000000050
---[ end trace 8a173e5ae5b588d3 ]---


dlm_controld  ? 0000000000000003     0  9630      1 0x00000080
 ffff88006cc63c08 0000000000000046 0000000000000000 ffff88006ce5c4c0
 ffff88006cc63bd8 ffff88006ce5c4c0 ffff88007f414400 ffff88006ce5c870
 00000001045b7c19 0000000000000046 ffffffff814f4018 ffffffff814f4000
Call Trace:
 [<ffffffff81047730>] do_exit+0x655/0x66e
 [<ffffffff8132c2fb>] oops_end+0xb2/0xba
 [<ffffffff810280b5>] no_context+0x1ec/0x1fb
 [<ffffffff810282ea>] __bad_area_nosemaphore+0x16c/0x18f
 [<ffffffff8102834c>] __bad_area+0x3f/0x48
 [<ffffffff81028373>] bad_area+0xe/0x10
 [<ffffffff8132d750>] do_page_fault+0x1fb/0x2db
 [<ffffffffa02c2ca2>] ? nlmsvc_grant_deferred+0x0/0x15a [lockd]
 [<ffffffff8132b80f>] page_fault+0x1f/0x30
 [<ffffffffa02c2ca2>] ? nlmsvc_grant_deferred+0x0/0x15a [lockd]
 [<ffffffff810dd0e1>] ? fget_light+0x4f/0xe9
 [<ffffffff8132af3c>] ? _spin_unlock+0x26/0x2a
 [<ffffffffa02de53a>] ? dev_write+0x203/0x221 [dlm]
 [<ffffffff8110deee>] ? posix_lock_file+0x8/0x13
 [<ffffffffa02de495>] dev_write+0x15e/0x221 [dlm]
 [<ffffffff8115bee1>] ? selinux_file_permission+0x5c/0x10f
 [<ffffffff810dc158>] vfs_write+0xae/0x14a
 [<ffffffff810dc6eb>] sys_write+0x47/0x6f
 [<ffffffff8100b9eb>] system_call_fastpath+0x16/0x1b


Based on lachlan's analysis, I'm going to investigate how the xop->file might
be getting freed after dlm_posix_lock() and before the callback, and what is,
in theory, supposed to prevent that (if anything).

Comment 19 David Teigland 2009-10-30 21:04:33 UTC

Created attachment 366868 [details]
debugging patch for reference

Collected the following info from printks in this patch, but not completely analyzed yet.

Oct 30 15:52:24 bull-01 kernel: lockd: grant for unknown block
Oct 30 15:52:24 bull-01 kernel: lockd: fl ffff88013f54b398 owner 7740398493674204011 start 7740398493674204011 end 7740398493674204011
Oct 30 15:53:28 bull-01 kernel: lockd: grant for unknown block
Oct 30 15:53:28 bull-01 kernel: lockd: fl ffff88003f9fedf8 owner 18446612137670821488 start 0 end 0
Oct 30 15:53:28 bull-01 kernel: dlm: dlm_plock_callback: 199873 fl ffff88003f9fedf8 lock granted after lock request failed; dangling lock!
Oct 30 15:53:28 bull-01 kernel: dlm: start 0 end 0

Oct 30 15:52:29 bull-02 kernel: dlm: dlm_plock_callback: 199884 fl ffff88013ed3d630 file ffff88012d369068 dentry (null)
Oct 30 15:52:29 bull-02 kernel: lockd: grant for unknown block
Oct 30 15:52:29 bull-02 kernel: dlm: dlm_plock_callback: 199884 fl ffff88013ed3d630 lock granted after lock request failed; dangling lock!
Oct 30 15:52:29 bull-02 kernel: dlm: start 7740398493674204011 end 7740398493674204011

Comment 21 David Teigland 2009-11-02 22:55:58 UTC

This problem first appears in bug 466677 which was probably never fully understood or fixed.

vfs_cancel_lock is being called constantly during these tests which seems strange.  I'd like to understand the whole end-to-end picture of what the lock/unlock cycle is supposed to look like, how it's supposed to work, and what role cancel has in it.  gfs translates a CANCELLK into an UNLOCK with a "Hack" comment.

Comment 22 David Teigland 2009-11-03 22:13:43 UTC

From comment 20 it sounds like the problem appears with a single client talking to a single server.

Comment 23 David Teigland 2009-11-03 22:28:33 UTC

I've spent the day studying lockd/svclock.c along with the dprintk output for a 
simple lock/unlock.  From the dlm side everything seems to be working correctly; it's only when the dlm calls back into lockd that there seems to be problems with the nlm_block structures.  My hunch is that it's cancel that somehow leads to the problems.  Overall, the comments in the svclock.c code are not encouraging; it sounds like there are plenty of gaps for things to go wrong even without the new async/DEFERRED behavior introduced by gfs.

Comment 24 David Teigland 2009-11-04 23:12:40 UTC

When lockd is operating correctly, one call to vfs_lock_file() should have one corresponding nlmsvc_grant_deferred() callback.  With a debugging patch, I'm setting a new B_IN_FS flag on a block when vfs_lock_file() is called, and then clearing it in nlmsvc_grant_deferred().  Before setting the flag I check that it's not already set, and before clearing the flag I check that it is set.

The first sign of problems is when lockd calls vfs_lock_file() on a block that already has B_IN_FS set.  After this, a lot of other similar errors quickly pile up indicating that the lockd block's are out of sync with the dlm locks.

I don't have any ideas why lockd may be calling vfs_lock_file() on a block that's currently busy in the fs, but I believe the root of the problems are in that direction.

Comment 25 Lachlan McIlroy 2009-11-05 02:12:57 UTC

My suspicion is that the file object is being freed because we don't take a reference on it before saving it away and later on we retrieve it after it has been freed.  I just ran this patch through a quick test and it hasn't crashed.

--- linux-2.6.18.x86_64/fs/dlm/plock.c.orig	2009-11-04 14:54:45.000000000 +1100
+++ linux-2.6.18.x86_64/fs/dlm/plock.c	2009-11-04 14:58:00.000000000 +1100
@@ -11,6 +11,7 @@
 #include <linux/poll.h>
 #include <linux/dlm.h>
 #include <linux/dlm_plock.h>
+#include <linux/file.h>
 
 #include "dlm_internal.h"
 #include "lockspace.h"
@@ -106,6 +107,7 @@ int dlm_posix_lock(dlm_lockspace_t *lock
 		locks_init_lock(&xop->flc);
 		locks_copy_lock(&xop->flc, fl);
 		xop->fl		= fl;
+		get_file(file);
 		xop->file	= file;
 	} else {
 		op->info.owner	= (__u64)(long) fl->fl_owner;
@@ -187,6 +189,7 @@ static int dlm_plock_callback(struct plo
 		log_print("dlm_plock_callback: vfs lock error %llx file %p fl %p",
 			  (unsigned long long)op->info.number, file, fl);
 	}
+	fput(file);
 
 	rv = notify(fl, NULL, 0);
 	if (rv) {

The "grant for unknown block" and "lock granted after lock request failed ..." messages are still appearing though so there may be more to fix.

Comment 26 Steve Whitehouse 2009-11-05 10:47:43 UTC

The patch in comment #25 doesn't look right. All locks are removed when the file is closed, so it should be impossible to have a closed file on which there are remaining locks.

Comment 27 David Teigland 2009-11-05 16:14:22 UTC

Comment #25 is a logical fix for the NULL-file oops, but I believe that it fixes a symptom and doesn't address the root cause.  (Fixing symptoms can be sensible to do, too, but I think our main goal right now is finding the root cause.)

Comment 28 Steve Whitehouse 2009-11-05 16:28:08 UTC

One thing that I couldn't see was any testing for the FL_CLOSE flag and I wonder if that needs to be handled specifically or whether it doesn't matter.

Comment 29 David Teigland 2009-11-05 16:56:06 UTC

It does not appear to me that the problems are in the direction of the vfs, but rather in the direction of lockd.  Remember, there are a lot more than file's and file_lock's involved here, and we don't have any known problems with local locking tests, only nlm locking tests.  I'm fairly confident in the struct lifetimes/references among the first three items in this list, but not among the last three items in the list:

- struct file
- struct file_lock
- struct plock_op
- struct nlm_file
- struct nlm_block

At the moment I don't have the impression this is even a struct lifetime or reference counting issue at the core.  As I said above, the first sign that things are off is when lockd seems to call vfs_lock_file() on a lock that's currently in the middle of vfs_lock_file().

Comment 30 David Teigland 2009-11-06 19:34:45 UTC

Created attachment 367874 [details]
debugging patch

This patch adds the B_IN_FS flag to a block that's busy in the fs as mentioned in the earlier comment.  It also takes it one step further in an attempt to fix (or at least avoid) the problems by *not* going ahead with another vfs_lock_file() if the block in question is busy with an earlier vfs_lock_file().

In my testing it has so far been successful in avoiding the problems (e.g. no "dangling locks" or "unknown blocks"), but it would be good to try the other tests that have shown problems.

My big remaining question is whether or not lockd is behaving correctly when it calls vfs_lock_file() on a block that's currently busy with a previous vfs_lock_file().  If not, then I'll pass this off to the lockd experts to debug why it's happening.  If it's legitmate or difficult to avoid, then we'll need to detect when it happens (with B_IN_FS or something equivalent) and abort it.

Comment 31 MSSL computing group 2009-11-06 19:56:03 UTC

We're getting multiple outages in production systems due to this bug (one machine goes down, a clustermate picks up the load and then crashes, etc) and users aren't happy.

Test RPMS would be handy...

Comment 32 MSSL computing group 2009-11-06 20:02:07 UTC

Not sure if related. Posted per Dave's request:

I'm seeing panics on machines when running local rsync between GFS1 and GFS2 filesystems. 

Neither FS is NFS exported, both were mounted as GFS local only (lock_nolock).

The filesystem content is 350Gb of Imap Mdir folders - approximately 3 million mostly tiny files. Some directories may contain 10k+ files but most hold far less than this. (usually a few hundred at most)

Comment 33 Steve Whitehouse 2009-11-06 20:10:03 UTC

Probably not related. I had another report of something that sounds like this recently. What the original reporter didn't say was that it was GFS1 -> GFS2 so when I tried to reproduce it I did GFS2 -> GFS2 and I didn't manage to reproduce it. I've not had a chance to try with GFS1 on the sending end so far.

If you have any log messages from that issue, I'd like to know, but its probably not appropriate for this bz.

Comment 34 David Teigland 2009-11-06 20:17:07 UTC

Re comments 31 and 32, are the outages and panics due to the oops in
posix_lock_file?  Or are they other bugs not yet recorded in bugzilla?

The posix_lock_file bug we're working on here should not appear if you're using
lock_nolock, only lock_dlm; and it should also not appear unless gfs is
exported via nfs and clients are doing locking.

Comment 35 David Teigland 2009-11-06 20:55:48 UTC

Created attachment 367889 [details]
flock-loop test program

In my testing I run
flock-loop file1
flock-loop file2
flock-loop file3

on two nfs clients, each client mounting from a separate server.

Comment 36 David Teigland 2009-11-06 21:14:22 UTC

I have now seen a couple errors even with the patch from comment 20:

lockd: grant for unknown block, result 0
grant fl t 2 p 14197 o ffff88013e1bbaa8 0-0
lockd: nlm_block list
b ffff88013eacad58 flags 0 file ffff88013eb06508 fl t 1 p 14199 o ffff88013e1bbaa8 0-0
b ffff88013eaca930 flags 8 file ffff88013eae2760 fl t 1 p 14198 o ffff88013e1bbaa8 0-0
dlm: dlm_plock_callback: 30ccc fl ffff88007e262598 lock granted after lock request failed; dangling lock!

This is a different kind of error from most of the "unknown block" / "dangling lock" cases I was seeing without the patch.  In this case the fl described in the callback has type 2 (F_UNLCK) which should never be the case in a callback.

And then separately,

nlmsvc_grant_deferred block ffff88007ee736d0 not B_IN_FS

Comment 37 David Teigland 2009-11-06 22:53:57 UTC

Another thing to try changing is nlmsvc_unlock() which does:

nlmsvc_cancel_blocked()
    vfs_cancel_lock()
vfs_lock_file(F_UNLCK)

Since gfs/dlm does not have the ability to cancel locks, it converts the vfs_cancel_lock() call into an ordinary unlock.  So, lockd ends up calling unlock twice back to back, first from vfs_cancel_lock() and second from vfs_lock_file(F_UNLCK).  I'll probably try removing the call to nlmsvc_cancel_blocked() altogether and see what happens.

Comment 39 David Teigland 2009-11-09 20:30:10 UTC

Removing the calls to nlmsvc_cancel_blocked() did seem to improve things, but I don't have any specific examples.  I was still seeing cases were a nlmsvc_grant_deferred() callback would occur with bogus fl data that would fail to match the fl in the block it should have matched.  This is similar to the bug we recently fixed to pass the pointer of the original fl into the callback instead of the flc (copy of the original fl) because the flc ranges are modified by the vfs.  Since it appears the original fl is being clobbered, causing it to not match the lock it's supposed to, I changed the dlm to make a second copy (flc2) of the original fl to pass back to nlmsvc_grant_deferred().  This appears to have fixed the problem of nlmsvc_grant_deferred() failing to find any matching blocks.  I'm still seeing an occasional occurance of nlmsvc_lock() called on a block that is currently busy in the dlm from a previous nlmsvc_lock() call.  I'm still dealing with that by aborting and returning in those cases.

Comment 40 David Teigland 2009-11-09 20:46:13 UTC

Summary of the functional changes I've made:

1. In dlm_plock_callback(), create a second copy of the original file_lock, and pass this copy into the fl_grant callback instead of the pointer to the original file_lock.

2. In lockd set B_IN_FS in block->b_flags before calling vfs_lock_file(), and clear it in nlmsvc_grant_deferred() (if DEFERRED was returned, otherwise clear it right after the vfs call)

3. In nlmsvc_lock(), check if B_IN_FS is set, and return without calling into the fs if it is.

4. In nlmsvc_unlock(), remove the call to nlmsvc_cancel_blocked().

5. In nlmsvc_cancel_blocked(), return immediately without doing anything.

Comment 41 MSSL computing group 2009-11-10 11:48:49 UTC

Response to #34:

#31 crashes are all posix lock related on nfs exported filesystems

#32 crashes are glocks on non-exported filesystems

Comment 42 David Teigland 2009-11-10 16:49:20 UTC

addition to comment 40,

6. In dlm_plock_callback(), check if the saved struct file is invalid (non-null file->f_path.dentry), and if so don't call posix_lock_file() (which oopses if passed a bad struct file).

Comment 43 David Teigland 2009-11-10 18:12:39 UTC

Since most of the changes I've been trying are work-arounds anyway (I've mostly given up for now on finding root problems), I tried get_file/fput, but killing flock-loop on the clients triggered fs/locks.c locks_remove_flock() BUG from the fput().

Comment 44 David Teigland 2009-11-11 17:10:13 UTC

I'm relatively confident in the broad reasons behind the problems we're seeing. I have the impression that the problems are structural/design ones, and not traceable to a specific root bug that can be fixed (although I'd sure like to be wrong on that.)

lockd sends off async plock ops to the fs.

lockd is not careful to collect the async reply from the fs before doing something else that may interfere with the op it has sent.

Instead, lockd can do a number of things after it sends off the op and before it has processed a reply for it:
- it can send off another lock op on the same file for the same holder
- it can send off an unlock op on the same file for the same holder
- it can close and free the file

The first two can confuse the dlm plock code, although I've added some additional checking to detect and ignore them (partially anyway).

The first two can also result in callback errors where lockd has forgotten about the op it fired off and can no longer match up the reply when it arrives,
i.e. lockd: grant for unknown block

The last one is especially troubling, because the dlm needs the struct file to be "valid" at the time the op completes so that it can do the vfs "bookkeeping". If lockd has pulled the file out from under the dlm, it will result in oopses of various kinds (even if the dlm adds its own get_file reference) in the vfs locking code (fs/locks.c).

The dlm assumes that the plock caller is "well behaved", where that's defined as the behavior that it would see from a local process doing a plock operation. lockd is not well behaved in this sense; it behaves differently in the three ways (at least) listed above.

The lockd changes that were made to accomodate async locks were minimal. They assumed that lockd could work in largely the same way for sync or async implementations, and seemd to ignore the issue of things happening in the async window which could interfere with an incomplete call.

Comment 45 MSSL computing group 2009-11-11 20:53:01 UTC

It wouldn't surprise me if you're right about lockd. 

We already found that nfs utils aren't wonderfully coded and had to wrap all /usr/share/cluster/nfsclient.sh exportfs calls with flock statements in order to have multiple NFS services start/stop without tripping over each other.

Comment 46 MSSL computing group 2009-11-12 19:01:18 UTC

I'm guessing that switching to NFSv4 won't help matters?

Comment 47 David Teigland 2009-11-12 23:16:24 UTC

NFSv4 won't change the dlm/lockd interactions, but it may have some effect on the server/client interactions, which may have some indirect effects on dlm/lockd parts, I don't know.

Comment 48 David Teigland 2009-11-12 23:19:32 UTC

Created attachment 369346 [details]
latest testing/debugging patch

My recent attempts have been to approach this as much as possible from the dlm side and avoid lockd changes.  This current patch does seem to be holding up better than average, although there are still issues.

Comment 49 MSSL computing group 2009-11-13 14:24:52 UTC

FWIW: All the nfs exports here are sync, not async.

To be honest at this point I think that lockd/nfs work would be more productive overall, but I'm a big fan of belt+braces+safetypin approaches.

The Samba project's CTDB project (http://ctdb.samba.org/) has some NFS work included which might be helpful.

Comment 51 David Teigland 2009-11-13 19:44:27 UTC

I don't seem to have any of the locking problems at all when using nfs4!
nfs4 is not using the dubious async lock completions like lockd does, but ordinary synchronous calls like local processes do.  I'm not sure why we've not realized this before, and why we've spent so much time trying to make lockd work on gfs/dlm (many other bz's before this one) rather than simply limiting nfs+gfs file locking to nfs4 configurations.  I'm going to seriously consider removing the async plock code from the dlm altogether.

Can you verify that switching to nfs4 solves all your file locking problems?
If so we can close this bz.

Comment 53 MSSL computing group 2009-11-19 16:56:08 UTC

Unfortunately we can't completely remove nfs3 (or 2) from the servers - there are older OSes involved which don't have NFSv4.

I'm in the process of migrating the RH clients to NFS4 - which should help a lot, but it's already exposed that cluster.conf doesn't support "mount --bind"

Comment 54 Alan Brown 2009-11-20 00:44:42 UTC

After a few hours of testing: Superficially NFSv4 seems to work without causing GFS issues, but it's not been tested in anger yet.

Additionally there are other problems: 

 Client delegation doesn't work (syslogging lots of errors as a result)

 IMPORTANT: file locking doesn't work at all! We have /home on NFS and being unable to lock .Xauthority + KDE startup files means that GUI logins don't work.

Serverside is exported using rw,insecure,no_subtree_check,sync,nohide

Clientside has: rw,nosuid,nodev,_netfs,fsc,acregmin=10,acregmax=120,acdirmax=120,timeo=600,retry=1,lock,hard,intr

I realise this is a bit beyond the scope of the ticket, but getting this working would allow us to drop v3 on 100+ clients and fully test if it gets round the crash.

Comment 55 MSSL computing group 2009-11-20 11:38:42 UTC

Sorted the file locking issue - broken kernel requiring  2.6.18-164.6.1.el5 or later.

Under heavy load NFSv4 is basically unusable. We had to revert to v3 so that people could work - so we're back in the crossfire of this posix locking issue :(

Comment 56 David Teigland 2009-11-20 19:28:23 UTC

Perhaps we should open a new bz to look at the nfs4 performance problems?
After two weeks working on nfs3 posix locks without a fix, I think we should spend a little time seeing if we can address the nfs4 performance.

Comment 57 David Teigland 2009-11-30 19:25:25 UTC

Is anyone looking into the nfs4 performance issues?

Comment 58 Alan Brown 2009-11-30 22:55:16 UTC

Not as such. Most of the performance issues turned out to be self-inflicted (dialled down the number of nfsd threads, unaware this also affects nfsd4, lack of documentation is a cow).

Even with 2.6.18-164.6.1.el5 there are still serious nfs4 issues apparent which make it effectively unusable in a cluster environment (not least of which is that with v4recovery on a cluster disk as recommended, clients are only able to mount shares from one server. If different subsets are on different servers the other shares are unreachable)


On the NFS3/GFS front, I've found it only takes ONE client doing sustained heavy nfs writes(*) on an otherwise idle disk to be able to knock a server over. Given that not all clients are NFS4 capable this is still a serious shortcoming which needs addressing.

(*) Client was running a 100Gb+ scp from a OSX box on the LAN to nfs mounted disk.

Comment 59 Alan Brown 2009-12-07 13:08:39 UTC

While trawling bugzilla I ran across BZ 531493 and thought it was worth testing on nfs3+GFS - the server got very sluggish after about 16G had been xferred so the test got aborted. 

Is this related or am I chasing mirages?

Comment 60 Lachlan McIlroy 2009-12-08 01:40:36 UTC

Alan, I doubt BZ 531493 is related to this bug.  That bug results in the system hanging whereas this bug results in a crash preceded by lockd/dlm errors messages.

Comment 61 Alan Brown 2009-12-08 16:59:22 UTC

In our case the recent spate of crashes have been preceeded by the servers running sluggishly thanks to a user trying to put 20+Gb files on the gfs fs via nfs.

We have lock/dlm errors logged every couple of minutes under normal circumstances but only occasional crashes unless large file xfers are underway as well.

Comment 62 Perry Myers 2010-02-01 15:00:37 UTC

Just a few notes about using NFS+GFS in general.

Active/Active usage of NFS on GFS1/2 is not presently supported. Active/Passive (i.e. failover) of NFS on GFS1/2 is supported and should work fine.

Here are the relevant docs on this:
http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Configuration_Example_-_NFS_Over_GFS/NFS_GFS_Overview.html

"Note that this configuration is not a "high capacity" configuration in the sense that more than one server is providing NFS service. In this configuration, the floating IP moves about as needed, but only one server is active at a time."

http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Configuration_Example_-_NFS_Over_GFS/NFS_GFS_Caveats.html

"Do not use NFS locks in a failover situation such as the one described in this configuration, as this may cause a duplicate lock situation. The GFS file system correctly distributes the locks between nodes.
The Linux NFS server as used in Red Hat Enterprise Linux 5 is not cluster aware. NFS state information is held on a per machine basis and there is a slight but definite chance of file corruption should two NFS servers try to write to the same file at the same time, even in a GFS environment. Because of this the current recommendation is to only export any given file system on one NFS machine only."

So presently the only supported configuration is a single NFS server at a time using GFS1/2 filesystem.

Though it should be noted that if you completely disable NFS locking that multiple NFS servers should work fine. But this is not officially a supported configuration.

Comment 64 Alan Brown 2010-02-01 15:36:14 UTC

(I don't think NFS and samba issues are relevant for this bugzilla except that they serve to drive loads up high enough to make the problem manifest.)

We're aware of the issues with active/active nfs sharing on any given filesystem and currently have serving of any given filesystem directed to one server only.

Nonetheless the bug manifested even with _all_ nfs activity coming off one physical machine in the cluster (3 machines) - the problem with this configuration is the load spikes to more nfs processes than one box can provide (there's a hard limit around 368 processes)

FWIW the same corruption risks apply to the EL5 version of samba, which is why the Samba project developed CTDB and have put a lot of effort into clustered NFS too. Bringing this work into RH clustering would solve a number of issues but probably not this one.

Comment 65 Perry Myers 2010-02-01 15:52:24 UTC

(In reply to comment #64)
> (I don't think NFS and samba issues are relevant for this bugzilla except that
> they serve to drive loads up high enough to make the problem manifest.)
> 
> We're aware of the issues with active/active nfs sharing on any given
> filesystem and currently have serving of any given filesystem directed to one
> server only.

Ok, reading through the bz comments I only saw mention of the active/active configuration.  But if this problem exists on a single NFS server configuration, then I agree with you.

Comment 67 David Teigland 2010-02-02 17:38:00 UTC

It has long been abundantly clear that this bug is due to incompatibility between lockd and gfs/dlm.  See comment 44 for the explanation of that.  That has nothing to do with active/active vs active/passive.

However, if you are doing active/passive, what's the point of sending the locks into gfs/dlm to "clusterize" them at all?  There is not point. Don't do that, let the locks be managed only by nfs/lockd/vfs on the active node, just like a local fs. Recovery (shifting from failed active node to passive one) should then also be handled like a local fs.

I believe that if you mount gfs with the "localflocks" option it should turn off the clusterization of all plock/flock calls and make them equivalent to a plock/flock calls on a local fs.

Comment 69 Alan Brown 2010-02-03 16:08:05 UTC

comment #67 assumes that the GFS filesystem in question is ONLY being accessed via NFS (in this case why use GFS at all?)

The first thing that springs to mind requiring clusterisation of locks is other cluster nodes accessing the filesystem via GFS while running other tasks or another cluster node acting as a samba server.

Comment 70 Alan Brown 2010-02-26 16:15:36 UTC

*NUDGE*

Is anything being done on this problem? More specifically has any thought been given to sorting out the NFS suite?

Even if NFSv4 can be nmade to behave properly in a clustered environment, there will be v3-only clients for a long time to come.

Comment 71 Alan Brown 2010-02-26 16:16:10 UTC

*NUDGE*

Is anything being done on this problem? More specifically has any thought been given to sorting out the NFS suite?

Even if NFSv4 can be nmade to behave properly in a clustered environment, there will be v3-only clients for a long time to come.

Comment 72 David Teigland 2010-02-26 16:33:08 UTC

Created attachment 396587 [details]
example patch

This untested patch allows us to control how plocks from nfs are handled in gfs,
without resorting to localflocks.  If this new "nfslocks" mount option is used, then gfs will pass nfs plock requests on to the dlm to be clusterized (like gfs2 has done since 5.3).  Without the nfslocks mount option, nfs plocks are handled locally and not clusterized (like gfs always did prior to 5.3).

I expect gfs2 developers may want to tweak this patch according to their taste.
We should also test to verify it works as expected, of course.

Comment 73 David Teigland 2010-02-26 17:11:18 UTC

Steve, could you take a look at the patch in comment 72?

Comment 74 Steve Whitehouse 2010-03-01 13:51:37 UTC

The patch looks ok, but I'm not sure I follow the problem that we are trying to solve with it. Why do we need to make nfs do something different to local fcntl locks? I think its likely to be rather confusing.

Comment 75 Alan Brown 2010-03-01 14:26:24 UTC

It's GFS, not ext4. eg: Other cluster nodes may be accessing the filesystem. (and in our case, they ARE.)

Comment 76 Alan Brown 2010-03-02 17:08:20 UTC

How long before this makes its way into a test kernel?

Comment 78 David Teigland 2010-03-18 21:07:13 UTC

The issues related to the combination of posix locks, nfs, and gfs have
been very confusing to everyone.  This includes developers like myself who
haven't understood the extent of the technical issues until recently.
There may still be some problems in the mix that we don't know or
understand because the issues span several subsystems.  Given all the
confusion, different people and documents have been saying different
things about what works or what doesn't work or why.

Here's an attempt to clarify the current situation.

It's easiest to begin by describing things without nfs in the picture.
When normal programs/processes acquire posix locks (plocks) on gfs files
using fcntl(2), the plocks are "clustered", i.e. plocks between two
processes on different nodes work the same way as two processes on the
same node; locks from different nodes are aware of each other.

In this context it's helpful to understand what the "localflocks" gfs
mount option does.  When localflocks is used, any plocks (fcntl(2)) or
flocks (flock(2)) acquired on that fs are not clustered among nodes and
remain local to the node, and behave like plocks/flocks on say ext3.

Whether localflocks is acceptable depends, of course, on the specific
applications using gfs.

What changes when nfs is added to the picture?

1. Most fundamentally, the locks are no longer finally held by processes
on the cluster nodes themselves, but by processes on nfs client nodes.
This means that when a cluster node fails, the plocks of all processes on
that node cannot simply be discarded by other cluster nodes during
recovery.  Instead, the locks need to be recovered from the nfs client
nodes.  This recovery requires a significant amount of new design and
development across multiple subsystems owned by multiple groups.  This
includes defining public interfaces between different components which is
especially difficult and time consuming.  The only current development I
am aware of is related indirectly, in the area of pnfs.

2. Processes are no longer acquiring plocks from userspace via fcntl(2),
but the nfs kernel server (lockd for nfs3), are calling into gfs to
acquire the locks.  lockd behaves differently from userland processes and
does not follow the same conventions.  Perhaps the worst example of this
is that it will give up on locks after a certain amount of time and try to
cancel them.  This creates a race condition that will require serious
changes to fix.  Attempts at fixing this race have only reduced the
occurance, and suggest that a complete fix may well extend beyond the
boundaries of the isolated cluster-fs-specific code.  Changing general
lockd code for the sake of cluster file systems would be especially
challenging.

The only way to address issue 1 currently is to prevent nfs plocks from
being passed into gfs since the capability to recover them simply does not
exist in any form.  One way to prevent passing nfs plocks into gfs is to
use the localflocks option to make all plocks/flocks local.  A second way
is a patch like https://bugzilla.redhat.com/show_bug.cgi?id=502977#c72
which is more discriminating than localflocks.

The current solution to issue 1 obviates any patches to address issue 2.

Why all the recent confusion?

These issues have recently become prominent because of an unfortunate
kernel patch upstream and in RHEL 5.3 that changed the default behavior of
nfs plocks on gfs.  Prior to RHEL 5.3, nfs plocks had always been local to
the node, regardless of the underlying filesystem type (gfs, ext3, etc).
There was no code or mechanism to pass plocks from nfs into gfs.  This
changed in RHEL 5.3 with the introduction of interfaces (from the GPFS
group at IBM) to allow this passing of plock operations between nfs and an
underlying cluster fs like gfs (bz 196318).

Unfortunately, these interfaces were put to use by default on gfs, under
the mistaken assumption that nfs plocks could now be clustered just like
plocks used by local processes via fcntl(2).  This decision failed to
account for the fact that there is much more to be done in the area of
recovery coordination before nfs plocks can truely be clustered for gfs.

The added confusion of active/passive nfs on gfs.

The context thus far has been about the most "natural" way to export nfs
from gfs: all gfs nodes exporting the same file system at the same time.
However, some people are interested in an active/passive configuration
where only one of the gfs nodes does the nfs export at a time.  If the
exporting node fails, rgmanager is used to export the same fs from a
different node and move a virtual ip address.

In this configuration, we do not want nfs plocks to be passed into gfs
(there is no reason to do so); we want them to be handled and recovered in
the same way as a local fs like ext3.  If this is done (e.g. pre-5.3
behavior, localflocks or a patch that disables nfs locks being handled on
gfs), then the underlying fs is not a factor and nfs plocks should work
even if gfs is the underlying fs.

Comment 79 Alan Brown 2010-03-19 11:49:18 UTC

Dave says in #78

> The context thus far has been about the most "natural" way to export nfs
> from gfs: all gfs nodes exporting the same file system at the same time.

This is our preferred method.... 

> However, some people are interested in an active/passive configuration
> where only one of the gfs nodes does the nfs export at a time.

We are running this setup because RH have explicitly warned that NFS isn't cluster safe and any given filesystem must only be NFS exported on one node at a time or there is risk of file corruption due to nfs write locks not passing between nodes

(In other words: active:passive is a workaround, not the preferred configuration. I believe this is the general case across most entities running A:P configs)

(RH also warn customers that in an a:p setup, all nfs locks are lost in the event of the nfs service switching between nodes. This isn't generally a problem.)

HOWEVER:

In a multinode active:active configuration containing 3 or more nodes it's a fair bet that any given filesystem won't be NFS exported on all nodes at all times, even if that fs may be exported on multiple nodes.

In such a configuration, the odds are reasonably high that customers will want 
failover of a failed node's NFS service to another machine in the cluster.

In this case the same requirements as single export active:passive take effect.



There used to be (still is?) a userspace nfsd - development was stalled a long time ago in favour of kernelspace nfsd because of speed issues. Perhaps it's worth revisiting the userspace daemon for clustered purposes.

Comment 80 Steve Whitehouse 2010-03-19 14:43:09 UTC

I'm not sure that moving to a userspace nfs lockd would improve matters here. There is still the issue of how the locks are to be failed over between nodes.

Looking at the gfs2 code we have this:

        if (cmd == F_CANCELLK) {
                /* Hack: */
                cmd = F_SETLK;
                fl->fl_type = F_UNLCK;
        }

which looks to me like it is not performing lock cancellation at all, but instead queuing an unlock, unless the userland code has some other way to distinguish between these requests? I presume from the comment that this might not be the case.

I'm still looking into the NFS code to try and figure out what exactly is required of the fs in that case.

Comment 81 Alan Brown 2010-03-19 18:26:00 UTC

Dave, could we get some clarification please?

Without your patch, is the bug currently present in both GFS and GFS2 or is it only in GFS now? 

If fixed in GFS2, do you know which RH test/release kernel?

Comment 82 David Teigland 2010-03-19 18:41:04 UTC

The patch in comment 72 (adding the nfslocks option) has not been included in any build or release of gfs1 or gfs2. The gfs developers will need to do that if they approve of it.

Without that patch or something like it, the only way I know of to keep nfs plocks out of gfs is to mount gfs with localflocks.

Comment 83 Steve Whitehouse 2010-03-22 09:32:58 UTC

The question is, why is localflocks not enough? I'll put the patch in if we can justify it in some way, but I don't understand the reasons for not just using localflocks.

Comment 84 Alan Brown 2010-03-22 10:44:54 UTC

> The question is, why is localflocks not enough?

localflocks == no clustered filesystem.

No clustered filesystem == no point in running GFS.

Comment 85 Steve Whitehouse 2010-03-22 12:46:12 UTC

localflocks is independent from the filesystem's internal locking. The only reason that this other option would be needed is if there is a requirement for flock() and fcntl() locks to have different configs wrt cluster/single node.

If the application doesn't use flock() but only fcntl() locks then there is no difference between the proposed patch and the localflocks option.

Comment 86 David Teigland 2010-03-22 13:47:49 UTC

localflocks *is* enough, the point is that it may be too much for some people.
localflocks means the fs won't do any clustering of flocks or plocks, even for local processes.  nfslocks only stops clustering of nfs locks, but any plocks or flocks by other processes are still clustered.

The really critical issue, though, is that nfslocks is *off by default*, which returns us to the original pre-5.3 behavior of nfs locks being local.  We really need nfslocks to be off by default or people will continue to run into these problems (inconsistent plocks, oopses).

Comment 87 Alan Brown 2010-03-22 15:35:49 UTC

Redhat Support advised us that filesystems mounted with localflocks should NOT be clustered and warn of file corruption risks. For our setup, the advice was to run all services on one node ONLY with the others powered down - hardly a High Availability situation to say the least. 

We were sold GFS as a High Availability solution for clustered NFS/Samba operations and weren't even advised that NFS/Samba should only be operated on one node of a cluster until well AFTER we had the thing running (on RHEL4)

To get to the situation of clustered hardware being switched off in order to have safe NFS fileserving under RHEL5 defeats the whole purpose of using GFS.

Users and management will put up with issues for a _limited_ period, however the current situation is approaching the absolute limits of their patience.

If this issue isn't resolved in a reasonable timescale then we see little alternative but to remove GFS.

Comment 88 David Teigland 2010-03-22 15:50:36 UTC

Are you confusing lock_nolock with localflocks?  Those are very different things.

If you have any other technical questions or confusion I'd be happy to clarify them.

Comment 89 Alan Brown 2010-03-22 16:28:13 UTC

Not particularly. I had explicit warnings not to use localflocks on clustered systems whilst discussing lock_nolock

GFS man page says:

===
localflocks
    This flag tells GFS that it is running as a local (not clustered) filesystem, so it can allow the kernel VFS layer to do all flock and fcntl file locking. When running in cluster mode, these file locks require inter-node locks, and require the support of GFS. When running locally, better performance is achieved by letting VFS handle the whole job.

    This is turned on automatically by the lock_nolock module, but can be overridden by using the ignore_local_fs option.
=== 

On that basis I'd prefer not to play russian roulette with a clustered production filesystem.

Comment 90 David Teigland 2010-03-22 18:36:07 UTC

I'm not necessarily suggesting you use localflocks.  A person needs to be aware of the applications' requirements with respect to file locks (flocks and posix locks) before knowing whether it is acceptable to use localflocks. I said this in comment 78.

But we've been led off on the localflocks tangent; localflocks is not the most pertinent question, the proposed nfslocks patch is.

If you are exporting nfs from a single gfs node:
- the nfslocks patch will prevent kernel oopses
- posix locks from nfs clients will be local to the single exporting node
- posix locks will "work" among all nfs clients
- posix locks will be clustered for processes using fcntl() on all gfs nodes

If you are exporting nfs from multiple gfs nodes:
- the nfslocks patch will prevent kernel oopses
- posix locks from nfs clients will be local to the server/node they mount from
- posix locks will "work" among nfs clients mounting from the same server/node,
  but not among nfs clients mounting from different servers
- posix locks will be clustered for processes using fcntl() on all gfs nodes

Comment 91 Alan Brown 2010-03-22 21:03:55 UTC

Dave, I know you're not suggesting we use localflocks, however others within RH are stating it is the only viable solution.

Given the current issues with multiheaded NFS exporting we'll still be sticking to one nfs server per filesystem, but we _must_ have stable clustered operation with NFS exports active on at least one node.

Comment 92 David Teigland 2010-03-25 21:36:42 UTC

Created attachment 402691 [details]
updated patch

Updated patch, added "nonfslocks" in addition to "nfslocks", now matches upstream version.  I have tested the upstream patch, it works as expected.

Comment 93 David Teigland 2010-03-25 22:04:43 UTC

Created attachment 402696 [details]
patch

fix patch conflicts

Comment 94 David Teigland 2010-03-26 19:03:30 UTC

Created attachment 402920 [details]
patch

fix

Comment 95 David Teigland 2010-03-26 19:23:52 UTC

I have tested and verified the patch in comment 94, in build:
https://brewweb.devel.redhat.com/taskinfo?taskID=2345796

Comment 96 David Teigland 2010-03-26 19:28:29 UTC

http://post-office.corp.redhat.com/archives/rhkernel-list/2010-March/msg00882.html

Comment 97 David Teigland 2010-03-29 16:50:03 UTC

Hm, Steve has rejected this patch since he doesn't understand the problem.
Steve, this bug is now completely in your hands.

Comment 99 Alan Brown 2010-05-14 17:37:19 UTC

Steve, what is the current state of this bug?

Comment 100 Steve Whitehouse 2010-05-17 09:20:34 UTC

Alan, I don't think there is anything to fix here. The bz has unfortunately got rather confused from the original report. Several different issues have been reported along the way. Let me try and clarify the situation....

We do not support:
 o Mixed samba/nfs exports of GFS/GFS2 filesystems
 o Active/active nfs exports with nfs lockd support (active/active should work without locking, and with udp nfs only)
 o Mixed nfs and local applications on GFS/GFS2 filesystems
 o Mixed samba and local applications on GFS/GFS2 filesystems

Active/passive nfs exports should work with nfs lockd, but you must set the "localflocks" mount option on each GFS/GFS2 mount.

If there are any issues other than those relating to NFS locking, they should be reported in different bugzillas.

We would like to be able to support both samba and nfs mixed and also active/active with locking support in the future. We have bz #580863, for example open to track upstream effort required to implement the features we need in order to do this. It will not be an easy thing to do unfortunately.

If there has been some confusion in the information supplied by support and/or any other part of Red Hat then please accept my apologies for that. If you are still experiencing problems, then please drop support a line and they will do their best to assist.

As for this bug, the originally reported issue has been resolved, so I'm now intending to close it.

Comment 101 Alan Brown 2010-05-17 13:31:40 UTC

Firstly, the original problem has NOT been resolved. There are still panics.

Secondly your comment about not supporting Mixed Samba/NFS exports of GFS is at odds with the original sale - which was specifically for this purpose...

Comment 102 Steve Whitehouse 2010-05-17 13:54:31 UTC

Alan, I think your set up does not bear much relation to the original report for which this bz was opened. The details of that are:

 o A contrived example set up by our support team to debug another customers problem
 o Active/active with nfs locking (not supported)
 o Using nfs on each of the two nodes to mount gfs2 from the other node (also not supported)
 o No use of samba at all

I apologise if someone at Red Hat has given you incorrect information. I would be very interested to know who gave you that information.

If you can drop our support team a line, then we'll try and work with you to come up with a solution for your situation.

The reason that mixed samba/nfs is a problem is basically down to locking. Samba has an internal cache of information which uses posix leases to keep itself uptodate. GFS2 doesn't support leases when it is clustered, although it does when it is run single node (lock_nolock). NFS supports posix fcntl locks and so does GFS2 (if lockd support is enabled, i.e. when clustered and not using the localflocks option). There is however a problem in trying to use that interface active/active in that NFS doesn't have cluster recovery support so that the lock state cannot in that case be recovered in case of node failure.

The combination of these issues mean that we cannot support mixed samba and nfs at the moment, even though we would certainly like to do so.

Comment 103 Steve Whitehouse 2010-05-18 12:41:52 UTC

Closing this bug on the basis that the original report was for an unsupported situation which has since been resolved. If there are other issues then they should be reported under a different/new bugzilla to avoid further confusion.

Upstream development work for active/active support of nfs lockd can be found in bz #580863

Comment 104 Alan Brown 2010-06-09 11:37:22 UTC

Please state what exactly is "unsupported" about NFS exporting running on a single GFS cluster node running by itself that is also exporting the same filesystems via samba and/or running local operations on the filesystem.

Comment 105 Alan Brown 2010-06-09 12:35:18 UTC

The reason I ask this question is that the same crash mode seen in the original report is exactly the same as seen on our systems in the configuration I describe.

We were referred to this ticket by Redhat Support because our crashdumps matched. Opening a new BZ for the _same_ issue simply causes more confusion.


The crash happens more often if there are 2 GFS nodes, even if one is completely quiescent. It's disruptive because we experience 10-15 minute cluster downtime for every event _and_ the ~1Tb filesystems eventually have to be taken down for a day in order to fsck.

This is not a contrived situation. It's an issue which occurs under normal network loadings and which didn't manifest prior to RHEL5.2

This is a very real problem, happening on supported configurations. It needs to be addressed properly, not shoved under the carpet.

Comment 106 Ric Wheeler 2010-06-09 13:36:29 UTC

Samba and NFS both maintain some state (lock state specifically) not in the kernel.

As things stand today, there is no coordination between Samba (a user space process) and the NFS server when exporting the same partition.

This is something that commercial NAS appliances provide and something worth investigating/implementing in Linux so we are interested in hearing about customers that would like this.  Please fill in a "FEAT" request if you are interested, this BZ is not intended to be a feature request.

Best regards,

Ric

Comment 108 Alfredo Moralejo 2010-07-13 15:58:20 UTC

As per some of the previous comments, would it be supported using Mixed nfs and local applications on GFS/GFS2 filesystems using NFSv4? If I've understood it right, it should work

Comment 109 Steve Whitehouse 2010-07-13 16:05:31 UTC

I doubt it will work correctly if recovery takes place. It is not tested and thus NFS is only supported on its own and not mixed with local applications.

Note You need to log in before you can comment on or make changes to this bug.

adas
ajb2
akarlsso
amoralej
bmarzins
bturner
cadre
cluster-maint
degts
jlayton
liko
mrappa
msolberg
rpeterso
rwheeler
sprabhu
stanislav.polasek
swhiteho
tao
tdunnon
teigland
vgaikwad