Description of problem: With the error messages seen in the logs as follows: ---------------- May 5 14:00:19 heim1 kernel: lockd: grant for unknown block May 5 14:00:19 heim1 kernel: dlm: dlm_plock_callback: lock granted after lock request failed; dangling lock! May 5 14:00:19 heim1 kernel: May 5 14:01:52 heim1 kernel: lockd: grant for unknown block May 5 14:01:52 heim1 kernel: dlm: dlm_plock_callback: lock granted after lock request failed; dangling lock! May 5 14:01:52 heim1 kernel: May 5 14:02:36 heim1 kernel: lockd: grant for unknown block May 5 14:02:36 heim1 kernel: dlm: dlm_plock_callback: lock granted after lock request failed; dangling lock! May 5 14:02:36 heim1 kernel: May 5 14:04:28 heim1 kernel: lockd: grant for unknown block May 5 14:04:28 heim1 kernel: dlm: dlm_plock_callback: lock granted after lock request failed; dangling lock! May 5 14:04:28 heim1 kernel: May 5 14:04:37 heim1 kernel: lockd: grant for unknown block May 5 14:04:37 heim1 kernel: dlm: dlm_plock_callback: lock granted after lock request failed; dangling lock! May 5 14:04:37 heim1 kernel: May 5 14:04:50 heim1 kernel: lockd: grant for unknown block May 5 14:04:50 heim1 kernel: dlm: dlm_plock_callback: lock granted after lock request failed; dangling lock! May 5 14:04:50 heim1 kernel: May 5 14:06:52 heim1 kernel: lockd: grant for unknown block May 5 14:06:52 heim1 kernel: dlm: dlm_plock_callback: lock granted after lock request failed; dangling lock! May 5 14:06:52 heim1 kernel: May 5 14:08:06 heim1 kernel: lockd: grant for unknown block May 5 14:08:06 heim1 kernel: dlm: dlm_plock_callback: lock granted after lock request failed; dangling lock! May 5 14:08:06 heim1 kernel: ---------------- Kernel Oops on a cluster node: This is node 1 of a two node cluster set up to NFS export home directories. Mar 17 11:02:50 heim1 kernel: Unable to handle kernel NULL pointer dereference at 0000000000000010 RIP: Mar 17 11:02:50 heim1 kernel: [<ffffffff800e4e68>] posix_lock_file+0x6/0xf Mar 17 11:02:50 heim1 kernel: PGD 221e6d067 PUD 22227d067 PMD 0 Mar 17 11:02:50 heim1 kernel: Oops: 0000 [1] SMP Mar 17 11:02:50 heim1 kernel: last sysfs file: /devices/pci0000:00/0000:00:00.0/irq Mar 17 11:02:50 heim1 kernel: CPU 4 Mar 17 11:02:50 heim1 kernel: Modules linked in: ip_vs nfsd exportfs lockd nfs_acl auth_rpcgss sunrpc autofs4 ipmi_devintf ipmi_si ipmi_msghandler lock_dlm gfs2(U) dlm configfs bo\ nding ipv6 xfrm_nalgo crypto_api dm_emc dm_round_robin dm_multipath scsi_dh video hwmon backlight sbs i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp par\ port ide_cd i5000_edac sg e1000e edac_mc bnx2 cdrom serio_raw pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod qla2xxx scsi_tran\ sport_fc ata_piix libata shpchp megaraid_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Mar 17 11:02:50 heim1 kernel: Pid: 3982, comm: gfs_controld Tainted: G 2.6.18-128.1.1.el5 #1 Mar 17 11:02:50 heim1 kernel: RIP: 0010:[<ffffffff800e4e68>] [<ffffffff800e4e68>] posix_lock_file+0x6/0xf Mar 17 11:02:50 heim1 kernel: RSP: 0018:ffff810221ee3ea0 EFLAGS: 00010246 Mar 17 11:02:50 heim1 kernel: RAX: 0000000000000000 RBX: ffff81012c695000 RCX: 0000000000000000 Mar 17 11:02:50 heim1 kernel: RDX: 0000000000000000 RSI: ffff81012c695070 RDI: ffff81022d47a380 Mar 17 11:02:50 heim1 kernel: RBP: ffff81012c695070 R08: 0000000000000000 R09: 7fffffffffffffff Mar 17 11:02:50 heim1 kernel: R10: 000000000000000c R11: 000000000003a2b4 R12: ffff81022d47a380 Mar 17 11:02:50 heim1 kernel: R13: ffff81022f14e8e0 R14: ffffffff88677fdf R15: 000000000cae4450 Mar 17 11:02:50 heim1 kernel: FS: 00002b6752d64a10(0000) GS:ffff81022fc1ed40(0000) knlGS:0000000000000000 Mar 17 11:02:50 heim1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Mar 17 11:02:50 heim1 kernel: CR2: 0000000000000010 CR3: 00000002243e9000 CR4: 00000000000006e0 Mar 17 11:02:50 heim1 kernel: Process gfs_controld (pid: 3982, threadinfo ffff810221ee2000, task ffff81022f489040) Mar 17 11:02:50 heim1 kernel: Stack: ffffffff885377f6 0000000100000001 0000000200000000 000000010000000c Mar 17 11:02:50 heim1 kernel: 0007000200000000 000000000003a2b4 0000000000000000 7fffffffffffffff Mar 17 11:02:50 heim1 kernel: 000000000000000c ffff81022fe658c0 0000000000000040 00007fff57cd43c0 Mar 17 11:02:50 heim1 kernel: Call Trace: Mar 17 11:02:50 heim1 kernel: [<ffffffff885377f6>] :dlm:dev_write+0x157/0x207 Mar 17 11:02:50 heim1 kernel: [<ffffffff8001659e>] vfs_write+0xce/0x174 Mar 17 11:02:50 heim1 kernel: [<ffffffff80016e6b>] sys_write+0x45/0x6e Mar 17 11:02:50 heim1 kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Mar 17 11:02:50 heim1 kernel: Mar 17 11:02:50 heim1 kernel: Mar 17 11:02:50 heim1 kernel: Code: 48 8b 78 10 e9 fd fb ff ff 41 57 49 89 ff 41 56 41 55 41 54 Mar 17 11:02:50 heim1 kernel: RIP [<ffffffff800e4e68>] posix_lock_file+0x6/0xf Mar 17 11:02:50 heim1 kernel: RSP <ffff810221ee3ea0> Mar 17 11:02:50 heim1 kernel: CR2: 0000000000000010 Mar 17 11:02:50 heim1 kernel: <0>Kernel panic - not syncing: Fatal exception The filesystem is GFS2 and the panic happens on the node which is exporting the GFS2 filesystem through NFS configured in cluster. Version-Release number of selected component (if applicable): This panic has been reproduced on 2.6.18-150.el5. How reproducible: 100% Steps to Reproduce: This how I reproduced it - may not represent what the customer was doing. 1. Setup cluster with two nodes and a GFS2 filesystem. 2. Export the GFS2 filesystem via NFS from node A and mount on node B 3. Export the GFS2 filesystem via NFS from node B and mount on node A 4. On both NFS clients run this: for i in `seq 1 1000` do ./flock $i & done [see attached flock.c for flock program] One of the nodes will panic within 15 mins. Additional info: crash> bt PID: 5229 TASK: ffff81022b1cc0c0 CPU: 1 COMMAND: "gfs_controld" #0 [ffff81021d989c00] crash_kexec at ffffffff800aaa19 #1 [ffff81021d989cc0] __die at ffffffff8006520f #2 [ffff81021d989d00] do_page_fault at ffffffff80066e1c #3 [ffff81021d989df0] error_exit at ffffffff8005dde9 [exception RIP: posix_lock_file+6] RIP: ffffffff800e4e68 RSP: ffff81021d989ea0 RFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8101bd756200 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff8101bd756270 RDI: ffff8101bcb4a980 RBP: ffff8101bd756270 R8: 0000000000000000 R9: 7fffffffffffffff R10: 0000000000000000 R11: 000000000003a2b4 R12: ffff8101bcb4a980 R13: ffff81022e1e64e0 R14: ffffffff88675fdf R15: 0000000006119280 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #4 [ffff81021d989ea0] dev_write at ffffffff885357f6 #5 [ffff81021d989f10] vfs_write at ffffffff8001659e #6 [ffff81021d989f40] sys_write at ffffffff80016e6b #7 [ffff81021d989f80] tracesys at ffffffff8005d28d (via system_call) RIP: 00000034130c56a0 RSP: 00007fffa5d97f88 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: ffffffff8005d28d RCX: ffffffffffffffff RDX: 0000000000000040 RSI: 00007fffa5d98080 RDI: 000000000000000a RBP: 0000000000000002 R8: 000000003a0698eb R9: 00000000702bc85c R10: 0000000049d63f88 R11: 0000000000000246 R12: 0000000006119280 R13: 00000000061192c0 R14: 0000000006119280 R15: 0000000006119280 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b crash> crash> dis posix_lock_file 0xffffffff800e4e62 <posix_lock_file>: mov 0x10(%rdi),%rax 0xffffffff800e4e66 <posix_lock_file+4>: xor %edx,%edx 0xffffffff800e4e68 <posix_lock_file+6>: mov 0x10(%rax),%rdi 0xffffffff800e4e6c <posix_lock_file+10>: jmpq 0xffffffff800e4a6e <__posix_lock_file_co nf> crash> px *(struct file *)0xffff8101bcb4a980 $2 = { f_u = { fu_list = { next = 0x0, prev = 0xffffffff8003c0d9 }, fu_rcuhead = { next = 0x0, func = 0xffffffff8003c0d9 <file_free_rcu> } }, f_dentry = 0x0, f_vfsmnt = 0x0, f_op = 0xffffffff88570a40, f_count = { counter = 0x0 }, f_flags = 0x8000, f_mode = 0xd, f_pos = 0x0, f_owner = { lock = { raw_lock = { lock = 0x1000000 } }, pid = 0x0, uid = 0x0, euid = 0x0, security = 0x0, signum = 0x0 }, f_uid = 0x96, f_gid = 0x96, f_ra = { start = 0x0, size = 0x0, flags = 0x0, cache_hit = 0x0, prev_page = 0xffffffffffffffff, ahead_start = 0x0, ahead_size = 0x0, ra_pages = 0x20, mmap_hit = 0x0, mmap_miss = 0x0 }, f_version = 0x0, f_security = 0x0, private_data = 0x0, f_ep_links = { next = 0xffff8101bcb4aa50, prev = 0xffff8101bcb4aa50 }, f_ep_lock = { raw_lock = { slock = 0x1 } }, f_mapping = 0xffff8101be425478 } We've crashed because the struct file pointer passed to posix_lock_file() appears to have been freed (ie f_count is zero and other fields have been reset - in particular f_dentry which we panicked trying to dereference). I suspect it's a bug in dlm (ie it needs to take an additional reference on the struct file when it's saved off in dlm_posix_lock() to prevent it from being freed before gfs_controld gets to it). This issue looks similar to: https://bugzilla.redhat.com/show_bug.cgi?id=470074 and https://bugzilla.redhat.com/show_bug.cgi?id=466677 but since the problems still exist in 2.6.18-150 they weren't fixed in those BZs.
Created attachment 345702 [details] Program to reproduce panic
also similar to https://bugzilla.redhat.com/show_bug.cgi?id=471254 but that patch should be in the -150 kernel as well.
I'm testing with 5.4 beta, 2.6.18-151.el5xen. nodes xen1 and xen2 are exporting, # cat /etc/exports /gfs *(rw,insecure,no_root_squash) node xen3 mounts from xen1, node xen4 mounts from xen2. xen3 and xen4 have been running my own test as well as the flock test in comment 1, and they seem to all work fine. Given that these are all vm's on one host, everything is very slow. I've tried both gfs1 and gfs2 as the shared fs between xen1 and xen2. I'll next try xen2 mounting xen1's export, and xen1 mounting xen2's export, although I wouldn't be too surprised if that arrangement produced an odd problem somewhere (and I wouldn't be too concerned about it.)
Initial results from xen2 mounting xen1's export and xen1 mounting xen2's export. The flock test often stops and doesn't make any progress on either node, I don't know why, I didn't notice this using separate clients and servers. I've seen a couple "lockd: grant for unknown block" messages on each node after running for a few minutes.
I've not been able to reproduce this. Lachlan, could you try this again with separate nodes exporting and importing? I was not able to reproduce in either case, but nodes both exporting and importing the same fs isn't a configuration we want to worry about.
Okay. I have a 4 node cluster with node 1 exporting the GFS2 filesystem via NFS to node 3 and node 2 exporting to node 4 with the test running on the NFS clients on nodes 3 and 4. I'm running 2.6.18-150 again on all nodes. I still see the "grant for unknown block" and "dangling lock" messages but so far no panic. I'll let it run overnight. Separating the NFS client and servers onto different nodes may change the load/timing enough to avoid the problem but the bug will still be lurking.
It ran all night without panicking. This morning I noticed the flock processes had been killed so I tried to unmount the NFS filesystems on nodes 3 and 4 and got EBUSY on both. The flock processes had not terminated yet and still had references to the filesystem. Slowly they terminated but not before node 1 panicked in dlm:dev_write() as above.
OK, thanks, I'll get a cluster set up to try this again.
This bug is manifesting under GFS1 as well as GFS2 (on a production system)
Both GFS and GFS2 use the same code to deal with posix locks, so its quite likely that any bugs in this area will be shared between the two code bases.
Have all four normal test nodes back. Testing with upstream kernel 2.6.32-rc5 because it's easier to debug and should be about the same code in this area. Reproduced the same bug with much less load. node1 and node2 have gfs mounted node3 mounts node1:/gfs /gfs node4 mounts node2:/gfs /gfs node3 and node4 each run three instances of looping flock test all in foreground (and modified to show output on each iteration), on files 1, 2, 3 flock-loop 1 flock-loop 2 flock-loop 3 This ran for several minutes, periodically one to three of the flock-loop instances would block for up to a minute at a time before resuming; reason unknown. While running, node1 had a single "lockd: grant for unknown block" message, and node2 had none. Neither reported a "dlm_plock_callback: lock granted" message. Eventually the original oops occured on node2: BUG: unable to handle kernel NULL pointer dereference at 00000000 00000050 IP: [<ffffffff8110deee>] posix_lock_file+0x8/0x13 PGD 6cc05067 PUD 6cc06067 PMD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map CPU 3 Modules linked in: nfsd nfs_acl auth_rpcgss exportfs gfs2 dlm configfs ipt_REJEC T xt_tcpudp iptable_filter ip_tables x_tables bridge stp autofs4 lockd sunrpc ip v6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi cpufreq_ondemand dm_mult ipath video output sbs sbshc battery ac parport_pc lp parport sg serio_raw butto n tg3 libphy i2c_nforce2 i2c_core pcspkr dm_snapshot dm_zero dm_mirror dm_region _hash dm_log dm_mod qla2xxx scsi_transport_fc shpchp mptspi mptscsih mptbase scs i_transport_spi sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 9630, comm: dlm_controld Not tainted 2.6.32-rc5 #2 ProLiant DL145 G2 RIP: 0010:[<ffffffff8110deee>] [<ffffffff8110deee>] posix_lock_file+0x8/0x13 RSP: 0018:ffff88006cc63e88 EFLAGS: 00010202 RAX: 0000000000000000 RBX: ffff88005e90c270 RCX: ffff88006ce5cbb8 RDX: 0000000000000000 RSI: ffff88005e90c2e0 RDI: ffff88007ee752f8 RBP: ffff88006cc63e88 R08: ffffffffa02de53a R09: ffffffff8132af3c R10: ffffffff810dd0e1 R11: 0000000000000206 R12: ffff88007eaac128 R13: ffff88007ee752f8 R14: ffffffffa02c2ca2 R15: 000000000133df60 FS: 00007fc9471286e0(0000) GS:ffff880083a00000(0000) knlGS:00000000f777a6c0 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000050 CR3: 000000006cc04000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process dlm_controld (pid: 9630, threadinfo ffff88006cc62000, task ffff88006ce5c 4c0) Stack: ffff88006cc63f08 ffffffffa02de495 ffff88006cc63ec8 ffffffff8115bee1 <0> 0000000100000001 0000000200000000 0000000200026817 6ce2339d00000000 <0> 00000000000308a8 0000000000000000 0000000000000000 0000000000026817 Call Trace: [<ffffffffa02de495>] dev_write+0x15e/0x221 [dlm] [<ffffffff8115bee1>] ? selinux_file_permission+0x5c/0x10f [<ffffffff810dc158>] vfs_write+0xae/0x14a [<ffffffff810dc6eb>] sys_write+0x47/0x6f [<ffffffff8100b9eb>] system_call_fastpath+0x16/0x1b Code: f1 fe ff ff 41 bc db ff ff ff e8 93 d7 21 00 eb cd 48 83 c4 38 44 89 e0 5b 41 5c 41 5d 41 5e 41 5f c9 c3 55 48 8b 47 18 48 89 e5 <48> 8b 78 50 e8 c4 fb ff ff c9 c3 48 8b 47 20 55 49 89 d0 48 89 RIP [<ffffffff8110deee>] posix_lock_file+0x8/0x13 RSP <ffff88006cc63e88> CR2: 0000000000000050 ---[ end trace 8a173e5ae5b588d3 ]--- dlm_controld ? 0000000000000003 0 9630 1 0x00000080 ffff88006cc63c08 0000000000000046 0000000000000000 ffff88006ce5c4c0 ffff88006cc63bd8 ffff88006ce5c4c0 ffff88007f414400 ffff88006ce5c870 00000001045b7c19 0000000000000046 ffffffff814f4018 ffffffff814f4000 Call Trace: [<ffffffff81047730>] do_exit+0x655/0x66e [<ffffffff8132c2fb>] oops_end+0xb2/0xba [<ffffffff810280b5>] no_context+0x1ec/0x1fb [<ffffffff810282ea>] __bad_area_nosemaphore+0x16c/0x18f [<ffffffff8102834c>] __bad_area+0x3f/0x48 [<ffffffff81028373>] bad_area+0xe/0x10 [<ffffffff8132d750>] do_page_fault+0x1fb/0x2db [<ffffffffa02c2ca2>] ? nlmsvc_grant_deferred+0x0/0x15a [lockd] [<ffffffff8132b80f>] page_fault+0x1f/0x30 [<ffffffffa02c2ca2>] ? nlmsvc_grant_deferred+0x0/0x15a [lockd] [<ffffffff810dd0e1>] ? fget_light+0x4f/0xe9 [<ffffffff8132af3c>] ? _spin_unlock+0x26/0x2a [<ffffffffa02de53a>] ? dev_write+0x203/0x221 [dlm] [<ffffffff8110deee>] ? posix_lock_file+0x8/0x13 [<ffffffffa02de495>] dev_write+0x15e/0x221 [dlm] [<ffffffff8115bee1>] ? selinux_file_permission+0x5c/0x10f [<ffffffff810dc158>] vfs_write+0xae/0x14a [<ffffffff810dc6eb>] sys_write+0x47/0x6f [<ffffffff8100b9eb>] system_call_fastpath+0x16/0x1b Based on lachlan's analysis, I'm going to investigate how the xop->file might be getting freed after dlm_posix_lock() and before the callback, and what is, in theory, supposed to prevent that (if anything).
Created attachment 366868 [details] debugging patch for reference Collected the following info from printks in this patch, but not completely analyzed yet. Oct 30 15:52:24 bull-01 kernel: lockd: grant for unknown block Oct 30 15:52:24 bull-01 kernel: lockd: fl ffff88013f54b398 owner 7740398493674204011 start 7740398493674204011 end 7740398493674204011 Oct 30 15:53:28 bull-01 kernel: lockd: grant for unknown block Oct 30 15:53:28 bull-01 kernel: lockd: fl ffff88003f9fedf8 owner 18446612137670821488 start 0 end 0 Oct 30 15:53:28 bull-01 kernel: dlm: dlm_plock_callback: 199873 fl ffff88003f9fedf8 lock granted after lock request failed; dangling lock! Oct 30 15:53:28 bull-01 kernel: dlm: start 0 end 0 Oct 30 15:52:29 bull-02 kernel: dlm: dlm_plock_callback: 199884 fl ffff88013ed3d630 file ffff88012d369068 dentry (null) Oct 30 15:52:29 bull-02 kernel: lockd: grant for unknown block Oct 30 15:52:29 bull-02 kernel: dlm: dlm_plock_callback: 199884 fl ffff88013ed3d630 lock granted after lock request failed; dangling lock! Oct 30 15:52:29 bull-02 kernel: dlm: start 7740398493674204011 end 7740398493674204011
This problem first appears in bug 466677 which was probably never fully understood or fixed. vfs_cancel_lock is being called constantly during these tests which seems strange. I'd like to understand the whole end-to-end picture of what the lock/unlock cycle is supposed to look like, how it's supposed to work, and what role cancel has in it. gfs translates a CANCELLK into an UNLOCK with a "Hack" comment.
From comment 20 it sounds like the problem appears with a single client talking to a single server.
I've spent the day studying lockd/svclock.c along with the dprintk output for a simple lock/unlock. From the dlm side everything seems to be working correctly; it's only when the dlm calls back into lockd that there seems to be problems with the nlm_block structures. My hunch is that it's cancel that somehow leads to the problems. Overall, the comments in the svclock.c code are not encouraging; it sounds like there are plenty of gaps for things to go wrong even without the new async/DEFERRED behavior introduced by gfs.
When lockd is operating correctly, one call to vfs_lock_file() should have one corresponding nlmsvc_grant_deferred() callback. With a debugging patch, I'm setting a new B_IN_FS flag on a block when vfs_lock_file() is called, and then clearing it in nlmsvc_grant_deferred(). Before setting the flag I check that it's not already set, and before clearing the flag I check that it is set. The first sign of problems is when lockd calls vfs_lock_file() on a block that already has B_IN_FS set. After this, a lot of other similar errors quickly pile up indicating that the lockd block's are out of sync with the dlm locks. I don't have any ideas why lockd may be calling vfs_lock_file() on a block that's currently busy in the fs, but I believe the root of the problems are in that direction.
My suspicion is that the file object is being freed because we don't take a reference on it before saving it away and later on we retrieve it after it has been freed. I just ran this patch through a quick test and it hasn't crashed. --- linux-2.6.18.x86_64/fs/dlm/plock.c.orig 2009-11-04 14:54:45.000000000 +1100 +++ linux-2.6.18.x86_64/fs/dlm/plock.c 2009-11-04 14:58:00.000000000 +1100 @@ -11,6 +11,7 @@ #include <linux/poll.h> #include <linux/dlm.h> #include <linux/dlm_plock.h> +#include <linux/file.h> #include "dlm_internal.h" #include "lockspace.h" @@ -106,6 +107,7 @@ int dlm_posix_lock(dlm_lockspace_t *lock locks_init_lock(&xop->flc); locks_copy_lock(&xop->flc, fl); xop->fl = fl; + get_file(file); xop->file = file; } else { op->info.owner = (__u64)(long) fl->fl_owner; @@ -187,6 +189,7 @@ static int dlm_plock_callback(struct plo log_print("dlm_plock_callback: vfs lock error %llx file %p fl %p", (unsigned long long)op->info.number, file, fl); } + fput(file); rv = notify(fl, NULL, 0); if (rv) { The "grant for unknown block" and "lock granted after lock request failed ..." messages are still appearing though so there may be more to fix.
The patch in comment #25 doesn't look right. All locks are removed when the file is closed, so it should be impossible to have a closed file on which there are remaining locks.
Comment #25 is a logical fix for the NULL-file oops, but I believe that it fixes a symptom and doesn't address the root cause. (Fixing symptoms can be sensible to do, too, but I think our main goal right now is finding the root cause.)
One thing that I couldn't see was any testing for the FL_CLOSE flag and I wonder if that needs to be handled specifically or whether it doesn't matter.
It does not appear to me that the problems are in the direction of the vfs, but rather in the direction of lockd. Remember, there are a lot more than file's and file_lock's involved here, and we don't have any known problems with local locking tests, only nlm locking tests. I'm fairly confident in the struct lifetimes/references among the first three items in this list, but not among the last three items in the list: - struct file - struct file_lock - struct plock_op - struct nlm_file - struct nlm_block At the moment I don't have the impression this is even a struct lifetime or reference counting issue at the core. As I said above, the first sign that things are off is when lockd seems to call vfs_lock_file() on a lock that's currently in the middle of vfs_lock_file().
Created attachment 367874 [details] debugging patch This patch adds the B_IN_FS flag to a block that's busy in the fs as mentioned in the earlier comment. It also takes it one step further in an attempt to fix (or at least avoid) the problems by *not* going ahead with another vfs_lock_file() if the block in question is busy with an earlier vfs_lock_file(). In my testing it has so far been successful in avoiding the problems (e.g. no "dangling locks" or "unknown blocks"), but it would be good to try the other tests that have shown problems. My big remaining question is whether or not lockd is behaving correctly when it calls vfs_lock_file() on a block that's currently busy with a previous vfs_lock_file(). If not, then I'll pass this off to the lockd experts to debug why it's happening. If it's legitmate or difficult to avoid, then we'll need to detect when it happens (with B_IN_FS or something equivalent) and abort it.
We're getting multiple outages in production systems due to this bug (one machine goes down, a clustermate picks up the load and then crashes, etc) and users aren't happy. Test RPMS would be handy...
Not sure if related. Posted per Dave's request: I'm seeing panics on machines when running local rsync between GFS1 and GFS2 filesystems. Neither FS is NFS exported, both were mounted as GFS local only (lock_nolock). The filesystem content is 350Gb of Imap Mdir folders - approximately 3 million mostly tiny files. Some directories may contain 10k+ files but most hold far less than this. (usually a few hundred at most)
Probably not related. I had another report of something that sounds like this recently. What the original reporter didn't say was that it was GFS1 -> GFS2 so when I tried to reproduce it I did GFS2 -> GFS2 and I didn't manage to reproduce it. I've not had a chance to try with GFS1 on the sending end so far. If you have any log messages from that issue, I'd like to know, but its probably not appropriate for this bz.
Re comments 31 and 32, are the outages and panics due to the oops in posix_lock_file? Or are they other bugs not yet recorded in bugzilla? The posix_lock_file bug we're working on here should not appear if you're using lock_nolock, only lock_dlm; and it should also not appear unless gfs is exported via nfs and clients are doing locking.
Created attachment 367889 [details] flock-loop test program In my testing I run flock-loop file1 flock-loop file2 flock-loop file3 on two nfs clients, each client mounting from a separate server.
I have now seen a couple errors even with the patch from comment 20: lockd: grant for unknown block, result 0 grant fl t 2 p 14197 o ffff88013e1bbaa8 0-0 lockd: nlm_block list b ffff88013eacad58 flags 0 file ffff88013eb06508 fl t 1 p 14199 o ffff88013e1bbaa8 0-0 b ffff88013eaca930 flags 8 file ffff88013eae2760 fl t 1 p 14198 o ffff88013e1bbaa8 0-0 dlm: dlm_plock_callback: 30ccc fl ffff88007e262598 lock granted after lock request failed; dangling lock! This is a different kind of error from most of the "unknown block" / "dangling lock" cases I was seeing without the patch. In this case the fl described in the callback has type 2 (F_UNLCK) which should never be the case in a callback. And then separately, nlmsvc_grant_deferred block ffff88007ee736d0 not B_IN_FS
Another thing to try changing is nlmsvc_unlock() which does: nlmsvc_cancel_blocked() vfs_cancel_lock() vfs_lock_file(F_UNLCK) Since gfs/dlm does not have the ability to cancel locks, it converts the vfs_cancel_lock() call into an ordinary unlock. So, lockd ends up calling unlock twice back to back, first from vfs_cancel_lock() and second from vfs_lock_file(F_UNLCK). I'll probably try removing the call to nlmsvc_cancel_blocked() altogether and see what happens.
Removing the calls to nlmsvc_cancel_blocked() did seem to improve things, but I don't have any specific examples. I was still seeing cases were a nlmsvc_grant_deferred() callback would occur with bogus fl data that would fail to match the fl in the block it should have matched. This is similar to the bug we recently fixed to pass the pointer of the original fl into the callback instead of the flc (copy of the original fl) because the flc ranges are modified by the vfs. Since it appears the original fl is being clobbered, causing it to not match the lock it's supposed to, I changed the dlm to make a second copy (flc2) of the original fl to pass back to nlmsvc_grant_deferred(). This appears to have fixed the problem of nlmsvc_grant_deferred() failing to find any matching blocks. I'm still seeing an occasional occurance of nlmsvc_lock() called on a block that is currently busy in the dlm from a previous nlmsvc_lock() call. I'm still dealing with that by aborting and returning in those cases.
Summary of the functional changes I've made: 1. In dlm_plock_callback(), create a second copy of the original file_lock, and pass this copy into the fl_grant callback instead of the pointer to the original file_lock. 2. In lockd set B_IN_FS in block->b_flags before calling vfs_lock_file(), and clear it in nlmsvc_grant_deferred() (if DEFERRED was returned, otherwise clear it right after the vfs call) 3. In nlmsvc_lock(), check if B_IN_FS is set, and return without calling into the fs if it is. 4. In nlmsvc_unlock(), remove the call to nlmsvc_cancel_blocked(). 5. In nlmsvc_cancel_blocked(), return immediately without doing anything.
Response to #34: #31 crashes are all posix lock related on nfs exported filesystems #32 crashes are glocks on non-exported filesystems
addition to comment 40, 6. In dlm_plock_callback(), check if the saved struct file is invalid (non-null file->f_path.dentry), and if so don't call posix_lock_file() (which oopses if passed a bad struct file).
Since most of the changes I've been trying are work-arounds anyway (I've mostly given up for now on finding root problems), I tried get_file/fput, but killing flock-loop on the clients triggered fs/locks.c locks_remove_flock() BUG from the fput().
I'm relatively confident in the broad reasons behind the problems we're seeing. I have the impression that the problems are structural/design ones, and not traceable to a specific root bug that can be fixed (although I'd sure like to be wrong on that.) lockd sends off async plock ops to the fs. lockd is not careful to collect the async reply from the fs before doing something else that may interfere with the op it has sent. Instead, lockd can do a number of things after it sends off the op and before it has processed a reply for it: - it can send off another lock op on the same file for the same holder - it can send off an unlock op on the same file for the same holder - it can close and free the file The first two can confuse the dlm plock code, although I've added some additional checking to detect and ignore them (partially anyway). The first two can also result in callback errors where lockd has forgotten about the op it fired off and can no longer match up the reply when it arrives, i.e. lockd: grant for unknown block The last one is especially troubling, because the dlm needs the struct file to be "valid" at the time the op completes so that it can do the vfs "bookkeeping". If lockd has pulled the file out from under the dlm, it will result in oopses of various kinds (even if the dlm adds its own get_file reference) in the vfs locking code (fs/locks.c). The dlm assumes that the plock caller is "well behaved", where that's defined as the behavior that it would see from a local process doing a plock operation. lockd is not well behaved in this sense; it behaves differently in the three ways (at least) listed above. The lockd changes that were made to accomodate async locks were minimal. They assumed that lockd could work in largely the same way for sync or async implementations, and seemd to ignore the issue of things happening in the async window which could interfere with an incomplete call.
It wouldn't surprise me if you're right about lockd. We already found that nfs utils aren't wonderfully coded and had to wrap all /usr/share/cluster/nfsclient.sh exportfs calls with flock statements in order to have multiple NFS services start/stop without tripping over each other.
I'm guessing that switching to NFSv4 won't help matters?
NFSv4 won't change the dlm/lockd interactions, but it may have some effect on the server/client interactions, which may have some indirect effects on dlm/lockd parts, I don't know.
Created attachment 369346 [details] latest testing/debugging patch My recent attempts have been to approach this as much as possible from the dlm side and avoid lockd changes. This current patch does seem to be holding up better than average, although there are still issues.
FWIW: All the nfs exports here are sync, not async. To be honest at this point I think that lockd/nfs work would be more productive overall, but I'm a big fan of belt+braces+safetypin approaches. The Samba project's CTDB project (http://ctdb.samba.org/) has some NFS work included which might be helpful.
I don't seem to have any of the locking problems at all when using nfs4! nfs4 is not using the dubious async lock completions like lockd does, but ordinary synchronous calls like local processes do. I'm not sure why we've not realized this before, and why we've spent so much time trying to make lockd work on gfs/dlm (many other bz's before this one) rather than simply limiting nfs+gfs file locking to nfs4 configurations. I'm going to seriously consider removing the async plock code from the dlm altogether. Can you verify that switching to nfs4 solves all your file locking problems? If so we can close this bz.
Unfortunately we can't completely remove nfs3 (or 2) from the servers - there are older OSes involved which don't have NFSv4. I'm in the process of migrating the RH clients to NFS4 - which should help a lot, but it's already exposed that cluster.conf doesn't support "mount --bind"
After a few hours of testing: Superficially NFSv4 seems to work without causing GFS issues, but it's not been tested in anger yet. Additionally there are other problems: Client delegation doesn't work (syslogging lots of errors as a result) IMPORTANT: file locking doesn't work at all! We have /home on NFS and being unable to lock .Xauthority + KDE startup files means that GUI logins don't work. Serverside is exported using rw,insecure,no_subtree_check,sync,nohide Clientside has: rw,nosuid,nodev,_netfs,fsc,acregmin=10,acregmax=120,acdirmax=120,timeo=600,retry=1,lock,hard,intr I realise this is a bit beyond the scope of the ticket, but getting this working would allow us to drop v3 on 100+ clients and fully test if it gets round the crash.
Sorted the file locking issue - broken kernel requiring 2.6.18-164.6.1.el5 or later. Under heavy load NFSv4 is basically unusable. We had to revert to v3 so that people could work - so we're back in the crossfire of this posix locking issue :(
Perhaps we should open a new bz to look at the nfs4 performance problems? After two weeks working on nfs3 posix locks without a fix, I think we should spend a little time seeing if we can address the nfs4 performance.
Is anyone looking into the nfs4 performance issues?
Not as such. Most of the performance issues turned out to be self-inflicted (dialled down the number of nfsd threads, unaware this also affects nfsd4, lack of documentation is a cow). Even with 2.6.18-164.6.1.el5 there are still serious nfs4 issues apparent which make it effectively unusable in a cluster environment (not least of which is that with v4recovery on a cluster disk as recommended, clients are only able to mount shares from one server. If different subsets are on different servers the other shares are unreachable) On the NFS3/GFS front, I've found it only takes ONE client doing sustained heavy nfs writes(*) on an otherwise idle disk to be able to knock a server over. Given that not all clients are NFS4 capable this is still a serious shortcoming which needs addressing. (*) Client was running a 100Gb+ scp from a OSX box on the LAN to nfs mounted disk.
While trawling bugzilla I ran across BZ 531493 and thought it was worth testing on nfs3+GFS - the server got very sluggish after about 16G had been xferred so the test got aborted. Is this related or am I chasing mirages?
Alan, I doubt BZ 531493 is related to this bug. That bug results in the system hanging whereas this bug results in a crash preceded by lockd/dlm errors messages.
In our case the recent spate of crashes have been preceeded by the servers running sluggishly thanks to a user trying to put 20+Gb files on the gfs fs via nfs. We have lock/dlm errors logged every couple of minutes under normal circumstances but only occasional crashes unless large file xfers are underway as well.
Just a few notes about using NFS+GFS in general. Active/Active usage of NFS on GFS1/2 is not presently supported. Active/Passive (i.e. failover) of NFS on GFS1/2 is supported and should work fine. Here are the relevant docs on this: http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Configuration_Example_-_NFS_Over_GFS/NFS_GFS_Overview.html "Note that this configuration is not a "high capacity" configuration in the sense that more than one server is providing NFS service. In this configuration, the floating IP moves about as needed, but only one server is active at a time." http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Configuration_Example_-_NFS_Over_GFS/NFS_GFS_Caveats.html "Do not use NFS locks in a failover situation such as the one described in this configuration, as this may cause a duplicate lock situation. The GFS file system correctly distributes the locks between nodes. The Linux NFS server as used in Red Hat Enterprise Linux 5 is not cluster aware. NFS state information is held on a per machine basis and there is a slight but definite chance of file corruption should two NFS servers try to write to the same file at the same time, even in a GFS environment. Because of this the current recommendation is to only export any given file system on one NFS machine only." So presently the only supported configuration is a single NFS server at a time using GFS1/2 filesystem. Though it should be noted that if you completely disable NFS locking that multiple NFS servers should work fine. But this is not officially a supported configuration.
(I don't think NFS and samba issues are relevant for this bugzilla except that they serve to drive loads up high enough to make the problem manifest.) We're aware of the issues with active/active nfs sharing on any given filesystem and currently have serving of any given filesystem directed to one server only. Nonetheless the bug manifested even with _all_ nfs activity coming off one physical machine in the cluster (3 machines) - the problem with this configuration is the load spikes to more nfs processes than one box can provide (there's a hard limit around 368 processes) FWIW the same corruption risks apply to the EL5 version of samba, which is why the Samba project developed CTDB and have put a lot of effort into clustered NFS too. Bringing this work into RH clustering would solve a number of issues but probably not this one.
(In reply to comment #64) > (I don't think NFS and samba issues are relevant for this bugzilla except that > they serve to drive loads up high enough to make the problem manifest.) > > We're aware of the issues with active/active nfs sharing on any given > filesystem and currently have serving of any given filesystem directed to one > server only. Ok, reading through the bz comments I only saw mention of the active/active configuration. But if this problem exists on a single NFS server configuration, then I agree with you.
It has long been abundantly clear that this bug is due to incompatibility between lockd and gfs/dlm. See comment 44 for the explanation of that. That has nothing to do with active/active vs active/passive. However, if you are doing active/passive, what's the point of sending the locks into gfs/dlm to "clusterize" them at all? There is not point. Don't do that, let the locks be managed only by nfs/lockd/vfs on the active node, just like a local fs. Recovery (shifting from failed active node to passive one) should then also be handled like a local fs. I believe that if you mount gfs with the "localflocks" option it should turn off the clusterization of all plock/flock calls and make them equivalent to a plock/flock calls on a local fs.
comment #67 assumes that the GFS filesystem in question is ONLY being accessed via NFS (in this case why use GFS at all?) The first thing that springs to mind requiring clusterisation of locks is other cluster nodes accessing the filesystem via GFS while running other tasks or another cluster node acting as a samba server.
*NUDGE* Is anything being done on this problem? More specifically has any thought been given to sorting out the NFS suite? Even if NFSv4 can be nmade to behave properly in a clustered environment, there will be v3-only clients for a long time to come.
Created attachment 396587 [details] example patch This untested patch allows us to control how plocks from nfs are handled in gfs, without resorting to localflocks. If this new "nfslocks" mount option is used, then gfs will pass nfs plock requests on to the dlm to be clusterized (like gfs2 has done since 5.3). Without the nfslocks mount option, nfs plocks are handled locally and not clusterized (like gfs always did prior to 5.3). I expect gfs2 developers may want to tweak this patch according to their taste. We should also test to verify it works as expected, of course.
Steve, could you take a look at the patch in comment 72?
The patch looks ok, but I'm not sure I follow the problem that we are trying to solve with it. Why do we need to make nfs do something different to local fcntl locks? I think its likely to be rather confusing.
It's GFS, not ext4. eg: Other cluster nodes may be accessing the filesystem. (and in our case, they ARE.)
How long before this makes its way into a test kernel?
The issues related to the combination of posix locks, nfs, and gfs have been very confusing to everyone. This includes developers like myself who haven't understood the extent of the technical issues until recently. There may still be some problems in the mix that we don't know or understand because the issues span several subsystems. Given all the confusion, different people and documents have been saying different things about what works or what doesn't work or why. Here's an attempt to clarify the current situation. It's easiest to begin by describing things without nfs in the picture. When normal programs/processes acquire posix locks (plocks) on gfs files using fcntl(2), the plocks are "clustered", i.e. plocks between two processes on different nodes work the same way as two processes on the same node; locks from different nodes are aware of each other. In this context it's helpful to understand what the "localflocks" gfs mount option does. When localflocks is used, any plocks (fcntl(2)) or flocks (flock(2)) acquired on that fs are not clustered among nodes and remain local to the node, and behave like plocks/flocks on say ext3. Whether localflocks is acceptable depends, of course, on the specific applications using gfs. What changes when nfs is added to the picture? 1. Most fundamentally, the locks are no longer finally held by processes on the cluster nodes themselves, but by processes on nfs client nodes. This means that when a cluster node fails, the plocks of all processes on that node cannot simply be discarded by other cluster nodes during recovery. Instead, the locks need to be recovered from the nfs client nodes. This recovery requires a significant amount of new design and development across multiple subsystems owned by multiple groups. This includes defining public interfaces between different components which is especially difficult and time consuming. The only current development I am aware of is related indirectly, in the area of pnfs. 2. Processes are no longer acquiring plocks from userspace via fcntl(2), but the nfs kernel server (lockd for nfs3), are calling into gfs to acquire the locks. lockd behaves differently from userland processes and does not follow the same conventions. Perhaps the worst example of this is that it will give up on locks after a certain amount of time and try to cancel them. This creates a race condition that will require serious changes to fix. Attempts at fixing this race have only reduced the occurance, and suggest that a complete fix may well extend beyond the boundaries of the isolated cluster-fs-specific code. Changing general lockd code for the sake of cluster file systems would be especially challenging. The only way to address issue 1 currently is to prevent nfs plocks from being passed into gfs since the capability to recover them simply does not exist in any form. One way to prevent passing nfs plocks into gfs is to use the localflocks option to make all plocks/flocks local. A second way is a patch like https://bugzilla.redhat.com/show_bug.cgi?id=502977#c72 which is more discriminating than localflocks. The current solution to issue 1 obviates any patches to address issue 2. Why all the recent confusion? These issues have recently become prominent because of an unfortunate kernel patch upstream and in RHEL 5.3 that changed the default behavior of nfs plocks on gfs. Prior to RHEL 5.3, nfs plocks had always been local to the node, regardless of the underlying filesystem type (gfs, ext3, etc). There was no code or mechanism to pass plocks from nfs into gfs. This changed in RHEL 5.3 with the introduction of interfaces (from the GPFS group at IBM) to allow this passing of plock operations between nfs and an underlying cluster fs like gfs (bz 196318). Unfortunately, these interfaces were put to use by default on gfs, under the mistaken assumption that nfs plocks could now be clustered just like plocks used by local processes via fcntl(2). This decision failed to account for the fact that there is much more to be done in the area of recovery coordination before nfs plocks can truely be clustered for gfs. The added confusion of active/passive nfs on gfs. The context thus far has been about the most "natural" way to export nfs from gfs: all gfs nodes exporting the same file system at the same time. However, some people are interested in an active/passive configuration where only one of the gfs nodes does the nfs export at a time. If the exporting node fails, rgmanager is used to export the same fs from a different node and move a virtual ip address. In this configuration, we do not want nfs plocks to be passed into gfs (there is no reason to do so); we want them to be handled and recovered in the same way as a local fs like ext3. If this is done (e.g. pre-5.3 behavior, localflocks or a patch that disables nfs locks being handled on gfs), then the underlying fs is not a factor and nfs plocks should work even if gfs is the underlying fs.
Dave says in #78 > The context thus far has been about the most "natural" way to export nfs > from gfs: all gfs nodes exporting the same file system at the same time. This is our preferred method.... > However, some people are interested in an active/passive configuration > where only one of the gfs nodes does the nfs export at a time. We are running this setup because RH have explicitly warned that NFS isn't cluster safe and any given filesystem must only be NFS exported on one node at a time or there is risk of file corruption due to nfs write locks not passing between nodes (In other words: active:passive is a workaround, not the preferred configuration. I believe this is the general case across most entities running A:P configs) (RH also warn customers that in an a:p setup, all nfs locks are lost in the event of the nfs service switching between nodes. This isn't generally a problem.) HOWEVER: In a multinode active:active configuration containing 3 or more nodes it's a fair bet that any given filesystem won't be NFS exported on all nodes at all times, even if that fs may be exported on multiple nodes. In such a configuration, the odds are reasonably high that customers will want failover of a failed node's NFS service to another machine in the cluster. In this case the same requirements as single export active:passive take effect. There used to be (still is?) a userspace nfsd - development was stalled a long time ago in favour of kernelspace nfsd because of speed issues. Perhaps it's worth revisiting the userspace daemon for clustered purposes.
I'm not sure that moving to a userspace nfs lockd would improve matters here. There is still the issue of how the locks are to be failed over between nodes. Looking at the gfs2 code we have this: if (cmd == F_CANCELLK) { /* Hack: */ cmd = F_SETLK; fl->fl_type = F_UNLCK; } which looks to me like it is not performing lock cancellation at all, but instead queuing an unlock, unless the userland code has some other way to distinguish between these requests? I presume from the comment that this might not be the case. I'm still looking into the NFS code to try and figure out what exactly is required of the fs in that case.
Dave, could we get some clarification please? Without your patch, is the bug currently present in both GFS and GFS2 or is it only in GFS now? If fixed in GFS2, do you know which RH test/release kernel?
The patch in comment 72 (adding the nfslocks option) has not been included in any build or release of gfs1 or gfs2. The gfs developers will need to do that if they approve of it. Without that patch or something like it, the only way I know of to keep nfs plocks out of gfs is to mount gfs with localflocks.
The question is, why is localflocks not enough? I'll put the patch in if we can justify it in some way, but I don't understand the reasons for not just using localflocks.
> The question is, why is localflocks not enough? localflocks == no clustered filesystem. No clustered filesystem == no point in running GFS.
localflocks is independent from the filesystem's internal locking. The only reason that this other option would be needed is if there is a requirement for flock() and fcntl() locks to have different configs wrt cluster/single node. If the application doesn't use flock() but only fcntl() locks then there is no difference between the proposed patch and the localflocks option.
localflocks *is* enough, the point is that it may be too much for some people. localflocks means the fs won't do any clustering of flocks or plocks, even for local processes. nfslocks only stops clustering of nfs locks, but any plocks or flocks by other processes are still clustered. The really critical issue, though, is that nfslocks is *off by default*, which returns us to the original pre-5.3 behavior of nfs locks being local. We really need nfslocks to be off by default or people will continue to run into these problems (inconsistent plocks, oopses).
Redhat Support advised us that filesystems mounted with localflocks should NOT be clustered and warn of file corruption risks. For our setup, the advice was to run all services on one node ONLY with the others powered down - hardly a High Availability situation to say the least. We were sold GFS as a High Availability solution for clustered NFS/Samba operations and weren't even advised that NFS/Samba should only be operated on one node of a cluster until well AFTER we had the thing running (on RHEL4) To get to the situation of clustered hardware being switched off in order to have safe NFS fileserving under RHEL5 defeats the whole purpose of using GFS. Users and management will put up with issues for a _limited_ period, however the current situation is approaching the absolute limits of their patience. If this issue isn't resolved in a reasonable timescale then we see little alternative but to remove GFS.
Are you confusing lock_nolock with localflocks? Those are very different things. If you have any other technical questions or confusion I'd be happy to clarify them.
Not particularly. I had explicit warnings not to use localflocks on clustered systems whilst discussing lock_nolock GFS man page says: === localflocks This flag tells GFS that it is running as a local (not clustered) filesystem, so it can allow the kernel VFS layer to do all flock and fcntl file locking. When running in cluster mode, these file locks require inter-node locks, and require the support of GFS. When running locally, better performance is achieved by letting VFS handle the whole job. This is turned on automatically by the lock_nolock module, but can be overridden by using the ignore_local_fs option. === On that basis I'd prefer not to play russian roulette with a clustered production filesystem.
I'm not necessarily suggesting you use localflocks. A person needs to be aware of the applications' requirements with respect to file locks (flocks and posix locks) before knowing whether it is acceptable to use localflocks. I said this in comment 78. But we've been led off on the localflocks tangent; localflocks is not the most pertinent question, the proposed nfslocks patch is. If you are exporting nfs from a single gfs node: - the nfslocks patch will prevent kernel oopses - posix locks from nfs clients will be local to the single exporting node - posix locks will "work" among all nfs clients - posix locks will be clustered for processes using fcntl() on all gfs nodes If you are exporting nfs from multiple gfs nodes: - the nfslocks patch will prevent kernel oopses - posix locks from nfs clients will be local to the server/node they mount from - posix locks will "work" among nfs clients mounting from the same server/node, but not among nfs clients mounting from different servers - posix locks will be clustered for processes using fcntl() on all gfs nodes
Dave, I know you're not suggesting we use localflocks, however others within RH are stating it is the only viable solution. Given the current issues with multiheaded NFS exporting we'll still be sticking to one nfs server per filesystem, but we _must_ have stable clustered operation with NFS exports active on at least one node.
Created attachment 402691 [details] updated patch Updated patch, added "nonfslocks" in addition to "nfslocks", now matches upstream version. I have tested the upstream patch, it works as expected.
Created attachment 402696 [details] patch fix patch conflicts
Created attachment 402920 [details] patch fix
I have tested and verified the patch in comment 94, in build: https://brewweb.devel.redhat.com/taskinfo?taskID=2345796
http://post-office.corp.redhat.com/archives/rhkernel-list/2010-March/msg00882.html
Hm, Steve has rejected this patch since he doesn't understand the problem. Steve, this bug is now completely in your hands.
Steve, what is the current state of this bug?
Alan, I don't think there is anything to fix here. The bz has unfortunately got rather confused from the original report. Several different issues have been reported along the way. Let me try and clarify the situation.... We do not support: o Mixed samba/nfs exports of GFS/GFS2 filesystems o Active/active nfs exports with nfs lockd support (active/active should work without locking, and with udp nfs only) o Mixed nfs and local applications on GFS/GFS2 filesystems o Mixed samba and local applications on GFS/GFS2 filesystems Active/passive nfs exports should work with nfs lockd, but you must set the "localflocks" mount option on each GFS/GFS2 mount. If there are any issues other than those relating to NFS locking, they should be reported in different bugzillas. We would like to be able to support both samba and nfs mixed and also active/active with locking support in the future. We have bz #580863, for example open to track upstream effort required to implement the features we need in order to do this. It will not be an easy thing to do unfortunately. If there has been some confusion in the information supplied by support and/or any other part of Red Hat then please accept my apologies for that. If you are still experiencing problems, then please drop support a line and they will do their best to assist. As for this bug, the originally reported issue has been resolved, so I'm now intending to close it.
Firstly, the original problem has NOT been resolved. There are still panics. Secondly your comment about not supporting Mixed Samba/NFS exports of GFS is at odds with the original sale - which was specifically for this purpose...
Alan, I think your set up does not bear much relation to the original report for which this bz was opened. The details of that are: o A contrived example set up by our support team to debug another customers problem o Active/active with nfs locking (not supported) o Using nfs on each of the two nodes to mount gfs2 from the other node (also not supported) o No use of samba at all I apologise if someone at Red Hat has given you incorrect information. I would be very interested to know who gave you that information. If you can drop our support team a line, then we'll try and work with you to come up with a solution for your situation. The reason that mixed samba/nfs is a problem is basically down to locking. Samba has an internal cache of information which uses posix leases to keep itself uptodate. GFS2 doesn't support leases when it is clustered, although it does when it is run single node (lock_nolock). NFS supports posix fcntl locks and so does GFS2 (if lockd support is enabled, i.e. when clustered and not using the localflocks option). There is however a problem in trying to use that interface active/active in that NFS doesn't have cluster recovery support so that the lock state cannot in that case be recovered in case of node failure. The combination of these issues mean that we cannot support mixed samba and nfs at the moment, even though we would certainly like to do so.
Closing this bug on the basis that the original report was for an unsupported situation which has since been resolved. If there are other issues then they should be reported under a different/new bugzilla to avoid further confusion. Upstream development work for active/active support of nfs lockd can be found in bz #580863
Please state what exactly is "unsupported" about NFS exporting running on a single GFS cluster node running by itself that is also exporting the same filesystems via samba and/or running local operations on the filesystem.
The reason I ask this question is that the same crash mode seen in the original report is exactly the same as seen on our systems in the configuration I describe. We were referred to this ticket by Redhat Support because our crashdumps matched. Opening a new BZ for the _same_ issue simply causes more confusion. The crash happens more often if there are 2 GFS nodes, even if one is completely quiescent. It's disruptive because we experience 10-15 minute cluster downtime for every event _and_ the ~1Tb filesystems eventually have to be taken down for a day in order to fsck. This is not a contrived situation. It's an issue which occurs under normal network loadings and which didn't manifest prior to RHEL5.2 This is a very real problem, happening on supported configurations. It needs to be addressed properly, not shoved under the carpet.
Samba and NFS both maintain some state (lock state specifically) not in the kernel. As things stand today, there is no coordination between Samba (a user space process) and the NFS server when exporting the same partition. This is something that commercial NAS appliances provide and something worth investigating/implementing in Linux so we are interested in hearing about customers that would like this. Please fill in a "FEAT" request if you are interested, this BZ is not intended to be a feature request. Best regards, Ric
As per some of the previous comments, would it be supported using Mixed nfs and local applications on GFS/GFS2 filesystems using NFSv4? If I've understood it right, it should work
I doubt it will work correctly if recovery takes place. It is not tested and thus NFS is only supported on its own and not mixed with local applications.