---Problem Description--- As of 2.6.30 kernels, a new lock is defined in the file struct (f_lock) and this lock is being obtained (I believe unnecessarily) in the path taken for fcntl(F_UNLCK). In RHEL 5.4 (2.6.18-164.11.1.el5), there is a similar new lock call in this path, but using the existing f_ep_lock. In either case, this new use of "fl->fl_file->" to access the lock is causing problems for GPFS in this path: posix_lock_file __posix_lock_file locks_delete_lock fasync_helper(0, fl->fl_file, 0, &fl->fl_fasync); fasync_remove_entry spin_lock(&filp->f_lock); <----- fasync_add_entry spin_lock(&filp->f_lock); <----- The problem for GPFS (as a cluster filesystem) is that in handling posix locks we have to ensure no conflicting locks are held on other nodes. To accomplish this, we temporarily acquire the lock on remote nodes before allowing the requestor to obtain it. Since on this remote node, the lock call is not application driven, it may not have a normal file structure associated with it, and in some cases it is necessary that fl_file is NULL when such a lock is obtained. Note that there are places (such as lock_get_status...used during "cat /proc/locks") where this is allowed in the code: static void lock_get_status(struct seq_file *f, struct file_lock *fl, int id, char *pfx) { struct inode *inode = NULL; unsigned int fl_pid; if (fl->fl_nspid) fl_pid = pid_vnr(fl->fl_nspid); else fl_pid = fl->fl_pid; -----> if (fl->fl_file != NULL) inode = fl->fl_file->f_path.dentry->d_inode; I would like to avoid this new lock when the object is a POSIX lock (I don't think that fasync_helper even needs to be called in this case). The call could be made conditional on this being a lease: 573 static void locks_delete_lock(struct file_lock **thisfl_p) 574 { 575 struct file_lock *fl = *thisfl_p; 576 577 *thisfl_p = fl->fl_next; 578 fl->fl_next = NULL; 579 list_del_init(&fl->fl_link); 580 +++++ if (IS_LEASE(fl)) 581 fasync_helper(0, fl->fl_file, 0, &fl->fl_fasync); ---uname output--- Linux c1f2bc3n06 2.6.18-164.11.1.el5 #1 SMP Wed Jan 6 13:26:04 EST 2010 x86_64 x86_64 x86_64 GNU/Linux ---Steps to Reproduce--- run fcntl application across GPFS nodes using RHEL5.4 ---Kernel - Filesystem Component Data--- Stack trace output: Feb 6 03:01:51 c1f2bc3n06 kernel: Unable to handle kernel NULL pointer dereference at 00000000000000e0 RIP: Feb 6 03:01:51 c1f2bc3n06 kernel: [<ffffffff80064a85>] _spin_lock+0x0/0xa Feb 6 03:01:51 c1f2bc3n06 kernel: PGD 11ad09067 PUD 10b354067 PMD 0 Feb 6 03:01:51 c1f2bc3n06 kernel: Oops: 0002 [1] SMP Feb 6 03:01:51 c1f2bc3n06 kernel: CPU 3 Feb 6 03:01:51 c1f2bc3n06 kernel: Modules linked in: mmfs26(U) mmfslinux(U) tracedev(U) nfs fscache nfs_acl deflat e zlib_deflate ccm serpent blowfish twofish ecb xcbc crypto_hash cbc md5 sha256 sha512 des aes_generic testmgr_ciph er testmgr crypto_blkcipher aes_x86_64 ah6 ah4 esp6 xfrm6_esp esp4 xfrm4_esp aead crypto_algapi xfrm4_tunnel tunnel 4 xfrm4_mode_tunnel xfrm4_mode_transport xfrm6_mode_transport xfrm6_mode_tunnel ipcomp ipcomp6 xfrm6_tunnel tunnel6 af_key autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i iw_cxgb3 ib_core cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transpor t_iscsi2 scsi_transport_iscsi dm_round_robin scsi_dh_rdac dm_multipath scsi_dh video backlight sbs i2c_ec button ba ttery asus_acpi acpi_memhotplug ac parport_pc lp parport ksm(U) kvm(U) i2c_amd756 i2c_core k8_edac amd_rng k8temp t g3 edac_mc pcspkr hwmon sg serio_raw dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_m Feb 6 03:01:51 c1f2bc3n06 kernel: rror dm_log dm_mod qla2xxx scsi_transport_fc shpchp mptspi mptscsih mptbase scsi _transport_spi sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Feb 6 03:01:51 c1f2bc3n06 kernel: Pid: 8993, comm: mmfsd Tainted: G 2.6.18-164.11.1.el5 #1 Feb 6 03:01:51 c1f2bc3n06 kernel: RIP: 0010:[<ffffffff80064a85>] [<ffffffff80064a85>] _spin_lock+0x0/0xa Feb 6 03:01:51 c1f2bc3n06 kernel: RSP: 0018:ffff8100dac47830 EFLAGS: 00010246 Feb 6 03:01:51 c1f2bc3n06 kernel: RAX: ffff81011ef106c0 RBX: ffff81011ef106b8 RCX: ffff81011ef10728 Feb 6 03:01:51 c1f2bc3n06 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000e0 Feb 6 03:01:59 c1f2bc3n06 kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffff889c9f02 Feb 6 03:02:08 c1f2bc3n06 kernel: R10: ffff8100dac47b98 R11: 0000000000000001 R12: ffff81011ef10728 Feb 6 03:02:08 c1f2bc3n06 kernel: R13: 0000000000000000 R14: ffff81003e11f340 R15: 0000000000000001 Feb 6 03:02:09 c1f2bc3n06 kernel: FS: 0000000044311940(0063) GS:ffff81011fc786c0(0000) knlGS:0000000000000000 Feb 6 03:02:09 c1f2bc3n06 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Feb 6 03:02:09 c1f2bc3n06 kernel: CR2: 00000000000000e0 CR3: 000000010b7db000 CR4: 00000000000006e0 Feb 6 03:02:09 c1f2bc3n06 kernel: Process mmfsd (pid: 8993, threadinfo ffff8100dac46000, task ffff8100d8f41860) Feb 6 03:02:09 c1f2bc3n06 kernel: Stack: ffffffff8003af28 ffffffff889c9f02 ffff81011ef106b8 ffff8100dac47b98 Feb 6 03:02:09 c1f2bc3n06 kernel: ffff8100dac47918 ffff81003e11f240 ffffffff8004e76c ffff81011ef106b8 Feb 6 03:02:09 c1f2bc3n06 kernel: ffffffff800e90e9 ffffffff889c9f02 ffff8100dac47918 ffff81011ef101e8 Feb 6 03:02:09 c1f2bc3n06 kernel: Call Trace: Feb 6 03:02:09 c1f2bc3n06 kernel: [<ffffffff8003af28>] fasync_helper+0x23/0x114 Feb 6 03:02:09 c1f2bc3n06 kernel: [<ffffffff8004e76c>] locks_delete_lock+0x3d/0x87 Feb 6 03:02:09 c1f2bc3n06 kernel: [<ffffffff800e90e9>] __posix_lock_file_conf+0x21b/0x3e0 Feb 6 03:02:09 c1f2bc3n06 kernel: [<ffffffff889be7a3>] :mmfslinux:cxiFcntlLock+0x313/0x430 Feb 6 03:02:09 c1f2bc3n06 kernel: [<ffffffff88a1c8c0>] :mmfs26:SharkRetry+0x0/0x180 Feb 6 03:02:09 c1f2bc3n06 kernel: [<ffffffff889acf70>] :mmfslinux:igrabInodeFindActor+0x0/0x70 Feb 6 03:02:09 c1f2bc3n06 kernel: [<ffffffff88a1b4a8>] :mmfs26:kxCommonReclock+0x2e8/0x6c0 Feb 6 03:02:10 c1f2bc3n06 kernel: [<ffffffff88afa356>] :mmfs26:_Z8ss_ioctljm+0xee6/0x13b0 Feb 6 03:02:10 c1f2bc3n06 kernel: [<ffffffff889b8eca>] :mmfslinux:doSSIoctl+0x35a/0x3c0 Feb 6 03:02:10 c1f2bc3n06 kernel: [<ffffffff80021f1e>] __up_read+0x19/0x7f Feb 6 03:02:10 c1f2bc3n06 kernel: [<ffffffff80042182>] do_ioctl+0x21/0x6b Feb 6 03:02:10 c1f2bc3n06 kernel: [<ffffffff80030293>] vfs_ioctl+0x457/0x4b9 Feb 6 03:02:10 c1f2bc3n06 kernel: [<ffffffff8004c843>] sys_ioctl+0x59/0x78 Feb 6 03:02:10 c1f2bc3n06 kernel: [<ffffffff8005d116>] system_call+0x7e/0x83 Feb 6 03:02:10 c1f2bc3n06 kernel: Feb 6 03:02:10 c1f2bc3n06 kernel: Feb 6 03:02:10 c1f2bc3n06 kernel: Code: f0 ff 0f 0f 88 6c 01 00 00 c3 f0 81 2f 00 00 00 01 74 05 e8 Feb 6 03:02:10 c1f2bc3n06 kernel: RIP [<ffffffff80064a85>] _spin_lock+0x0/0xa Feb 6 03:02:10 c1f2bc3n06 kernel: RSP <ffff8100dac47830> Oops output: Feb 6 03:01:51 c1f2bc3n06 kernel: Unable to handle kernel NULL pointer dereference at 00000000000000e0 RIP: Feb 6 03:01:51 c1f2bc3n06 kernel: [<ffffffff80064a85>] _spin_lock+0x0/0xa Feb 6 03:01:51 c1f2bc3n06 kernel: PGD 11ad09067 PUD 10b354067 PMD 0 Feb 6 03:01:51 c1f2bc3n06 kernel: Oops: 0002 [1] SMP
I don't quite understand how a file_lock can have a NULL fl_file pointer in this stack trace. For __posix_lock_file_conf() to call locks_delete_lock() it means that a lock request overlaps and/or replaces an existing one. The existing one is already attached to the inode and I'm pretty sure it should have a valid file pointer attached. When a file pointer is closed all posix locks that use that file pointer are released so we shouldn't have any posix locks hanging around for files that are not open.
This problem was exposed by patch: linux-2.6-fs-fasync-split-fasync_helper-into-separate-add-remove-functions.patch that went into 2.6.18-164.11.1.el5. Prior to this patch the fasync_helper() function didn't de-reference the fl_file pointer so a NULL value passed through without incident. But this function did (and still does) use the fl_file value to filter out which locks to remove and filtering on a NULL value rather than a valid file pointer may not be producing the correct results and in that case it would be leaking memory in the fasync_cache.
It appears the only way to queue a locked file onto the async list is via fcntl_setlease(): int fcntl_setlease(unsigned int fd, struct file *filp, long arg) { struct file_lock fl, *flp = &fl; struct dentry *dentry = filp->f_dentry; struct inode *inode = dentry->d_inode; int error; if (IS_NO_LEASES(inode)) return -EINVAL; locks_init_lock(&fl); error = lease_init(filp, arg, &fl); if (error) return error; lock_kernel(); error = setlease(filp, arg, &flp); if (error || arg == F_UNLCK) goto out_unlock; error = fasync_helper(fd, filp, 1, &flp->fl_fasync); ... Where filp is a valid file pointer. int fasync_helper(int fd, struct file *filp, int on, struct fasync_struct **fapp) { if (!on) return fasync_remove_entry(filp, fapp); return fasync_add_entry(fd, filp, fapp); } static int fasync_add_entry(int fd, struct file *filp, struct fasync_struct **fapp) { struct fasync_struct *new, *fa, **fp; int result = 0; new = kmem_cache_alloc(fasync_cache, GFP_KERNEL); if (!new) return -ENOMEM; spin_lock(&filp->f_ep_lock); write_lock_irq(&fasync_lock); for (fp = fapp; (fa = *fp) != NULL; fp = &fa->fa_next) { if (fa->fa_file != filp) continue; fa->fa_fd = fd; kmem_cache_free(fasync_cache, new); goto out; } new->magic = FASYNC_MAGIC; new->fa_file = filp; new->fa_fd = fd; new->fa_next = *fapp; *fapp = new; result = 1; filp->f_flags |= FASYNC; out: write_unlock_irq(&fasync_lock); spin_unlock(&filp->f_ep_lock); return result; } So the fasync_struct gets queue with a non-NULL fa_file field. When deleting locks, fl_file is apparently NULL: static void locks_delete_lock(struct file_lock **thisfl_p) { struct file_lock *fl = *thisfl_p; *thisfl_p = fl->fl_next; fl->fl_next = NULL; list_del_init(&fl->fl_link); fasync_helper(0, fl->fl_file, 0, &fl->fl_fasync); if (fl->fl_fasync != NULL) { printk(KERN_ERR "locks_delete_lock: fasync == %p\n", fl->fl_fasync); fl->fl_fasync = NULL; } ... So we have a NULL filp here: static int fasync_remove_entry(struct file *filp, struct fasync_struct **fapp) { struct fasync_struct *fa, **fp; int result = 0; spin_lock(&filp->f_ep_lock); write_lock_irq(&fasync_lock); for (fp = fapp; (fa = *fp) != NULL; fp = &fa->fa_next) { if (fa->fa_file != filp) continue; *fp = fa->fa_next; kmem_cache_free(fasync_cache, fa); filp->f_flags &= ~FASYNC; result = 1; break; } write_unlock_irq(&fasync_lock); spin_unlock(&filp->f_ep_lock); return result; } And matching on a NULL filp will not find the locked files that were originally queued on the async list. So back in locks_delete_lock() we should be tripping this code: if (fl->fl_fasync != NULL) { printk(KERN_ERR "locks_delete_lock: fasync == %p\n", fl->fl_fasync); fl->fl_fasync = NULL; } And leaking memory from the fasync_cache. The same bug would have existed before the fasync_helper() patch was applied.
From the description from the customer it sounds like they are not using lease locks. Since we only queue the locked file for leases then there's no need to dequeue it when we are removing a non-lease lock. So their suggested change would certainly prevent the panics they are seeing. 573 static void locks_delete_lock(struct file_lock **thisfl_p) 574 { 575 struct file_lock *fl = *thisfl_p; 576 577 *thisfl_p = fl->fl_next; 578 fl->fl_next = NULL; 579 list_del_init(&fl->fl_link); 580 +++++ if (IS_LEASE(fl)) 581 fasync_helper(0, fl->fl_file, 0, &fl->fl_fasync); While I think this is a valid optimisation for the RHEL code I'm not convinced it's the correct way to fix their problem because having a lock without a valid fl_file field is indicative of another bug.
A test kernel with version kernel-2.6.18-164.11.1.el5.bz567479.1 that includes the proposed fix to locks_delete_lock() will be available from: https://brewweb.devel.redhat.com/taskinfo?taskID=2275587
Created attachment 395654 [details] Patch requested by customer
(In reply to comment #4) > This problem was exposed by patch: > linux-2.6-fs-fasync-split-fasync_helper-into-separate-add-remove-functions.patch > that went into 2.6.18-164.11.1.el5. > > Prior to this patch the fasync_helper() function didn't de-reference the > fl_file pointer so a NULL value passed through without incident. Ok, interesting. Upstream, the old fasync_helper did this before the change: 625 int fasync_helper(int fd, struct file * filp, int on, struct fasync_struct **fapp) 626 { ... 637 /* 638 * We need to take f_lock first since it's not an IRQ-safe 639 * lock. 640 */ 641 spin_lock(&filp->f_lock); but we didn't have that in RHEL5; the patch in question essentially added this new spinlock which now dereferences filp. Danny's patch essentially pulled in part of 76398425bb06b07cc3a3b1ce169c67dc9d6874ed as well, which moved the locking to this function. But we didn't get the rest of it, let me see what was going on there.
So whatever GPFS is doing to end up with the NULL filp, that's why we didn't oops before: 76398425bb06b07cc3a3b1ce169c67dc9d6874ed turned a big lock_kernel() into a spin_lock(filp-> ... ) and we didn't have any of that; now we do, -some- of it came in with the backport of 53281b6d3, which added a new deref of filp - and we still have the old lock_kernel as well. So -maybe- with that old lock_kernel still in place, removing the spin_lock() would eliminate the NULL deref and get back to the old behavior that "worked," but it still doesn't sound like we should be getting here with a NULL filp in any case.
Eric, I think you're onto something here. Since the lock_kernel() is still in place then the f_ep_lock seems totally redundant and can be removed. It's not even the correct lock to use. Removing the f_ep_lock instead of the proposed patch will mean we don't have to update mainline, we fix a bug in our own code and (as Al pointed out) we wont hurt ISVs that depend on fasync for non-lease locks.
Jeremy, a test kernel with version kernel-2.6.18-164.11.1.el5.bz567479.2 that includes a patch to remove the unneeded f_ep_lock will be available from: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2278674
Created attachment 395873 [details] Alternative patch to remove unnecessary use of f_ep_lock
------- Comment From sharyath.com 2010-02-23 01:30 EDT------- Brian You already seem to have solution for this problem. I went through the code and found that even in the later code (2.6.32) this check is not present. I would suggest you to mail it to lkml as the distro will be more willing to consider it for acceptance once the mainline acceptance is complete. Can you please mail it to lkml. And one more request please do CC me (sharyath.com) so that it will be easier to track the issue. Regards Sharyathi N
Event posted on 2010-02-24 10:40:44am EST by cambach Hi Jeremy, I installed the new kernel kernel-2.6.18-164.11.1.el5.bz567479.2 on a SONAS box and did a quick test. The error did not show up again as with the first test package. Regards, Christian This event sent from IssueTracker by jeder issue 545163
------- Comment From tpnoonan.com 2010-02-24 16:33 EDT------- HI Red Hat, Any outlook for when this code will go upstream? Thanks
it would appear that locks_mandatory_area() can queue a lock with a NULL filp.
um... Just to make it clear - NULL fl->fl_file is not valid. We are talking about workaround for RHEL and *only* that. For mainline GPFS will need to figure out how to live without that, the quoted piece of (misguided) defensive programming in the tree nonwithstanding.
Created attachment 396472 [details] Updated patch for removing f_ep_lock Fixed a comment in this patch to match the code changes.
------- Comment From sharyath.com 2010-03-09 05:31 EDT------- I am moving this bug to Fixed by Distro and to Submitted Re-open if you like to investigate further
in kernel-2.6.18-197.el5 You can download this test kernel from http://people.redhat.com/jwilson/el5 Please update the appropriate value in the Verified field (cf_verified) to indicate this fix has been successfully verified. Include a comment with verification details.
------- Comment From dixonbp.com 2010-05-13 11:40 EDT------- yes, this can be closed
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: The Red Hat Enterprise Linux 5.5 kernel contained a fix for Bugzilla issue number 548657 which introduced a regression in file locking behavior that presented with the General Parallel File System (GPFS). This update removes the redundant locking code.
(In reply to comment #4) > This problem was exposed by patch: > linux-2.6-fs-fasync-split-fasync_helper-into-separate-add-remove-functions.patch > that went into 2.6.18-164.11.1.el5. > > Prior to this patch the fasync_helper() function didn't de-reference the > fl_file pointer so a NULL value passed through without incident. But this > function did (and still does) use the fl_file value to filter out which locks > to remove and filtering on a NULL value rather than a valid file pointer may > not be producing the correct results and in that case it would be leaking > memory in the fasync_cache. QE, while testing this bug, please also use the test case for CVE-2009-4141. Email me if you do not have access to the test case already. Thanks.
According to the patch, this has been verified by IBM's GPFS developer Brian Dixon. I have ran test case suggested in comment 37 and fsstress, no issue found. And confirmed patch is applied correctly in -233 kernel
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html