Bug 1379329

Summary:	Some of the files goes into .glusterfs/unlink folder after execution of pynfs test suite.
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Shashank Raj <sraj>
Component:	nfs-ganesha	Assignee:	Soumya Koduri <skoduri>
Status:	CLOSED ERRATA	QA Contact:	Arthy Loganathan <aloganat>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.1	CC:	aloganat, bmohanra, jthottan, kkeithle, mzywusko, ndevos, rcyriac, rhinduja, rhs-bugs, skoduri, storage-qa-internal
Target Milestone:	---	Keywords:	Triaged
Target Release:	RHGS 3.2.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	nfs-ganesha-2.4.0-2	Doc Type:	Bug Fix
Doc Text:	Previously, there was an fd-leak on the file on which lock operations have been performed from a nfs-mount of the volume that is exported via NFS-Ganesha server. When that file is deleted, it was not removed from .glusterfs/unlink folder at the backend consuming memory. With this fix, all the files that are removed from the mount point shall be removed completely from the backend as well.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-03-23 06:23:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1351528

Description Shashank Raj 2016-09-26 11:45:48 UTC

Description of problem:

Some of the files goes into .glusterfs/unlink folder after execution of pynfs test suite.

Version-Release number of selected component (if applicable):

Seen on hotfix build provided:

[root@dhcp47-149 ~]# rpm -qa|grep glusterfs

glusterfs-cli-3.7.9-12.el7rhgs.1.HOTFIX.Case01698221.BZ1374166.x86_64
glusterfs-client-xlators-3.7.9-12.el7rhgs.1.HOTFIX.Case01698221.BZ1374166.x86_64
glusterfs-debuginfo-3.7.9-12.el7rhgs.1.HOTFIX.Case01698221.BZ1374166.x86_64
glusterfs-api-3.7.9-12.el7rhgs.1.HOTFIX.Case01698221.BZ1374166.x86_64
glusterfs-ganesha-3.7.9-12.el7rhgs.1.HOTFIX.Case01698221.BZ1374166.x86_64
glusterfs-libs-3.7.9-12.el7rhgs.1.HOTFIX.Case01698221.BZ1374166.x86_64
glusterfs-fuse-3.7.9-12.el7rhgs.1.HOTFIX.Case01698221.BZ1374166.x86_64
glusterfs-geo-replication-3.7.9-12.el7rhgs.1.HOTFIX.Case01698221.BZ1374166.x86_64
glusterfs-3.7.9-12.el7rhgs.1.HOTFIX.Case01698221.BZ1374166.x86_64
glusterfs-rdma-3.7.9-12.el7rhgs.1.HOTFIX.Case01698221.BZ1374166.x86_64
glusterfs-server-3.7.9-12.el7rhgs.1.HOTFIX.Case01698221.BZ1374166.x86_64

How reproducible:

Twice

Steps to Reproduce:

1. Create nfs-ganesha cluster.
2. Create a volume, start it and mount it on a client.
3. Start executing pynfs test suite from the client.
4. Observe that some of the tests related to WRT18 and WRT19 fails:

WRT18    st_write.testChangeGranularityWrite                      : FAILURE
           consecutive SETATTR(mode)'s don't all change change
           attribute
WRT19    st_write.testStolenStateid                               : FAILURE
           WRITE with stolen stateid should return NFS4ERR_ACCESS
           or NFS4ERR_PERM, instead got NFS4_OK

5. Once the pynfs test suite finishes, following are the remaining data on the mount point:

[root@Client2 nfs1]# ls
tmp  tree
[root@Client2 nfs1]# ls -lR
.:
total 8
drwxr-xr-x. 2 nobody nobody 4096 Sep 26 16:52 tmp
drwxr-xr-x. 3 nobody nobody 4096 Sep 26 16:33 tree

./tmp:
total 0

./tree:
total 5
brwxr-xr-x. 1 nobody nobody 1, 2 Sep 26 16:33 block
crwxr-xr-x. 1 nobody nobody 1, 2 Sep 26 16:33 char
drwxr-xr-x. 2 nobody nobody 4096 Sep 26 16:33 dir
prwxr-xr-x. 1 nobody nobody    0 Sep 26 16:33 fifo
-rwxr-xr-x. 1 nobody nobody   27 Sep 26 16:33 file
lrwxrwxrwx. 1 nobody nobody    8 Sep 26 16:33 link -> /etc/X11
srwxr-xr-x. 1 nobody nobody    0 Sep 26 16:33 socket

./tree/dir:
total 0


6. Check the brick path under .glusterfs/unlink on all nodes.

7. Observe that some of the files are present under .glusterfs/unlink folder once pynfs test suite finishes.

[root@dhcp47-149 unlink]# cd /bricks/brick2/b2/.glusterfs/unlink/
[root@dhcp47-149 unlink]# ls
aec616bd-f438-4e06-bf79-f79ba616ba14  f9db2469-dd7e-47d6-8c4c-e69ab9c447c3
[root@dhcp47-149 unlink]# 

[root@dhcp47-139 unlink]# cd /bricks/brick2/b2/.glusterfs/unlink/
[root@dhcp47-139 unlink]# ls
6b802efc-2ea4-4f2b-a5ed-538d53c800b2

8. Even after removing the data from the mount point, these files remains in the same path.

[root@Client2 nfs1]# ls
tmp  tree
[root@Client2 nfs1]# rm -rf *
[root@Client2 nfs1]# ls
[root@Client2 nfs1]#

[root@dhcp47-149 unlink]# cd /bricks/brick2/b2/.glusterfs/unlink/
[root@dhcp47-149 unlink]# ls
2c75c428-a0a5-4d02-93b2-5829d8689c3c  ac856802-9640-4a59-bbd8-988fa275bdff

[root@dhcp47-139 unlink]# cd /bricks/brick2/b2/.glusterfs/unlink/
[root@dhcp47-139 unlink]# ls
6b1ac1e2-da55-4b76-b223-a1b36f54ffa1

Actual results:

Some of the files goes into .glusterfs/unlink folder after execution of pynfs test suite.

Expected results:

files created during pynfs test suite should not go under .glusterfs/unlink folder

Additional info:

Same issue is seen with 3.1.3 builds as well without the hotfix.

Following errors are seen in ganesha.log

26/09/2016 07:08:05 : epoch 9c020000 : dhcp47-149.lab.eng.blr.redhat.com : ganesha.nfsd-22172[work-2] nfs4_Compound :NFS4 :MAJ :An empty COMPOUND (no operation in it) was received
26/09/2016 07:08:05 : epoch 9c020000 : dhcp47-149.lab.eng.blr.redhat.com : ganesha.nfsd-22172[work-6] nfs4_Compound :NFS4 :CRIT :Bad Minor Version 50
26/09/2016 07:08:05 : epoch 9c020000 : dhcp47-149.lab.eng.blr.redhat.com : ganesha.nfsd-22172[work-14] nfs4_Compound :NFS4 :MAJ :A COMPOUND with too many operations (150) was received
26/09/2016 07:12:46 : epoch 9c020000 : dhcp47-149.lab.eng.blr.redhat.com : ganesha.nfsd-22172[work-6] nfs4_op_open_downgrade :STATE :EVENT :Failed to open downgrade:  (share access or deny never seen before)
26/09/2016 07:12:46 : epoch 9c020000 : dhcp47-149.lab.eng.blr.redhat.com : ganesha.nfsd-22172[work-13] nfs4_op_open_downgrade :STATE :EVENT :Failed to open downgrade:  (share access or deny never seen before)
~

Comment 2 Shashank Raj 2016-09-26 11:50:28 UTC

With 3.1.3, the entries inside .glusterfs/unlink were present for almost all the files even after removal but with 3.1.3 + hotfix, the entries are only seen for 3 files.

Also, running the write tests individually which includes the failed cases of WRT18 and WRT19, i don't see this issue. Which means these failed cases are not the cause of these entries inside .glusterfs/unlink

Comment 3 Soumya Koduri 2016-09-26 12:00:35 UTC

Maybe there is still some ref leak happening in certain code path.

@Shashank,

To further narrow down if possible could you please run for specific codes and flags and check which specific test led to this leak?

To list possible codes and flags, you could try below command (may be in an automated test script) -

 ./testserver.py --showflags
./testserver.py --showcodes

Comment 4 Shashank Raj 2016-09-26 13:40:20 UTC

The test LOCKCHGD and LOCKMRG under lock is causing this issue.

********************************************************************
 ./testserver.py 10.70.44.92:/testvolume -v --outfile ~/pynfs.run --maketree --showomit --rundep LOCKCHGD > /tmp/pynfs-hotfix-lock-specific.log

Results:

INIT     st_setclientid.testValid                                 : RUNNING
INIT     st_setclientid.testValid                                 : PASS
MKFILE   st_open.testOpen                                         : RUNNING
MKFILE   st_open.testOpen                                         : PASS
LOCK1    st_lock.testFile                                         : RUNNING
LOCK1    st_lock.testFile                                         : PASS
LOCKCHGD st_lock.testDowngrade                                    : RUNNING
LOCKCHGD st_lock.testDowngrade                                    : PASS
**************************************************
INIT     st_setclientid.testValid                                 : PASS
LOCK1    st_lock.testFile                                         : PASS
LOCKCHGD st_lock.testDowngrade                                    : PASS
MKFILE   st_open.testOpen                                         : PASS
**************************************************
Command line asked for 4 of 665 tests

Files present in unlink path:

[root@dhcp47-149 unlink]# cd /bricks/brick2/b2/.glusterfs/unlink/
[root@dhcp47-149 unlink]# ls
caa03fa0-e855-4375-98fc-74e361a07cce

*******************************************************************




*******************************************************************
./testserver.py 10.70.44.92:/testvolume -v --outfile ~/pynfs.run --maketree --showomit --rundep LOCKMRG > /tmp/pynfs-hotfix-lock-specific.log

Results:

INIT     st_setclientid.testValid                                 : RUNNING
INIT     st_setclientid.testValid                                 : PASS
MKFILE   st_open.testOpen                                         : RUNNING
MKFILE   st_open.testOpen                                         : PASS
LOCK1    st_lock.testFile                                         : RUNNING
LOCK1    st_lock.testFile                                         : PASS
LKTOVER  st_lockt.testOverlap                                     : RUNNING
LKTOVER  st_lockt.testOverlap                                     : PASS
LOCKMRG  st_lock.testOverlap                                      : RUNNING
LOCKMRG  st_lock.testOverlap                                      : PASS
**************************************************
INIT     st_setclientid.testValid                                 : PASS
LKTOVER  st_lockt.testOverlap                                     : PASS
LOCK1    st_lock.testFile                                         : PASS
LOCKMRG  st_lock.testOverlap                                      : PASS
MKFILE   st_open.testOpen                                         : PASS
**************************************************
Command line asked for 5 of 665 tests

Files present in unlink path:

[root@dhcp47-149 unlink]# ls
2e76b1b5-da65-4225-936b-5e943ec2f45e  caa03fa0-e855-4375-98fc-74e361a07cce

******************************************************************

Comment 5 Soumya Koduri 2016-09-27 08:43:32 UTC

[root@dhcp47-149 ~]# ls -ltr /bricks/brick2/b2/.glusterfs/unlink/bd1d4afa-6ecf-4e39-9b18-ccea0101c81a 
-rwxr-xr-x. 1 root root 0 Sep 26 09:36 /bricks/brick2/b2/.glusterfs/unlink/bd1d4afa-6ecf-4e39-9b18-ccea0101c81a
[root@dhcp47-149 ~]# 
[root@dhcp47-149 ~]# 
[root@dhcp47-149 ~]# ps aux | grep brick2
root     22966  0.6  0.9 1893280 78536 ?       Ssl  Sep22  41:47 /usr/sbin/glusterfsd -s 10.70.47.149 --volfile-id testvolume.10.70.47.149.bricks-brick2-b2 -p /var/lib/glusterd/vols/testvolume/run/10.70.47.149-bricks-brick2-b2.pid -S /var/run/gluster/1701b36d8c9fed2c219d4cfac9e70b32.socket --brick-name /bricks/brick2/b2 -l /var/log/glusterfs/bricks/bricks-brick2-b2.log --xlator-option *-posix.glusterd-uuid=767feb78-378b-48c5-8f6b-a08aaae053af --brick-port 49159 --xlator-option testvolume-server.listen-port=49159
root     23585  0.0  0.0 112648   976 pts/1    S+   03:39   0:00 grep --color=auto brick2
[root@dhcp47-149 ~]# 
[root@dhcp47-149 ~]# ls -ltr /proc/22966/fd | grep bd1d4afa-6ecf-4e39-9b18-ccea0101c81a
lrwx------. 1 root root 64 Sep 27 02:50 32 -> /bricks/brick2/b2/.glusterfs/unlink/bd1d4afa-6ecf-4e39-9b18-ccea0101c81a
[root@dhcp47-149 ~]# 


There is still one open fd for each of these files under unlink directory.

Comment 6 Soumya Koduri 2016-09-27 08:46:50 UTC

When checked via gdb (for test LOCKCHGD), found that the file "LOCKCHGD" is 
* opened  (comes to FSAL_GLUSTER, makes gfapi call returning glfd)
* Write Lock taken on that file
* Read Lock is issued for the same offset
* file is removed.

But there is no close operation which reached FSAL_GLUSTER which resulted in glfd leak for that file.

Comment 7 Soumya Koduri 2016-09-27 08:53:15 UTC

On further debugging found that ,

Breakpoint 11, state_lock (entry=0x7fa5ac079e60, owner=owner@entry=0x7fa5cc11a1c0, state=0x7fa5cc1169c0, 
    blocking=blocking@entry=STATE_NON_BLOCKING, block_data=block_data@entry=0x0, lock=lock@entry=0x7fa5f7ffd010, 
    holder=holder@entry=0x7fa5f7ffcfb0, conflict=conflict@entry=0x7fa5f7ffd030)
    at /usr/src/debug/nfs-ganesha-2.3.1/src/SAL/state_lock.c:2495
2495	{
(gdb) p *lock
$38 = {lock_sle_type = FSAL_POSIX_LOCK, lock_type = FSAL_LOCK_R, lock_start = 25, lock_length = 75, 
  lock_reclaim = false}
(gdb) c
Continuing.

Breakpoint 15, do_lock_op (entry=entry@entry=0x7fa5ac079e60, lock_op=lock_op@entry=FSAL_OP_LOCK, 
    owner=owner@entry=0x7fa5cc11a1c0, lock=lock@entry=0x7fa5f7ffd010, holder=0x7fa5f7ffcfb0, 
    conflict=0x7fa5f7ffd030, overlap=0, sle_type=sle_type@entry=FSAL_POSIX_LOCK)
    at /usr/src/debug/nfs-ganesha-2.3.1/src/SAL/state_lock.c:2280
2280	{
(gdb) 
Continuing.

Breakpoint 14, merge_lock_entry (entry=entry@entry=0x7fa5ac079e60, lock_entry=lock_entry@entry=0x7fa5dc06d480)
    at /usr/src/debug/nfs-ganesha-2.3.1/src/SAL/state_lock.c:795
795	{
(gdb) p *lock_entry
$39 = {sle_list = {next = 0x0, prev = 0x0}, sle_owner_locks = {next = 0x7fa5cc11a1c8, prev = 0x7fa5cc11a430}, 
  sle_client_locks = {next = 0x0, prev = 0x0}, sle_state_locks = {next = 0x7fa5cc116a30, prev = 0x7fa5cc11a450}, 
  sle_export_locks = {next = 0x7fa627c632b0, prev = 0x7fa5cc11a460}, sle_export = 0x7fa627c63268, 
  sle_entry = 0x7fa5ac079e60, sle_block_data = 0x0, sle_owner = 0x7fa5cc11a1c0, sle_state = 0x7fa5cc1169c0, 
  sle_blocked = STATE_NON_BLOCKING, sle_ref_count = 1, sle_lock = {lock_sle_type = FSAL_POSIX_LOCK, 
    lock_type = FSAL_LOCK_R, lock_start = 25, lock_length = 75, lock_reclaim = false}, sle_mutex = {__data = {
      __lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, 
        __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}
(gdb) p entry->object.file.lock_list
$40 = {next = 0x7fa5cc11a420, prev = 0x7fa5cc11a420}
(gdb) p &entry->object.file.lock_list
$41 = (struct glist_head *) 0x7fa5ac079fd8
(gdb) n
805		glist_for_each_safe(glist, glistn, &entry->object.file.lock_list) {
(gdb) 
795	{
(gdb) 
805		glist_for_each_safe(glist, glistn, &entry->object.file.lock_list) {
(gdb) 
809			if (check_entry == lock_entry)
(gdb) 
813			    (check_entry->sle_owner, lock_entry->sle_owner))
(gdb) 
812			if (different_owners
(gdb) 
817			if (check_entry->sle_blocked != STATE_NON_BLOCKING)
(gdb) 
820			check_entry_end = lock_end(&check_entry->sle_lock);
(gdb) 
821			lock_entry_end = lock_end(&lock_entry->sle_lock);
(gdb) 
823			if ((check_entry_end + 1) < lock_entry->sle_lock.lock_start)
(gdb) 
827			if ((lock_entry_end + 1) < check_entry->sle_lock.lock_start)
(gdb) 
836			if ((check_entry->sle_lock.lock_type !=
(gdb) 
838			    && ((lock_entry_end < check_entry_end)
(gdb) 
839				|| (check_entry->sle_lock.lock_start <
(gdb) 
908			    lock_entry_end - lock_entry->sle_lock.lock_start + 1;
(gdb) 
911			LogEntry("Merged", lock_entry);
(gdb) 
912			LogEntry("Merging removing", check_entry);
(gdb) 
913			remove_from_locklist(check_entry);
(gdb) 




>>> In the state_lock(), we take pin_ref for the cache_entry and this need to be unref'ed at the end unless it is the first lock entry for that file.

2512 state_status_t state_lock(cache_entry_t *entry,
2513                           state_owner_t *owner,
2514                           state_t *state,
2515                           state_blocking_t blocking,
2516                           state_block_data_t *block_data,
2517                           fsal_lock_param_t *lock,
2518                           state_owner_t **holder,
2519                           fsal_lock_param_t *conflict)
2520 {
2521         bool allow = true, overlap = false;
2522         struct glist_head *glist;

....
....
...
2542         cache_status = cache_inode_inc_pin_ref(entry);
2543 
2544        

....
....
...
2804         if (allow
2805             || fsal_export->exp_ops.fs_supports(fsal_export,
2806                                              fso_lock_support_async_block)) {
2807                 /* Prepare to make call to FSAL for this lock */
2808                 status = do_lock_op(entry,
2809                                     lock_op,
2810                                     owner,
2811                                     lock,
2812                                     allow ? holder : NULL,
2813                                     allow ? conflict : NULL,
2814                                     overlap,
2815                                     FSAL_POSIX_LOCK);
2816         } else
2817                 status = STATE_LOCK_BLOCKED;
2818 
2819         if (status == STATE_SUCCESS) {
2820                 /* Merge any touching or overlapping locks into this one */
2821                 LogEntry("FSAL lock acquired, merging locks for",
2822                          found_entry);
2823 
2824                 merge_lock_entry(entry, found_entry);
2825 
2826                 /* Insert entry into lock list */
2827                 LogEntry("New lock", found_entry);
2828 
2829                 if (glist_empty(&entry->object.file.lock_list)) {
2830                         /* List was empty so we must retain the pin reference
2831                          */
2832                         unpin = false;
2833                 }
>>>> i.e, only if it is first lock entry, we should set unpin to false.
...
...
...
2884  out:
2885 
2886         if (unpin) {
2887                 /* We don't need to retain the pin and LRU ref we took at the
2888                  * top because the file was already pinned for locks or we
2889                  * did not add locks to the file.
2890                  */
2891                 cache_inode_dec_pin_ref(entry, false);
2892                 cache_inode_lru_unref(entry, LRU_FLAG_NONE);
2893         }

>> So for second lock on the same file, ideally unpin should be set to true. But from the pynfs test case, looks like it got reset to false. And the reason is that in merge_lock_entry() (at line:2824), we merge the overlapping locks and remove the old lock entry. Since the list has thus become empty, unpin got set to false. This needs to be fixed. While removing last lock entry we need to decrement pin_refcnt as well.

Comment 8 Soumya Koduri 2016-09-27 10:34:13 UTC

From 57b7c69cac74957b4751e99d03362c126f85cfc3 Mon Sep 17 00:00:00 2001
From: Soumya Koduri <skoduri>
Date: Tue, 27 Sep 2016 14:45:41 +0530
Subject: [PATCH] SAL: Ref leak in case of merge of lock entries

In state_lock(), we pin the cache_entry before processing the
lock operation. This shall be unref'ed at the end of the
operation unless it is the first lock entry for that cache
entry.
But in case of overlapping locks, in merge_lock_entry(), once
we merge the locks, we remove the older one. This could lead
to empty lock list for that cache_entry, but with the earlier
ref not taken out resulting in a leak.
The fix is to unpin the cache entry if the lock entry removed
was the last one in its lock list.

Change-Id: I9ed9e9ad8115d1e3c5dcc28bcc8e733b94b82de2
Signed-off-by: Soumya Koduri <skoduri>
---
 src/SAL/state_lock.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/SAL/state_lock.c b/src/SAL/state_lock.c
index bba2bf7..cb628a5 100644
--- a/src/SAL/state_lock.c
+++ b/src/SAL/state_lock.c
@@ -924,6 +924,13 @@ static void merge_lock_entry(cache_entry_t *entry,
 		LogEntry("Merged", lock_entry);
 		LogEntry("Merging removing", check_entry);
 		remove_from_locklist(check_entry);
+
+		/* if check_entry was the last lock entry, unpin
+		 * the cache entry */
+		if (glist_empty(&entry->object.file.lock_list)) {
+			cache_inode_dec_pin_ref(entry, false);
+			cache_inode_lru_unref(entry, LRU_FLAG_NONE);
+		}
 	}
 }

-- 
2.5.0

[root@dhcp35-197 nfs-ganesha]# 

Above patch seem to have fixed the issue. But we need to check if its applicable to latest 2.4 nfs-ganesha code-path as well and get it reviewed upstream.

Comment 9 Soumya Koduri 2016-09-27 10:39:54 UTC

This fix doesn't seem to be applicable to RHGS 3.2/nfs-ganesha 2.4 codebase.

Comment 10 Soumya Koduri 2016-09-29 13:22:41 UTC

The fix is available with RHGS 3.2 nfs-ganesha rpms

Comment 14 Arthy Loganathan 2016-11-07 06:27:07 UTC

Verified this bug on RHGS 3.2, nfs-ganesha-2.4.1-1.el7rhgs.x86_64 build.

Observed that some of the files are present under .glusterfs/unlink folder once pynfs test suite finishes, but when removing the data from the mount point, these files are getting deleted from the .glusterfs/unlink path.

Comment 18 Bhavana 2017-03-08 07:29:24 UTC

Edited the doc text for the errata.

Comment 20 errata-xmlrpc 2017-03-23 06:23:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2017-0493.html