Bug 1326252

Summary: Failed refresh config generates lot of "inode.c:1792:inode_table_destroy" messages in ganesha-gfapi.log
Product: Red Hat Gluster Storage Reporter: Shashank Raj <sraj>
Component: nfs-ganeshaAssignee: Soumya Koduri <skoduri>
Status: CLOSED ERRATA QA Contact: Shashank Raj <sraj>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: asrivast, jthottan, kkeithle, ndevos, nlevinki, rhinduja, sashinde, skoduri, sraj
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.1.3   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.7.9-3 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-06-23 05:33:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1311817    

Description Shashank Raj 2016-04-12 09:11:29 UTC
Description of problem:
Failed refresh config generates lot of "inode.c:1792:inode_table_destroy" messages in ganesha-gfapi.log

Version-Release number of selected component (if applicable):
glusterfs-3.7.9-1
nfs-ganesha-gluster-2.3.1-2.el7rhgs.x86_64
nfs-ganesha-2.3.1-2.el7rhgs.x86_64

How reproducible:
Always

Steps to Reproduce:
1.Create a 4 node cluster and configure ganesha on it.
2.Create a volume and enable ganesha on it.
3.Perform a refresh config scenario where it fails on one of the node.
4.Observe that continuous below messages gets generated in ganesha-gfapi.log

[2016-04-11 22:22:41.432307] W [inode.c:1792:inode_table_destroy] (-->/lib64/libgfapi.so.0(glfs_fini+0x3cf) [0x7f15d962907f] -->/lib64/libglusterfs.so.0(inode_table_destroy_all+0x51) [0x7f15d939db21] -->/lib64/libglusterfs.so.0(inode_table_destroy+0xf6) [0x7f15d939da16] ) 0-gfapi: Active inode(0x7f15bc007bb4) with refcount(0) found during cleanup

Actual results:

Expected results:

There should not be such large number of messages getting logged in ganesha-gfapi.log

Additional info:

Comment 2 Soumya Koduri 2016-04-13 09:04:33 UTC
Shashank,

I suspect the issue of this BZ is same as 1325975. To confirm that could you list the exact steps performed to reproduce this issue (may distaf logs) as you had listed in bug1325975.

Comment 3 Shashank Raj 2016-04-13 17:03:54 UTC
Please follow the steps below:

create a volume

gluster volume create newvolume replica 2     transport tcp  dhcp37-180.lab.eng.blr.redhat.com:/bricks/brick1/newvolume_brick0 dhcp37-158.lab.eng.blr.redhat.com:/bricks/brick1/newvolume_brick1 dhcp37-127.lab.eng.blr.redhat.com:/bricks/brick1/newvolume_brick2 dhcp37-174.lab.eng.blr.redhat.com:/bricks/brick0/newvolume_brick3 dhcp37-180.lab.eng.blr.redhat.com:/bricks/brick2/newvolume_brick4 dhcp37-158.lab.eng.blr.redhat.com:/bricks/brick2/newvolume_brick5 dhcp37-127.lab.eng.blr.redhat.com:/bricks/brick2/newvolume_brick6 dhcp37-174.lab.eng.blr.redhat.com:/bricks/brick1/newvolume_brick7 dhcp37-180.lab.eng.blr.redhat.com:/bricks/brick3/newvolume_brick8 dhcp37-158.lab.eng.blr.redhat.com:/bricks/brick3/newvolume_brick9 dhcp37-127.lab.eng.blr.redhat.com:/bricks/brick3/newvolume_brick10 dhcp37-174.lab.eng.blr.redhat.com:/bricks/brick2/newvolume_brick11 --mode=script force" on dhcp37-180.lab.eng.blr.redhat.com: STDOUT is
 volume create: newvolume: success: please start the volume to access data

Start the volume

2016-04-12 12:38:45,478 INFO run "gluster volume start newvolume " on dhcp37-180.lab.eng.blr.redhat.com: STDOUT is
 volume start: newvolume: success

Enable ganesha on the volume

2016-04-12 12:38:46,653 INFO run "gluster volume set newvolume ganesha.enable on --mode=script" on dhcp37-180.lab.eng.blr.redhat.com: STDOUT is 
 volume set: success

Do a showmount

2016-04-12 12:38:57,561 INFO run "showmount -e localhost" on dhcp37-180.lab.eng.blr.redhat.com: STDOUT is
 Export list for localhost:
/newvolume (everyone)


Mount the volume on one client and write a file on the mount point

2016-04-12 12:38:57,598 INFO run Executing mount -t nfs -o vers=4 10.70.36.217:/newvolume /mnt/nfs1460444907.67 on dhcp37-206.lab.eng.blr.redhat.com
2016-04-12 12:38:57,731 INFO run "mount -t nfs -o vers=4 10.70.36.217:/newvolume /mnt/nfs1460444907.67" on dhcp37-206.lab.eng.blr.redhat.com: RETCODE is 0
2016-04-12 12:38:57,731 INFO run Executing dd if=/dev/zero of=/mnt/nfs1460444907.67/file.dd bs=1024 count=1024 on dhcp37-206.lab.eng.blr.redhat.com
2016-04-12 12:38:57,929 INFO run "dd if=/dev/zero of=/mnt/nfs1460444907.67/file.dd bs=1024 count=1024" on dhcp37-206.lab.eng.blr.redhat.com: RETCODE is 0
2016-04-12 12:38:57,929 ERROR run "dd if=/dev/zero of=/mnt/nfs1460444907.67/file.dd bs=1024 count=1024" on dhcp37-206.lab.eng.blr.redhat.com: STDERR is
 1024+0 records in
1024+0 records out
1048576 bytes (1.0 MB) copied, 0.124464 s, 8.4 MB/s


Mount the volume on other client and write a file

2016-04-12 12:38:57,949 INFO run Executing mount -t nfs -o vers=4 10.70.36.217:/newvolume /mnt/nfs1460444907.67 on dhcp46-206.lab.eng.blr.redhat.com
2016-04-12 12:38:58,048 INFO run "mount -t nfs -o vers=4 10.70.36.217:/newvolume /mnt/nfs1460444907.67" on dhcp46-206.lab.eng.blr.redhat.com: RETCODE is 0
2016-04-12 12:38:58,048 INFO run Executing dd if=/dev/zero of=/mnt/nfs1460444907.67/file.dd bs=1024 count=1024 on dhcp46-206.lab.eng.blr.redhat.com
2016-04-12 12:38:58,223 INFO run "dd if=/dev/zero of=/mnt/nfs1460444907.67/file.dd bs=1024 count=1024" on dhcp46-206.lab.eng.blr.redhat.com: RETCODE is 0
2016-04-12 12:38:58,223 ERROR run "dd if=/dev/zero of=/mnt/nfs1460444907.67/file.dd bs=1024 count=1024" on dhcp46-206.lab.eng.blr.redhat.com: STDERR is 
 1024+0 records in
1024+0 records out
1048576 bytes (1.0 MB) copied, 0.110896 s, 9.5 MB/s


Create a dir from first client, change the squash parameter to Root_squash and do a refersh config from the mounted node

2016-04-12 12:38:58,223 INFO run Executing mkdir /mnt/nfs1460444907.67/dir on dhcp37-206.lab.eng.blr.redhat.com
2016-04-12 12:38:58,260 INFO run "mkdir /mnt/nfs1460444907.67/dir" on dhcp37-206.lab.eng.blr.redhat.com: RETCODE is 0
2016-04-12 12:38:58,260 INFO run Executing chmod 777 /mnt/nfs1460444907.67/dir on dhcp37-206.lab.eng.blr.redhat.com
2016-04-12 12:38:58,281 INFO run "chmod 777 /mnt/nfs1460444907.67/dir" on dhcp37-206.lab.eng.blr.redhat.com: RETCODE is 0
2016-04-12 12:38:58,281 INFO run Executing sed -i  s/'Squash=.*'/'Squash="Root_squash";'/g /etc/ganesha/exports/export.newvolume.conf on dhcp37-180.lab.eng.blr.redhat.com
2016-04-12 12:38:58,299 INFO run "sed -i  s/'Squash=.*'/'Squash="Root_squash";'/g /etc/ganesha/exports/export.newvolume.conf" on dhcp37-180.lab.eng.blr.redhat.com: RETCODE is 0
2016-04-12 12:38:58,299 INFO set_root_squash edited the export file newvolume successfully for rootsquash on
2016-04-12 12:38:58,299 INFO run Executing /usr/libexec/ganesha/ganesha-ha.sh --refresh-config /etc/ganesha/ newvolume on dhcp37-180.lab.eng.blr.redhat.com
2016-04-12 12:39:17,508 INFO run "/usr/libexec/ganesha/ganesha-ha.sh --refresh-config /etc/ganesha/ newvolume" on dhcp37-180.lab.eng.blr.redhat.com: RETCODE is 0
2016-04-12 12:39:17,509 INFO run "/usr/libexec/ganesha/ganesha-ha.sh --refresh-config /etc/ganesha/ newvolume" on dhcp37-180.lab.eng.blr.redhat.com: STDOUT is
 Refresh-config completed on dhcp37-127.
Refresh-config completed on dhcp37-158.
Refresh-config completed on dhcp37-174.
Success: refresh-config completed.


Create another directory inside previously created dir, create a file and perform write and read on that file as below


2016-04-12 12:39:17,509 INFO refresh_config refresh-config script successfully ran for newvolume 
2016-04-12 12:39:22,514 INFO run Executing mkdir /mnt/nfs1460444907.67/dir/dir1 on dhcp37-206.lab.eng.blr.redhat.com
2016-04-12 12:39:22,558 INFO run "mkdir /mnt/nfs1460444907.67/dir/dir1" on dhcp37-206.lab.eng.blr.redhat.com: RETCODE is 0
2016-04-12 12:39:22,559 INFO run Executing touch /mnt/nfs1460444907.67/dir/dir1/file.1 on dhcp37-206.lab.eng.blr.redhat.com
2016-04-12 12:39:22,588 INFO run "touch /mnt/nfs1460444907.67/dir/dir1/file.1" on dhcp37-206.lab.eng.blr.redhat.com: RETCODE is 0
2016-04-12 12:39:22,651 INFO RootSquashEnable342825 posix.stat_result(st_mode=33188, st_ino=-6078174508180817136, st_dev=39L, st_nlink=1, st_uid=4294967294, st_gid=4294967294, st_size=0, st_atime=1460406603, st_mtime=1460406603, st_ctime=1460406603)
2016-04-12 12:39:22,655 INFO RootSquashEnable342825 4294967294
2016-04-12 12:39:22,656 INFO RootSquashEnable342825 4294967294
2016-04-12 12:39:22,658 INFO RootSquashEnable342825 file.1 has got the correct uid/gid post root-squash is put to on
2016-04-12 12:39:22,659 INFO run Executing echo "Hello World" >> /mnt/nfs1460444907.67/dir/dir1/file.1 on dhcp46-206.lab.eng.blr.redhat.com
2016-04-12 12:39:22,692 INFO run "echo "Hello World" >> /mnt/nfs1460444907.67/dir/dir1/file.1" on dhcp46-206.lab.eng.blr.redhat.com: RETCODE is 0
2016-04-12 12:39:22,693 INFO RootSquashEnable342825 write allowed
2016-04-12 12:39:22,693 INFO run Executing cat /mnt/nfs1460444907.67/dir/dir1/file.1 on dhcp46-206.lab.eng.blr.redhat.com
2016-04-12 12:39:22,706 INFO run "cat /mnt/nfs1460444907.67/dir/dir1/file.1" on dhcp46-206.lab.eng.blr.redhat.com: RETCODE is 0
2016-04-12 12:39:22,706 INFO run "cat /mnt/nfs1460444907.67/dir/dir1/file.1" on dhcp46-206.lab.eng.blr.redhat.com: STDOUT is 
 Hello World


Change the squash parameter again to No_root_squash from the mounted node and do a refresh config

2016-04-12 12:39:22,706 INFO run Executing sed -i  s/'Squash=.*'/'Squash="No_root_squash";'/g /etc/ganesha/exports/export.newvolume.conf on dhcp37-180.lab.eng.blr.redhat.com
2016-04-12 12:39:22,722 INFO run "sed -i  s/'Squash=.*'/'Squash="No_root_squash";'/g /etc/ganesha/exports/export.newvolume.conf" on dhcp37-180.lab.eng.blr.redhat.com: RETCODE is 0
2016-04-12 12:39:22,723 INFO set_root_squash edited the export file newvolume successfully for rootsquash off
2016-04-12 12:39:22,723 INFO run Executing /usr/libexec/ganesha/ganesha-ha.sh --refresh-config /etc/ganesha/ newvolume on dhcp37-180.lab.eng.blr.redhat.com
2016-04-12 12:40:06,011 INFO run "/usr/libexec/ganesha/ganesha-ha.sh --refresh-config /etc/ganesha/ newvolume" on dhcp37-180.lab.eng.blr.redhat.com: RETCODE is 1
2016-04-12 12:40:06,011 INFO run "/usr/libexec/ganesha/ganesha-ha.sh --refresh-config /etc/ganesha/ newvolume" on dhcp37-180.lab.eng.blr.redhat.com: STDOUT is 
 Refresh-config completed on dhcp37-127.
Refresh-config completed on dhcp37-158.
Refresh-config completed on dhcp37-174.
Error: refresh-config failed on localhost.


Observe that the refresh config fails this time on localhost and below messages are continuously generated in ganesha-gfapi.log


2016-04-11 22:22:41.432307] W [inode.c:1792:inode_table_destroy] (-->/lib64/libgfapi.so.0(glfs_fini+0x3cf) [0x7f15d962907f] -->/lib64/libglusterfs.so.0(inode_table_destroy_all+0x51) [0x7f15d939db21] -->/lib64/libglusterfs.so.0(inode_table_destroy+0xf6) [0x7f15d939da16] ) 0-gfapi: Active inode(0x7f15bc007bb4) with refcount(0) found during cleanup

Comment 4 Soumya Koduri 2016-04-17 14:35:23 UTC
I could reproduce the issue. As suspected it is similar bug1325975. The dbus thread enters infinite loop while destroying inodes in inode_table_destroy. 

The patch http://review.gluster.org/13987 (posted upstream) shall fix this issue as well.

Comment 5 Soumya Koduri 2016-04-17 14:36:39 UTC
Shashank,

Could you please give qe_ack.

Comment 7 Shashank Raj 2016-04-25 11:15:59 UTC
removing the needinfo as qa ack has been provided.

Comment 8 Soumya Koduri 2016-04-30 16:47:36 UTC
Fix has been merged downstream -https://code.engineering.redhat.com/gerrit/73330

Comment 9 Shashank Raj 2016-05-04 09:39:13 UTC
Verified this bug with latest glusterfs-3.7.9-3 build and below are the scenarios tried:

Tried reproducing this issue by following below scenarios:

1) Stopped glusterd on one node and changed acl settings in volume export file and performed refresh config from other node:

[root@dhcp37-180 exports]# /usr/libexec/ganesha/ganesha-ha.sh --refresh-config /etc/ganesha/ newvolume
Refresh-config completed on dhcp37-127.
Refresh-config completed on dhcp37-158.
Error: refresh-config failed on dhcp37-174.

It failed on the node where glusterd was stopped and i did not observe any “inode_table_destroy” messages in ganesha-gfapi.log. 

2) Stopped ganesha service on one node and changed acl settings in volume export file and performed refresh config from other node:

[root@dhcp37-180 exports]# /usr/libexec/ganesha/ganesha-ha.sh --refresh-config /etc/ganesha/ newvolume
Refresh-config completed on dhcp37-127.
Refresh-config completed on dhcp37-158.
Error: refresh-config failed on dhcp37-174.

It failed on the node where ganesha service was stopped and i did not observe any “inode_table_destroy” messages in ganesha-gfapi.log. 

3) Stopped ganesha service on one node and changed acl settings in volume export file and performed refresh config from that node:

[root@dhcp37-180 exports]# /usr/libexec/ganesha/ganesha-ha.sh --refresh-config /etc/ganesha/ newvolume
Refresh-config completed on dhcp37-127.
Refresh-config completed on dhcp37-158.
Refresh-config completed on dhcp37-174.
Error: refresh-config failed on localhost.

It failed on the localhost where ganesha service was stopped and i did not observe any “inode_table_destroy” messages in ganesha-gfapi.log. 

4) Stopped glusterd on one node and changed acl settings in volume export file and performed refresh config from that node:

[root@dhcp37-180 exports]# /usr/libexec/ganesha/ganesha-ha.sh --refresh-config /etc/ganesha/ newvolume
Refresh-config completed on dhcp37-127.
Refresh-config completed on dhcp37-158.
Refresh-config completed on dhcp37-174.
Error: refresh-config failed on localhost.

It failed on the localhost where glusterd service was stopped and i did not observe any “inode_table_destroy” messages in ganesha-gfapi.log. 

5) Also performed the automation case which originally was reproducing this issue and i didn't hit it.

Since in any failed refresh config case, i didn't observe "inode_table_destroy" messages. Hence marking this bug as Verified

Comment 11 errata-xmlrpc 2016-06-23 05:33:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2016:1247