Bug 1577325

Summary: [Ganesha] Ganesha process entered failed state while exporting/unexporting multiple volumes in loop
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Manisha Saini <msaini>
Component: nfs-ganeshaAssignee: Daniel Gryniewicz <dang>
Status: CLOSED ERRATA QA Contact: Manisha Saini <msaini>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.4CC: amukherj, chpai, dang, ffilz, grajoria, jthottan, kkeithle, rcyriac, rhs-bugs, sankarshan, storage-qa-internal, vdas
Target Milestone: ---   
Target Release: RHGS 3.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: nfs-ganesha-2.5.5-9 Doc Type: Bug Fix
Doc Text:
While exporting/unexporting multiple volume in loop, NFS-Ganesha process goes into failed state and then restarts on its own. The export id cache entry is also updated by race. With this fix, while unexporting, refcount on entry is taken into account and the race is closed.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-04 06:55:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1578089    
Bug Blocks: 1503137    

Description Manisha Saini 2018-05-11 17:37:03 UTC
Description of problem:

While doing validation of BZ 1492695, observed, while exporting/unexporting multiple volume in loop,Ganesha process goes into failed state and then restart on its own.

Consequences-

When ganesha process restart post entering failed state,1-2 volume exports fails  on that particular node.Now for those 2 volume unexport fails with "Dynamic export addition/deletion failed" (as export entries not same across nodes)



snippet 

ganesha.log

-----.

11/05/2018 22:10:41 : epoch 1af30000 : moonshine.lab.eng.blr.redhat.com : ganesha.nfsd-15340[dbus_heartbeat] glusterfs_create_export :FSAL :EVENT :Volume ecvol2 exported at : '/'
11/05/2018 22:10:42 : epoch 1af30000 : moonshine.lab.eng.blr.redhat.com : ganesha.nfsd-15340[cache_lru] lru_run :INODE LRU :CRIT :Futility count exceeded.  The LRU thread is unable to make progress in reclaiming FDs.  Disabling FD cache.
11/05/2018 22:10:42 : epoch 1af30000 : moonshine.lab.eng.blr.redhat.com : ganesha.nfsd-15340[cache_lru] lru_run :INODE LRU :CRIT :Futility count exceeded.  The LRU thread is unable to make progress in reclaiming FDs.  Disabling FD cache.

 FDs.  Disabling FD cache.
11/05/2018 22:10:44 : epoch 1af30000 : moonshine.lab.eng.blr.redhat.com : ganesha.nfsd-15340[cache_lru] lru_run_lane :INODE :FATAL :An entry (0x7f4584acec80) having an unmappable export_id (103) is unexpected
11/05/2018 22:10:44 : epoch 1af30000 : moonshine.lab.eng.blr.redhat.com : ganesha.nfsd-15340[cache_lru] unregister_fsal :FSAL :CRIT :Unregister FSAL GLUSTER with non-zero refcount=4
11/05/2018 22:10:44 : epoch 1af30000 : moonshine.lab.eng.blr.redhat.com : ganesha.nfsd-15340[cache_lru] glusterfs_unload :FSAL :CRIT :FSAL Gluster unable to unload.  Dying ...
11/05/2018 22:10:47 : epoch 1af30000 : moonshine.lab.eng.blr.redhat.com : ganesha.nfsd-7685[main] main :MAIN :EVENT :ganesha.nfsd Starting: Ganesha Version 2.5.5
11/05/2018 22:10:47 : epoch 1af30000 : moonshine.lab.eng.blr.redhat.com : ganesha.nfsd-7686[main] nfs_set_param_from_conf :NFS STARTUP :EVENT :Configuration file successfully parsed
11/05/2018 22:10:47 : epoch 1af30000 : moonshine.lab.eng.blr.redhat.com : ganesha.nfsd-7686[main] init_server_pkgs :NFS STARTUP :EVENT :Initializing ID Mapper.
11/05/2018 22:10:47 : epoch 1af30000 : moonshine.lab.eng.blr.redhat.com : ganesha.nfsd-7686[main] init_server_pkgs :NFS STARTUP :EVENT :ID Mapper successfully initialized.
11/05/2018 22:10:47 : epoch 1af30000 : moonshine.lab.eng.blr.redhat.com : ganesha.nfsd-7686[main] glusterfs_create_export :FSAL :EVENT :Volume Ganeshavol1 exported at : '/'
11/05/2018 22:10:49 : epoch 1af30000 : moonshine.lab.eng.blr.redhat.com : ganesha.nfsd-7686[main] glusterfs_create_export :FSAL :EVENT :Volume Ganeshavol2 exported at : '/'
11/05/2018 22:10:51 : epoch 1af30000 : moonshine.lab.eng.blr.redhat.com : ganesha.nfsd-7686[main] glusterfs_create_export :FSAL :EVENT :Volume ecvol1 exported at : '/'
11/05/2018 22:10:52 : epoch 1af30000 : moonshine.lab.eng.blr.redhat.com : ganesha.nfsd-7686[main] glusterfs_create_export :FSAL :EVENT :Volume ecvol2 exported at : '/'


--------------------

/var/log/messages

--------------
May 11 22:10:01 moonshine systemd: Starting Session 16760 of user root.
May 11 22:10:01 moonshine systemd: Started Session 16761 of user root.
May 11 22:10:01 moonshine systemd: Starting Session 16761 of user root.
May 11 22:10:44 moonshine systemd: nfs-ganesha.service: main process exited, code=exited, status=2/INVALIDARGUMENT
May 11 22:10:44 moonshine systemd: Unit nfs-ganesha.service entered failed state.
May 11 22:10:44 moonshine systemd: nfs-ganesha.service failed.
May 11 22:10:47 moonshine systemd: nfs-ganesha.service holdoff time over, scheduling restart.
May 11 22:10:47 moonshine systemd: Starting Process NFS-Ganesha configuration...
May 11 22:10:47 moonshine systemd: Started Process NFS-Ganesha configuration.
May 11 22:10:47 moonshine systemd: Starting NFS-Ganesha file server...
May 11 22:10:49 moonshine systemd: Started NFS-Ganesha file server.
May 11 22:10:53 moonshine rpc.statd[17012]: Received SM_UNMON_ALL request from moonshine.lab.eng.blr.redhat.com while not monitoring any hosts
May 11 22:11:02 moonshine systemd: Started Session 16762 of user root.
May 11 22:11:02 moonshine systemd: Starting Session 16762 of user root.
May 11 22:12:01 moonshine systemd: Started Session 16763 of user root.
May 11 22:12:01 moonshine systemd: Starting Session 16763 of user root.
May 11 22:13:01 moonshine systemd: Started Session 16764 of user root.

--------------
-

When ganesha process entered failed state,there is no core dumped by ganesha process.Unable to capture via gdb as well


-----------
[Thread 0x7f44301ba700 (LWP 18266) exited]
[Thread 0x7f4430179700 (LWP 18267) exited]
[Thread 0x7f4430138700 (LWP 18268) exited]
[Thread 0x7f44300f7700 (LWP 18269) exited]
[Thread 0x7f465a441100 (LWP 15340) exited]
[Inferior 1 (process 15340) exited with code 02]
(gdb) bt
No stack.
(gdb) bt
No stack.
---------------- 


Export entry for ecvol3 is missing in showmount -e on that particular node


While exporting cli shows success--

---------
# sh start_volume.sh
volume set: success
volume set: success
volume set: success
volume set: success
volume set: success
volume set: success
volume set: success
volume set: success
volume set: success
volume set: success
volume set: success

---------------


# showmount -e
Export list for moonshine.lab.eng.blr.redhat.com:
/Ganeshavol1 (everyone)
/Ganeshavol2 (everyone)
/ecvol1      (everyone)
/ecvol2      (everyone)
/ecvol4      (everyone)
/ecvol5      (everyone)
/ecvol6      (everyone)
/ecvol7      (everyone)
/ecvol8      (everyone)

----------------

Unexport failed for ecvol3

# sh start_volume.sh 
volume set: success
volume set: success
volume set: failed: Dynamic export addition/deletion failed. Please see log file for details
volume set: success
volume set: success




Version-Release number of selected component (if applicable):

# rpm -qa | grep ganesha
nfs-ganesha-gluster-2.5.5-6.el7rhgs.x86_64
nfs-ganesha-debuginfo-2.5.5-6.el7rhgs.x86_64
nfs-ganesha-2.5.5-6.el7rhgs.x86_64
glusterfs-ganesha-3.12.2-9.el7rhgs.x86_64


How reproducible:

Managed to hit it 3 times

Steps to Reproduce:

1.Create 8 node ganesha cluster
2.Create 2 distributed-disperse volume 6 x (4 + 2) (Ganeshavol1 and Ganeshavol2) and export it via ganesha
3.Mount the volume to 4 clients.Each volume mounted to 2 clients via v3/v4 simultaneously running bonnie and dbench
4.While IO's is in process create 98 other EC vols and start exporting/unexporting in loop

Actual results:

Ganesha service entered failed state while export/unexport is going on

Expected results:

Ganesha process should not enter fail state

Additional info:
Attaching sosreport and ganesha logs shortly

Comment 7 Daniel Gryniewicz 2018-05-21 14:13:43 UTC
This should close the race:

https://review.gerrithub.io/#/c/ffilz/nfs-ganesha/+/411912

Comment 12 Daniel Gryniewicz 2018-07-13 14:57:14 UTC
Are there unexports happening here?  Or only exports?

Comment 13 Daniel Gryniewicz 2018-07-13 18:05:24 UTC
Okay, this is another race (and possibly a third race):

https://review.gerrithub.io/#/c/ffilz/nfs-ganesha/+/419268

Comment 17 Manisha Saini 2018-08-02 05:37:08 UTC
Verified this with

# rpm -qa | grep ganesha
nfs-ganesha-2.5.5-9.el7rhgs.x86_64
nfs-ganesha-debuginfo-2.5.5-9.el7rhgs.x86_64
glusterfs-ganesha-3.12.2-15.el7rhgs.x86_64
nfs-ganesha-gluster-2.5.5-9.el7rhgs.x86_64


Exported/unexported 98 volumes in loop.No ganesha service failure was observed.Moving this BZ to verified state.

Comment 19 Daniel Gryniewicz 2018-08-31 13:23:16 UTC
I would suggest something like this:

"When exporting or unexporting on a live Ganesha, there were several races between the cleanup thread and the export ID cache.  Close these races with a combination of refcounting, locking, and code re-organization."

I'm not sure how detailed these need to be; if lots of detail is necessary, then the 2 commit messages are complete and detailed.

Comment 21 errata-xmlrpc 2018-09-04 06:55:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2610