Description of problem: ---------------------- Post setting up Ganesha,i.e.,after installing the latest rpms,pcs auth,ganesha enable and export , nfs-ganesha crashed on 2/4 servers when I tried to restart ganesha service.. The process came back alive,so my guess is it dumped core when Ganesha process was stopped. ************* BT from crash ************* (gdb) bt #0 0x00007fb6f39e780c in ?? () #1 0x0000000000000000 in ?? () (gdb) The signature of the BT looks similar to the one reported in BZ#1380619. client-io-threads was on during my testing.I'll update result after setting it to off as well in the BZ soon. Version-Release number of selected component (if applicable): ------------------------------------------------------------- [root@gqas013 tmp]# rpm -qa|grep ganesha glusterfs-ganesha-3.8.4-3.el7rhgs.x86_64 nfs-ganesha-2.4.1-1.el7rhgs.x86_64 [root@gqas013 tmp]# How reproducible: ----------------- 2/4 Steps to Reproduce: ------------------ > After a fresh install,perform steps to set up Ganesha - install rpms,pcs auth,enable Ganesha and export. > Start the volume,restart glusterd,rpcbind and nfs-ganesha. Actual results: --------------- Ganesha crashed and dumped core on 2/4 servers. The process was alive,so the core was dumped when Ganesha was stopped during the restart Expected results: ----------------- No crashes while restarting system services. Additional info: ---------------- OS : RHEL 7.3 *Vol config* : Volume Name: testvol Type: Distributed-Replicate Volume ID: 7b413fd4-9775-44a2-bfa8-23d206db9dfe Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0 Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1 Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2 Brick4: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3 Options Reconfigured: nfs.disable: on performance.readdir-ahead: on transport.address-family: inet performance.stat-prefetch: off server.allow-insecure: on features.cache-invalidation: off ganesha.enable: on cluster.enable-shared-storage: enable nfs-ganesha: enable [root@gqas013 tmp]#
Ambarish, If you happen to reproduce the issue, please take the core (using gdb) before running service stop/restart so as to compare the threads before and after the crash. Thanks!
I tried it twice,but I could not reproduce the issue post setting client-io-threads to "off". The issue is a bit intermittent,so it's hard to say that with certainty ,though. (if that is or is not the culprit).
Soumya, I tried the steps after keeping my volume in "Started" state,before setting up the Ganesha cluster and exporting the volume,twice on fresh setups,and I could not reproduce the crash on multiple tries of system service restarts.
Thanks Amabrish. That almost confirms the theory that this crash is hit only if a volume is being exported via nfs-ganesha before it is even started. Since this is not a recommended configuration, lowering the priority of the bug for now. I suspect that probably when the volume is not started, the flow shall be glfs_init() -> xlator_init() of all the child subvols -> and then rpc_connection to brick which shall fail. Post which "glfs_fini" shall be called. May be since glfs_init() itself failed, graph would have not been setup and PARENT_DOWN may not have been sent to io-threads xlator, resulting in the dangling thread. This is just the theory I have on top of my mind. Will look through the code a bit. CCin Pranith too.
I could not reproduce this crash on multiple tries gluster : glusterfs-3.8.4-10 ganesha : 2.4.1-3 Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html