Created attachment 1160991 [details] ganesha-gfapi.log from mounted node Description of problem: Ganesha gets killed with segfault error while rebalance is in progress. Version-Release number of selected component (if applicable): glusterfs-3.7.9-5 nfs-ganesha-2.3.1-7 How reproducible: Always Steps to Reproduce: 1. Create a 4 node ganesha cluster. 2. Create a volume and enable ganesha on it. 3. Mount the volume using vers=3 or 4 and create nested directories on the mount point. from distaf logs: for i in {1..25}; do mkdir /mnt1464089502.83/a$i; for j in {1..50}; do mkdir /mnt1464089502.83/a$i/b$j; for k in {1..50}; do touch /mnt1464089502.83/a$i/b$j/c$k; done done done 4. Add bricks to the volume. gluster volume add-brick newvolume replica 2 dhcp37-44.lab.eng.blr.redhat.com:/bricks/brick4/newvolume_brick12 dhcp37-220.lab.eng.blr.redhat.com:/bricks/brick4/newvolume_brick13 5. start the rebalance process: gluster v rebalance newvolume start force 6. Observe that while rebalance is in progress, ganesha process on the mounted node gets killed with seg fault error: [73850.224747] ganesha.nfsd[6003]: segfault at 7fda62dfd8a4 ip 00007fc8c38e1210 sp 00007fc83ac7cf68 error 6 in libpthread-2.17.so[7fc8c38d5000+16000] Below is the bt generated from gdb: Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7f9c60430280 (LWP 18923)] 0x00007f9c8fcd0210 in pthread_spin_lock () from /lib64/libpthread.so.0 Missing separate debuginfos, use: debuginfo-install libacl-2.2.51-12.el7.x86_64 openssl-libs-1.0.1e-51.el7_2.5.x86_64 zlib-1.2.7-15.el7.x86_64 (gdb) bt #0 0x00007f9c8fcd0210 in pthread_spin_lock () from /lib64/libpthread.so.0 #1 0x00007f9c7c60b63d in __gf_free (free_ptr=0x7f9c50000950) at mem-pool.c:316 #2 0x00007f9c7c89db0b in glfs_h_poll_cache_invalidation ( fs=fs@entry=0x7f9c78007e30, up_arg=up_arg@entry=0x7f9c6042f0a0, upcall_data=upcall_data@entry=0x7f9c5e239e00) at glfs-handleops.c:1972 #3 0x00007f9c7c89de00 in pub_glfs_h_poll_upcall (fs=0x7f9c78007e30, up_arg=up_arg@entry=0x7f9c6042f0a0) at glfs-handleops.c:2066 #4 0x00007f9c7ccb5ed3 in GLUSTERFSAL_UP_Thread (Arg=0x7f9c78007d00) at /usr/src/debug/nfs-ganesha-2.3.1/src/FSAL/FSAL_GLUSTER/fsal_up.c:153 #5 0x00007f9c8fccbdc5 in start_thread () from /lib64/libpthread.so.0 #6 0x00007f9c8f399ced in clone () from /lib64/libc.so.6 Actual results: Ganesha gets killed with segfault error while rebalance is in progress. Expected results: ganesha process should not get killed. Additional info: Attached ganesha and ganesha-gfapi logs from the mounted node
Created attachment 1160992 [details] ganesha.log
The reason for the crash is that in 'glfs_h_poll_cache_invalidation', we used 'calloc' to create up_inpode_arg. Hence it shall not have any memory accounting variables set/defined which are used in GF_FREE (up_inode_arg) (in case of any errors). The fix is to use 'GF_CALLOC' while creating memory for up_inode_arg variable.
Since this is a change in glusterfs code-path, adjusting components accordingly.
Since no ganesha kills are acceptable in any scenario, raising a blocker flag for 3.1.3
Fix has been posted upstream for review - http://review.gluster.org/14521
Verified this bug with latest glusterfs-3.7.9-7 and nfs-ganesha-2.3.1-7 build and its working as expected. The earlier rebalance automated cases which were making nfs-ganesha to crash, is now working fine and no ganesha crash/ segfault error is observed. based on the above observation, marking this bug as Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1240