Description of problem: ltp testsuite called as fsstress. While execution of ltp testsuite's fsstress the nfs-ganesha segfaults. The volume in consideration is a data tiered volume. Version-Release number of selected component (if applicable): glusterfs-3.7.5-7.el7rhgs.x86_64 nfs-ganesha-2.2.0-11.el7rhgs.x86_64 How reproducible: happen to seen this time Steps to Reproduce: 1. create a volume of type data-tiering 2. setup nfs-ganesha 3. mount the volume with vers=4 4. start fs-sanity, ltp testsuite is part of it and wait for fsstress test to get executed. Actual results: # time bash /usr/libexec/ganesha/ganesha-ha.sh --status Online: [ vm1 vm2 vm3 vm4 ] vm1-cluster_ip-1 vm4 vm1-trigger_ip-1 vm4 vm2-cluster_ip-1 vm2 vm2-trigger_ip-1 vm2 vm3-cluster_ip-1 vm3 vm3-trigger_ip-1 vm3 vm4-cluster_ip-1 vm4 vm4-trigger_ip-1 vm4 vm1-dead_ip-1 vm1 Nov 27 20:05:55 vm1 kernel: ganesha.nfsd[6035]: segfault at 8 ip 00007fe44e247e39 sp 00007fe412f99930 error 4 in dht.so[7fe44e23b000+68000] Nov 27 20:05:56 vm1 systemd: nfs-ganesha.service: main process exited, code=killed, status=11/SEGV Nov 27 20:05:56 vm1 systemd: Unit nfs-ganesha.service entered failed state. Even if the segfault has happened for a process, the HA should properly but things do not work as expected. As such the the process may have failed over to vm4 from vm1, still the subsequent I/O display "Stale filehandle" in ganesha-gfapi.log Expected results: no segfault is exepected also HA should work properly. Additional info: Tyring to get the coredump again.
I tried to reproduce the problem and now I see that the process is in "D" state. # ps -auxww | grep ltp root 20611 0.0 0.0 113120 1528 pts/1 S+ 15:05 0:00 /bin/bash /opt/qa/tools/system_light/run.sh -w /mnt -l /export/ltp-27nov.log -t ltp root 20630 0.0 0.0 113120 1396 pts/1 S+ 15:05 0:00 /bin/bash /opt/qa/tools/system_light/scripts/ltp/ltp.sh root 20632 0.0 0.0 113260 1560 pts/1 S+ 15:05 0:00 /bin/bash /opt/qa/tools/system_light/scripts/ltp/ltp_run.sh root 20850 0.0 0.0 4324 588 pts/1 S+ 15:13 0:00 /opt/qa/tools/ltp-full-20091031/testcases/kernel/fs//fsstress/fsstress -d /mnt/run20611/ -l 22 -n 22 -p 22 root 21318 0.0 0.0 69860 332 pts/1 D+ 15:16 0:00 /opt/qa/tools/ltp-full-20091031/testcases/kernel/fs//fsstress/fsstress -d /mnt/run20611/ -l 22 -n 22 -p 22 root 21384 0.0 0.0 112640 928 pts/2 S+ 15:55 0:00 grep --color=auto ltp # strace -p 21318 Process 21318 attached
Finally I was able to get the coredump again, #0 dht_layout_ref (this=0x7f0128010460, layout=layout@entry=0x0) at dht-layout.c:149 #1 0x00007f016a3df2fb in dht_selfheal_restore (frame=frame@entry=0x7f01401d5b60, dir_cbk=dir_cbk@entry=0x7f016a3e8150 <dht_rmdir_selfheal_cbk>, loc=loc@entry=0x7f0138f1f694, layout=0x0) at dht-selfheal.c:1914 #2 0x00007f016a3ed792 in dht_rmdir_hashed_subvol_cbk (frame=0x7f01401d5b60, cookie=0x7f01401d7efc, this=0x7f0128010460, op_ret=-1, op_errno=39, preparent=0x7f01387b8b20, postparent=0x7f01387b8b90, xdata=0x0) at dht-common.c:6849 #3 0x00007f016a63fa67 in afr_rmdir_unwind (frame=<optimized out>, this=<optimized out>) at afr-dir-write.c:1338 #4 0x00007f016a6413a9 in __afr_dir_write_cbk (frame=0x7f01401e3668, cookie=<optimized out>, this=0x7f012800f6d0, op_ret=<optimized out>, op_errno=<optimized out>, buf=buf@entry=0x0, preparent=0x7f012f164ff0, postparent=postparent@entry=0x7f012f165060, preparent2=preparent2@entry=0x0, postparent2=postparent2@entry=0x0, xdata=xdata@entry=0x0) at afr-dir-write.c:246 #5 0x00007f016a6415a6 in afr_rmdir_wind_cbk (frame=<optimized out>, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, preparent=<optimized out>, postparent=0x7f012f165060, xdata=0x0) at afr-dir-write.c:1350 #6 0x00007f016a8bd7d1 in client3_3_rmdir_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f01401d6d84) at client-rpc-fops.c:729 #7 0x00007f017607db20 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f012819ab30, pollin=pollin@entry=0x7f0124b29070) at rpc-clnt.c:766 #8 0x00007f017607dddf in rpc_clnt_notify (trans=<optimized out>, mydata=0x7f012819ab60, event=<optimized out>, data=0x7f0124b29070) at rpc-clnt.c:907 #9 0x00007f0176079913 in rpc_transport_notify (this=this@entry=0x7f01281aa7b0, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f0124b29070) at rpc-transport.c:545 #10 0x00007f016af614c6 in socket_event_poll_in (this=this@entry=0x7f01281aa7b0) at socket.c:2236 #11 0x00007f016af643b4 in socket_event_handler (fd=fd@entry=45, idx=idx@entry=9, data=0x7f01281aa7b0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2349 #12 0x00007f017631089a in event_dispatch_epoll_handler (event=0x7f012f165540, event_pool=0x92ba10) at event-epoll.c:575 #13 event_dispatch_epoll_worker (data=0x7f01280c4300) at event-epoll.c:678 #14 0x00007f01786a1df5 in start_thread () from /lib64/libpthread.so.0 #15 0x00007f0177fb11ad in clone () from /lib64/libc.so.6 The coredump is copied at the location mentioned above.
Thank you for your bug report. We are not further root causing this bug, as a result this bug is being closed as WONTFIX. Please reopen if the problem continues to be observed after upgrading to a latest version.