1286022 – nfs-ganesha+data tiering: nfs-ganesha process segfault with vers=4 while executing ltp testsuite "fsstress" test

Bug 1286022 - nfs-ganesha+data tiering: nfs-ganesha process segfault with vers=4 while executing ltp testsuite "fsstress" test

Summary: nfs-ganesha+data tiering: nfs-ganesha process segfault with vers=4 while exec...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	tier
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	hari gowtham
QA Contact:	Bala Konda Reddy M
Docs Contact:
URL:
Whiteboard:	tier-fuse-nfs-samba
Depends On:	1288403
Blocks:
TreeView+	depends on / blocked

Reported:	2015-11-27 09:19 UTC by Saurabh
Modified:	2019-12-02 12:14 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-02-06 17:52:11 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Saurabh 2015-11-27 09:19:36 UTC

Description of problem:
ltp testsuite called as fsstress.
While execution of ltp testsuite's fsstress the nfs-ganesha segfaults.
The volume in consideration is a data tiered volume.

Version-Release number of selected component (if applicable):
glusterfs-3.7.5-7.el7rhgs.x86_64
nfs-ganesha-2.2.0-11.el7rhgs.x86_64

How reproducible:
happen to seen this time

Steps to Reproduce:
1. create a volume of type data-tiering
2. setup nfs-ganesha 
3. mount the volume with vers=4
4. start fs-sanity, ltp testsuite is part of it and wait for fsstress test to get executed. 

Actual results:
# time bash /usr/libexec/ganesha/ganesha-ha.sh --status
Online: [ vm1 vm2 vm3 vm4 ]

vm1-cluster_ip-1 vm4
vm1-trigger_ip-1 vm4
vm2-cluster_ip-1 vm2
vm2-trigger_ip-1 vm2
vm3-cluster_ip-1 vm3
vm3-trigger_ip-1 vm3
vm4-cluster_ip-1 vm4
vm4-trigger_ip-1 vm4
vm1-dead_ip-1 vm1


Nov 27 20:05:55 vm1 kernel: ganesha.nfsd[6035]: segfault at 8 ip 00007fe44e247e39 sp 00007fe412f99930 error 4 in dht.so[7fe44e23b000+68000]
Nov 27 20:05:56 vm1 systemd: nfs-ganesha.service: main process exited, code=killed, status=11/SEGV
Nov 27 20:05:56 vm1 systemd: Unit nfs-ganesha.service entered failed state.

Even if the segfault has happened for a process, the HA should properly but things do not work as expected.
As such the the process may have failed over to vm4 from vm1, still the subsequent I/O display "Stale filehandle" in ganesha-gfapi.log


Expected results:
no segfault is exepected also HA should work properly.

Additional info:
Tyring to get the coredump again.

Comment 3 Saurabh 2015-11-27 10:41:23 UTC

I tried to reproduce the problem and now I see that the process is in "D" state.

# ps -auxww | grep ltp
root     20611  0.0  0.0 113120  1528 pts/1    S+   15:05   0:00 /bin/bash /opt/qa/tools/system_light/run.sh -w /mnt -l /export/ltp-27nov.log -t ltp
root     20630  0.0  0.0 113120  1396 pts/1    S+   15:05   0:00 /bin/bash /opt/qa/tools/system_light/scripts/ltp/ltp.sh
root     20632  0.0  0.0 113260  1560 pts/1    S+   15:05   0:00 /bin/bash /opt/qa/tools/system_light/scripts/ltp/ltp_run.sh
root     20850  0.0  0.0   4324   588 pts/1    S+   15:13   0:00 /opt/qa/tools/ltp-full-20091031/testcases/kernel/fs//fsstress/fsstress -d /mnt/run20611/ -l 22 -n 22 -p 22
root     21318  0.0  0.0  69860   332 pts/1    D+   15:16   0:00 /opt/qa/tools/ltp-full-20091031/testcases/kernel/fs//fsstress/fsstress -d /mnt/run20611/ -l 22 -n 22 -p 22
root     21384  0.0  0.0 112640   928 pts/2    S+   15:55   0:00 grep --color=auto ltp

# strace -p 21318
Process 21318 attached

Comment 4 Saurabh 2015-11-27 11:52:25 UTC

Finally I was able to get the coredump again,
#0  dht_layout_ref (this=0x7f0128010460, layout=layout@entry=0x0) at dht-layout.c:149
#1  0x00007f016a3df2fb in dht_selfheal_restore (frame=frame@entry=0x7f01401d5b60, dir_cbk=dir_cbk@entry=0x7f016a3e8150 <dht_rmdir_selfheal_cbk>, loc=loc@entry=0x7f0138f1f694, layout=0x0)
    at dht-selfheal.c:1914
#2  0x00007f016a3ed792 in dht_rmdir_hashed_subvol_cbk (frame=0x7f01401d5b60, cookie=0x7f01401d7efc, this=0x7f0128010460, op_ret=-1, op_errno=39, preparent=0x7f01387b8b20, postparent=0x7f01387b8b90, 
    xdata=0x0) at dht-common.c:6849
#3  0x00007f016a63fa67 in afr_rmdir_unwind (frame=<optimized out>, this=<optimized out>) at afr-dir-write.c:1338
#4  0x00007f016a6413a9 in __afr_dir_write_cbk (frame=0x7f01401e3668, cookie=<optimized out>, this=0x7f012800f6d0, op_ret=<optimized out>, op_errno=<optimized out>, buf=buf@entry=0x0, 
    preparent=0x7f012f164ff0, postparent=postparent@entry=0x7f012f165060, preparent2=preparent2@entry=0x0, postparent2=postparent2@entry=0x0, xdata=xdata@entry=0x0) at afr-dir-write.c:246
#5  0x00007f016a6415a6 in afr_rmdir_wind_cbk (frame=<optimized out>, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, preparent=<optimized out>, 
    postparent=0x7f012f165060, xdata=0x0) at afr-dir-write.c:1350
#6  0x00007f016a8bd7d1 in client3_3_rmdir_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f01401d6d84) at client-rpc-fops.c:729
#7  0x00007f017607db20 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f012819ab30, pollin=pollin@entry=0x7f0124b29070) at rpc-clnt.c:766
#8  0x00007f017607dddf in rpc_clnt_notify (trans=<optimized out>, mydata=0x7f012819ab60, event=<optimized out>, data=0x7f0124b29070) at rpc-clnt.c:907
#9  0x00007f0176079913 in rpc_transport_notify (this=this@entry=0x7f01281aa7b0, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f0124b29070) at rpc-transport.c:545
#10 0x00007f016af614c6 in socket_event_poll_in (this=this@entry=0x7f01281aa7b0) at socket.c:2236
#11 0x00007f016af643b4 in socket_event_handler (fd=fd@entry=45, idx=idx@entry=9, data=0x7f01281aa7b0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2349
#12 0x00007f017631089a in event_dispatch_epoll_handler (event=0x7f012f165540, event_pool=0x92ba10) at event-epoll.c:575
#13 event_dispatch_epoll_worker (data=0x7f01280c4300) at event-epoll.c:678
#14 0x00007f01786a1df5 in start_thread () from /lib64/libpthread.so.0
#15 0x00007f0177fb11ad in clone () from /lib64/libc.so.6

The coredump is copied at the location mentioned above.

Comment 20 Shyamsundar 2018-02-06 17:52:11 UTC

Thank you for your bug report.

We are not further root causing this bug, as a result this bug is being closed as WONTFIX. Please reopen if the problem continues to be observed after upgrading
to a latest version.

Note You need to log in before you can comment on or make changes to this bug.