Bug 1562770

Summary: [Ganesha] : Ganesha crashed in ec_notify().
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Ambarish <asoman>
Component: nfs-ganeshaAssignee: Xavi Hernandez <jahernan>
Status: CLOSED CURRENTRELEASE QA Contact: Manisha Saini <msaini>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.4CC: aspandey, bturner, dang, ffilz, jahernan, jijoy, jthottan, kkeithle, mbenjamin, msaini, pasik, pkarampu, rhinduja, rhs-bugs, sheggodu, skoduri, storage-qa-internal
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1562951 (view as bug list) Environment:
Last Closed: 2020-01-09 10:52:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1562951, 1563306    
Bug Blocks:    

Description Ambarish 2018-04-02 13:08:28 UTC
Description of problem:
------------------------

I have a 100 EC volumes exported via Ganesha.

2 of them are active - butcher1 and butcher2.

There is Bonnie and dbench running via v3 and v4 on these two exports.

The other 98 exports are passive.

I was exporting/unepxorting these 98 passive volumes at random (via vol restarts and ganesha.enable on/off).


Ganesha crashed on one the nodes and dumped a core in the meantime .

BT : 


Core was generated by `/usr/bin/ganesha.nfsd -L /var/log/ganesha/ganesha.log -f /etc/ganesha/ganesha.c'.
Program terminated with signal 11, Segmentation fault.
#0  ec_notify (this=0x7feb708d1340, event=6, data=0x7feb708c5d10, data2=0x7fef9dcae300 <__pthread_keys>) at ec.c:511
511	        for (idx = 0; idx < ec->nodes; idx++) {
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 dbus-libs-1.10.24-7.el7.x86_64 elfutils-libelf-0.170-4.el7.x86_64 elfutils-libs-0.170-4.el7.x86_64 glibc-2.17-222.el7.x86_64 gssproxy-0.7.0-17.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-18.el7.x86_64 libacl-2.2.51-14.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libblkid-2.23.2-52.el7.x86_64 libcap-2.22-9.el7.x86_64 libcom_err-1.42.9-11.el7.x86_64 libgcc-4.8.5-28.el7.x86_64 libgcrypt-1.5.3-14.el7.x86_64 libgpg-error-1.12-3.el7.x86_64 libnfsidmap-0.25-19.el7.x86_64 libselinux-2.5-12.el7.x86_64 libuuid-2.23.2-52.el7.x86_64 lz4-1.7.5-2.el7.x86_64 openssl-libs-1.0.2k-12.el7.x86_64 pcre-8.32-17.el7.x86_64 sssd-client-1.16.0-19.el7.x86_64 systemd-libs-219-57.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  ec_notify (this=0x7feb708d1340, event=6, data=0x7feb708c5d10, data2=0x7fef9dcae300 <__pthread_keys>) at ec.c:511
#1  0x00007feefabad3a9 in notify (this=<optimized out>, event=<optimized out>, data=<optimized out>) at ec.c:598
#2  0x00007fef07d2aa62 in xlator_notify (xl=0x7feb708d1340, event=event@entry=6, data=data@entry=0x7feb708c5d10) at xlator.c:566
#3  0x00007fef07dcacc4 in default_notify (this=this@entry=0x7feb708c5d10, event=event@entry=6, data=data@entry=0x0) at defaults.c:3113
#4  0x00007feefae33e39 in client_notify_dispatch (this=this@entry=0x7feb708c5d10, event=event@entry=6, data=data@entry=0x0) at client.c:90
#5  0x00007feefae33e9a in client_notify_dispatch_uniq (this=this@entry=0x7feb708c5d10, event=event@entry=6, data=data@entry=0x0) at client.c:68
#6  0x00007feefae35207 in client_rpc_notify (rpc=0x7feb71365920, mydata=0x7feb708c5d10, event=<optimized out>, data=<optimized out>) at client.c:2303
#7  0x00007fef0c0ba50b in rpc_clnt_handle_disconnect (conn=0x7feb71365950, clnt=0x7feb71365920) at rpc-clnt.c:876
#8  rpc_clnt_notify (trans=<optimized out>, mydata=0x7feb71365950, event=<optimized out>, data=0x7feb71365af0) at rpc-clnt.c:939
#9  0x00007fef0c0b6473 in rpc_transport_notify (this=this@entry=0x7feb71365af0, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=data@entry=0x7feb71365af0) at rpc-transport.c:538
#10 0x00007feefb502baf in socket_event_poll_err (idx=<optimized out>, gen=<optimized out>, this=0x7feb71365af0) at socket.c:1206
#11 socket_event_handler (fd=123, idx=<optimized out>, gen=<optimized out>, data=0x7feb71365af0, poll_in=<optimized out>, poll_out=4, poll_err=24) at socket.c:2476
#12 0x00007fef07d8a3d4 in event_dispatch_epoll_handler (event=0x7fe95cff9500, event_pool=0x7fef08278260) at event-epoll.c:583
#13 event_dispatch_epoll_worker (data=0x7feb711f0e10) at event-epoll.c:659
#14 0x00007fef9da9edd5 in start_thread () from /lib64/libpthread.so.0
#15 0x00007fef9d16ab3d in clone () from /lib64/libc.so.6
(gdb) 


Version-Release number of selected component (if applicable):
--------------------------------------------------------------

glusterfs-ganesha-3.12.2-6.el7rhgs.x86_64
nfs-ganesha-2.5.5-3.el7rhgs.x86_64


How reproducible:
-----------------

2/3

Comment 3 Ambarish 2018-04-02 13:13:13 UTC
The crash is fairly reproducible.

The core on gqas003 is slightly different :


(gdb) bt
#0  0x00007fbf68eae052 in ec_notify () from /usr/lib64/glusterfs/3.12.2/xlator/cluster/disperse.so
#1  0x00007fbf68eae3a9 in notify () from /usr/lib64/glusterfs/3.12.2/xlator/cluster/disperse.so
#2  0x00007fbf762b5a62 in xlator_notify () from /lib64/libglusterfs.so.0
#3  0x00007fbf76355cc4 in default_notify () from /lib64/libglusterfs.so.0
#4  0x00007fbf69134e39 in client_notify_dispatch () from /usr/lib64/glusterfs/3.12.2/xlator/protocol/client.so
#5  0x00007fbf69134e9a in client_notify_dispatch_uniq () from /usr/lib64/glusterfs/3.12.2/xlator/protocol/client.so
#6  0x00007fbf69136207 in client_rpc_notify () from /usr/lib64/glusterfs/3.12.2/xlator/protocol/client.so
#7  0x00007fbf7608050b in rpc_clnt_notify () from /lib64/libgfrpc.so.0
#8  0x00007fbf7607c473 in rpc_transport_notify () from /lib64/libgfrpc.so.0
#9  0x00007fbf6980fbaf in socket_event_handler () from /usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so
#10 0x00007fbf763153d4 in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0
#11 0x00007fbf7b7acdd5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007fbf7ae78b3d in clone () from /lib64/libc.so.6
(gdb)

Comment 4 Ambarish 2018-04-02 13:16:17 UTC
CC'ing EC guys - Pranith,Xavi,Ashish.

(Since at least superficially it seems to come from EC)

Comment 9 Kaleb KEITHLEY 2018-04-09 17:18:48 UTC
Maybe it's not. I misread the bt and thought it had crashed in gf_timer_call_cancel().

Comment 17 Red Hat Bugzilla 2023-09-14 04:26:16 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days