Description of problem: ------------------------ I have a 100 EC volumes exported via Ganesha. 2 of them are active - butcher1 and butcher2. There is Bonnie and dbench running via v3 and v4 on these two exports. The other 98 exports are passive. I was exporting/unepxorting these 98 passive volumes at random (via vol restarts and ganesha.enable on/off). Ganesha crashed on one the nodes and dumped a core in the meantime . BT : Core was generated by `/usr/bin/ganesha.nfsd -L /var/log/ganesha/ganesha.log -f /etc/ganesha/ganesha.c'. Program terminated with signal 11, Segmentation fault. #0 ec_notify (this=0x7feb708d1340, event=6, data=0x7feb708c5d10, data2=0x7fef9dcae300 <__pthread_keys>) at ec.c:511 511 for (idx = 0; idx < ec->nodes; idx++) { Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 dbus-libs-1.10.24-7.el7.x86_64 elfutils-libelf-0.170-4.el7.x86_64 elfutils-libs-0.170-4.el7.x86_64 glibc-2.17-222.el7.x86_64 gssproxy-0.7.0-17.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-18.el7.x86_64 libacl-2.2.51-14.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libblkid-2.23.2-52.el7.x86_64 libcap-2.22-9.el7.x86_64 libcom_err-1.42.9-11.el7.x86_64 libgcc-4.8.5-28.el7.x86_64 libgcrypt-1.5.3-14.el7.x86_64 libgpg-error-1.12-3.el7.x86_64 libnfsidmap-0.25-19.el7.x86_64 libselinux-2.5-12.el7.x86_64 libuuid-2.23.2-52.el7.x86_64 lz4-1.7.5-2.el7.x86_64 openssl-libs-1.0.2k-12.el7.x86_64 pcre-8.32-17.el7.x86_64 sssd-client-1.16.0-19.el7.x86_64 systemd-libs-219-57.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64 (gdb) bt #0 ec_notify (this=0x7feb708d1340, event=6, data=0x7feb708c5d10, data2=0x7fef9dcae300 <__pthread_keys>) at ec.c:511 #1 0x00007feefabad3a9 in notify (this=<optimized out>, event=<optimized out>, data=<optimized out>) at ec.c:598 #2 0x00007fef07d2aa62 in xlator_notify (xl=0x7feb708d1340, event=event@entry=6, data=data@entry=0x7feb708c5d10) at xlator.c:566 #3 0x00007fef07dcacc4 in default_notify (this=this@entry=0x7feb708c5d10, event=event@entry=6, data=data@entry=0x0) at defaults.c:3113 #4 0x00007feefae33e39 in client_notify_dispatch (this=this@entry=0x7feb708c5d10, event=event@entry=6, data=data@entry=0x0) at client.c:90 #5 0x00007feefae33e9a in client_notify_dispatch_uniq (this=this@entry=0x7feb708c5d10, event=event@entry=6, data=data@entry=0x0) at client.c:68 #6 0x00007feefae35207 in client_rpc_notify (rpc=0x7feb71365920, mydata=0x7feb708c5d10, event=<optimized out>, data=<optimized out>) at client.c:2303 #7 0x00007fef0c0ba50b in rpc_clnt_handle_disconnect (conn=0x7feb71365950, clnt=0x7feb71365920) at rpc-clnt.c:876 #8 rpc_clnt_notify (trans=<optimized out>, mydata=0x7feb71365950, event=<optimized out>, data=0x7feb71365af0) at rpc-clnt.c:939 #9 0x00007fef0c0b6473 in rpc_transport_notify (this=this@entry=0x7feb71365af0, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=data@entry=0x7feb71365af0) at rpc-transport.c:538 #10 0x00007feefb502baf in socket_event_poll_err (idx=<optimized out>, gen=<optimized out>, this=0x7feb71365af0) at socket.c:1206 #11 socket_event_handler (fd=123, idx=<optimized out>, gen=<optimized out>, data=0x7feb71365af0, poll_in=<optimized out>, poll_out=4, poll_err=24) at socket.c:2476 #12 0x00007fef07d8a3d4 in event_dispatch_epoll_handler (event=0x7fe95cff9500, event_pool=0x7fef08278260) at event-epoll.c:583 #13 event_dispatch_epoll_worker (data=0x7feb711f0e10) at event-epoll.c:659 #14 0x00007fef9da9edd5 in start_thread () from /lib64/libpthread.so.0 #15 0x00007fef9d16ab3d in clone () from /lib64/libc.so.6 (gdb) Version-Release number of selected component (if applicable): -------------------------------------------------------------- glusterfs-ganesha-3.12.2-6.el7rhgs.x86_64 nfs-ganesha-2.5.5-3.el7rhgs.x86_64 How reproducible: ----------------- 2/3
The crash is fairly reproducible. The core on gqas003 is slightly different : (gdb) bt #0 0x00007fbf68eae052 in ec_notify () from /usr/lib64/glusterfs/3.12.2/xlator/cluster/disperse.so #1 0x00007fbf68eae3a9 in notify () from /usr/lib64/glusterfs/3.12.2/xlator/cluster/disperse.so #2 0x00007fbf762b5a62 in xlator_notify () from /lib64/libglusterfs.so.0 #3 0x00007fbf76355cc4 in default_notify () from /lib64/libglusterfs.so.0 #4 0x00007fbf69134e39 in client_notify_dispatch () from /usr/lib64/glusterfs/3.12.2/xlator/protocol/client.so #5 0x00007fbf69134e9a in client_notify_dispatch_uniq () from /usr/lib64/glusterfs/3.12.2/xlator/protocol/client.so #6 0x00007fbf69136207 in client_rpc_notify () from /usr/lib64/glusterfs/3.12.2/xlator/protocol/client.so #7 0x00007fbf7608050b in rpc_clnt_notify () from /lib64/libgfrpc.so.0 #8 0x00007fbf7607c473 in rpc_transport_notify () from /lib64/libgfrpc.so.0 #9 0x00007fbf6980fbaf in socket_event_handler () from /usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so #10 0x00007fbf763153d4 in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0 #11 0x00007fbf7b7acdd5 in start_thread () from /lib64/libpthread.so.0 #12 0x00007fbf7ae78b3d in clone () from /lib64/libc.so.6 (gdb)
CC'ing EC guys - Pranith,Xavi,Ashish. (Since at least superficially it seems to come from EC)
Maybe it's not. I misread the bt and thought it had crashed in gf_timer_call_cancel().
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days