1562770 – [Ganesha] : Ganesha crashed in ec_notify().

Bug 1562770 - [Ganesha] : Ganesha crashed in ec_notify().

Summary: [Ganesha] : Ganesha crashed in ec_notify().

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	nfs-ganesha
Sub Component:
Version:	rhgs-3.4
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Xavi Hernandez
QA Contact:	Manisha Saini
Docs Contact:
URL:
Whiteboard:
Depends On:	1562951 1563306
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-02 13:08 UTC by Ambarish
Modified:	2023-09-14 04:26 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1562951 (view as bug list)
Environment:
Last Closed:	2020-01-09 10:52:29 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Ambarish 2018-04-02 13:08:28 UTC

Description of problem:
------------------------

I have a 100 EC volumes exported via Ganesha.

2 of them are active - butcher1 and butcher2.

There is Bonnie and dbench running via v3 and v4 on these two exports.

The other 98 exports are passive.

I was exporting/unepxorting these 98 passive volumes at random (via vol restarts and ganesha.enable on/off).


Ganesha crashed on one the nodes and dumped a core in the meantime .

BT : 


Core was generated by `/usr/bin/ganesha.nfsd -L /var/log/ganesha/ganesha.log -f /etc/ganesha/ganesha.c'.
Program terminated with signal 11, Segmentation fault.
#0  ec_notify (this=0x7feb708d1340, event=6, data=0x7feb708c5d10, data2=0x7fef9dcae300 <__pthread_keys>) at ec.c:511
511	        for (idx = 0; idx < ec->nodes; idx++) {
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 dbus-libs-1.10.24-7.el7.x86_64 elfutils-libelf-0.170-4.el7.x86_64 elfutils-libs-0.170-4.el7.x86_64 glibc-2.17-222.el7.x86_64 gssproxy-0.7.0-17.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-18.el7.x86_64 libacl-2.2.51-14.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libblkid-2.23.2-52.el7.x86_64 libcap-2.22-9.el7.x86_64 libcom_err-1.42.9-11.el7.x86_64 libgcc-4.8.5-28.el7.x86_64 libgcrypt-1.5.3-14.el7.x86_64 libgpg-error-1.12-3.el7.x86_64 libnfsidmap-0.25-19.el7.x86_64 libselinux-2.5-12.el7.x86_64 libuuid-2.23.2-52.el7.x86_64 lz4-1.7.5-2.el7.x86_64 openssl-libs-1.0.2k-12.el7.x86_64 pcre-8.32-17.el7.x86_64 sssd-client-1.16.0-19.el7.x86_64 systemd-libs-219-57.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  ec_notify (this=0x7feb708d1340, event=6, data=0x7feb708c5d10, data2=0x7fef9dcae300 <__pthread_keys>) at ec.c:511
#1  0x00007feefabad3a9 in notify (this=<optimized out>, event=<optimized out>, data=<optimized out>) at ec.c:598
#2  0x00007fef07d2aa62 in xlator_notify (xl=0x7feb708d1340, event=event@entry=6, data=data@entry=0x7feb708c5d10) at xlator.c:566
#3  0x00007fef07dcacc4 in default_notify (this=this@entry=0x7feb708c5d10, event=event@entry=6, data=data@entry=0x0) at defaults.c:3113
#4  0x00007feefae33e39 in client_notify_dispatch (this=this@entry=0x7feb708c5d10, event=event@entry=6, data=data@entry=0x0) at client.c:90
#5  0x00007feefae33e9a in client_notify_dispatch_uniq (this=this@entry=0x7feb708c5d10, event=event@entry=6, data=data@entry=0x0) at client.c:68
#6  0x00007feefae35207 in client_rpc_notify (rpc=0x7feb71365920, mydata=0x7feb708c5d10, event=<optimized out>, data=<optimized out>) at client.c:2303
#7  0x00007fef0c0ba50b in rpc_clnt_handle_disconnect (conn=0x7feb71365950, clnt=0x7feb71365920) at rpc-clnt.c:876
#8  rpc_clnt_notify (trans=<optimized out>, mydata=0x7feb71365950, event=<optimized out>, data=0x7feb71365af0) at rpc-clnt.c:939
#9  0x00007fef0c0b6473 in rpc_transport_notify (this=this@entry=0x7feb71365af0, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=data@entry=0x7feb71365af0) at rpc-transport.c:538
#10 0x00007feefb502baf in socket_event_poll_err (idx=<optimized out>, gen=<optimized out>, this=0x7feb71365af0) at socket.c:1206
#11 socket_event_handler (fd=123, idx=<optimized out>, gen=<optimized out>, data=0x7feb71365af0, poll_in=<optimized out>, poll_out=4, poll_err=24) at socket.c:2476
#12 0x00007fef07d8a3d4 in event_dispatch_epoll_handler (event=0x7fe95cff9500, event_pool=0x7fef08278260) at event-epoll.c:583
#13 event_dispatch_epoll_worker (data=0x7feb711f0e10) at event-epoll.c:659
#14 0x00007fef9da9edd5 in start_thread () from /lib64/libpthread.so.0
#15 0x00007fef9d16ab3d in clone () from /lib64/libc.so.6
(gdb) 


Version-Release number of selected component (if applicable):
--------------------------------------------------------------

glusterfs-ganesha-3.12.2-6.el7rhgs.x86_64
nfs-ganesha-2.5.5-3.el7rhgs.x86_64


How reproducible:
-----------------

2/3

Comment 3 Ambarish 2018-04-02 13:13:13 UTC

The crash is fairly reproducible.

The core on gqas003 is slightly different :


(gdb) bt
#0  0x00007fbf68eae052 in ec_notify () from /usr/lib64/glusterfs/3.12.2/xlator/cluster/disperse.so
#1  0x00007fbf68eae3a9 in notify () from /usr/lib64/glusterfs/3.12.2/xlator/cluster/disperse.so
#2  0x00007fbf762b5a62 in xlator_notify () from /lib64/libglusterfs.so.0
#3  0x00007fbf76355cc4 in default_notify () from /lib64/libglusterfs.so.0
#4  0x00007fbf69134e39 in client_notify_dispatch () from /usr/lib64/glusterfs/3.12.2/xlator/protocol/client.so
#5  0x00007fbf69134e9a in client_notify_dispatch_uniq () from /usr/lib64/glusterfs/3.12.2/xlator/protocol/client.so
#6  0x00007fbf69136207 in client_rpc_notify () from /usr/lib64/glusterfs/3.12.2/xlator/protocol/client.so
#7  0x00007fbf7608050b in rpc_clnt_notify () from /lib64/libgfrpc.so.0
#8  0x00007fbf7607c473 in rpc_transport_notify () from /lib64/libgfrpc.so.0
#9  0x00007fbf6980fbaf in socket_event_handler () from /usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so
#10 0x00007fbf763153d4 in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0
#11 0x00007fbf7b7acdd5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007fbf7ae78b3d in clone () from /lib64/libc.so.6
(gdb)

Comment 4 Ambarish 2018-04-02 13:16:17 UTC

CC'ing EC guys - Pranith,Xavi,Ashish.

(Since at least superficially it seems to come from EC)

Comment 9 Kaleb KEITHLEY 2018-04-09 17:18:48 UTC

Maybe it's not. I misread the bt and thought it had crashed in gf_timer_call_cancel().

Comment 17 Red Hat Bugzilla 2023-09-14 04:26:16 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.