Bug 1492695 - [Ganesha] : Ganesha crashed while exporting multiple volumes in loop.
Summary: [Ganesha] : Ganesha crashed while exporting multiple volumes in loop.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: nfs-ganesha
Version: rhgs-3.2
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: RHGS 3.4.0
Assignee: Soumya Koduri
QA Contact: Manisha Saini
URL:
Whiteboard:
Depends On:
Blocks: 1503134
TreeView+ depends on / blocked
 
Reported: 2017-09-18 13:37 UTC by Ambarish
Modified: 2018-09-24 10:34 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-04 06:53:36 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2018:2610 0 None None None 2018-09-04 06:54:41 UTC

Description Ambarish 2017-09-18 13:37:12 UTC
Description of problem:
-----------------------

In an attempt to have an in-house reproducer for https://bugzilla.redhat.com/show_bug.cgi?id=1481040,Ganesha crashed on 3/6 nodes with the following BT :

(gdb) bt
#0  0x00007f2c28918020 in ?? ()
#1  0x00007f2e025aade9 in notify (this=<optimized out>, event=<optimized out>, data=0x7f2c2c002960) at glfs-master.c:109
#2  0x00007f2e131034c6 in xlator_notify (xl=0x7f2d139bb4c0, event=6, data=0x7f2c2c002960) at xlator.c:496
#3  0x00007f2e1319d39e in default_notify (this=0x7f2c2c00e720, event=6, data=<optimized out>) at defaults.c:3107
#4  0x00007f2e131034c6 in xlator_notify (xl=0x7f2c2c00e720, event=event@entry=6, data=data@entry=0x7f2c2c00d760) at xlator.c:496
#5  0x00007f2e1319d293 in default_notify (this=this@entry=0x7f2c2c00d760, event=6, data=data@entry=0x7f2c2c00c780) at defaults.c:3113
#6  0x00007f2de007925a in notify (this=0x7f2c2c00d760, event=<optimized out>, data=0x7f2c2c00c780) at io-stats.c:3966
#7  0x00007f2e131034c6 in xlator_notify (xl=0x7f2c2c00d760, event=event@entry=6, data=data@entry=0x7f2c2c00c780) at xlator.c:496
#8  0x00007f2e1319d293 in default_notify (this=this@entry=0x7f2c2c00c780, event=event@entry=6, data=data@entry=0x7f2c2c00b9a0) at defaults.c:3113
#9  0x00007f2de029314a in notify (this=0x7f2c2c00c780, event=6, data=0x7f2c2c00b9a0) at md-cache.c:2985
#10 0x00007f2e131034c6 in xlator_notify (xl=0x7f2c2c00c780, event=event@entry=6, data=data@entry=0x7f2c2c00b9a0) at xlator.c:496
#11 0x00007f2e1319d293 in default_notify (this=0x7f2c2c00b9a0, event=6, data=<optimized out>) at defaults.c:3113
#12 0x00007f2e131034c6 in xlator_notify (xl=0x7f2c2c00b9a0, event=event@entry=6, data=data@entry=0x7f2c2c00abc0) at xlator.c:496
#13 0x00007f2e1319d293 in default_notify (this=0x7f2c2c00abc0, event=6, data=<optimized out>) at defaults.c:3113
#14 0x00007f2e131034c6 in xlator_notify (xl=0x7f2c2c00abc0, event=event@entry=6, data=data@entry=0x7f2c2c009e50) at xlator.c:496
#15 0x00007f2e1319d293 in default_notify (this=0x7f2c2c009e50, event=6, data=<optimized out>) at defaults.c:3113
#16 0x00007f2e131034c6 in xlator_notify (xl=0x7f2c2c009e50, event=event@entry=6, data=data@entry=0x7f2c2c009000) at xlator.c:496
#17 0x00007f2e1319d293 in default_notify (this=0x7f2c2c009000, event=6, data=<optimized out>) at defaults.c:3113
#18 0x00007f2e131034c6 in xlator_notify (xl=0x7f2c2c009000, event=event@entry=6, data=data@entry=0x7f2c2c008280) at xlator.c:496
#19 0x00007f2e1319d293 in default_notify (this=0x7f2c2c008280, event=6, data=<optimized out>) at defaults.c:3113
#20 0x00007f2e131034c6 in xlator_notify (xl=0x7f2c2c008280, event=event@entry=6, data=data@entry=0x7f2c2c007440) at xlator.c:496
#21 0x00007f2e1319d293 in default_notify (this=0x7f2c2c007440, event=6, data=<optimized out>) at defaults.c:3113
#22 0x00007f2e131034c6 in xlator_notify (xl=0x7f2c2c007440, event=event@entry=6, data=data@entry=0x7f2c2c006540) at xlator.c:496
#23 0x00007f2e1319d293 in default_notify (this=this@entry=0x7f2c2c006540, event=event@entry=6, data=<optimized out>) at defaults.c:3113
#24 0x00007f2de15c316f in dht_notify (this=0x7f2c2c006540, event=6, data=<optimized out>) at dht-common.c:9079
#25 0x00007f2e131034c6 in xlator_notify (xl=0x7f2c2c006540, event=event@entry=6, data=data@entry=0x7f2c2c0056b0) at xlator.c:496
#26 0x00007f2e1319d293 in default_notify (this=this@entry=0x7f2c2c0056b0, event=event@entry=6, data=data@entry=0x7f2c2c004250) at defaults.c:3113
#27 0x00007f2df1ecea2a in afr_notify (this=<optimized out>, event=6, data=0x7f2c2c004250, data2=<optimized out>) at afr-common.c:4786
#28 0x00007f2df1edd2a9 in notify (this=<optimized out>, event=<optimized out>, data=<optimized out>) at afr.c:43
#29 0x00007f2e131034c6 in xlator_notify (xl=0x7f2c2c0056b0, event=event@entry=6, data=data@entry=0x7f2c2c004250) at xlator.c:496
#30 0x00007f2e1319d293 in default_notify (this=this@entry=0x7f2c2c004250, event=event@entry=6, data=data@entry=0x0) at defaults.c:3113
#31 0x00007f2e096449f9 in client_notify_dispatch (this=this@entry=0x7f2c2c004250, event=event@entry=6, data=data@entry=0x0) at client.c:90
#32 0x00007f2e09644a5a in client_notify_dispatch_uniq (this=this@entry=0x7f2c2c004250, event=event@entry=6, data=data@entry=0x0) at client.c:68
#33 0x00007f2e09645d78 in client_rpc_notify (rpc=0x7f2c2d32dee0, mydata=0x7f2c2c004250, event=<optimized out>, data=<optimized out>) at client.c:2298
#34 0x00007f2e12ecfa13 in rpc_clnt_handle_disconnect (conn=0x7f2c2d32df10, clnt=0x7f2c2d32dee0) at rpc-clnt.c:892
#35 rpc_clnt_notify (trans=<optimized out>, mydata=0x7f2c2d32df10, event=<optimized out>, data=0x7f2c2d387470) at rpc-clnt.c:955
#36 0x00007f2e12ecb9f3 in rpc_transport_notify (this=this@entry=0x7f2c2d387470, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=data@entry=0x7f2c2d387470) at rpc-transport.c:538
#37 0x00007f2e01398754 in socket_event_poll_err (this=0x7f2c2d387470) at socket.c:1184
#38 socket_event_handler (fd=<optimized out>, idx=1, data=0x7f2c2d387470, poll_in=1, poll_out=0, poll_err=<optimized out>) at socket.c:2409
#39 0x00007f2e1315fa30 in event_dispatch_epoll_handler (event=0x7f2c1e7fb540, event_pool=0x7f2c2892ce00) at event-epoll.c:571
#40 event_dispatch_epoll_worker (data=0x7f2c2d318e20) at event-epoll.c:674
#41 0x00007f2ea2e8edc5 in start_thread (arg=0x7f2c1e7fc700) at pthread_create.c:308
#42 0x00007f2ea255d73d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) 


Would request you to refrain from restating Ganesha.I have the setup preserved for https://bugzilla.redhat.com/show_bug.cgi?id=1481040 for Soumya.



This is what I did :


1) for i in {1..12};do gluster v create testvolC$i replica 2 gqas008.sbu.lab.eng.bos.redhat.com:/bricks$i/Anew1 gqas006.sbu.lab.eng.bos.redhat.com:/bricks$i/Anew1;done

2) for i in {1..12};do gluster v set testvolC$i cluster.lookup-optimize on;gluster v set testvolC$i server.event-threads 4;gluster v set testvolC$i client.event-threads 4;done

3) for i in {1..12};do gluster v set testvolC$i ganesha.enable on;sleep 10;done

The first time I exported my volumes via (3) , I did not add sleep and ran into this issue : https://bugzilla.redhat.com/show_bug.cgi?id=1455801


So I added sleep and ran these steps for testvol$i,testvolB$i and testvolC$i.


These were the test steps when I noticed a Ganesha crash on 3 of my nodes with the mentioned BT.


Version-Release number of selected component (if applicable):
-------------------------------------------------------------

nfs-ganesha-2.4.1-11.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-18.4.el7rhgs.x86_64


How reproducible:
-----------------

1/1

Comment 3 Ambarish 2017-09-18 13:42:15 UTC
Seeing repetitive Export IDs for different exports (16, 7 etc):



[root@gqas013 exports]# cat * | grep "Export_Id"
      Export_Id= 14 ;
      Export_Id= 15 ;
      Export_Id= 16 ;
      Export_Id= 16 ;
      Export_Id= 6 ;
      Export_Id= 7 ;
      Export_Id= 8 ;
      Export_Id= 9 ;
      Export_Id= 10 ;
      Export_Id= 11 ;
      Export_Id= 12 ;
      Export_Id= 13 ;
      Export_Id=21;
      Export_Id=2;
      Export_Id=5;
      Export_Id = 2;
      Export_Id=17;
      Export_Id=18;
      Export_Id= 5 ;
      Export_Id=19;
      Export_Id=20;
      Export_Id= 9 ;
      Export_Id=3;
      Export_Id=4;
      Export_Id= 26 ;
      Export_Id= 27 ;
      Export_Id= 28 ;
      Export_Id= 17 ;
      Export_Id= 18 ;
      Export_Id= 19 ;
      Export_Id= 20 ;
      Export_Id= 21 ;
      Export_Id= 22 ;
      Export_Id= 23 ;
      Export_Id= 24 ;
      Export_Id= 25 ;
      Export_Id= 7 ;
      Export_Id=1;
[root@gqas013 exports]#

Comment 4 Daniel Gryniewicz 2017-09-18 14:08:13 UTC
So, this crash is in GFAPI, not in Ganesha (it's in a pure GFAPI thread), so I'm a bit out of my depth here.  However, here's my analysis.

It's calling an invalid function pointer (fs->init_cbk).  You can see that the value of the pointer is, in fact, the top of the call stack:

(gdb) p fs->init_cbk
$3 = (glfs_init_cbk) 0x7f2c28918020

#0  0x00007f2c28918020 in ?? ()

fs appears to be valid (at least, it has a good name, and whatnot).  The only place I can find that sets this pointer is glfs_init_async(), which also appears to not be called anywhere.  fs is created with calloc(), so it should all be zeroed.  This means *something* wrote to that address, but possibly not legitimately.  So there may be memory corruption here.

Or I may just be missing something, since I'm unfamiliar with the gluster codebase.

Comment 5 Soumya Koduri 2017-09-20 04:13:04 UTC
Thread 253 (Thread 0x7f2e193ba700 (LWP 28299)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x00007f2e1312b749 in event_dispatch_destroy (event_pool=0x7f2c2892ce00) at event.c:261
#2  0x00007f2e133d7971 in pub_glfs_fini (fs=fs@entry=0x7f2d139943a0) at glfs.c:1216
#3  0x00007f2e181a9d9b in glusterfs_create_export (fsal_hdl=<optimized out>, 
    parse_node=<optimized out>, err_type=<optimized out>, up_ops=0x7f2ea4c33800 <fsal_up_top>)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/FSAL_GLUSTER/export.c:772
#4  0x00007f2ea49a98e0 in fsal_cfg_commit (node=0x7f2e170e42b0, link_mem=0x7f2d13994ee8, 
    self_struct=<optimized out>, err_type=0x7f2e193b91c0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/support/exports.c:751
#5  0x00007f2ea49e2848 in proc_block (node=<optimized out>, item=<optimized out>, 
    link_mem=<optimized out>, err_type=<optimized out>)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/config_parsing/config_parsing.c:1337
#6  0x00007f2ea49e1cc0 in do_block_load (err_type=<optimized out>, param_struct=<optimized out>, 
    relax=<optimized out>, params=<optimized out>, blk=<optimized out>)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/config_parsing/config_parsing.c:1195
#7  proc_block (node=<optimized out>, item=<optimized out>, link_mem=<optimized out>, 
    err_type=<optimized out>)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/config_parsing/config_parsing.c:1321
#8  0x00007f2ea49e2fa9 in load_config_from_node (tree_node=0x7f2d13c77c10, 
    conf_blk=0x7f2ea4c37240 <add_export_param>, param=param@entry=0x0, unique=unique@entry=false, 
    err_type=err_type@entry=0x7f2e193b91c0)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/config_parsing/config_parsing.c:1836
#9  0x00007f2ea49b8fc7 in gsh_export_addexport (args=<optimized out>, reply=0x7f2ea58e6ab0, 
    error=0x7f2e193b92e0) at /usr/src/debug/nfs-ganesha-2.4.1/src/support/export_mgr.c:967
#10 0x00007f2ea49ddf49 in dbus_message_entrypoint (conn=0x7f2ea58e6620, msg=msg@entry=0x7f2e14008050, 
    user_data=user_data@entry=0x7f2ea4c38ce0 <export_interfaces>)
    at /usr/src/debug/nfs-ganesha-2.4.1/src/dbus/dbus_server.c:512
#11 0x00007f2ea4277c76 in _dbus_object_tree_dispatch_and_unlock (tree=0x7f2ea58ec990, 
    message=message@entry=0x7f2e14008050, found_object=found_object@entry=0x7f2e193b9484)
    at dbus-object-tree.c:862
#12 0x00007f2ea4269e49 in dbus_connection_dispatch (connection=connection@entry=0x7f2ea58e6620)
    at dbus-connection.c:4672



There is another thread cleaning (glfs_fini) up the same 'fs' object. Maybe this caused the memory corruption which Dan has mentioned above. Will look further into core and update.

@Ambarish,
I am unable to access those machine now. Could you please check if they are up. Thanks!

Comment 7 Soumya Koduri 2017-09-20 14:37:43 UTC
Thanks Ambarish. 

Finally I could reproduce this issue even with current upstream bits but only when event_thread count is increased to '4', but not when the values were set to default. The issue seems to be happening when the export creation fails at later stage (due to duplicate export id for eg.,) and we try to cleanup glfs object, but strangely not during regular unexport. Yet to analyze the actual cause.

Also, while working on this issue, I found couple of other bugs wrt to cleanup during export failure in upstream. Shall send patches for those. 

@Ambarish,

Could you please check if this issue and the one reported in bug1492995 are reproducible with event-threads value set to default. Thanks!

Comment 8 Soumya Koduri 2017-09-21 08:34:20 UTC
okay. The issue seems to be spurious but may not be related to event-thread count. If I try to export a volume with already used export-id in a loop, I hit the crash reported in this bug or the one mentioned in bug1492995. Also I once saw below bt -

(gdb) bt
#0  0x00007f7903f2da98 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007f7903f2f69a in __GI_abort () at abort.c:89
#2  0x00007f7903f70e1a in __libc_message (do_abort=do_abort@entry=2, 
    fmt=fmt@entry=0x7f7904083a00 "*** Error in `%s': %s: 0x%s ***\n")
    at ../sysdeps/posix/libc_fatal.c:175
#3  0x00007f7903f7941a in malloc_printerr (ar_ptr=<optimized out>, ptr=<optimized out>, 
    str=0x7f790408133d "free(): invalid pointer", action=3) at malloc.c:5000
#4  _int_free (av=<optimized out>, p=<optimized out>, have_lock=<optimized out>) at malloc.c:3861
#5  0x00007f7903f7cbcc in __GI___libc_free (mem=<optimized out>) at malloc.c:2962
#6  0x00000000004f94be in gsh_free (p=0x7f78ec04b4f0)
    at /home/guest/Documents/workspace/nfs-ganesha/src/include/abstract_mem.h:271
#7  0x00000000004fd2e0 in server_stats_free (statsp=0x7f78ec05dee0)
    at /home/guest/Documents/workspace/nfs-ganesha/src/support/server_stats.c:2106
#8  0x00000000004fe8ff in free_export (export=0x7f78ec05df28)
    at /home/guest/Documents/workspace/nfs-ganesha/src/support/export_mgr.c:260
#9  0x00000000004eaf56 in export_init (link_mem=0x0, self_struct=0x7f78ec05df28)
    at /home/guest/Documents/workspace/nfs-ganesha/src/support/exports.c:947
#10 0x000000000053a4c6 in proc_block (node=0x7f78ec04ab10, item=0x5bed08 <add_export_param+8>, 
    link_mem=0x0, err_type=0x7f78c2fec220)
    at /home/guest/Documents/workspace/nfs-ganesha/src/config_parsing/config_parsing.c:1359
#11 0x000000000053b1e6 in load_config_from_node (tree_node=0x7f78ec04ab10, 
    conf_blk=0x5bed00 <add_export_param>, param=0x0, unique=false, err_type=0x7f78c2fec220)
    at /home/guest/Documents/workspace/nfs-ganesha/src/config_parsing/config_parsing.c:1835
#12 0x0000000000500ad0 in gsh_export_addexport (args=0x7f78c2fec2c0, reply=0x6bd8d0, 
    error=0x7f78c2fec310) at /home/guest/Documents/workspace/nfs-ganesha/src/support/export_mgr.c:984
#13 0x0000000000535487 in dbus_message_entrypoint (conn=0x6bd4b0, msg=0x6bdab0, 
    user_data=0x5bf870 <export_interfaces>)
    at /home/guest/Documents/workspace/nfs-ganesha/src/dbus/dbus_server.c:511
#14 0x00007f790548c153 in _dbus_object_tree_dispatch_and_unlock (tree=0x6c2d90, 
    message=message@entry=0x6bdab0, found_object=found_object@entry=0x7f78c2fec488)
    at ../../dbus/dbus-object-tree.c:1020
#15 0x00007f790547d6e4 in dbus_connection_dispatch (connection=0x6bd4b0)
    at ../../dbus/dbus-connection.c:4744
#16 0x00007f790547d9ed in _dbus_connection_read_write_dispatch (connection=0x6bd4b0, 
    timeout_milliseconds=100, dispatch=<optimized out>) at ../../dbus/dbus-connection.c:3691
#17 0x0000000000535fef in gsh_dbus_thread (arg=0x0)
    at /home/guest/Documents/workspace/nfs-ganesha/src/dbus/dbus_server.c:738
#18 0x00007f79046e260a in start_thread (arg=0x7f78c2fed700) at pthread_create.c:334
#19 0x00007f7903ffbbbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb) 


Maybe there is some issue with ganesha layer itself. Another thing to note is that I was running V2.6-dev.6 when I hit above crashes. But now I rebased to current next  V2.6-dev.10 and somehow cant reproduce either of these crashes.

Comment 9 Soumya Koduri 2017-09-21 14:36:21 UTC
If run in a loop, could hit this issue even on latest next branch. It seems to be use_after_free in FSAL_GLUSTER. Posted fix upstream -

https://review.gerrithub.io/#/c/379481/

Other fixes which need to be pulled in for 3.4 (on top of V2.5.2) are
- https://review.gerrithub.io/379430
- https://review.gerrithub.io/379431

Comment 14 Manisha Saini 2018-08-02 07:33:50 UTC
Verified this with 

# rpm -qa | grep ganesha
nfs-ganesha-2.5.5-9.el7rhgs.x86_64
nfs-ganesha-debuginfo-2.5.5-9.el7rhgs.x86_64
glusterfs-ganesha-3.12.2-15.el7rhgs.x86_64
nfs-ganesha-gluster-2.5.5-9.el7rhgs.x86_64

Steps-

1.Created 50 Distributed-replicated *3 volumes in loop

for i in $(seq 1 50)
do 
gluster v create mani$i replica 3 moonshine.lab.eng.blr.redhat.com:/gluster/brick1/v$i tettnang.lab.eng.blr.redhat.com:/gluster/brick1/v$i zod.lab.eng.blr.redhat.com:/gluster/brick1/v$i yarrow.lab.eng.blr.redhat.com:/gluster/brick1/v$i rhs-gp-srv3.lab.eng.blr.redhat.com:/gluster/brick1/v$i rhs-hpc-srv4.lab.eng.blr.redhat.com:/gluster/brick1/v$i
sleep 2 
done

2.Started 50 volumes

for i in {1..50};do gluster v start mani$i;done

3.Set cluster.lookup-optimize,server.event-threads,client.event-threads on volume

for i in {1..50};do gluster v set mani$i cluster.lookup-optimize on;gluster v set mani$i server.event-threads 4;gluster v set mani$i client.event-threads 4;done

4.Export volume via ganesha

for i in {1..50};do gluster v set mani$i ganesha.enable on;sleep 10;done


No crashes were observed.All volumes got exported successfully.
Moving this BZ to verified state

Comment 16 errata-xmlrpc 2018-09-04 06:53:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2610


Note You need to log in before you can comment on or make changes to this bug.