Bug 1492695
| Summary: | [Ganesha] : Ganesha crashed while exporting multiple volumes in loop. | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Ambarish <asoman> |
| Component: | nfs-ganesha | Assignee: | Soumya Koduri <skoduri> |
| Status: | CLOSED ERRATA | QA Contact: | Manisha Saini <msaini> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | rhgs-3.2 | CC: | bturner, dang, ffilz, jthottan, kkeithle, mbenjamin, msaini, rhinduja, rhs-bugs, sheggodu, skoduri, storage-qa-internal |
| Target Milestone: | --- | ||
| Target Release: | RHGS 3.4.0 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-09-04 06:53:36 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1503134 | ||
|
Description
Ambarish
2017-09-18 13:37:12 UTC
Seeing repetitive Export IDs for different exports (16, 7 etc):
[root@gqas013 exports]# cat * | grep "Export_Id"
Export_Id= 14 ;
Export_Id= 15 ;
Export_Id= 16 ;
Export_Id= 16 ;
Export_Id= 6 ;
Export_Id= 7 ;
Export_Id= 8 ;
Export_Id= 9 ;
Export_Id= 10 ;
Export_Id= 11 ;
Export_Id= 12 ;
Export_Id= 13 ;
Export_Id=21;
Export_Id=2;
Export_Id=5;
Export_Id = 2;
Export_Id=17;
Export_Id=18;
Export_Id= 5 ;
Export_Id=19;
Export_Id=20;
Export_Id= 9 ;
Export_Id=3;
Export_Id=4;
Export_Id= 26 ;
Export_Id= 27 ;
Export_Id= 28 ;
Export_Id= 17 ;
Export_Id= 18 ;
Export_Id= 19 ;
Export_Id= 20 ;
Export_Id= 21 ;
Export_Id= 22 ;
Export_Id= 23 ;
Export_Id= 24 ;
Export_Id= 25 ;
Export_Id= 7 ;
Export_Id=1;
[root@gqas013 exports]#
So, this crash is in GFAPI, not in Ganesha (it's in a pure GFAPI thread), so I'm a bit out of my depth here. However, here's my analysis. It's calling an invalid function pointer (fs->init_cbk). You can see that the value of the pointer is, in fact, the top of the call stack: (gdb) p fs->init_cbk $3 = (glfs_init_cbk) 0x7f2c28918020 #0 0x00007f2c28918020 in ?? () fs appears to be valid (at least, it has a good name, and whatnot). The only place I can find that sets this pointer is glfs_init_async(), which also appears to not be called anywhere. fs is created with calloc(), so it should all be zeroed. This means *something* wrote to that address, but possibly not legitimately. So there may be memory corruption here. Or I may just be missing something, since I'm unfamiliar with the gluster codebase. Thread 253 (Thread 0x7f2e193ba700 (LWP 28299)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007f2e1312b749 in event_dispatch_destroy (event_pool=0x7f2c2892ce00) at event.c:261
#2 0x00007f2e133d7971 in pub_glfs_fini (fs=fs@entry=0x7f2d139943a0) at glfs.c:1216
#3 0x00007f2e181a9d9b in glusterfs_create_export (fsal_hdl=<optimized out>,
parse_node=<optimized out>, err_type=<optimized out>, up_ops=0x7f2ea4c33800 <fsal_up_top>)
at /usr/src/debug/nfs-ganesha-2.4.1/src/FSAL/FSAL_GLUSTER/export.c:772
#4 0x00007f2ea49a98e0 in fsal_cfg_commit (node=0x7f2e170e42b0, link_mem=0x7f2d13994ee8,
self_struct=<optimized out>, err_type=0x7f2e193b91c0)
at /usr/src/debug/nfs-ganesha-2.4.1/src/support/exports.c:751
#5 0x00007f2ea49e2848 in proc_block (node=<optimized out>, item=<optimized out>,
link_mem=<optimized out>, err_type=<optimized out>)
at /usr/src/debug/nfs-ganesha-2.4.1/src/config_parsing/config_parsing.c:1337
#6 0x00007f2ea49e1cc0 in do_block_load (err_type=<optimized out>, param_struct=<optimized out>,
relax=<optimized out>, params=<optimized out>, blk=<optimized out>)
at /usr/src/debug/nfs-ganesha-2.4.1/src/config_parsing/config_parsing.c:1195
#7 proc_block (node=<optimized out>, item=<optimized out>, link_mem=<optimized out>,
err_type=<optimized out>)
at /usr/src/debug/nfs-ganesha-2.4.1/src/config_parsing/config_parsing.c:1321
#8 0x00007f2ea49e2fa9 in load_config_from_node (tree_node=0x7f2d13c77c10,
conf_blk=0x7f2ea4c37240 <add_export_param>, param=param@entry=0x0, unique=unique@entry=false,
err_type=err_type@entry=0x7f2e193b91c0)
at /usr/src/debug/nfs-ganesha-2.4.1/src/config_parsing/config_parsing.c:1836
#9 0x00007f2ea49b8fc7 in gsh_export_addexport (args=<optimized out>, reply=0x7f2ea58e6ab0,
error=0x7f2e193b92e0) at /usr/src/debug/nfs-ganesha-2.4.1/src/support/export_mgr.c:967
#10 0x00007f2ea49ddf49 in dbus_message_entrypoint (conn=0x7f2ea58e6620, msg=msg@entry=0x7f2e14008050,
user_data=user_data@entry=0x7f2ea4c38ce0 <export_interfaces>)
at /usr/src/debug/nfs-ganesha-2.4.1/src/dbus/dbus_server.c:512
#11 0x00007f2ea4277c76 in _dbus_object_tree_dispatch_and_unlock (tree=0x7f2ea58ec990,
message=message@entry=0x7f2e14008050, found_object=found_object@entry=0x7f2e193b9484)
at dbus-object-tree.c:862
#12 0x00007f2ea4269e49 in dbus_connection_dispatch (connection=connection@entry=0x7f2ea58e6620)
at dbus-connection.c:4672
There is another thread cleaning (glfs_fini) up the same 'fs' object. Maybe this caused the memory corruption which Dan has mentioned above. Will look further into core and update.
@Ambarish,
I am unable to access those machine now. Could you please check if they are up. Thanks!
Thanks Ambarish. Finally I could reproduce this issue even with current upstream bits but only when event_thread count is increased to '4', but not when the values were set to default. The issue seems to be happening when the export creation fails at later stage (due to duplicate export id for eg.,) and we try to cleanup glfs object, but strangely not during regular unexport. Yet to analyze the actual cause. Also, while working on this issue, I found couple of other bugs wrt to cleanup during export failure in upstream. Shall send patches for those. @Ambarish, Could you please check if this issue and the one reported in bug1492995 are reproducible with event-threads value set to default. Thanks! okay. The issue seems to be spurious but may not be related to event-thread count. If I try to export a volume with already used export-id in a loop, I hit the crash reported in this bug or the one mentioned in bug1492995. Also I once saw below bt - (gdb) bt #0 0x00007f7903f2da98 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:55 #1 0x00007f7903f2f69a in __GI_abort () at abort.c:89 #2 0x00007f7903f70e1a in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7f7904083a00 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175 #3 0x00007f7903f7941a in malloc_printerr (ar_ptr=<optimized out>, ptr=<optimized out>, str=0x7f790408133d "free(): invalid pointer", action=3) at malloc.c:5000 #4 _int_free (av=<optimized out>, p=<optimized out>, have_lock=<optimized out>) at malloc.c:3861 #5 0x00007f7903f7cbcc in __GI___libc_free (mem=<optimized out>) at malloc.c:2962 #6 0x00000000004f94be in gsh_free (p=0x7f78ec04b4f0) at /home/guest/Documents/workspace/nfs-ganesha/src/include/abstract_mem.h:271 #7 0x00000000004fd2e0 in server_stats_free (statsp=0x7f78ec05dee0) at /home/guest/Documents/workspace/nfs-ganesha/src/support/server_stats.c:2106 #8 0x00000000004fe8ff in free_export (export=0x7f78ec05df28) at /home/guest/Documents/workspace/nfs-ganesha/src/support/export_mgr.c:260 #9 0x00000000004eaf56 in export_init (link_mem=0x0, self_struct=0x7f78ec05df28) at /home/guest/Documents/workspace/nfs-ganesha/src/support/exports.c:947 #10 0x000000000053a4c6 in proc_block (node=0x7f78ec04ab10, item=0x5bed08 <add_export_param+8>, link_mem=0x0, err_type=0x7f78c2fec220) at /home/guest/Documents/workspace/nfs-ganesha/src/config_parsing/config_parsing.c:1359 #11 0x000000000053b1e6 in load_config_from_node (tree_node=0x7f78ec04ab10, conf_blk=0x5bed00 <add_export_param>, param=0x0, unique=false, err_type=0x7f78c2fec220) at /home/guest/Documents/workspace/nfs-ganesha/src/config_parsing/config_parsing.c:1835 #12 0x0000000000500ad0 in gsh_export_addexport (args=0x7f78c2fec2c0, reply=0x6bd8d0, error=0x7f78c2fec310) at /home/guest/Documents/workspace/nfs-ganesha/src/support/export_mgr.c:984 #13 0x0000000000535487 in dbus_message_entrypoint (conn=0x6bd4b0, msg=0x6bdab0, user_data=0x5bf870 <export_interfaces>) at /home/guest/Documents/workspace/nfs-ganesha/src/dbus/dbus_server.c:511 #14 0x00007f790548c153 in _dbus_object_tree_dispatch_and_unlock (tree=0x6c2d90, message=message@entry=0x6bdab0, found_object=found_object@entry=0x7f78c2fec488) at ../../dbus/dbus-object-tree.c:1020 #15 0x00007f790547d6e4 in dbus_connection_dispatch (connection=0x6bd4b0) at ../../dbus/dbus-connection.c:4744 #16 0x00007f790547d9ed in _dbus_connection_read_write_dispatch (connection=0x6bd4b0, timeout_milliseconds=100, dispatch=<optimized out>) at ../../dbus/dbus-connection.c:3691 #17 0x0000000000535fef in gsh_dbus_thread (arg=0x0) at /home/guest/Documents/workspace/nfs-ganesha/src/dbus/dbus_server.c:738 #18 0x00007f79046e260a in start_thread (arg=0x7f78c2fed700) at pthread_create.c:334 #19 0x00007f7903ffbbbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 (gdb) Maybe there is some issue with ganesha layer itself. Another thing to note is that I was running V2.6-dev.6 when I hit above crashes. But now I rebased to current next V2.6-dev.10 and somehow cant reproduce either of these crashes. If run in a loop, could hit this issue even on latest next branch. It seems to be use_after_free in FSAL_GLUSTER. Posted fix upstream - https://review.gerrithub.io/#/c/379481/ Other fixes which need to be pulled in for 3.4 (on top of V2.5.2) are - https://review.gerrithub.io/379430 - https://review.gerrithub.io/379431 Verified this with
# rpm -qa | grep ganesha
nfs-ganesha-2.5.5-9.el7rhgs.x86_64
nfs-ganesha-debuginfo-2.5.5-9.el7rhgs.x86_64
glusterfs-ganesha-3.12.2-15.el7rhgs.x86_64
nfs-ganesha-gluster-2.5.5-9.el7rhgs.x86_64
Steps-
1.Created 50 Distributed-replicated *3 volumes in loop
for i in $(seq 1 50)
do
gluster v create mani$i replica 3 moonshine.lab.eng.blr.redhat.com:/gluster/brick1/v$i tettnang.lab.eng.blr.redhat.com:/gluster/brick1/v$i zod.lab.eng.blr.redhat.com:/gluster/brick1/v$i yarrow.lab.eng.blr.redhat.com:/gluster/brick1/v$i rhs-gp-srv3.lab.eng.blr.redhat.com:/gluster/brick1/v$i rhs-hpc-srv4.lab.eng.blr.redhat.com:/gluster/brick1/v$i
sleep 2
done
2.Started 50 volumes
for i in {1..50};do gluster v start mani$i;done
3.Set cluster.lookup-optimize,server.event-threads,client.event-threads on volume
for i in {1..50};do gluster v set mani$i cluster.lookup-optimize on;gluster v set mani$i server.event-threads 4;gluster v set mani$i client.event-threads 4;done
4.Export volume via ganesha
for i in {1..50};do gluster v set mani$i ganesha.enable on;sleep 10;done
No crashes were observed.All volumes got exported successfully.
Moving this BZ to verified state
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2610 |