Description of problem: Snapd crashed on one of the Storage node. Version-Release number of selected component (if applicable): [root@darkknightrises ~]# rpm -qa | grep glusterfs glusterfs-libs-3.7.1-5.el6rhs.x86_64 glusterfs-api-3.7.1-5.el6rhs.x86_64 samba-vfs-glusterfs-4.1.17-7.el6rhs.x86_64 glusterfs-3.7.1-5.el6rhs.x86_64 glusterfs-fuse-3.7.1-5.el6rhs.x86_64 glusterfs-cli-3.7.1-5.el6rhs.x86_64 glusterfs-geo-replication-3.7.1-5.el6rhs.x86_64 glusterfs-debuginfo-3.7.1-5.el6rhs.x86_64 glusterfs-client-xlators-3.7.1-5.el6rhs.x86_64 glusterfs-server-3.7.1-5.el6rhs.x86_64 How reproducible: 1/1 Steps to Reproduce: 1. Create 2*2 distribute replicate volume 2. Do fuse mount and start I/O from the mount point 3. Enable quota and set limit usage 4. Enable USS 5. set snap-activate-on-create enable for snapshot 6. Enable auto-delete for snapshot 7. set cluster.enable-shared-storage enable 8. Schedule snapshot every after every 5 mins [root@darkknightrises ~]# snap_scheduler.py list JOB_NAME SCHEDULE OPERATION VOLUME NAME -------------------------------------------------------------------- job1 */5 * * * * Snapshot Create vol0 9. edit scheduled time to every two minutes 10. Restart nptd service on all the storage node. 11 Did gluster v status vol0 observers the crash Actual results: Snapd crashed Expected results: Snapd should not crash Additional info: [root@darkknightrises ~]# gluster v info vol0 Volume Name: vol0 Type: Distributed-Replicate Volume ID: 553670d5-8f05-4d41-878e-921d59f117ae Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.33.214:/rhs/brick1/b01 Brick2: 10.70.33.219:/rhs/brick1/b02 Brick3: 10.70.33.225:/rhs/brick1/b03 Brick4: 10.70.44.13:/rhs/brick1/04 Options Reconfigured: features.quota-deem-statfs: on features.inode-quota: on features.quota: on performance.readdir-ahead: on features.uss: enable features.show-snapshot-directory: enable features.barrier: enable auto-delete: enable snap-activate-on-create: enable cluster.enable-shared-storage: enable =============================================================== [root@darkknightrises /]# gluster v status vol0 Status of volume: vol0 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.33.214:/rhs/brick1/b01 49152 0 Y 24731 Brick 10.70.33.219:/rhs/brick1/b02 49152 0 Y 16359 Brick 10.70.33.225:/rhs/brick1/b03 49152 0 Y 10149 Brick 10.70.44.13:/rhs/brick1/04 49152 0 Y 5512 Snapshot Daemon on localhost N/A N/A N 25133 NFS Server on localhost 2049 0 Y 25671 Self-heal Daemon on localhost N/A N/A Y 25681 Quota Daemon on localhost N/A N/A Y 26666 Snapshot Daemon on 10.70.33.225 49153 0 Y 10525 NFS Server on 10.70.33.225 2049 0 Y 10991 Self-heal Daemon on 10.70.33.225 N/A N/A Y 11001 Quota Daemon on 10.70.33.225 N/A N/A Y 11853 Snapshot Daemon on 10.70.44.13 49154 0 Y 5918 NFS Server on 10.70.44.13 2049 0 Y 6396 Self-heal Daemon on 10.70.44.13 N/A N/A Y 6406 Quota Daemon on 10.70.44.13 N/A N/A Y 7255 Snapshot Daemon on 10.70.33.219 49154 0 Y 16714 NFS Server on 10.70.33.219 2049 0 Y 17201 Self-heal Daemon on 10.70.33.219 N/A N/A Y 17211 Quota Daemon on 10.70.33.219 N/A N/A Y 18168 bt ======================================== krishnan_p> #44214 0x00007f473c1f5def in client_dump_version_cbk (req=<value optimized out>, iov=<value optimized out>, count=<value optimized out>, myframe=0x7f4024701c24) at client-handshake.c:1623 <krishnan_p> 1623 gf_msg (frame->this->name, GF_LOG_WARNING, ENOTCONN, <krishnan_p> (gdb) up <krishnan_p> #44215 0x00007f475360cb1e in saved_frames_unwind (saved_frames=0x7f45d80022d0) at rpc-clnt.c:366 <krishnan_p> 366 trav->rpcreq->cbkfn (trav->rpcreq, &iov, 1, trav->frame); <krishnan_p> (gdb) up <krishnan_p> #44216 0x00007f475360cc0e in saved_frames_destroy (frames=0x7f45d80022d0) at rpc-clnt.c:383 <krishnan_p> 383 saved_frames_unwind (frames); <krishnan_p> (gdb) down <krishnan_p> #44215 0x00007f475360cb1e in saved_frames_unwind (saved_frames=0x7f45d80022d0) at rpc-clnt.c:366 <krishnan_p> 366 trav->rpcreq->cbkfn (trav->rpcreq, &iov, 1, trav->frame); <krishnan_p> (gdb) up <krishnan_p> #44216 0x00007f475360cc0e in saved_frames_destroy (frames=0x7f45d80022d0) at rpc-clnt.c:383 <krishnan_p> 383 saved_frames_unwind (frames); <krishnan_p> (gdb) up <krishnan_p> #44217 0x00007f475360ccdb in rpc_clnt_connection_cleanup (conn=0x7f45d0e86f40) at rpc-clnt.c:536 <krishnan_p> 536 saved_frames_destroy (saved_frames); <krishnan_p> (gdb) up <krishnan_p> #44218 0x00007f475360d29f in rpc_clnt_notify (trans=<value optimized out>, mydata=0x7f45d0e86f40, event=<value optimized out>, data=<value optimized out>) at rpc-clnt.c:843 <krishnan_p> 843 rpc_clnt_connection_cleanup (conn); <krishnan_p> (gdb) up <krishnan_p> #44219 0x00007f4753608928 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>) at rpc-transport.c:543 <krishnan_p> 543 ret = this->notify (this, this->mydata, event, data); <krishnan_p> (gdb) up <krishnan_p> #44220 0x00007f4748370741 in socket_event_poll_err (fd=<value optimized out>, idx=<value optimized out>, data=0x7f45d0ee01c0, poll_in=<value optimized out>, poll_out=0, poll_err=0) <krishnan_p> at socket.c:1205 <krishnan_p> 1205 rpc_transport_notify (this, RPC_TRANSPORT_DISCONNECT, this); <krishnan_p> (gdb) up <krishnan_p> #44221 socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x7f45d0ee01c0, poll_in=<value optimized out>, poll_out=0, poll_err=0) at socket <krishnan_p> ---------------------------------------- <krishnan_p> and some frames at the top <krishnan_p> 0 0x00007f47521ee06a in vfprintf () from /lib64/libc.so.6 <krishnan_p> #1 0x00007f47522a9b10 in __vsnprintf_chk () from /lib64/libc.so.6 <krishnan_p> #2 0x00007f47522a9a4a in __snprintf_chk () from /lib64/libc.so.6 <krishnan_p> #3 0x00007f4753856a5d in snprintf (buf=0x7f4024f1bb98 "") at /usr/include/bits/stdio2.h:65 <krishnan_p> #4 gf_backtrace_append (buf=0x7f4024f1bb98 "") at common-utils.c:3547 <krishnan_p> #5 gf_backtrace_fillframes (buf=0x7f4024f1bb98 "") at common-utils.c:3593 <krishnan_p> #6 0x00007f4753856ac1 in gf_backtrace_save (buf=<value optimized out>) at common-utils.c:3631 <krishnan_p> #7 0x00007f475383cfc0 in _gf_log_callingfn (domain=0x7f47538add11 "", file=<value optimized out>, function=0x7f47538b3850 "gf_mem_set_acct_info", line=62, level=GF_LOG_ERROR, <krishnan_p> fmt=0x7f47538b3718 "Assertion failed: xl->mem_acct != NULL") at logging.c:835 <krishnan_p> #8 0x00007f475386dd34 in gf_mem_set_acct_info (xl=0x7f4753ad4120, alloc_ptr=0x7f401a76ffe8, size=437, type=39, typestr=0x7f47538b35cc "gf_common_mt_asprintf") at mem-pool.c:62 <krishnan_p> #9 0x00007f475386ddf6 in __gf_malloc (size=437, type=39, typestr=0x7f47538b35cc "gf_common_mt_asprintf") at mem-pool.c:146 <krishnan_p> #10 0x00007f475386df04 in gf_vasprintf (string_ptr=0x7f401a770228, format=0x7f47538adc22 "[%s] %s [%s:%d:%s] %s %d-%s: ", arg=<value optimized out>) at mem-pool.c:220 <krishnan_p> #11 0x00007f475386dfc8 in gf_asprintf (string_ptr=<value optimized out>, format=<value optimized out>) at mem-pool.c:238 <krishnan_p> #12 0x00007f475383d1fe in _gf_log_callingfn (domain=0x7f47538add11 "", file=<value optimized out>, function=0x7f47538b3850 "gf_mem_set_acct_info", line=62, level=GF_LOG_ERROR, <krishnan_p> fmt=0x7f47538b3718 "Assertion failed: xl->mem_acct != NULL") at logging.c:869 <krishnan_p> #13 0x00007f475386dd34 in gf_mem_set_acct_info (xl=0x7f4753ad4120, alloc_ptr=0x7f401a770488, size=437, type=39, typestr=0x7f47538b35cc "gf_common_mt_asprintf") at mem-pool.c:62 <krishnan_p> #14 0x00007f475386ddf6 in __gf_malloc (size=437, type=39, typestr=0x7f47538b35cc "gf_common_mt_asprintf") at mem-pool.c:146
I think once http://review.gluster.org/10824 will get backported to downstream, this issue won't be seen
RCA ---- The stack overflow was seen when older snapshots were being deleted while new ones were being created concurently. In the setup detailed above, snapshot scheduler creates snapshots periodically and auto-delete of snapshots is enabled. When no. of snapshots in the system (of the volume) exceeds the soft-limit configured, snapshots are (auto-)deleted. The crash happened when a scheduled snapshot-create coincided with the auto-delete triggered snapshot-delete operation. Implementation detail ---------------------- Snapshot daemon uses gfapi interface to serve user-serviceable snapshots. gfapi interface creates a new glfs object for every snapshot (volume) serviced. This object is 'linked' with a global xlator object until the time glfs object is fully initialized (i.e, set-volume operation is complete). The global xlator object's ctx (glusterfs_ctx_t) object is being modified in a thread-unsafe manner and could refer to a destroyed ctx (which belonged to glfs representing a deleted snapshot). Fix outline ------------ All initialisation managment operations (e.g, RPCs like DUMP_VERSION, SET_VOLUME, etc.) must refer to the corresponding translator objects in the glfs' graph.
Atin, the patch you refer to in comment #2 doesn't fix the issue. Please refer to the RCA in comment #3.
Doc text is edited. Please sign off to be included in Known Issues.
Upstream patch on master branch - http://review.gluster.com/11436
https://code.engineering.redhat.com/gerrit/#/c/55164/1
Bug verified on build glusterfs-3.7.1-15.el7rhgs.x86_64.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1845.html