Description of problem: glusterd crashed when it failed to create geo-rep status file. bt from core file ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 86_64 zlib-1.2.3-29.el6.x86_64 (gdb) bt #0 __gf_free (free_ptr=0x7f06c4045c70) at mem-pool.c:252 #1 0x00007f06d8159cd4 in _local_gsyncd_start (this=<value optimized out>, key=<value optimized out>, value=<value optimized out>, data=0xda8110) at glusterd-utils.c:6958 #2 0x0000003723a18ce5 in dict_foreach (dict=0x7f06e028e308, fn=0x7f06d8159ad0 <_local_gsyncd_start>, data=0xda8110) at dict.c:1127 #3 0x00007f06d813e6af in glusterd_volume_restart_gsyncds (volinfo=0xda8110) at glusterd-utils.c:6984 #4 0x00007f06d813e718 in glusterd_restart_gsyncds (conf=0xd9e980) at glusterd-utils.c:6995 #5 0x00007f06d8156dd8 in glusterd_spawn_daemons (opaque=<value optimized out>) at glusterd-utils.c:3579 #6 0x0000003723a5b742 in synctask_wrap (old_task=<value optimized out>) at syncop.c:333 #7 0x0000003e09c43bf0 in ?? () from /lib64/libc.so.6 #8 0x0000000000000000 in ?? () ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: bt in log file ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: [2014-07-14 08:41:24.424692] D [socket.c:499:__socket_rwv] 0-socket.management: EOF on socket [2014-07-14 08:41:24.424732] D [socket.c:2246:socket_event_handler] 0-transport: disconnecting now [2014-07-14 08:41:24.529252] E [glusterd-geo-rep.c:1942:glusterd_create_status_file] 0-: Creating status file failed. [2014-07-14 08:41:24.529300] D [glusterd-geo-rep.c:1949:glusterd_create_status_file] 0-: returning -1 [2014-07-14 08:41:24.529326] E [glusterd-utils.c:6967:_local_gsyncd_start] 0-: Unable to create status file. Error : Bad file descriptor [2014-07-14 08:41:24.529490] E [mem-pool.c:242:__gf_free] (-->/usr/lib64/glusterfs/3.6.0.24/xlator/mgmt/glusterd.so(glusterd_volume_restart_gsyncds+0x1f) [0x7f06d813e6af] (-->/usr/lib64/libglusterfs.so.0(dict_foreach+0x45) [0x3723a18ce5] (-->/usr/lib64/glusterfs/3.6.0.24/xlator/mgmt/glusterd.so(_local_gsyncd_start+0x204) [0x7f06d8159cd4]))) 0-: Assertion failed: GF_MEM_HEADER_MAGIC == *(uint32_t *)ptr [2014-07-14 08:41:24.529531] D [logging.c:1805:gf_log_flush_extra_msgs] 0-logging-infra: Log buffer size reduced. About to flush 5 extra log messages [2014-07-14 08:41:24.529558] D [logging.c:1808:gf_log_flush_extra_msgs] 0-logging-infra: Just flushed 5 extra log messages pending frames: frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2014-07-14 08:41:24 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.6.0.24 /usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x3723a1fe56] /usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x3723a3a28f] /lib64/libc.so.6[0x3e09c329a0] /usr/lib64/libglusterfs.so.0(__gf_free+0xcc)[0x3723a4d17c] /usr/lib64/glusterfs/3.6.0.24/xlator/mgmt/glusterd.so(_local_gsyncd_start+0x204)[0x7f06d8159cd4] /usr/lib64/libglusterfs.so.0(dict_foreach+0x45)[0x3723a18ce5] /usr/lib64/glusterfs/3.6.0.24/xlator/mgmt/glusterd.so(glusterd_volume_restart_gsyncds+0x1f)[0x7f06d813e6af] /usr/lib64/glusterfs/3.6.0.24/xlator/mgmt/glusterd.so(glusterd_restart_gsyncds+0x28)[0x7f06d813e718] /usr/lib64/glusterfs/3.6.0.24/xlator/mgmt/glusterd.so(glusterd_spawn_daemons+0x38)[0x7f06d8156dd8] /usr/lib64/libglusterfs.so.0(synctask_wrap+0x12)[0x3723a5b742] /lib64/libc.so.6[0x3e09c43bf0] ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: Version-Release number of selected component (if applicable):glusterfs-3.6.0.24-1.el6rhs How reproducible: Didn't try to reproduce. This crash happened while restoring snapshot in geo-rep setup. This crash was side effect something unknown happened on the setup. Not sure how to reproduce this glusterd. Hence mentioning following steps when it happened. Steps to Reproduce: 1. create geo-rep setup. 2. take multiple snapshots with geo-rep while IOs are happening on the master (follow steps to create snapshot with geo-rep) 3. restore to 2 immediate snapshots then restore one of the older snapshots. (follow the steps to restore snaps with geo-rep) 4. In the case where this crash happened, after the the third snapshot, geo-rep start failed with "Staging failed on 10.70.43.107. Error: state-file entry missing in the config file(/var/lib/glusterd/geo-replication/master_10.70.43.170_slave/gsyncd.conf)" Staging failed on 10.70.43.162. Error: state-file entry missing in the config file(/var/lib/glusterd/geo-replication/master_10.70.43.170_slave/gsyncd.conf) 5: After this restarting glusterd on the node where all the commands executed resulted in glusterd crash both the nodes where staging failed, which was actually because it failed to create geo-rep status file. Actual results: glusterd crashed when it failed to create geo-rep status file Expected results: glusterd shouldn't crash. Additional info:
In a node, where the status file fails to be created during glusterd startup coz of any issue(it maybe a runner_run failure or any other issue), this bug will cause gluserd to crash. The issue will remain consistent as long as the status file creation fails, and glusterd will keep crashing.
Fix at https://code.engineering.redhat.com/gerrit/#/c/29533/
Verified in the build glusterfs-3.6.0.27-1. Steps. 1. Since there are no proper steps to reproduce, verification was done using gdb and setting breakpoint at _local_gsyncd_start. 2. This can be verified by making glusterd restart gsyncd processes and setting few variables to hit the particular scenario.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-1278.html