Description of problem: ======================= While trying to list the snapshots under .snaps folder, snapd crashed with client.event-threads and server.event-threads set to 4 and 5 respectively. Version-Release number of selected component (if applicable): ============================================================= glusterfs 3.6.0.51 How reproducible: ================= tried once Steps to Reproduce: ================== 1.Create a 6x2 dist-rep volume and start it enable quota on the volume set client.event-threads to 4 set server.event-threads to 5 2.Fuse and NFS mount the volume and create some IO on it for i in {1..20} ; do cp -rvf /etc fetc.$i ; done for i in {1..20} ; do cp -rvf /etc n1etc.$i ; done 3.Create 256 snapshots in a loop 4.Activate 60 snapshots in loop 5.Enable USS on the volume 6.After 256 snapshot creation is completed, activate all the remaining snapshots From fuse mount, cd to .snaps and list the snapshots -> successful From NFS mount, cd to .snaps and list the snapshots -> No such file/directory From fuse mount, cd to .snaps and list snapshots again -> failed with "Transport endpoint is not connected" Actual results: ============== While listing snapshots under .snaps snapd crashed Expected results: ================ All snapshots should be listed under .snaps successfully Additional info: ================ [root@inception core]# gluster v i Volume Name: vol0 Type: Distributed-Replicate Volume ID: d1ac6dec-a438-4c9f-9b0a-671396088e40 Status: Started Snap Volume: no Number of Bricks: 6 x 2 = 12 Transport-type: tcp Bricks: Brick1: inception.lab.eng.blr.redhat.com:/rhs/brick1/b1 Brick2: rhs-arch-srv2.lab.eng.blr.redhat.com:/rhs/brick1/b1 Brick3: rhs-arch-srv3.lab.eng.blr.redhat.com:/rhs/brick1/b1 Brick4: rhs-arch-srv4.lab.eng.blr.redhat.com:/rhs/brick1/b1 Brick5: inception.lab.eng.blr.redhat.com:/rhs/brick2/b2 Brick6: rhs-arch-srv2.lab.eng.blr.redhat.com:/rhs/brick2/b2 Brick7: rhs-arch-srv3.lab.eng.blr.redhat.com:/rhs/brick2/b2 Brick8: rhs-arch-srv4.lab.eng.blr.redhat.com:/rhs/brick2/b2 Brick9: inception.lab.eng.blr.redhat.com:/rhs/brick3/b3 Brick10: rhs-arch-srv2.lab.eng.blr.redhat.com:/rhs/brick3/b3 Brick11: rhs-arch-srv3.lab.eng.blr.redhat.com:/rhs/brick3/b3 Brick12: rhs-arch-srv4.lab.eng.blr.redhat.com:/rhs/brick3/b3 Options Reconfigured: features.uss: enable features.barrier: disable client.event-threads: 4 server.event-threads: 5 features.quota: on performance.readdir-ahead: on auto-delete: disable snap-max-soft-limit: 90 snap-max-hard-limit: 256 bt : === om_err-1.41.12-21.el6.x86_64 libgcc-4.4.7-11.el6.x86_64 libselinux-2.0.94-5.8.el6.x86_64 ncurses-libs-5.7-3.20090208.el6.x86_64 openssl-1.0.1e-30.el6_6.5.x86_64 readline-6.0-4.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) bt #0 0x0000003f236093a0 in ?? () #1 0x00000033e4425060 in gf_log_set_log_buf_size (buf_size=0) at logging.c:256 #2 0x00000033e44251ff in gf_log_disable_suppression_before_exit (ctx=0x22b3010) at logging.c:427 #3 0x00000033e443bac5 in gf_print_trace (signum=11, ctx=0x22b3010) at common-utils.c:493 #4 0x0000003f232326a0 in ?? () #5 0x0000000000000000 in ?? () [root@inception core]# gluster v status vol0 Status of volume: vol0 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick inception.lab.eng.blr.redhat.com:/rhs /brick1/b1 49155 0 Y 24032 Brick rhs-arch-srv2.lab.eng.blr.redhat.com: /rhs/brick1/b1 49155 0 Y 1709 Brick rhs-arch-srv3.lab.eng.blr.redhat.com: /rhs/brick1/b1 49155 0 Y 1212 Brick rhs-arch-srv4.lab.eng.blr.redhat.com: /rhs/brick1/b1 49155 0 Y 27968 Brick inception.lab.eng.blr.redhat.com:/rhs /brick2/b2 49156 0 Y 24045 Brick rhs-arch-srv2.lab.eng.blr.redhat.com: /rhs/brick2/b2 49156 0 Y 1722 Brick rhs-arch-srv3.lab.eng.blr.redhat.com: /rhs/brick2/b2 49156 0 Y 1226 Brick rhs-arch-srv4.lab.eng.blr.redhat.com: /rhs/brick2/b2 49156 0 Y 27981 Brick inception.lab.eng.blr.redhat.com:/rhs /brick3/b3 49157 0 Y 24058 Brick rhs-arch-srv2.lab.eng.blr.redhat.com: /rhs/brick3/b3 49157 0 Y 1735 Brick rhs-arch-srv3.lab.eng.blr.redhat.com: /rhs/brick3/b3 49157 0 Y 1239 Brick rhs-arch-srv4.lab.eng.blr.redhat.com: /rhs/brick3/b3 49157 0 Y 27994 Snapshot Daemon on localhost N/A N/A N 31079 NFS Server on localhost 2049 0 Y 31087 Self-heal Daemon on localhost N/A N/A Y 24079 Quota Daemon on localhost N/A N/A Y 24119 Snapshot Daemon on rhs-arch-srv2.lab.eng.bl r.redhat.com 49845 0 Y 24540 NFS Server on rhs-arch-srv2.lab.eng.blr.red hat.com 2049 0 Y 24548 Self-heal Daemon on rhs-arch-srv2.lab.eng.b lr.redhat.com N/A N/A Y 1756 Quota Daemon on rhs-arch-srv2.lab.eng.blr.r edhat.com N/A N/A Y 1776 Snapshot Daemon on rhs-arch-srv4.lab.eng.bl r.redhat.com 49845 0 Y 27158 NFS Server on rhs-arch-srv4.lab.eng.blr.red hat.com 2049 0 Y 27171 Self-heal Daemon on rhs-arch-srv4.lab.eng.b lr.redhat.com N/A N/A Y 28015 Quota Daemon on rhs-arch-srv4.lab.eng.blr.r edhat.com N/A N/A Y 28034 Snapshot Daemon on rhs-arch-srv3.lab.eng.bl r.redhat.com 49845 0 Y 31320 NFS Server on rhs-arch-srv3.lab.eng.blr.red hat.com 2049 0 Y 31330 Self-heal Daemon on rhs-arch-srv3.lab.eng.b lr.redhat.com N/A N/A Y 1260 Quota Daemon on rhs-arch-srv3.lab.eng.blr.r edhat.com N/A N/A Y 1279 Task Status of Volume vol0 ------------------------------------------------------------------------------ There are no active volume tasks Following is the bt from crash: #0 __pthread_mutex_lock (mutex=0x320) at pthread_mutex_lock.c:50 #1 0x00000033e4425060 in gf_log_set_log_buf_size (buf_size=0) at logging.c:256 #2 0x00000033e44251ff in gf_log_disable_suppression_before_exit (ctx=0x22b3010) at logging.c:427 #3 0x00000033e443bac5 in gf_print_trace (signum=11, ctx=0x22b3010) at common-utils.c:493 #4 <signal handler called> #5 0x00000033e444f731 in __gf_free (free_ptr=0x7f911ef33c50) at mem-pool.c:231 #6 0x00000033e443da02 in gf_timer_proc (ctx=0x7f911ef35630) at timer.c:207 #7 0x0000003f236079d1 in start_thread (arg=0x7f8eb197b700) at pthread_create.c:301 #8 0x0000003f232e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 In the test case multiple snapshots were created and then activated. And after activation snapshots were accessed using USS. while accessing these snapshot the crash is seen. code wise this crash is happening during the timer thread destruction. Timer thread is destroyed as part of glfs_fini. Normally glfs_fini is called when snapshots are deactivated or deleted. But in this case no snapshots were deleted or deactivated. In this case glfs_fini is called due to failure in glfs_init. For some reason the snapshot brick is not in started state leading to failure in glfs_init. We could not figure out the exact cause of this since the brick and snapshot logs were missing from the sos-report. But anyway when glfs_init fails we call glfs_fini to clean up the resources allocated. In the timer thread current THIS is overwritten and never restored, leading to wrong value of THIS which causes segmentation fault in __gf_free function.
REVIEW: http://review.gluster.org/9895 (libgfapi, timer: Fix a crash seen in timer when glfs_fini was invoked.) posted (#1) for review on master by Poornima G (pgurusid)
REVIEW: http://review.gluster.org/9895 (libgfapi, timer: Fix a crash seen in timer when glfs_fini was invoked.) posted (#2) for review on master by Rajesh Joseph (rjoseph)
REVIEW: http://review.gluster.org/9895 (libgfapi, timer: Fix a crash seen in timer when glfs_fini was invoked.) posted (#3) for review on master by Rajesh Joseph (rjoseph)
COMMIT: http://review.gluster.org/9895 committed in master by Vijay Bellur (vbellur) ------ commit c99c72b35fac16e08c4d170b6a46a786caaeef58 Author: Poornima G <pgurusid> Date: Mon Mar 16 15:47:30 2015 +0530 libgfapi, timer: Fix a crash seen in timer when glfs_fini was invoked. The crash is seen when, glfs_init failed for some reason and glfs_fini was called for cleaning up the partial initialization. The fix is in two folds: 1. In timer store and restore the THIS, previously it was being overwritten. 2. In glfs_free_from_ctx() and glfs_fini() check for NULL before destroying. Change-Id: If40bf69936b873a1da8e348c9d92c66f2f07994b BUG: 1202290 Signed-off-by: Poornima G <pgurusid> Reviewed-on: http://review.gluster.org/9895 Reviewed-by: Raghavendra Talur <rtalur> Reviewed-by: Krishnan Parthasarathi <kparthas> Reviewed-by: Raghavendra Bhat <raghavendra> Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Vijay Bellur <vbellur>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report. glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939 [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user