Description of problem: ======================= client mount hung while running plain files, directories creation and linux untar on a disperse volume. No bricks were brought down during IO. Below is the gdb of the process Backtrace: ========== (gdb) thread apply all bt Thread 8 (Thread 0x7f3e6dd5c700 (LWP 9691)): #0 0x00000032aa80efbd in nanosleep () at ../sysdeps/unix/syscall-template.S:82 #1 0x00000030770454da in gf_timer_proc (ctx=0x1d08010) at timer.c:195 #2 0x00000032aa807a51 in start_thread (arg=0x7f3e6dd5c700) at pthread_create.c:301 #3 0x00000032aa4e896d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 7 (Thread 0x7f3e6d35b700 (LWP 9692)): #0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:97 #1 0x00000032aa47cd96 in _L_lock_2632 () at hooks.c:129 #2 0x00000032aa477105 in __libc_mallinfo () at malloc.c:4254 #3 0x000000307705abc9 in gf_proc_dump_mem_info () at statedump.c:302 #4 0x000000307705bac2 in gf_proc_dump_info (signum=<value optimized out>, ctx=0x1d08010) at statedump.c:818 #5 0x0000000000405df1 in glusterfs_sigwaiter (arg=<value optimized out>) at glusterfsd.c:1996 #6 0x00000032aa807a51 in start_thread (arg=0x7f3e6d35b700) at pthread_create.c:301 #7 0x00000032aa4e896d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 6 (Thread 0x7f3e6b11d700 (LWP 9695)): #0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:97 #1 0x00000032aa47d29f in _L_lock_9730 () at hooks.c:129 #2 0x00000032aa47a88b in __libc_calloc (n=<value optimized out>, elem_size=<value optimized out>) at malloc.c:4094 #3 0x0000003077065a7e in __gf_default_calloc (size=2097152, cnt=1) at mem-pool.h:118 #4 0x0000003077066067 in synctask_create (env=0x1d35db0, fn=0x7f3e6a2c4bc0 <ec_synctask_heal_wrap>, cbk=0x7f3e6a2bc0c0 <ec_heal_done>, frame=<value optimized out>, opaque=0x7f3e594fed94) at syncop.c:497 #5 0x00000030770692b9 in synctask_new (env=<value optimized out>, fn=<value optimized out>, cbk=0x7f3e6a2bc0c0 <ec_heal_done>, frame=<value optimized out>, opaque=<value optimized out>) at syncop.c:566 #6 0x00007f3e6a2bc375 in ec_heal (frame=0x0, this=0x7f3e640265c0, target=18446744073709551615, minimum=-1, func=0x7f3e6a28b010 <ec_heal_report>, data=<value optimized out>, loc=0x7f3e435815b8, partial=0, xdata=0x0) at ec-heal.c:3707 #7 0x00007f3e6a28b27c in ec_check_status (fop=0x7f3e594e6f5c) at ec-common.c:167 #8 0x00007f3e6a2a699c in ec_combine (newcbk=0x7f3e590e2964, combine=<value optimized out>) at ec-combine.c:931 #9 0x00007f3e6a2a46d5 in ec_inode_write_cbk (frame=<value optimized out>, this=0x7f3e640265c0, cookie=<value optimized out>, op_ret=512, op_errno=<value optimized out>, prestat=0x7f3e6b11cb10, poststat=0x7f3e6b11caa0, xdata=0x7f3e73dd3460) at ec-inode-write.c:60 #10 0x00007f3e6a508a3c in client3_3_writev_cbk (req=<value optimized out>, iov=<value optimized out>, count=<value optimized out>, myframe=0x7f3e743ebe58) at client-rpc-fops.c:860 #11 0x000000307740ed75 in rpc_clnt_handle_reply (clnt=0x7f3e6452f7f0, pollin=0x7f3e435f4de0) at rpc-clnt.c:766 #12 0x0000003077410212 in rpc_clnt_notify (trans=<value optimized out>, mydata=0x7f3e6452f820, event=<value optimized out>, data=<value optimized out>) at rpc-clnt.c:894 #13 0x000000307740b8e8 in rpc_transport_notify (this=<value optimized out>, ---Type <return> to continue, or q <return> to quit--- event=<value optimized out>, data=<value optimized out>) at rpc-transport.c:543 #14 0x00007f3e6b34dbcd in socket_event_poll_in (this=0x7f3e6453f460) at socket.c:2290 #15 0x00007f3e6b34f6fd in socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x7f3e6453f460, poll_in=1, poll_out=0, poll_err=0) at socket.c:2403 #16 0x0000003077080f70 in event_dispatch_epoll_handler (data=0x1d70680) at event-epoll.c:572 #17 event_dispatch_epoll_worker (data=0x1d70680) at event-epoll.c:674 #18 0x00000032aa807a51 in start_thread (arg=0x7f3e6b11d700) at pthread_create.c:301 #19 0x00000032aa4e896d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 5 (Thread 0x7f3e60acd700 (LWP 9720)): #0 0x00000032aa4df143 in __poll (fds=<value optimized out>, nfds=<value optimized out>, timeout=<value optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87 #1 0x00000032aa516010 in svc_run () at svc_run.c:84 #2 0x00007f3e697b2e54 in nsm_thread (argv=<value optimized out>) at nlmcbk_svc.c:121 #3 0x00000032aa807a51 in start_thread (arg=0x7f3e60acd700) at pthread_create.c:301 #4 0x00000032aa4e896d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 4 (Thread 0x7f3e5bfff700 (LWP 9721)): #0 0x00000032aa4e8f63 in epoll_wait () at ../sysdeps/unix/syscall-template.S:82 #1 0x0000003077080dd9 in event_dispatch_epoll_worker (data=0x7f3e640c4cc0) at event-epoll.c:664 #2 0x00000032aa807a51 in start_thread (arg=0x7f3e5bfff700) at pthread_create.c:301 #3 0x00000032aa4e896d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 3 (Thread 0x7f3e5365c700 (LWP 9772)): #0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:97 #1 0x00000032aa47cf7e in _L_lock_5746 () at hooks.c:129 #2 0x00000032aa478a8b in _int_free (av=0x32aa78fe80, p=0x1d71760, have_lock=0) at malloc.c:4967 #3 0x00000030770690d2 in synctask_destroy (task=0x7f3e43605900) at syncop.c:391 #4 0x00000030770695a0 in syncenv_processor (thdata=0x1d36530) at syncop.c:687 #5 0x00000032aa807a51 in start_thread (arg=0x7f3e5365c700) at pthread_create.c:301 #6 0x00000032aa4e896d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 2 (Thread 0x7f3e2bfff700 (LWP 10950)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:239 #1 0x00000030770650db in syncenv_task (proc=0x1d36cb0) at syncop.c:591 #2 0x00000030770695b0 in syncenv_processor (thdata=0x1d36cb0) at syncop.c:683 #3 0x00000032aa807a51 in start_thread (arg=0x7f3e2bfff700) at pthread_create.c:301 #4 0x00000032aa4e896d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 1 (Thread 0x7f3e750e4740 (LWP 9690)): ---Type <return> to continue, or q <return> to quit--- #0 0x00000032aa8082ad in pthread_join (threadid=139906061031168, thread_return=0x0) at pthread_join.c:89 #1 0x0000003077080a6d in event_dispatch_epoll (event_pool=0x1d26c90) at event-epoll.c:759 #2 0x0000000000407ad4 in main (argc=11, argv=0x7fff254647f8) at glusterfsd.c:2326 (gdb) (gdb) (gdb) (gdb) Version-Release number of selected component (if applicable): ============================================================= [root@interstellar gluster]# gluster --version glusterfs 3.7.1 built on Jun 9 2015 02:31:56 Repository revision: git://git.gluster.com/glusterfs.git Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com> GlusterFS comes with ABSOLUTELY NO WARRANTY. You may redistribute copies of GlusterFS under the terms of the GNU General Public License. [root@interstellar gluster]# How reproducible: ================= seen once Steps to Reproduce: 1. create a 8+3 disperse volume 2. nfs mount on client and create files, directories and linux untar 3. Actual results: =============== client mount hung. Expected results: Additional info:
correction: How reproducible: ================= 100%. rebooted the client, mounted the volume and ran IO and its hung.
Will pickup these builds in a day or two and try to reproduce.
Fuse mount too hung but with taking down 2 of the bricks. I have taken up the debug builds and trying to reproduce.
(In reply to Bhaskarakiran from comment #6) > Fuse mount too hung but with taking down 2 of the bricks. I have taken up > the debug builds and trying to reproduce. Could you check if this issue is observed on volume type(s) other than disperse(erasure-coded) ?
The hang is still seen on the fuse mount. [root@rhs-client29 ~]# mount /dev/mapper/vg_rhsclient29-lv_root on / type ext4 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw,rootcontext="system_u:object_r:tmpfs_t:s0") /dev/sda1 on /boot type ext4 (rw) /dev/mapper/vg_rhsclient29-lv_home on /home type ext4 (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) transformers:/vol2 on /mnt/fuse type fuse.glusterfs (rw,default_permissions,allow_other,max_read=131072) [root@ninja ~]# gluster v status vol2 Status of volume: vol2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick ninja:/rhs/brick1/vol2-1 49157 0 Y 2731 Brick ninja:/rhs/brick2/vol2-2 49158 0 Y 2740 Brick ninja:/rhs/brick3/vol2-3 49159 0 Y 2747 Brick ninja:/rhs/brick4/vol2-4 49160 0 Y 2754 Brick vertigo:/rhs/brick1/vol2-5 49156 0 Y 27613 Brick vertigo:/rhs/brick2/vol2-6 49157 0 Y 19504 Brick vertigo:/rhs/brick3/vol2-7 49158 0 Y 19511 Brick ninja:/rhs/brick1/vol2-8 49161 0 Y 2765 Brick ninja:/rhs/brick2/vol2-9 49162 0 Y 2770 Brick ninja:/rhs/brick3/vol2-10 49163 0 Y 2779 Brick ninja:/rhs/brick4/vol2-11 49164 0 Y 2786 Snapshot Daemon on localhost 49165 0 Y 2855 NFS Server on localhost 2049 0 Y 10459 Self-heal Daemon on localhost N/A N/A Y 10486 Snapshot Daemon on 10.70.34.56 49160 0 Y 19539 NFS Server on 10.70.34.56 2049 0 Y 27648 Self-heal Daemon on 10.70.34.56 N/A N/A Y 27670 Snapshot Daemon on transformers 49162 0 Y 12992 NFS Server on transformers 2049 0 Y 46858 Self-heal Daemon on transformers N/A N/A Y 46881 Snapshot Daemon on interstellar 49166 0 Y 14480 NFS Server on interstellar 2049 0 Y 48872 Self-heal Daemon on interstellar N/A N/A Y 48882 Task Status of Volume vol2 ------------------------------------------------------------------------------ There are no active volume tasks [root@ninja ~]# [root@ninja ~]# gluster --version glusterfs 3.7.1 built on Jun 28 2015 11:01:17 Repository revision: git://git.gluster.com/glusterfs.git Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com> GlusterFS comes with ABSOLUTELY NO WARRANTY. You may redistribute copies of GlusterFS under the terms of the GNU General Public License. [root@ninja ~]# [root@rhs-client29 ~]# rpm -qa |grep gluster glusterfs-fuse-3.7.1-6.el6rhs.x86_64 glusterfs-client-xlators-3.7.1-6.el6rhs.x86_64 glusterfs-3.7.1-6.el6rhs.x86_64 glusterfs-api-3.7.1-6.el6rhs.x86_64 glusterfs-libs-3.7.1-6.el6rhs.x86_64 [root@rhs-client29 ~]# The fuse mount logs shows below continuously though the volume is up. [2015-06-29 12:21:23.253607] W [MSGID: 122002] [ec-common.c:122:ec_heal_report] 0-vol2-disperse-0: Heal failed [Input/output error] [2015-06-29 12:21:23.253934] W [rpc-clnt.c:1571:rpc_clnt_submit] 0-vol2-client-0: failed to submit rpc-request (XID: 0x5fab0 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (vol2-client-0) [2015-06-29 12:21:23.253972] W [MSGID: 114031] [client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-vol2-client-0: remote operation failed. Path: /dirs./dir.31618 (00000000-0000-0000-0000-000000000000) [Transport endpoint is not connected] [2015-06-29 12:21:23.254944] W [MSGID: 122053] [ec-common.c:166:ec_check_status] 0-vol2-disperse-0: Operation failed on some subvolumes (up=7FF, mask=7FF, remaining=0, good=7EE, bad=11)
Is this suppose to work by disabling client side heal?
Has run IO for sufficient time and didn't see the hangs with client side heal disabled. Moving this bug to fixed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1495.html