Bug 1230612 - Disperse volume : NFS and Fuse mounts hung with plain IO
Summary: Disperse volume : NFS and Fuse mounts hung with plain IO
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: disperse
Version: rhgs-3.1
Hardware: Unspecified
OS: Unspecified
medium
unspecified
Target Milestone: ---
: RHGS 3.1.0
Assignee: Pranith Kumar K
QA Contact: Bhaskarakiran
URL:
Whiteboard:
Depends On: 1227197
Blocks: 1202842 1234768
TreeView+ depends on / blocked
 
Reported: 2015-06-11 08:57 UTC by Bhaskarakiran
Modified: 2016-11-23 23:12 UTC (History)
11 users (show)

Fixed In Version: glusterfs-3.7.1-7.el6
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1234768 (view as bug list)
Environment:
Last Closed: 2015-07-29 05:02:05 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2015:1495 0 normal SHIPPED_LIVE Important: Red Hat Gluster Storage 3.1 update 2015-07-29 08:26:26 UTC

Description Bhaskarakiran 2015-06-11 08:57:53 UTC
Description of problem:
=======================
client mount hung while running plain files, directories creation and linux untar on a disperse volume. No bricks were brought down during IO. Below is the gdb of the process

Backtrace:
==========
(gdb) thread apply all bt

Thread 8 (Thread 0x7f3e6dd5c700 (LWP 9691)):
#0  0x00000032aa80efbd in nanosleep () at ../sysdeps/unix/syscall-template.S:82
#1  0x00000030770454da in gf_timer_proc (ctx=0x1d08010) at timer.c:195
#2  0x00000032aa807a51 in start_thread (arg=0x7f3e6dd5c700)
    at pthread_create.c:301
#3  0x00000032aa4e896d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 7 (Thread 0x7f3e6d35b700 (LWP 9692)):
#0  __lll_lock_wait_private ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:97
#1  0x00000032aa47cd96 in _L_lock_2632 () at hooks.c:129
#2  0x00000032aa477105 in __libc_mallinfo () at malloc.c:4254
#3  0x000000307705abc9 in gf_proc_dump_mem_info () at statedump.c:302
#4  0x000000307705bac2 in gf_proc_dump_info (signum=<value optimized out>, 
    ctx=0x1d08010) at statedump.c:818
#5  0x0000000000405df1 in glusterfs_sigwaiter (arg=<value optimized out>)
    at glusterfsd.c:1996
#6  0x00000032aa807a51 in start_thread (arg=0x7f3e6d35b700)
    at pthread_create.c:301
#7  0x00000032aa4e896d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 6 (Thread 0x7f3e6b11d700 (LWP 9695)):
#0  __lll_lock_wait_private ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:97
#1  0x00000032aa47d29f in _L_lock_9730 () at hooks.c:129
#2  0x00000032aa47a88b in __libc_calloc (n=<value optimized out>, 
    elem_size=<value optimized out>) at malloc.c:4094
#3  0x0000003077065a7e in __gf_default_calloc (size=2097152, cnt=1)
    at mem-pool.h:118
#4  0x0000003077066067 in synctask_create (env=0x1d35db0, 
    fn=0x7f3e6a2c4bc0 <ec_synctask_heal_wrap>, 
    cbk=0x7f3e6a2bc0c0 <ec_heal_done>, frame=<value optimized out>, 
    opaque=0x7f3e594fed94) at syncop.c:497
#5  0x00000030770692b9 in synctask_new (env=<value optimized out>, 
    fn=<value optimized out>, cbk=0x7f3e6a2bc0c0 <ec_heal_done>, 
    frame=<value optimized out>, opaque=<value optimized out>) at syncop.c:566
#6  0x00007f3e6a2bc375 in ec_heal (frame=0x0, this=0x7f3e640265c0, 
    target=18446744073709551615, minimum=-1, 
    func=0x7f3e6a28b010 <ec_heal_report>, data=<value optimized out>, 
    loc=0x7f3e435815b8, partial=0, xdata=0x0) at ec-heal.c:3707
#7  0x00007f3e6a28b27c in ec_check_status (fop=0x7f3e594e6f5c)
    at ec-common.c:167
#8  0x00007f3e6a2a699c in ec_combine (newcbk=0x7f3e590e2964, 
    combine=<value optimized out>) at ec-combine.c:931
#9  0x00007f3e6a2a46d5 in ec_inode_write_cbk (frame=<value optimized out>, 
    this=0x7f3e640265c0, cookie=<value optimized out>, op_ret=512, 
    op_errno=<value optimized out>, prestat=0x7f3e6b11cb10, 
    poststat=0x7f3e6b11caa0, xdata=0x7f3e73dd3460) at ec-inode-write.c:60
#10 0x00007f3e6a508a3c in client3_3_writev_cbk (req=<value optimized out>, 
    iov=<value optimized out>, count=<value optimized out>, 
    myframe=0x7f3e743ebe58) at client-rpc-fops.c:860
#11 0x000000307740ed75 in rpc_clnt_handle_reply (clnt=0x7f3e6452f7f0, 
    pollin=0x7f3e435f4de0) at rpc-clnt.c:766
#12 0x0000003077410212 in rpc_clnt_notify (trans=<value optimized out>, 
    mydata=0x7f3e6452f820, event=<value optimized out>, 
    data=<value optimized out>) at rpc-clnt.c:894
#13 0x000000307740b8e8 in rpc_transport_notify (this=<value optimized out>, 
---Type <return> to continue, or q <return> to quit---
    event=<value optimized out>, data=<value optimized out>)
    at rpc-transport.c:543
#14 0x00007f3e6b34dbcd in socket_event_poll_in (this=0x7f3e6453f460)
    at socket.c:2290
#15 0x00007f3e6b34f6fd in socket_event_handler (fd=<value optimized out>, 
    idx=<value optimized out>, data=0x7f3e6453f460, poll_in=1, poll_out=0, 
    poll_err=0) at socket.c:2403
#16 0x0000003077080f70 in event_dispatch_epoll_handler (data=0x1d70680)
    at event-epoll.c:572
#17 event_dispatch_epoll_worker (data=0x1d70680) at event-epoll.c:674
#18 0x00000032aa807a51 in start_thread (arg=0x7f3e6b11d700)
    at pthread_create.c:301
#19 0x00000032aa4e896d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 5 (Thread 0x7f3e60acd700 (LWP 9720)):
#0  0x00000032aa4df143 in __poll (fds=<value optimized out>, 
    nfds=<value optimized out>, timeout=<value optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:87
#1  0x00000032aa516010 in svc_run () at svc_run.c:84
#2  0x00007f3e697b2e54 in nsm_thread (argv=<value optimized out>)
    at nlmcbk_svc.c:121
#3  0x00000032aa807a51 in start_thread (arg=0x7f3e60acd700)
    at pthread_create.c:301
#4  0x00000032aa4e896d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 4 (Thread 0x7f3e5bfff700 (LWP 9721)):
#0  0x00000032aa4e8f63 in epoll_wait () at ../sysdeps/unix/syscall-template.S:82
#1  0x0000003077080dd9 in event_dispatch_epoll_worker (data=0x7f3e640c4cc0)
    at event-epoll.c:664
#2  0x00000032aa807a51 in start_thread (arg=0x7f3e5bfff700)
    at pthread_create.c:301
#3  0x00000032aa4e896d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 3 (Thread 0x7f3e5365c700 (LWP 9772)):
#0  __lll_lock_wait_private ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:97
#1  0x00000032aa47cf7e in _L_lock_5746 () at hooks.c:129
#2  0x00000032aa478a8b in _int_free (av=0x32aa78fe80, p=0x1d71760, have_lock=0)
    at malloc.c:4967
#3  0x00000030770690d2 in synctask_destroy (task=0x7f3e43605900) at syncop.c:391
#4  0x00000030770695a0 in syncenv_processor (thdata=0x1d36530) at syncop.c:687
#5  0x00000032aa807a51 in start_thread (arg=0x7f3e5365c700)
    at pthread_create.c:301
#6  0x00000032aa4e896d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 2 (Thread 0x7f3e2bfff700 (LWP 10950)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:239
#1  0x00000030770650db in syncenv_task (proc=0x1d36cb0) at syncop.c:591
#2  0x00000030770695b0 in syncenv_processor (thdata=0x1d36cb0) at syncop.c:683
#3  0x00000032aa807a51 in start_thread (arg=0x7f3e2bfff700)
    at pthread_create.c:301
#4  0x00000032aa4e896d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 1 (Thread 0x7f3e750e4740 (LWP 9690)):
---Type <return> to continue, or q <return> to quit---
#0  0x00000032aa8082ad in pthread_join (threadid=139906061031168, 
    thread_return=0x0) at pthread_join.c:89
#1  0x0000003077080a6d in event_dispatch_epoll (event_pool=0x1d26c90)
    at event-epoll.c:759
#2  0x0000000000407ad4 in main (argc=11, argv=0x7fff254647f8)
    at glusterfsd.c:2326
(gdb) 
(gdb) 
(gdb) 
(gdb) 


Version-Release number of selected component (if applicable):
=============================================================
[root@interstellar gluster]# gluster --version
glusterfs 3.7.1 built on Jun  9 2015 02:31:56
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General Public License.
[root@interstellar gluster]# 


How reproducible:
=================
seen once

Steps to Reproduce:
1. create a 8+3 disperse volume
2. nfs mount on client and create files, directories and linux untar 
3.

Actual results:
===============
client mount hung.

Expected results:


Additional info:

Comment 2 Bhaskarakiran 2015-06-11 09:02:48 UTC
correction:

How reproducible:
=================
100%.

rebooted the client, mounted the volume and ran IO and its hung.

Comment 5 Bhaskarakiran 2015-06-17 08:50:05 UTC
Will pickup these builds in a day or two and try to reproduce.

Comment 6 Bhaskarakiran 2015-06-18 10:31:24 UTC
Fuse mount too hung but with taking down 2 of the bricks.  I have taken up the debug builds and trying to reproduce.

Comment 9 krishnan parthasarathi 2015-06-22 13:44:00 UTC
(In reply to Bhaskarakiran from comment #6)
> Fuse mount too hung but with taking down 2 of the bricks.  I have taken up
> the debug builds and trying to reproduce.

Could you check if this issue is observed on volume type(s) other than disperse(erasure-coded) ?

Comment 12 Bhaskarakiran 2015-06-29 12:25:26 UTC
The hang is still seen on the fuse mount.


[root@rhs-client29 ~]# mount
/dev/mapper/vg_rhsclient29-lv_root on / type ext4 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw,rootcontext="system_u:object_r:tmpfs_t:s0")
/dev/sda1 on /boot type ext4 (rw)
/dev/mapper/vg_rhsclient29-lv_home on /home type ext4 (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
transformers:/vol2 on /mnt/fuse type fuse.glusterfs (rw,default_permissions,allow_other,max_read=131072)


[root@ninja ~]# gluster v status vol2
Status of volume: vol2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick ninja:/rhs/brick1/vol2-1              49157     0          Y       2731 
Brick ninja:/rhs/brick2/vol2-2              49158     0          Y       2740 
Brick ninja:/rhs/brick3/vol2-3              49159     0          Y       2747 
Brick ninja:/rhs/brick4/vol2-4              49160     0          Y       2754 
Brick vertigo:/rhs/brick1/vol2-5            49156     0          Y       27613
Brick vertigo:/rhs/brick2/vol2-6            49157     0          Y       19504
Brick vertigo:/rhs/brick3/vol2-7            49158     0          Y       19511
Brick ninja:/rhs/brick1/vol2-8              49161     0          Y       2765 
Brick ninja:/rhs/brick2/vol2-9              49162     0          Y       2770 
Brick ninja:/rhs/brick3/vol2-10             49163     0          Y       2779 
Brick ninja:/rhs/brick4/vol2-11             49164     0          Y       2786 
Snapshot Daemon on localhost                49165     0          Y       2855 
NFS Server on localhost                     2049      0          Y       10459
Self-heal Daemon on localhost               N/A       N/A        Y       10486
Snapshot Daemon on 10.70.34.56              49160     0          Y       19539
NFS Server on 10.70.34.56                   2049      0          Y       27648
Self-heal Daemon on 10.70.34.56             N/A       N/A        Y       27670
Snapshot Daemon on transformers             49162     0          Y       12992
NFS Server on transformers                  2049      0          Y       46858
Self-heal Daemon on transformers            N/A       N/A        Y       46881
Snapshot Daemon on interstellar             49166     0          Y       14480
NFS Server on interstellar                  2049      0          Y       48872
Self-heal Daemon on interstellar            N/A       N/A        Y       48882
 
Task Status of Volume vol2
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@ninja ~]# 


[root@ninja ~]# gluster --version
glusterfs 3.7.1 built on Jun 28 2015 11:01:17
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General Public License.
[root@ninja ~]# 

[root@rhs-client29 ~]# rpm -qa |grep gluster
glusterfs-fuse-3.7.1-6.el6rhs.x86_64
glusterfs-client-xlators-3.7.1-6.el6rhs.x86_64
glusterfs-3.7.1-6.el6rhs.x86_64
glusterfs-api-3.7.1-6.el6rhs.x86_64
glusterfs-libs-3.7.1-6.el6rhs.x86_64
[root@rhs-client29 ~]# 


The fuse mount logs shows below continuously though the volume is up.

[2015-06-29 12:21:23.253607] W [MSGID: 122002] [ec-common.c:122:ec_heal_report] 0-vol2-disperse-0: Heal failed [Input/output error]
[2015-06-29 12:21:23.253934] W [rpc-clnt.c:1571:rpc_clnt_submit] 0-vol2-client-0: failed to submit rpc-request (XID: 0x5fab0 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (vol2-client-0)
[2015-06-29 12:21:23.253972] W [MSGID: 114031] [client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-vol2-client-0: remote operation failed. Path: /dirs./dir.31618 (00000000-0000-0000-0000-000000000000) [Transport endpoint is not connected]
[2015-06-29 12:21:23.254944] W [MSGID: 122053] [ec-common.c:166:ec_check_status] 0-vol2-disperse-0: Operation failed on some subvolumes (up=7FF, mask=7FF, remaining=0, good=7EE, bad=11)

Comment 13 Anoop 2015-07-03 08:56:36 UTC
Is this suppose to work by disabling client side heal?

Comment 15 Bhaskarakiran 2015-07-09 05:31:42 UTC
Has run IO for sufficient time and didn't see the hangs with client side heal disabled. Moving this bug to fixed.

Comment 16 errata-xmlrpc 2015-07-29 05:02:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1495.html


Note You need to log in before you can comment on or make changes to this bug.