Description of problem: ========================= I have setup a systemic testbed, where I have a 4x2 volume spanning 4 nodes. I hav enabled below features, look at vol info: Volume Name: drvol Type: Distributed-Replicate Volume ID: 2f0e5510-fe47-4ce8-906e-6ddc7f9334ca Status: Started Snapshot Count: 0 Number of Bricks: 4 x 2 = 8 Transport-type: tcp Bricks: Brick1: 10.70.35.191:/rhs/brick1/drvol Brick2: 10.70.37.108:/rhs/brick1/drvol Brick3: 10.70.35.3:/rhs/brick1/drvol Brick4: 10.70.37.66:/rhs/brick1/drvol Brick5: 10.70.35.191:/rhs/brick2/drvol Brick6: 10.70.37.108:/rhs/brick2/drvol Brick7: 10.70.35.3:/rhs/brick2/drvol Brick8: 10.70.37.66:/rhs/brick2/drvol Options Reconfigured: cluster.use-compound-fops: on diagnostics.count-fop-hits: on diagnostics.latency-measurement: on features.uss: enable features.quota-deem-statfs: on features.inode-quota: on features.quota: on features.barrier: disable cluster.shd-max-threads: 16 performance.md-cache-timeout: 600 performance.cache-invalidation: true features.cache-invalidation-timeout: 300 features.cache-invalidation: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on I then mounted the volume on 10 different clients and did following IOs: From all clients:===> started taking statedump of the fuse mount process every 5 minutes and moving them to a dedicated directory for each host on the mount point(so into gluster vol) From all clients:====>collecting top and cpu usage every 2 mins and appending the contents into a file for each host on the mount point(so into gluster vol) From 2(say A&B) clients:===> creating a big deapth and breadth dir structure ==>both the client fuse crashed due to oom kill , as it had low memory of 4gb only from 1(say C) client ===> did file creates of about 1million using touch from 1(say D) client==> did ls -l of each file by name which were getting created by above client from 1(say E) client ==>did stat of each file by name which were getting created by client C I saw that client E , the stat was hung indefinately. CLient is rhel6.8 memory consumption is as below: [root@rhs-client14 ~]# top -n 1 -b|grep 5377 5377 root 20 0 7029m 5.9g 4436 T 0.0 38.0 408:49.23 glusterfs PStack info of the pid: [root@rhs-client14 ~]# pstack 5377 Thread 10 (Thread 0x7fce12ab6700 (LWP 5378)): #0 0x00007fce1abbb00d in nanosleep () from /lib64/libpthread.so.0 #1 0x00007fce1bade265 in gf_timer_proc () from /usr/lib64/libglusterfs.so.0 #2 0x00007fce1abb3aa1 in start_thread () from /lib64/libpthread.so.0 #3 0x00007fce1a51caad in clone () from /lib64/libc.so.6 Thread 9 (Thread 0x7fce120b5700 (LWP 5379)): #0 0x00007fce1abb8465 in ?? () from /lib64/libpthread.so.0 #1 0x00007fce1baf4c8c in fd_ctx_dump () from /usr/lib64/libglusterfs.so.0 #2 0x00007fce1bade793 in inode_dump () from /usr/lib64/libglusterfs.so.0 #3 0x00007fce1bae0957 in inode_table_dump () from /usr/lib64/libglusterfs.so.0 #4 0x00007fce12aca289 in fuse_itable_dump () from /usr/lib64/glusterfs/3.8.4/xlator/mount/fuse.so #5 0x00007fce1bafdd92 in gf_proc_dump_xlator_info () from /usr/lib64/libglusterfs.so.0 #6 0x00007fce1bafe6c5 in gf_proc_dump_info () from /usr/lib64/libglusterfs.so.0 #7 0x00007fce1bfc53ed in glusterfs_sigwaiter () #8 0x00007fce1abb3aa1 in start_thread () from /lib64/libpthread.so.0 #9 0x00007fce1a51caad in clone () from /lib64/libc.so.6 Thread 8 (Thread 0x7fce116b4700 (LWP 5380)): #0 0x00007fce1abb7a5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fce1bb09cdc in syncenv_task () from /usr/lib64/libglusterfs.so.0 #2 0x00007fce1bb17d20 in syncenv_processor () from /usr/lib64/libglusterfs.so.0 #3 0x00007fce1abb3aa1 in start_thread () from /lib64/libpthread.so.0 #4 0x00007fce1a51caad in clone () from /lib64/libc.so.6 Thread 7 (Thread 0x7fce0ea7e700 (LWP 5383)): #0 0x00007fce1abba334 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007fce1abb55d8 in _L_lock_854 () from /lib64/libpthread.so.0 #2 0x00007fce1abb54a7 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007fce1bae1741 in inode_unref () from /usr/lib64/libglusterfs.so.0 #4 0x00007fce1bacdbf2 in loc_wipe () from /usr/lib64/libglusterfs.so.0 #5 0x00007fce0de3d94e in client_local_wipe () from /usr/lib64/glusterfs/3.8.4/xlator/protocol/client.so #6 0x00007fce0de5686c in client3_3_open_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/protocol/client.so #7 0x00007fce1b89a925 in rpc_clnt_handle_reply () from /usr/lib64/libgfrpc.so.0 #8 0x00007fce1b89ba8c in rpc_clnt_notify () from /usr/lib64/libgfrpc.so.0 #9 0x00007fce1b896bc8 in rpc_transport_notify () from /usr/lib64/libgfrpc.so.0 #10 0x00007fce100a456d in socket_event_poll_in () from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so #11 0x00007fce100a585e in socket_event_handler () from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so #12 0x00007fce1bb2cc96 in event_dispatch_epoll_worker () from /usr/lib64/libglusterfs.so.0 #13 0x00007fce1abb3aa1 in start_thread () from /lib64/libpthread.so.0 #14 0x00007fce1a51caad in clone () from /lib64/libc.so.6 Thread 6 (Thread 0x7fce076ff700 (LWP 5384)): #0 0x00007fce1abba334 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007fce1abb55d8 in _L_lock_854 () from /lib64/libpthread.so.0 #2 0x00007fce1abb54a7 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007fce1bae1741 in inode_unref () from /usr/lib64/libglusterfs.so.0 #4 0x00007fce1bacdbf2 in loc_wipe () from /usr/lib64/libglusterfs.so.0 #5 0x00007fce0cce1b7c in ob_fd_free () from /usr/lib64/glusterfs/3.8.4/xlator/performance/open-behind.so #6 0x00007fce0cce1dd0 in ob_wake_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/performance/open-behind.so #7 0x00007fce1bb5e186 in default_open_cbk () from /usr/lib64/libglusterfs.so.0 #8 0x00007fce0d0f2c39 in ioc_open_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-cache.so #9 0x00007fce0d50f754 in ra_open_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/performance/read-ahead.so #10 0x00007fce0d99139e in dht_open_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/cluster/distribute.so #11 0x00007fce0dbdb26d in afr_open_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so #12 0x00007fce0de5685a in client3_3_open_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/protocol/client.so #13 0x00007fce1b89a925 in rpc_clnt_handle_reply () from /usr/lib64/libgfrpc.so.0 #14 0x00007fce1b89ba8c in rpc_clnt_notify () from /usr/lib64/libgfrpc.so.0 #15 0x00007fce1b896bc8 in rpc_transport_notify () from /usr/lib64/libgfrpc.so.0 #16 0x00007fce100a456d in socket_event_poll_in () from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so #17 0x00007fce100a585e in socket_event_handler () from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so #18 0x00007fce1bb2cc96 in event_dispatch_epoll_worker () from /usr/lib64/libglusterfs.so.0 #19 0x00007fce1abb3aa1 in start_thread () from /lib64/libpthread.so.0 #20 0x00007fce1a51caad in clone () from /lib64/libc.so.6 Thread 5 (Thread 0x7fce0c26e700 (LWP 5385)): #0 0x00007fce1abb7a5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fce0c8c1a6c in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so #2 0x00007fce1abb3aa1 in start_thread () from /lib64/libpthread.so.0 #3 0x00007fce1a51caad in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x7fcdf7efe700 (LWP 5389)): #0 0x00007fce1abba334 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007fce1abb55d8 in _L_lock_854 () from /lib64/libpthread.so.0 #2 0x00007fce1abb54a7 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007fce1bae1ac1 in inode_ref () from /usr/lib64/libglusterfs.so.0 #4 0x00007fce12abc01f in fuse_ino_to_inode () from /usr/lib64/glusterfs/3.8.4/xlator/mount/fuse.so #5 0x00007fce12abd01d in fuse_resolve_inode_init () from /usr/lib64/glusterfs/3.8.4/xlator/mount/fuse.so #6 0x00007fce12ac832a in fuse_getattr () from /usr/lib64/glusterfs/3.8.4/xlator/mount/fuse.so #7 0x00007fce12ad6270 in fuse_thread_proc () from /usr/lib64/glusterfs/3.8.4/xlator/mount/fuse.so #8 0x00007fce1abb3aa1 in start_thread () from /lib64/libpthread.so.0 #9 0x00007fce1a51caad in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7fcdf74fd700 (LWP 5390)): #0 0x00007fce1abb768c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fce12ac7d9b in notify_kernel_loop () from /usr/lib64/glusterfs/3.8.4/xlator/mount/fuse.so #2 0x00007fce1abb3aa1 in start_thread () from /lib64/libpthread.so.0 #3 0x00007fce1a51caad in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7fcca2ffa700 (LWP 6850)): #0 0x00007fce1abb7a5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fce1bb09cdc in syncenv_task () from /usr/lib64/libglusterfs.so.0 #2 0x00007fce1bb17d20 in syncenv_processor () from /usr/lib64/libglusterfs.so.0 #3 0x00007fce1abb3aa1 in start_thread () from /lib64/libpthread.so.0 #4 0x00007fce1a51caad in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7fce1bfa8740 (LWP 5377)): #0 0x00007fce1abb42fd in pthread_join () from /lib64/libpthread.so.0 #1 0x00007fce1bb2c75d in event_dispatch_epoll () from /usr/lib64/libglusterfs.so.0 #2 0x00007fce1bfc749b in main () [root@rhs-client14 ~]# [root@rhs-client14 ~]# rpm -qa|grep gluster glusterfs-client-xlators-3.8.4-3.el6.x86_64 glusterfs-3.8.4-3.el6.x86_64 glusterfs-fuse-3.8.4-3.el6.x86_64 glusterfs-libs-3.8.4-3.el6.x86_64 glusterfs-debuginfo-3.8.4-3.el6.x86_64 [root@rhs-client14 ~]#
also note that two bricks were brought down. [root@dhcp37-108 ~]# gluster v status Status of volume: drvol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.191:/rhs/brick1/drvol N/A N/A N N/A Brick 10.70.37.108:/rhs/brick1/drvol 49154 0 Y 14087 Brick 10.70.35.3:/rhs/brick1/drvol 49154 0 Y 4790 Brick 10.70.37.66:/rhs/brick1/drvol 49154 0 Y 14602 Brick 10.70.35.191:/rhs/brick2/drvol 49156 0 Y 15070 Brick 10.70.37.108:/rhs/brick2/drvol N/A N/A N N/A Brick 10.70.35.3:/rhs/brick2/drvol 49155 0 Y 4810 Brick 10.70.37.66:/rhs/brick2/drvol 49155 0 Y 14622 Snapshot Daemon on localhost 49152 0 Y 14278 Self-heal Daemon on localhost N/A N/A Y 14128 Quota Daemon on localhost N/A N/A Y 14226 Snapshot Daemon on 10.70.35.3 49152 0 Y 4974 Self-heal Daemon on 10.70.35.3 N/A N/A Y 4835 Quota Daemon on 10.70.35.3 N/A N/A Y 4928 Snapshot Daemon on 10.70.37.66 49152 0 Y 14795 Self-heal Daemon on 10.70.37.66 N/A N/A Y 14643 Quota Daemon on 10.70.37.66 N/A N/A Y 14742 Snapshot Daemon on dhcp35-191.lab.eng.blr.r edhat.com 49153 0 Y 15307 Self-heal Daemon on dhcp35-191.lab.eng.blr. redhat.com N/A N/A Y 15091 Quota Daemon on dhcp35-191.lab.eng.blr.redh at.com N/A N/A Y 15239 Task Status of Volume drvol ------------------------------------------------------------------------------ There are no active volume tasks
open-behind is taking fd->lock then inode->lock where as statedump is taking inode->lock then fd->lock, so it is leading to deadlock: void ob_fd_free (ob_fd_t *ob_fd) { loc_wipe (&ob_fd->loc); <<--- this takes (inode->lock) if (ob_fd->xdata) dict_unref (ob_fd->xdata); if (ob_fd->open_frame) STACK_DESTROY (ob_fd->open_frame->root); GF_FREE (ob_fd); } int ob_wake_cbk (call_frame_t *frame, void *cookie, xlator_t *this, int op_ret, int op_errno, fd_t *fd_ret, dict_t *xdata) { fd_t *fd = NULL; struct list_head list; ob_fd_t *ob_fd = NULL; call_stub_t *stub = NULL, *tmp = NULL; fd = frame->local; frame->local = NULL; INIT_LIST_HEAD (&list); LOCK (&fd->lock); <<---- fd->lock { ob_fd = __ob_fd_ctx_get (this, fd); list_splice_init (&ob_fd->list, &list); if (op_ret < 0) { /* mark fd BAD for ever */ ob_fd->op_errno = op_errno; } else { __fd_ctx_del (fd, this, NULL); ob_fd_free (ob_fd); } } UNLOCK (&fd->lock); ============================================================== inode_dump (inode_t *inode, char *prefix) { int ret = -1; xlator_t *xl = NULL; int i = 0; fd_t *fd = NULL; struct _inode_ctx *inode_ctx = NULL; struct list_head fd_list; if (!inode) return; INIT_LIST_HEAD (&fd_list); ret = TRY_LOCK(&inode->lock); <<---- takes inode->lock if (ret != 0) { return; } { gf_proc_dump_write("gfid", "%s", uuid_utoa (inode->gfid)); gf_proc_dump_write("nlookup", "%ld", inode->nlookup); gf_proc_dump_write("fd-count", "%u", inode->fd_count); gf_proc_dump_write("ref", "%u", inode->ref); gf_proc_dump_write("ia_type", "%d", inode->ia_type); if (inode->_ctx) { inode_ctx = GF_CALLOC (inode->table->ctxcount, sizeof (*inode_ctx), gf_common_mt_inode_ctx); if (inode_ctx == NULL) { goto unlock; } for (i = 0; i < inode->table->ctxcount; i++) { inode_ctx[i] = inode->_ctx[i]; } } if (dump_options.xl_options.dump_fdctx != _gf_true) goto unlock; list_for_each_entry (fd, &inode->fd_list, inode_list) { fd_ctx_dump (fd, prefix); <<<----------------- } } fd_ctx_dump (fd_t *fd, char *prefix) { struct _fd_ctx *fd_ctx = NULL; xlator_t *xl = NULL; int i = 0; if ((fd == NULL) || (fd->_ctx == NULL)) { goto out; } LOCK (&fd->lock); <<<------------------- { if (fd->_ctx != NULL) { fd_ctx = GF_CALLOC (fd->xl_count, sizeof (*fd_ctx), gf_common_mt_fd_ctx); if (fd_ctx == NULL) { goto unlock; } for (i = 0; i < fd->xl_count; i++) { fd_ctx[i] = fd->_ctx[i]; } } }
http://review.gluster.org/15808
https://code.engineering.redhat.com/gerrit/#/c/89794/
I have run this case for two days on my systemic setup and didn't hit any hang/deadlock Hence moving to verified test version:3.8.4-5
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html