+++ This bug was initially created as a clone of Bug #1443123 +++ Description of problem: On an existing gluster cluster with USS enabled volume and snapshot created & activated , i enabled brick-multiplex. Post this i executed "gluster snapshot status" which showed time out issue. Then when i am executing gluster volume status , gluster pe status, these were throwing time out errors. (Happening across all the 4 nodes) Also it has become slow for other commands as in "umount", "df -Th" Version-Release number of selected component (if applicable): glusterfs-3.8.4-22.el7rhgs.x86_64 How reproducible: Always Steps to Reproduce: 1. Have a setup ready with snapshots activated 2. Enable brick-multiplex 3. Run gluster snapshot status 4. Run gluster volume status Actual results: Error : Request timed out Expected results: Should display the status Additional info: --- Additional comment from Red Hat Bugzilla Rules Engine on 2017-04-18 10:02:11 EDT --- This bug is automatically being proposed for the current release of Red Hat Gluster Storage 3 under active development, by setting the release flag 'rhgs‑3.3.0' to '?'. If this bug should be proposed for a different release, please manually change the proposed release flag. --- Additional comment from Vivek Das on 2017-04-18 10:29:37 EDT --- Logs : http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1443123 --- Additional comment from Atin Mukherjee on 2017-04-18 10:45:42 EDT --- I think you are hitting BZ 1441946. Can you check if you are seeing a stale glusterd lock entry in glusterd log file? If so that confirms this issue is same. --- Additional comment from Atin Mukherjee on 2017-04-18 23:56:31 EDT --- (In reply to Atin Mukherjee from comment #3) > I think you are hitting BZ 1441946. Can you check if you are seeing a stale > glusterd lock entry in glusterd log file? If so that confirms this issue is > same. Doesn't look like the same issue. In one of the node dhcp43-155.lab.eng.blr.redhat.com the backtrace of glusterd claims the process is hung: Thread 8 (Thread 0x7fe84aff6700 (LWP 1561)): #0 0x00007fe8527d91bd in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007fe8527d4d02 in _L_lock_791 () from /lib64/libpthread.so.0 #2 0x00007fe8527d4c08 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007fe8539802e5 in gf_timer_proc () from /lib64/libglusterfs.so.0 #4 0x00007fe8527d2dc5 in start_thread () from /lib64/libpthread.so.0 #5 0x00007fe85211773d in clone () from /lib64/libc.so.6 Thread 7 (Thread 0x7fe84a7f5700 (LWP 1562)): #0 0x00007fe8527da101 in sigwait () from /lib64/libpthread.so.0 #1 0x00007fe853e69ebb in glusterfs_sigwaiter () #2 0x00007fe8527d2dc5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007fe85211773d in clone () from /lib64/libc.so.6 Thread 6 (Thread 0x7fe849ff4700 (LWP 1563)): #0 0x00007fe8520de66d in nanosleep () from /lib64/libc.so.6 #1 0x00007fe8520de504 in sleep () from /lib64/libc.so.6 #2 0x00007fe85399982d in pool_sweeper () from /lib64/libglusterfs.so.0 #3 0x00007fe8527d2dc5 in start_thread () from /lib64/libpthread.so.0 #4 0x00007fe85211773d in clone () from /lib64/libc.so.6 Thread 5 (Thread 0x7fe8497f3700 (LWP 1564)): #0 0x00007fe8527d91bd in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007fe8527d4d02 in _L_lock_791 () from /lib64/libpthread.so.0 #2 0x00007fe8527d4c08 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007fe853980805 in gf_timer_call_cancel () from /lib64/libglusterfs.so.0 #4 0x00007fe8539765a3 in gf_log_disable_suppression_before_exit () from /lib64/libglusterfs.so.0 #5 0x00007fe85397c8e5 in gf_print_trace () from /lib64/libglusterfs.so.0 #6 <signal handler called> #7 0x00007fe8539807a6 in gf_timer_call_cancel () from /lib64/libglusterfs.so.0 #8 0x00007fe8484c2ac3 in glusterd_volume_start_glusterfs () from /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so #9 0x00007fe8484c54cf in glusterd_brick_start () from /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so #10 0x00007fe8484c5a4d in glusterd_restart_bricks () from /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so #11 0x00007fe8484d8806 in glusterd_spawn_daemons () from /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so #12 0x00007fe8539a9362 in synctask_wrap () from /lib64/libglusterfs.so.0 #13 0x00007fe852066cf0 in ?? () from /lib64/libc.so.6 #14 0x0000000000000000 in ?? () Thread 4 (Thread 0x7fe848ff2700 (LWP 1565)): #0 0x00007fe8527d6a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fe8539ab898 in syncenv_task () from /lib64/libglusterfs.so.0 #2 0x00007fe8539ac6e0 in syncenv_processor () from /lib64/libglusterfs.so.0 #3 0x00007fe8527d2dc5 in start_thread () from /lib64/libpthread.so.0 #4 0x00007fe85211773d in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7fe843f54700 (LWP 1819)): #0 0x00007fe8527d66d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fe84854bc43 in hooks_worker () from /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so #2 0x00007fe8527d2dc5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007fe85211773d in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7fe843753700 (LWP 1820)): #0 0x00007fe8527d66d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fe8539a945b in __synclock_lock () from /lib64/libglusterfs.so.0 #2 0x00007fe8539ac996 in synclock_lock () from /lib64/libglusterfs.so.0 #3 0x00007fe848497c2d in glusterd_big_locked_notify () from /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so #4 0x00007fe85373cb84 in rpc_clnt_notify () from /lib64/libgfrpc.so.0 #5 0x00007fe8537389f3 in rpc_transport_notify () from /lib64/libgfrpc.so.0 #6 0x00007fe8459031e7 in socket_connect_finish () from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so #7 0x00007fe845907848 in socket_event_handler () from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so #8 0x00007fe8539cce50 in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0 #9 0x00007fe8527d2dc5 in start_thread () from /lib64/libpthread.so.0 #10 0x00007fe85211773d in clone () from /lib64/libc.so.6 ---Type <return> to continue, or q <return> to quit--- Thread 1 (Thread 0x7fe853e4b780 (LWP 1559)): #0 0x00007fe8527d3ef7 in pthread_join () from /lib64/libpthread.so.0 #1 0x00007fe8539cd2e0 in event_dispatch_epoll () from /lib64/libglusterfs.so.0 #2 0x00007fe853e66d95 in main () --- Additional comment from Atin Mukherjee on 2017-04-19 07:14:18 EDT --- The above hang was caused due to a node reboot. Following is the gluster volume info output for self reference. [root@dhcp43-99 ~]# gluster v info Volume Name: benbhai Type: Distributed-Replicate Volume ID: 3b0cb05f-629e-435b-998c-a0f5870e888a Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: dhcp43-99.lab.eng.blr.redhat.com:/bricks/brick1/benbhai_brick0 Brick2: dhcp43-155.lab.eng.blr.redhat.com:/bricks/brick2/benbhai_brick1 Brick3: dhcp42-240.lab.eng.blr.redhat.com:/bricks/brick2/benbhai_brick2 Brick4: dhcp43-101.lab.eng.blr.redhat.com:/bricks/brick1/benbhai_brick3 Options Reconfigured: features.barrier: disable features.show-snapshot-directory: enable features.uss: enable performance.cache-samba-metadata: on network.inode-lru-limit: 50000 performance.md-cache-timeout: 600 performance.cache-invalidation: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on storage.batch-fsync-delay-usec: 0 performance.stat-prefetch: on server.allow-insecure: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on cluster.brick-multiplex: enable Volume Name: ctdb Type: Replicate Volume ID: 195b67be-d9af-4b2a-9c6c-b17a088a1921 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 4 = 4 Transport-type: tcp Bricks: Brick1: 10.70.43.99:/bricks/brick4/ctdb Brick2: 10.70.43.155:/bricks/brick4/ctdb Brick3: 10.70.42.240:/bricks/brick4/ctdb Brick4: 10.70.43.101:/bricks/brick4/ctdb Options Reconfigured: transport.address-family: inet performance.readdir-ahead: on nfs.disable: on cluster.brick-multiplex: enable Volume Name: fashion Type: Distributed-Replicate Volume ID: 3022ed91-5646-4c4e-a173-ad84cfb2556a Status: Started Snapshot Count: 3 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: dhcp43-99.lab.eng.blr.redhat.com:/bricks/brick2/fashion_brick0 Brick2: dhcp43-155.lab.eng.blr.redhat.com:/bricks/brick3/fashion_brick1 Brick3: dhcp42-240.lab.eng.blr.redhat.com:/bricks/brick3/fashion_brick2 Brick4: dhcp43-101.lab.eng.blr.redhat.com:/bricks/brick2/fashion_brick3 Options Reconfigured: performance.parallel-readdir: on features.barrier: disable features.show-snapshot-directory: enable features.uss: enable performance.cache-samba-metadata: on network.inode-lru-limit: 50000 performance.md-cache-timeout: 600 performance.cache-invalidation: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on storage.batch-fsync-delay-usec: 0 performance.stat-prefetch: on server.allow-insecure: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on nfs.log-level: DEFAULT cluster.brick-multiplex: enable Volume Name: samba-arbitor Type: Distributed-Replicate Volume ID: dfb5e983-642c-4abd-8bb4-57356e31f982 Status: Started Snapshot Count: 0 Number of Bricks: 2 x (2 + 1) = 6 Transport-type: tcp Bricks: Brick1: dhcp43-99.lab.eng.blr.redhat.com:/bricks/brick0/samba-arbitor_brick0 Brick2: dhcp43-155.lab.eng.blr.redhat.com:/bricks/brick0/samba-arbitor_brick1 Brick3: dhcp42-240.lab.eng.blr.redhat.com:/bricks/brick1/samba-arbitor_brick2 (arbiter) Brick4: dhcp42-240.lab.eng.blr.redhat.com:/bricks/brick0/samba-arbitor_brick2 Brick5: dhcp43-101.lab.eng.blr.redhat.com:/bricks/brick0/samba-arbitor_brick3 Brick6: dhcp43-155.lab.eng.blr.redhat.com:/bricks/brick1/samba-arbitor_brick1 (arbiter) Options Reconfigured: storage.batch-fsync-delay-usec: 0 performance.stat-prefetch: on server.allow-insecure: on nfs.disable: on performance.readdir-ahead: on transport.address-family: inet cluster.brick-multiplex: enable Volume Name: tmpvol Type: Distribute Volume ID: 255e41a5-c9a2-466c-9f56-4f126957147e Status: Stopped Snapshot Count: 0 Number of Bricks: 1 Transport-type: tcp Bricks: Brick1: dhcp43-155.lab.eng.blr.redhat.com:/bricks/tmp Options Reconfigured: transport.address-family: inet nfs.disable: on cluster.brick-multiplex: enable --- Additional comment from Atin Mukherjee on 2017-04-19 08:49:02 EDT --- Looks similar to BZ 1421721
REVIEW: https://review.gluster.org/17088 (glusterd: set conn->reconnect to null on timer cancellation) posted (#1) for review on master by Atin Mukherjee (amukherj)
COMMIT: https://review.gluster.org/17088 committed in master by Jeff Darcy (jeff.us) ------ commit 98dc1f08c114adea1f4133c12dff0d4c3d75b30d Author: Atin Mukherjee <amukherj> Date: Thu Apr 20 13:57:27 2017 +0530 glusterd: set conn->reconnect to null on timer cancellation Change-Id: Ic48e6652f431daeb0db027660f6c9de16d893f08 BUG: 1443896 Signed-off-by: Atin Mukherjee <amukherj> Reviewed-on: https://review.gluster.org/17088 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Jeff Darcy <jeff.us>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.0, please open a new bug report. glusterfs-3.11.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2017-May/000073.html [2] https://www.gluster.org/pipermail/gluster-users/