Bug 1443896 - [BrickMultiplex] gluster command not responding and .snaps directory is not visible after executing snapshot related command
Summary: [BrickMultiplex] gluster command not responding and .snaps directory is not v...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: mainline
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Atin Mukherjee
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1443123 1444128
TreeView+ depends on / blocked
 
Reported: 2017-04-20 08:28 UTC by Atin Mukherjee
Modified: 2017-05-30 18:50 UTC (History)
7 users (show)

Fixed In Version: glusterfs-3.11.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1443123
: 1444128 (view as bug list)
Environment:
Last Closed: 2017-05-30 18:50:19 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Atin Mukherjee 2017-04-20 08:28:23 UTC
+++ This bug was initially created as a clone of Bug #1443123 +++

Description of problem:
On an existing gluster cluster with USS enabled volume and snapshot created & activated , i enabled brick-multiplex. Post this i executed "gluster snapshot status" which showed time out issue.
Then when i am executing gluster volume status , gluster pe status, these were throwing time out errors. (Happening across all the 4 nodes)
Also it has become slow for other commands as in "umount", "df -Th"

Version-Release number of selected component (if applicable):
glusterfs-3.8.4-22.el7rhgs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Have a setup ready with snapshots activated
2. Enable brick-multiplex
3. Run gluster snapshot status
4. Run gluster volume status

Actual results:
Error : Request timed out

Expected results:
Should display the status

Additional info:

--- Additional comment from Red Hat Bugzilla Rules Engine on 2017-04-18 10:02:11 EDT ---

This bug is automatically being proposed for the current release of Red Hat Gluster Storage 3 under active development, by setting the release flag 'rhgs‑3.3.0' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from Vivek Das on 2017-04-18 10:29:37 EDT ---

Logs : http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1443123

--- Additional comment from Atin Mukherjee on 2017-04-18 10:45:42 EDT ---

I think you are hitting BZ 1441946. Can you check if you are seeing a stale glusterd lock entry in glusterd log file? If so that confirms this issue is same.

--- Additional comment from Atin Mukherjee on 2017-04-18 23:56:31 EDT ---

(In reply to Atin Mukherjee from comment #3)
> I think you are hitting BZ 1441946. Can you check if you are seeing a stale
> glusterd lock entry in glusterd log file? If so that confirms this issue is
> same.

Doesn't look like the same issue. In one of the node dhcp43-155.lab.eng.blr.redhat.com the backtrace of glusterd claims the process is hung:

Thread 8 (Thread 0x7fe84aff6700 (LWP 1561)):
#0  0x00007fe8527d91bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fe8527d4d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007fe8527d4c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007fe8539802e5 in gf_timer_proc () from /lib64/libglusterfs.so.0
#4  0x00007fe8527d2dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fe85211773d in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7fe84a7f5700 (LWP 1562)):
#0  0x00007fe8527da101 in sigwait () from /lib64/libpthread.so.0
#1  0x00007fe853e69ebb in glusterfs_sigwaiter ()
#2  0x00007fe8527d2dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fe85211773d in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7fe849ff4700 (LWP 1563)):
#0  0x00007fe8520de66d in nanosleep () from /lib64/libc.so.6
#1  0x00007fe8520de504 in sleep () from /lib64/libc.so.6
#2  0x00007fe85399982d in pool_sweeper () from /lib64/libglusterfs.so.0
#3  0x00007fe8527d2dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fe85211773d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7fe8497f3700 (LWP 1564)):
#0  0x00007fe8527d91bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fe8527d4d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007fe8527d4c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007fe853980805 in gf_timer_call_cancel () from /lib64/libglusterfs.so.0
#4  0x00007fe8539765a3 in gf_log_disable_suppression_before_exit () from /lib64/libglusterfs.so.0
#5  0x00007fe85397c8e5 in gf_print_trace () from /lib64/libglusterfs.so.0
#6  <signal handler called>
#7  0x00007fe8539807a6 in gf_timer_call_cancel () from /lib64/libglusterfs.so.0
#8  0x00007fe8484c2ac3 in glusterd_volume_start_glusterfs () from /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so
#9  0x00007fe8484c54cf in glusterd_brick_start () from /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so
#10 0x00007fe8484c5a4d in glusterd_restart_bricks () from /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so
#11 0x00007fe8484d8806 in glusterd_spawn_daemons () from /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so
#12 0x00007fe8539a9362 in synctask_wrap () from /lib64/libglusterfs.so.0
#13 0x00007fe852066cf0 in ?? () from /lib64/libc.so.6
#14 0x0000000000000000 in ?? ()

Thread 4 (Thread 0x7fe848ff2700 (LWP 1565)):
#0  0x00007fe8527d6a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fe8539ab898 in syncenv_task () from /lib64/libglusterfs.so.0
#2  0x00007fe8539ac6e0 in syncenv_processor () from /lib64/libglusterfs.so.0
#3  0x00007fe8527d2dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fe85211773d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fe843f54700 (LWP 1819)):
#0  0x00007fe8527d66d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fe84854bc43 in hooks_worker () from /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so
#2  0x00007fe8527d2dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fe85211773d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fe843753700 (LWP 1820)):
#0  0x00007fe8527d66d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fe8539a945b in __synclock_lock () from /lib64/libglusterfs.so.0
#2  0x00007fe8539ac996 in synclock_lock () from /lib64/libglusterfs.so.0
#3  0x00007fe848497c2d in glusterd_big_locked_notify () from /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so
#4  0x00007fe85373cb84 in rpc_clnt_notify () from /lib64/libgfrpc.so.0
#5  0x00007fe8537389f3 in rpc_transport_notify () from /lib64/libgfrpc.so.0
#6  0x00007fe8459031e7 in socket_connect_finish () from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so
#7  0x00007fe845907848 in socket_event_handler () from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so
#8  0x00007fe8539cce50 in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0
#9  0x00007fe8527d2dc5 in start_thread () from /lib64/libpthread.so.0
#10 0x00007fe85211773d in clone () from /lib64/libc.so.6
---Type <return> to continue, or q <return> to quit---

Thread 1 (Thread 0x7fe853e4b780 (LWP 1559)):
#0  0x00007fe8527d3ef7 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007fe8539cd2e0 in event_dispatch_epoll () from /lib64/libglusterfs.so.0
#2  0x00007fe853e66d95 in main ()

--- Additional comment from Atin Mukherjee on 2017-04-19 07:14:18 EDT ---

The above hang was caused due to a node reboot. Following is the gluster volume info output for self reference.

[root@dhcp43-99 ~]# gluster v info
 
Volume Name: benbhai
Type: Distributed-Replicate
Volume ID: 3b0cb05f-629e-435b-998c-a0f5870e888a
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: dhcp43-99.lab.eng.blr.redhat.com:/bricks/brick1/benbhai_brick0
Brick2: dhcp43-155.lab.eng.blr.redhat.com:/bricks/brick2/benbhai_brick1
Brick3: dhcp42-240.lab.eng.blr.redhat.com:/bricks/brick2/benbhai_brick2
Brick4: dhcp43-101.lab.eng.blr.redhat.com:/bricks/brick1/benbhai_brick3
Options Reconfigured:
features.barrier: disable
features.show-snapshot-directory: enable
features.uss: enable
performance.cache-samba-metadata: on
network.inode-lru-limit: 50000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
storage.batch-fsync-delay-usec: 0
performance.stat-prefetch: on
server.allow-insecure: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
cluster.brick-multiplex: enable
 
Volume Name: ctdb
Type: Replicate
Volume ID: 195b67be-d9af-4b2a-9c6c-b17a088a1921
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 4 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.43.99:/bricks/brick4/ctdb
Brick2: 10.70.43.155:/bricks/brick4/ctdb
Brick3: 10.70.42.240:/bricks/brick4/ctdb
Brick4: 10.70.43.101:/bricks/brick4/ctdb
Options Reconfigured:
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
cluster.brick-multiplex: enable
 
Volume Name: fashion
Type: Distributed-Replicate
Volume ID: 3022ed91-5646-4c4e-a173-ad84cfb2556a
Status: Started
Snapshot Count: 3
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: dhcp43-99.lab.eng.blr.redhat.com:/bricks/brick2/fashion_brick0
Brick2: dhcp43-155.lab.eng.blr.redhat.com:/bricks/brick3/fashion_brick1
Brick3: dhcp42-240.lab.eng.blr.redhat.com:/bricks/brick3/fashion_brick2
Brick4: dhcp43-101.lab.eng.blr.redhat.com:/bricks/brick2/fashion_brick3
Options Reconfigured:
performance.parallel-readdir: on
features.barrier: disable
features.show-snapshot-directory: enable
features.uss: enable
performance.cache-samba-metadata: on
network.inode-lru-limit: 50000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
storage.batch-fsync-delay-usec: 0
performance.stat-prefetch: on
server.allow-insecure: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
nfs.log-level: DEFAULT
cluster.brick-multiplex: enable
 
Volume Name: samba-arbitor
Type: Distributed-Replicate
Volume ID: dfb5e983-642c-4abd-8bb4-57356e31f982
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (2 + 1) = 6
Transport-type: tcp
Bricks:
Brick1: dhcp43-99.lab.eng.blr.redhat.com:/bricks/brick0/samba-arbitor_brick0
Brick2: dhcp43-155.lab.eng.blr.redhat.com:/bricks/brick0/samba-arbitor_brick1
Brick3: dhcp42-240.lab.eng.blr.redhat.com:/bricks/brick1/samba-arbitor_brick2 (arbiter)
Brick4: dhcp42-240.lab.eng.blr.redhat.com:/bricks/brick0/samba-arbitor_brick2
Brick5: dhcp43-101.lab.eng.blr.redhat.com:/bricks/brick0/samba-arbitor_brick3
Brick6: dhcp43-155.lab.eng.blr.redhat.com:/bricks/brick1/samba-arbitor_brick1 (arbiter)
Options Reconfigured:
storage.batch-fsync-delay-usec: 0
performance.stat-prefetch: on
server.allow-insecure: on
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
cluster.brick-multiplex: enable
 
Volume Name: tmpvol
Type: Distribute
Volume ID: 255e41a5-c9a2-466c-9f56-4f126957147e
Status: Stopped
Snapshot Count: 0
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: dhcp43-155.lab.eng.blr.redhat.com:/bricks/tmp
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
cluster.brick-multiplex: enable

--- Additional comment from Atin Mukherjee on 2017-04-19 08:49:02 EDT ---

Looks similar to BZ 1421721

Comment 1 Worker Ant 2017-04-20 08:29:28 UTC
REVIEW: https://review.gluster.org/17088 (glusterd: set conn->reconnect to null on timer cancellation) posted (#1) for review on master by Atin Mukherjee (amukherj)

Comment 2 Worker Ant 2017-04-20 15:14:20 UTC
COMMIT: https://review.gluster.org/17088 committed in master by Jeff Darcy (jeff.us) 
------
commit 98dc1f08c114adea1f4133c12dff0d4c3d75b30d
Author: Atin Mukherjee <amukherj>
Date:   Thu Apr 20 13:57:27 2017 +0530

    glusterd: set conn->reconnect to null on timer cancellation
    
    Change-Id: Ic48e6652f431daeb0db027660f6c9de16d893f08
    BUG: 1443896
    Signed-off-by: Atin Mukherjee <amukherj>
    Reviewed-on: https://review.gluster.org/17088
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Jeff Darcy <jeff.us>

Comment 3 Shyamsundar 2017-05-30 18:50:19 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.0, please open a new bug report.

glusterfs-3.11.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-May/000073.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.