Bug 1398188

Summary: [Arbiter] IO's Halted and heal info command hung
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Karan Sandha <ksandha>
Component: arbiterAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED ERRATA QA Contact: Karan Sandha <ksandha>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.2CC: amukherj, pkarampu, ravishankar, rhinduja, rhs-bugs, storage-qa-internal
Target Milestone: ---   
Target Release: RHGS 3.2.0   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.8.4-8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1401404 (view as bug list) Environment:
Last Closed: 2017-03-23 05:50:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1351528, 1401404, 1412909, 1413062    
Attachments:
Description Flags
statedumps none

Description Karan Sandha 2016-11-24 09:37:22 UTC
Description of problem:
Add arbiter bricks to replica 2 volume ( 8x2 ) while IO is in progress and issued heal info which leads to IO and heal info hang. 

Version-Release number of selected component (if applicable):
gluster --version
glusterfs 3.8.4 built on Nov 11 2016 06:45:08
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General Public License.

How reproducible:
2/2
logs and statedumps placed at rhsqe-repo.lab.eng.blr.redhat.com:/var/www/html/sosreports/<bug>




Steps to Reproduce:
1. Create  8x2 repica 2 volume (ksandha) using below command:-
gluster volume create ksandha replica 2 10.70.47.141:/bricks//brick0/testvol_brick0 10.70.47.143:/bricks//brick0/testvol_brick1 10.70.47.144:/bricks//brick0/testvol_brick2 10.70.47.197:/bricks//brick0/testvol_brick3 10.70.47.175:/bricks//brick0/testvol_brick4 10.70.46.142:/bricks//brick0/testvol_brick5 10.70.47.141:/bricks//brick1/testvol_brick6 10.70.47.143:/bricks//brick1/testvol_brick7 10.70.47.144:/bricks//brick1/testvol_brick8 10.70.47.197:/bricks//brick1/testvol_brick9 10.70.47.175:/bricks//brick1/testvol_brick10 10.70.46.142:/bricks//brick1/testvol_brick11 10.70.47.141:/bricks//brick2/testvol_brick12 10.70.47.143:/bricks//brick2/testvol_brick13 10.70.47.144:/bricks//brick2/testvol_brick14 10.70.47.197:/bricks//brick2/testvol_brick15

[root@dhcp47-141 ~]# gluster v info
 
Volume Name: ksandha
Type: Distributed-Replicate
Volume ID: cdc6fddd-023c-4c25-ab37-c03b329d07a6
Status: Started
Snapshot Count: 0
Number of Bricks: 8 x 2 = 16
Transport-type: tcp
Bricks:
Brick1: 10.70.47.141:/bricks/brick0/testvol_brick0
Brick2: 10.70.47.143:/bricks/brick0/testvol_brick1
Brick3: 10.70.47.144:/bricks/brick0/testvol_brick2
Brick4: 10.70.47.197:/bricks/brick0/testvol_brick3
Brick5: 10.70.47.175:/bricks/brick0/testvol_brick4
Brick6: 10.70.46.142:/bricks/brick0/testvol_brick5
Brick7: 10.70.47.141:/bricks/brick1/testvol_brick6
Brick8: 10.70.47.143:/bricks/brick1/testvol_brick7
Brick9: 10.70.47.144:/bricks/brick1/testvol_brick8
Brick10: 10.70.47.197:/bricks/brick1/testvol_brick9
Brick11: 10.70.47.175:/bricks/brick1/testvol_brick10
Brick12: 10.70.46.142:/bricks/brick1/testvol_brick11
Brick13: 10.70.47.141:/bricks/brick2/testvol_brick12
Brick14: 10.70.47.143:/bricks/brick2/testvol_brick13
Brick15: 10.70.47.144:/bricks/brick2/testvol_brick14
Brick16: 10.70.47.197:/bricks/brick2/testvol_brick15
Options Reconfigured:
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
[root@dhcp47-141 ~]# 


2. mount it on x.x.47.116:ksandha /mnt/ksandha and dd 10 files of 2gb each from mount point and let it finish

3. Kill 1 brick each from 2 sub volumes.
[root@dhcp47-141 ~]# gluster v status
Status of volume: ksandha
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.47.141:/bricks/brick0/testvol_b
rick0                                       N/A       N/A        N       N/A  
Brick 10.70.47.143:/bricks/brick0/testvol_b
rick1                                       49152     0          Y       25809
Brick 10.70.47.144:/bricks/brick0/testvol_b
rick2                                       49152     0          Y       25783
Brick 10.70.47.197:/bricks/brick0/testvol_b
rick3                                       49152     0          Y       21827
Brick 10.70.47.175:/bricks/brick0/testvol_b
rick4                                       49152     0          Y       21792
Brick 10.70.46.142:/bricks/brick0/testvol_b
rick5                                       49152     0          Y       21838
Brick 10.70.47.141:/bricks/brick1/testvol_b
rick6                                       49153     0          Y       15757
Brick 10.70.47.143:/bricks/brick1/testvol_b
rick7                                       49153     0          Y       25828
Brick 10.70.47.144:/bricks/brick1/testvol_b
rick8                                       49153     0          Y       25802
Brick 10.70.47.197:/bricks/brick1/testvol_b
rick9                                       49153     0          Y       21846
Brick 10.70.47.175:/bricks/brick1/testvol_b
rick10                                      49153     0          Y       21811
Brick 10.70.46.142:/bricks/brick1/testvol_b
rick11                                      49153     0          Y       21857
Brick 10.70.47.141:/bricks/brick2/testvol_b
rick12                                      49154     0          Y       15776
Brick 10.70.47.143:/bricks/brick2/testvol_b
rick13                                      N/A       N/A        N       N/A  
Brick 10.70.47.144:/bricks/brick2/testvol_b
rick14                                      49154     0          Y       25821
Brick 10.70.47.197:/bricks/brick2/testvol_b
rick15                                      49154     0          Y       21865
Self-heal Daemon on localhost               N/A       N/A        Y       15796
Self-heal Daemon on 10.70.47.143            N/A       N/A        Y       25867
Self-heal Daemon on 10.70.46.142            N/A       N/A        Y       21877
Self-heal Daemon on 10.70.47.197            N/A       N/A        Y       21885
Self-heal Daemon on 10.70.47.144            N/A       N/A        Y       25841
Self-heal Daemon on 10.70.47.175            N/A       N/A        Y       21831
 
Task Status of Volume ksandha
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@dhcp47-141 ~]# 

4. Now create 1,00,000 files. while creation is in progress add the arbiter bricks to the volume using below command:-


[root@dhcp47-141 ~]# gluster volume add-brick ksandha replica 3 arbiter 1 10.70.47.141:/bricks/brick3/arbiter 10.70.47.143:/bricks/brick3/arbiter 10.70.47.144:/bricks/brick3/arbiter 10.70.47.197:/bricks/brick3/arbiter 10.70.47.175:/bricks/brick3/arbiter 10.70.46.142:/bricks/brick3/arbiter 10.70.47.141:/bricks/brick4/arbiter 10.70.47.143:/bricks/brick4/arbiter 
volume add-brick: success
[root@dhcp47-141 ~]# 

root@dhcp47-141 ~]# gluster v info
 
Volume Name: ksandha
Type: Distributed-Replicate
Volume ID: ab8c2a73-73ff-4026-9b7e-2134095d2986
Status: Started
Snapshot Count: 0
Number of Bricks: 8 x (2 + 1) = 24
Transport-type: tcp
Bricks:
Brick1: 10.70.47.141:/bricks/brick0/testvol_brick0
Brick2: 10.70.47.143:/bricks/brick0/testvol_brick1
Brick3: 10.70.47.141:/bricks/brick3/arbiter (arbiter)
Brick4: 10.70.47.144:/bricks/brick0/testvol_brick2
Brick5: 10.70.47.197:/bricks/brick0/testvol_brick3
Brick6: 10.70.47.143:/bricks/brick3/arbiter (arbiter)
Brick7: 10.70.47.175:/bricks/brick0/testvol_brick4
Brick8: 10.70.46.142:/bricks/brick0/testvol_brick5
Brick9: 10.70.47.144:/bricks/brick3/arbiter (arbiter)
Brick10: 10.70.47.141:/bricks/brick1/testvol_brick6
Brick11: 10.70.47.143:/bricks/brick1/testvol_brick7
Brick12: 10.70.47.197:/bricks/brick3/arbiter (arbiter)
Brick13: 10.70.47.144:/bricks/brick1/testvol_brick8
Brick14: 10.70.47.197:/bricks/brick1/testvol_brick9
Brick15: 10.70.47.175:/bricks/brick3/arbiter (arbiter)
Brick16: 10.70.47.175:/bricks/brick1/testvol_brick10
Brick17: 10.70.46.142:/bricks/brick1/testvol_brick11
Brick18: 10.70.46.142:/bricks/brick3/arbiter (arbiter)
Brick19: 10.70.47.141:/bricks/brick2/testvol_brick12
Brick20: 10.70.47.143:/bricks/brick2/testvol_brick13
Brick21: 10.70.47.141:/bricks/brick4/arbiter (arbiter)
Brick22: 10.70.47.144:/bricks/brick2/testvol_brick14
Brick23: 10.70.47.197:/bricks/brick2/testvol_brick15
Brick24: 10.70.47.143:/bricks/brick4/arbiter (arbiter)
Options Reconfigured:
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on


5. Start the volume. and issue gluster heal info command on one of the servers after 1-2 mins.
[root@dhcp47-141 ~]# gluster volume start ksandha force
volume start: ksandha: success
[root@dhcp47-141 ~]# 

Actual results:
1) gluster heal ksandha info hangs on the servers.
2) I/O hangs on the mount point.
3) Few locks observed in the state dumps of servers and mount logs.


Expected results:
1) Healing should not be hampered. IO's should run smoothly. 
2) no deadlocks should be observed.


Additional info:
Ravi had done some initial debugging and found few locks on server x.x.47.141 and x.x.47.143 and mount point 47.116

Comment 6 Ravishankar N 2016-12-02 06:20:07 UTC
Created attachment 1227157 [details]
statedumps

bricks-brick0-testvol_brick0.26116.dump.1479805293 --> brick:client-0 (140)
bricks-brick0-testvol_brick1.25809.dump.1479803748-->brick:client-1 (143)
glusterdump.4424.dump.1479804418 --> fuse mount statedump
mnt-ksandha.log --> Fuse mount log

Comment 7 Atin Mukherjee 2016-12-05 08:51:19 UTC
upstream mainline patch http://review.gluster.org/#/c/16024/ posted for review.

Comment 12 errata-xmlrpc 2017-03-23 05:50:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html