Bug 1208564

Summary: ls operation hangs after the first brick is killed for distributed-disperse volume
Product: [Community] GlusterFS Reporter: Fang Huang <fanghuang.data>
Component: disperseAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED EOL QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.6.2CC: bugs, kaushal, rabhat
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 3.6.3 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-16 13:08:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Fang Huang 2015-04-02 14:19:44 UTC
Description of problem:

At client ls operation hangs after the first brick is killed for distributed-disperse volume

Version-Release number of selected component (if applicable):

glusterfs 3.6.3beta1 built on Apr  1 2015 20:11:53
Repository revision: git://git.gluster.com/glusterfs.git

How reproducible:

Always

Steps to Reproduce:
1. Create a distributed-disperse volume

Volume Name: dht-ec-vol
Type: Distributed-Disperse
Volume ID: 5697b709-4187-47ce-970b-b8c19d610055
Status: Started
Number of Bricks: 2 x (3 + 1) = 8
Transport-type: tcp
Bricks:
Brick1: cn4:/home/wtb/brick/bri1
Brick2: cn4:/home/wtb/brick/bri2
Brick3: cn4:/home/wtb/brick/bri3
Brick4: cn4:/home/wtb/brick/bri4
Brick5: cn4:/home/wtb/brick/bri5
Brick6: cn4:/home/wtb/brick/bri6
Brick7: cn4:/home/wtb/brick/bri7
Brick8: cn4:/home/wtb/brick/bri8

2. mount and create some files

mkdir  -p /mnt/wtb/dht-ec-client
mount -t glusterfs cn4:/dht-ec-vol /mnt/wtb/dht-ec-client

touch /mnt/wtb/dht-ec-client/file{0..99}

3. kill the first brick

execute 'gluster v status dht-ec-vol' and find the pid of Brick1
kill this process

4. ls /mnt/wtb/dht-ec-client 

Actual results:

ls hangs on and shows nothing

Expected results:

file{0..99} should be displayed.

Additional info:

Comment 1 Fang Huang 2015-04-02 14:21:41 UTC
Ls operation works well if the other brick is killed.

Comment 2 Pranith Kumar K 2015-05-09 16:53:49 UTC
root@pranithk-laptop - ~ 
22:22:12 :) ⚡ glusterd && gluster volume create ec2 disperse 4 redundancy 1 `hostname`:/home/gfs/ec_{0..7} force && gluster volume start ec2 && mount -t glusterfs `hostname`:/ec2 /mnt/ec2
This configuration is not optimal on most workloads. Do you want to use it ? (y/n) y
Multiple bricks of a replicate volume are present on the same server. This setup is not optimal.
Do you still want to continue creating the volume?  (y/n) y
volume create: ec2: success: please start the volume to access data
volume start: ec2: success

root@pranithk-laptop - ~ 
22:22:23 :) ⚡ cd /mnt/ec2

root@pranithk-laptop - /mnt/ec2 
22:22:25 :) ⚡ mkdir -p dht-ec-client

root@pranithk-laptop - /mnt/ec2 
22:22:37 :) ⚡ cd dht-ec-client

root@pranithk-laptop - /mnt/ec2/dht-ec-client 
22:22:38 :) ⚡ touch file{0..99}

root@pranithk-laptop - /mnt/ec2/dht-ec-client 
22:22:55 :) ⚡ ls -l
total 0
-rw-r--r--. 1 root root 0 May  9 22:22 file0
-rw-r--r--. 1 root root 0 May  9 22:22 file1
-rw-r--r--. 1 root root 0 May  9 22:22 file10
-rw-r--r--. 1 root root 0 May  9 22:22 file11
-rw-r--r--. 1 root root 0 May  9 22:22 file12
-rw-r--r--. 1 root root 0 May  9 22:22 file13
-rw-r--r--. 1 root root 0 May  9 22:22 file14
-rw-r--r--. 1 root root 0 May  9 22:22 file15
-rw-r--r--. 1 root root 0 May  9 22:22 file16
-rw-r--r--. 1 root root 0 May  9 22:22 file17
-rw-r--r--. 1 root root 0 May  9 22:22 file18
-rw-r--r--. 1 root root 0 May  9 22:22 file19
-rw-r--r--. 1 root root 0 May  9 22:22 file2
-rw-r--r--. 1 root root 0 May  9 22:22 file20
-rw-r--r--. 1 root root 0 May  9 22:22 file21
-rw-r--r--. 1 root root 0 May  9 22:22 file22
-rw-r--r--. 1 root root 0 May  9 22:22 file23
-rw-r--r--. 1 root root 0 May  9 22:22 file24
-rw-r--r--. 1 root root 0 May  9 22:22 file25
-rw-r--r--. 1 root root 0 May  9 22:22 file26
-rw-r--r--. 1 root root 0 May  9 22:22 file27
-rw-r--r--. 1 root root 0 May  9 22:22 file28
-rw-r--r--. 1 root root 0 May  9 22:22 file29
-rw-r--r--. 1 root root 0 May  9 22:22 file3
-rw-r--r--. 1 root root 0 May  9 22:22 file30
-rw-r--r--. 1 root root 0 May  9 22:22 file31
-rw-r--r--. 1 root root 0 May  9 22:22 file32
-rw-r--r--. 1 root root 0 May  9 22:22 file33
-rw-r--r--. 1 root root 0 May  9 22:22 file34
-rw-r--r--. 1 root root 0 May  9 22:22 file35
-rw-r--r--. 1 root root 0 May  9 22:22 file36
-rw-r--r--. 1 root root 0 May  9 22:22 file37
-rw-r--r--. 1 root root 0 May  9 22:22 file38
-rw-r--r--. 1 root root 0 May  9 22:22 file39
-rw-r--r--. 1 root root 0 May  9 22:22 file4
-rw-r--r--. 1 root root 0 May  9 22:22 file40
-rw-r--r--. 1 root root 0 May  9 22:22 file41
-rw-r--r--. 1 root root 0 May  9 22:22 file42
-rw-r--r--. 1 root root 0 May  9 22:22 file43
-rw-r--r--. 1 root root 0 May  9 22:22 file44
-rw-r--r--. 1 root root 0 May  9 22:22 file45
-rw-r--r--. 1 root root 0 May  9 22:22 file46
-rw-r--r--. 1 root root 0 May  9 22:22 file47
-rw-r--r--. 1 root root 0 May  9 22:22 file48
-rw-r--r--. 1 root root 0 May  9 22:22 file49
-rw-r--r--. 1 root root 0 May  9 22:22 file5
-rw-r--r--. 1 root root 0 May  9 22:22 file50
-rw-r--r--. 1 root root 0 May  9 22:22 file51
-rw-r--r--. 1 root root 0 May  9 22:22 file52
-rw-r--r--. 1 root root 0 May  9 22:22 file53
-rw-r--r--. 1 root root 0 May  9 22:22 file54
-rw-r--r--. 1 root root 0 May  9 22:22 file55
-rw-r--r--. 1 root root 0 May  9 22:22 file56
-rw-r--r--. 1 root root 0 May  9 22:22 file57
-rw-r--r--. 1 root root 0 May  9 22:22 file58
-rw-r--r--. 1 root root 0 May  9 22:22 file59
-rw-r--r--. 1 root root 0 May  9 22:22 file6
-rw-r--r--. 1 root root 0 May  9 22:22 file60
-rw-r--r--. 1 root root 0 May  9 22:22 file61
-rw-r--r--. 1 root root 0 May  9 22:22 file62
-rw-r--r--. 1 root root 0 May  9 22:22 file63
-rw-r--r--. 1 root root 0 May  9 22:22 file64
-rw-r--r--. 1 root root 0 May  9 22:22 file65
-rw-r--r--. 1 root root 0 May  9 22:22 file66
-rw-r--r--. 1 root root 0 May  9 22:22 file67
-rw-r--r--. 1 root root 0 May  9 22:22 file68
-rw-r--r--. 1 root root 0 May  9 22:22 file69
-rw-r--r--. 1 root root 0 May  9 22:22 file7
-rw-r--r--. 1 root root 0 May  9 22:22 file70
-rw-r--r--. 1 root root 0 May  9 22:22 file71
-rw-r--r--. 1 root root 0 May  9 22:22 file72
-rw-r--r--. 1 root root 0 May  9 22:22 file73
-rw-r--r--. 1 root root 0 May  9 22:22 file74
-rw-r--r--. 1 root root 0 May  9 22:22 file75
-rw-r--r--. 1 root root 0 May  9 22:22 file76
-rw-r--r--. 1 root root 0 May  9 22:22 file77
-rw-r--r--. 1 root root 0 May  9 22:22 file78
-rw-r--r--. 1 root root 0 May  9 22:22 file79
-rw-r--r--. 1 root root 0 May  9 22:22 file8
-rw-r--r--. 1 root root 0 May  9 22:22 file80
-rw-r--r--. 1 root root 0 May  9 22:22 file81
-rw-r--r--. 1 root root 0 May  9 22:22 file82
-rw-r--r--. 1 root root 0 May  9 22:22 file83
-rw-r--r--. 1 root root 0 May  9 22:22 file84
-rw-r--r--. 1 root root 0 May  9 22:22 file85
-rw-r--r--. 1 root root 0 May  9 22:22 file86
-rw-r--r--. 1 root root 0 May  9 22:22 file87
-rw-r--r--. 1 root root 0 May  9 22:22 file88
-rw-r--r--. 1 root root 0 May  9 22:22 file89
-rw-r--r--. 1 root root 0 May  9 22:22 file9
-rw-r--r--. 1 root root 0 May  9 22:22 file90
-rw-r--r--. 1 root root 0 May  9 22:22 file91
-rw-r--r--. 1 root root 0 May  9 22:22 file92
-rw-r--r--. 1 root root 0 May  9 22:22 file93
-rw-r--r--. 1 root root 0 May  9 22:22 file94
-rw-r--r--. 1 root root 0 May  9 22:22 file95
-rw-r--r--. 1 root root 0 May  9 22:22 file96
-rw-r--r--. 1 root root 0 May  9 22:22 file97
-rw-r--r--. 1 root root 0 May  9 22:22 file98
-rw-r--r--. 1 root root 0 May  9 22:22 file99

root@pranithk-laptop - /mnt/ec2/dht-ec-client 
22:22:57 :) ⚡ /home/pk1/.scripts/gfs -c k -v ec2 -k 0
Dir: /var/lib/glusterd/vols/ec2/run/
kill -9 2800


root@pranithk-laptop - /mnt/ec2/dht-ec-client 
22:23:02 :) ⚡ gluster v status
Status of volume: ec2
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick pranithk-laptop:/home/gfs/ec_0			N/A	N	2800
Brick pranithk-laptop:/home/gfs/ec_1			49156	Y	2811
Brick pranithk-laptop:/home/gfs/ec_2			49157	Y	2822
Brick pranithk-laptop:/home/gfs/ec_3			49158	Y	2833
Brick pranithk-laptop:/home/gfs/ec_4			49159	Y	2844
Brick pranithk-laptop:/home/gfs/ec_5			49160	Y	2855
Brick pranithk-laptop:/home/gfs/ec_6			49161	Y	2866
Brick pranithk-laptop:/home/gfs/ec_7			49162	Y	2877
NFS Server on localhost					2049	Y	2890
 
Task Status of Volume ec2
------------------------------------------------------------------------------
There are no active volume tasks
 

root@pranithk-laptop - /mnt/ec2/dht-ec-client 
22:23:05 :) ⚡ ls
file0   file14  file2   file25  file30  file36  file41  file47  file52  file58  file63  file69  file74  file8   file85  file90  file96
file1   file15  file20  file26  file31  file37  file42  file48  file53  file59  file64  file7   file75  file80  file86  file91  file97
file10  file16  file21  file27  file32  file38  file43  file49  file54  file6   file65  file70  file76  file81  file87  file92  file98
file11  file17  file22  file28  file33  file39  file44  file5   file55  file60  file66  file71  file77  file82  file88  file93  file99
file12  file18  file23  file29  file34  file4   file45  file50  file56  file61  file67  file72  file78  file83  file89  file94
file13  file19  file24  file3   file35  file40  file46  file51  file57  file62  file68  file73  file79  file84  file9   file95

root@pranithk-laptop - /mnt/ec2/dht-ec-client 
22:23:11 :) ⚡

Comment 3 Fang Huang 2015-05-20 15:17:17 UTC
I tested the last 3.6 commit which is 24b3db07685fcc39f08e15b1b5e5cd1262b91590, the bug still exist. The following is my test script.
------------------------------------------------
# cat tests/bugs/bug-1208564.t               
#!/bin/bash

. $(dirname $0)/../include.rc
. $(dirname $0)/../volume.rc

function file_count() {
    ls $1 |wc -l
}

cleanup

TEST glusterd
TEST pidof glusterd

#TEST mkdir -p $B0/${V0}{0,1,2,3,4,5,6,7}
TEST $CLI volume create $V0 disperse 4 redundancy 1 $H0:$B0/${v0}{0,1,2,3,4,5,6,7}

EXPECT "$V0" volinfo_field $V0 'Volume Name'
EXPECT 'Created' volinfo_field $V0 'Status'
EXPECT '8' brick_count $V0

TEST $CLI volume start $V0
EXPECT_WITHIN $PROCESS_UP_TIMEOUT 'Started' volinfo_field $V0 'Status'

TEST glusterfs --entry-timeout=0 --attribute-timeout=0 -s $H0 --volfile-id $V0 $M0

TEST touch $M0/file{0..99}

TEST kill_brick $V0 $H0 $B0/0
EXPECT_WITHIN $PROCESS_UP_TIMEOUT '100' file_count $M0

cleanup
--------------------------------------------

It stuck at the last file_count checking.

More details for my test:

----------
OS version:
Red Hat Enterprise Linux Server release 6.5 (Santiago) with libc-2.12
and CentOS release 6.6 (Final) with libc-2.12

# gluster --version
glusterfs 3.6.4beta1 built on May 20 2015 22:16:38
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General Public License.

#gluster v info all
 
Volume Name: patchy
Type: Distributed-Disperse
Volume ID: 84374ec8-1aa5-4126-908d-c458d605d905
Status: Started
Number of Bricks: 2 x (3 + 1) = 8
Transport-type: tcp
Bricks:
Brick1: mn1:/d/backends/0
Brick2: mn1:/d/backends/1
Brick3: mn1:/d/backends/2
Brick4: mn1:/d/backends/3
Brick5: mn1:/d/backends/4
Brick6: mn1:/d/backends/5
Brick7: mn1:/d/backends/6
Brick8: mn1:/d/backends/7

# gluster v status
Status of volume: patchy
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick mn1:/d/backends/0                                 N/A     N       16557
Brick mn1:/d/backends/1                                 49153   Y       16568
Brick mn1:/d/backends/2                                 49154   Y       16579
Brick mn1:/d/backends/3                                 49155   Y       16590
Brick mn1:/d/backends/4                                 49156   Y       16601
Brick mn1:/d/backends/5                                 49157   Y       16612
Brick mn1:/d/backends/6                                 49158   Y       16623
Brick mn1:/d/backends/7                                 49159   Y       16634
NFS Server on localhost                                 2049    Y       16652
 
Task Status of Volume patchy
------------------------------------------------------------------------------
There are no active volume tasks


#df
Filesystem             1K-blocks      Used  Available Use% Mounted on
/dev/mapper/35000c5005e5a3337p5
                      1784110888  85086720 1608389928   6% /
tmpfs                   16416508         4   16416504   1% /dev/shm
/dev/mapper/35000c5005e5a3337p1
                          999320     72752     874140   8% /boot
/dev/mapper/35000c5005e5a3337p2
                       103081248     61028   97777340   1% /test
mn1:patchy           10704665216 510520320 9650339456   6% /mnt/glusterfs/0

# ls /mnt/glusterfs/0/

`ls` command hang here

PS: For the master branch the test passed.

Comment 4 Fang Huang 2015-05-21 03:02:45 UTC
Here is some log information from /var/log/glusterfs/mnt-glusterfs-0.log, which may be helpful.

[2015-05-20 14:19:10.887997] I [client-handshake.c:188:client_set_lk_version_cbk] 0-patchy-client-4: Server lk version = 1
[2015-05-20 14:19:11.684083] W [socket.c:620:__socket_rwv] 0-patchy-client-0: readv on 192.168.1.31:49152 failed (No data available)
[2015-05-20 14:19:11.684193] I [client.c:2215:client_rpc_notify] 0-patchy-client-0: disconnected from patchy-client-0. Client proces
s will keep trying to connect to glusterd until brick's port is available
[2015-05-20 14:19:11.688660] W [ec-common.c:407:ec_child_select] 0-patchy-disperse-0: Executing operation with some subvolumes unava
ilable (1)
[2015-05-20 14:19:11.689057] W [ec-common.c:407:ec_child_select] 0-patchy-disperse-0: Executing operation with some subvolumes unava
ilable (1)
[2015-05-20 14:19:11.689665] W [ec-common.c:407:ec_child_select] 0-patchy-disperse-0: Executing operation with some subvolumes unava
ilable (1)
[2015-05-20 14:19:11.690359] W [ec-common.c:407:ec_child_select] 0-patchy-disperse-0: Executing operation with some subvolumes unava
ilable (1)
[2015-05-20 14:19:11.691292] W [ec-common.c:407:ec_child_select] 0-patchy-disperse-0: Executing operation with some subvolumes unava
ilable (1)
[2015-05-20 14:19:11.693141] W [ec-common.c:407:ec_child_select] 0-patchy-disperse-0: Executing operation with some subvolumes unava
ilable (1)
[2015-05-20 14:19:11.693168] E [ec-common.c:443:ec_child_select] 0-patchy-disperse-0: Insufficient available childs for this request
 (have 0, need 1)
[2015-05-20 14:19:12.789888] W [ec-common.c:407:ec_child_select] 0-patchy-disperse-0: Executing operation with some subvolumes unava
ilable (1)
[2015-05-20 14:19:12.790018] W [ec-common.c:407:ec_child_select] 0-patchy-disperse-0: Executing operation with some subvolumes unava
ilable (1)

Comment 5 Kaushal 2016-08-16 13:08:20 UTC
This bug is being closed as GlusterFS-3.6 is nearing its End-Of-Life and only important security bugs will be fixed. If you still face this bug with the newer GlusterFS versions, please open a new bug.