Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1205709

Summary:	ls command blocked when one brick disconnect, even reconnect.
Product:	[Community] GlusterFS	Reporter:	jiademing.dd <iesool>
Component:	disperse	Assignee:	Pranith Kumar K <pkarampu>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	mainline	CC:	amukherj, annair, aspandey, bugs, fanghuang.data, iesool
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-08-11 09:07:04 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description jiademing.dd 2015-03-25 13:46:56 UTC

Description of problem:
   I use ten dd process to continueed write ten files in mountpoint，like this：
      dd if=/dev/zero of=/mountpoint/test/cnt.bak bs=1M &;   cnt={1,2,3....10}
 
   I disconnect one brick，then ls /mountpoint/test, this command  blocked(longer than the timeout).I reconnected that brick, ls command still blocked.
   I kill `pidof dd`, the ls command return.

Version-Release number of selected component (if applicable):
  This problem also in 3.6.2

How reproducible:


Steps to Reproduce:
1.I create a disperse 3 redundancy 1 volume

Volume Name: test
Type: Disperse
Volume ID: bfdbfc8e-3dcc-4459-a1e4-9de17df03db5
Status: Started
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: node-1:/sda/
Brick2: node-2:/sdb/
Brick3: node-3:/sdc/
2.mkdir /mountpoint/test,then dd if=/dev/zero of=/mountpoint/test/cnt.bak bs=1M &;(cnt=1,2,3...10)
3.disconnect node-3, ls /mountpoint/test

Actual results:
ls /mountpoint/test command blocked all the time.

Expected results:
ls /mountpoint/test command can return at soon.

Additional info:

Comment 1 jiademing.dd 2015-04-03 03:08:48 UTC

in heal.c:
case EC_STATE_HEAL_PRE_INODELK_LOCK:
            if (heal->done)
                    return EC_STATE_HEAL_DISPATCH;

            /* Only heal data/metadata if enough information is supplied. */
            if (uuid_is_null(heal->loc.gfid))
            {
                ec_heal_entrylk(heal, ENTRYLK_UNLOCK);

                return EC_STATE_HEAL_DISPATCH;
            }

            ec_heal_inodelk(heal, F_WRLCK, 0, 0, 0);

            return EC_STATE_HEAL_PRE_INODE_LOOKUP;

after heal entry,it not unlock the parent entrylk, then to get the inodelk to heal attr and xattr. but lock_reuse lead to dd hold inodelk all the time. so, the heal can not get the inodelk,but hold it's parent's entrylk. so ls command blocled (opendir need to get the entrylk)

Comment 2 Pranith Kumar K 2015-05-09 17:25:22 UTC

With code on master, hang is not observed anymore:
root@pranithk-laptop - ~ 
22:50:28 :) ⚡ glusterd && gluster volume create ec2 disperse 3 redundancy 1 `hostname`:/home/gfs/ec_{0..2} force && gluster volume start ec2 && mount -t glusterfs `hostname`:/ec2 /mnt/ec2
volume create: ec2: success: please start the volume to access data
volume start: ec2: success

root@pranithk-laptop - ~ 
22:50:44 :) ⚡ cd /mnt/ec2

root@pranithk-laptop - /mnt/ec2 
22:51:50 :) ⚡ for i in {1..10}; do eval "dd of=$i if=/dev/zero bs=1M &"; done
[1] 16673
[2] 16674
[3] 16675
[4] 16676
[5] 16677
[6] 16678
[7] 16679
[8] 16680
[9] 16681
[10] 16682

root@pranithk-laptop - /mnt/ec2 
22:52:48 :) ⚡ /home/pk1/.scripts/gfs -c k -v ec2 -k 0
Dir: /var/lib/glusterd/vols/ec2/run/
kill -9 16511


root@pranithk-laptop - /mnt/ec2 
22:52:53 :) ⚡ ls -l
ls: cannot access 7: Input/output error
ls: cannot access 6: Input/output error
ls: cannot access 10: Input/output error
ls: cannot access 5: Input/output error
ls: cannot access 1: Input/output error
ls: cannot access 3: Input/output error
total 374232
??????????? ? ?    ?           ?            ? 1
??????????? ? ?    ?           ?            ? 10
-rw-r--r--. 1 root root 96468992 May  9 22:52 2
??????????? ? ?    ?           ?            ? 3
-rw-r--r--. 1 root root 99614720 May  9 22:52 4
??????????? ? ?    ?           ?            ? 5
??????????? ? ?    ?           ?            ? 6
??????????? ? ?    ?           ?            ? 7
-rw-r--r--. 1 root root 88997888 May  9 22:52 8
-rw-r--r--. 1 root root 98041856 May  9 22:52 9

root@pranithk-laptop - /mnt/ec2 
22:52:55 :( ⚡ ls -l
ls: cannot access 9: Input/output error
ls: cannot access 4: Input/output error
ls: cannot access 2: Input/output error
ls: cannot access 7: Input/output error
ls: cannot access 6: Input/output error
ls: cannot access 10: Input/output error
ls: cannot access 8: Input/output error
ls: cannot access 1: Input/output error
ls: cannot access 3: Input/output error
total 99328
??????????? ? ?    ?            ?            ? 1
??????????? ? ?    ?            ?            ? 10
??????????? ? ?    ?            ?            ? 2
??????????? ? ?    ?            ?            ? 3
??????????? ? ?    ?            ?            ? 4
-rw-r--r--. 1 root root 101580800 May  9 22:52 5
??????????? ? ?    ?            ?            ? 6
??????????? ? ?    ?            ?            ? 7
??????????? ? ?    ?            ?            ? 8
??????????? ? ?    ?            ?            ? 9

root@pranithk-laptop - /mnt/ec2 
22:52:57 :( ⚡ ls -l
ls: cannot access 9: Input/output error
ls: cannot access 4: Input/output error
ls: cannot access 2: Input/output error
ls: cannot access 7: Input/output error
^[[Als: cannot access 6: Input/output error
ls: cannot access 3: Input/output error
total 436736
-rw-r--r--. 1 root root 112459776 May  9 22:52 1
-rw-r--r--. 1 root root 114163712 May  9 22:52 10
??????????? ? ?    ?            ?            ? 2
??????????? ? ?    ?            ?            ? 3
??????????? ? ?    ?            ?            ? 4
-rw-r--r--. 1 root root 113770496 May  9 22:52 5
??????????? ? ?    ?            ?            ? 6
??????????? ? ?    ?            ?            ? 7
-rw-r--r--. 1 root root 106823680 May  9 22:52 8
??????????? ? ?    ?            ?            ? 9


root@pranithk-laptop - /mnt/ec2 
22:53:06 :( ⚡ ls -l
total 1170176
-rw-r--r--. 1 root root 118882304 May  9 22:53 1
-rw-r--r--. 1 root root 120193024 May  9 22:53 10
-rw-r--r--. 1 root root 121634816 May  9 22:53 2
-rw-r--r--. 1 root root 114163712 May  9 22:53 3
-rw-r--r--. 1 root root 125173760 May  9 22:53 4
-rw-r--r--. 1 root root 120586240 May  9 22:53 5
-rw-r--r--. 1 root root 115343360 May  9 22:53 6
-rw-r--r--. 1 root root 122683392 May  9 22:53 7
-rw-r--r--. 1 root root 111935488 May  9 22:53 8
-rw-r--r--. 1 root root 127664128 May  9 22:53 9

root@pranithk-laptop - /mnt/ec2 
22:53:07 :) ⚡ killall dd
[1]   Terminated              dd of=1 if=/dev/zero bs=1M
[4]   Terminated              dd of=4 if=/dev/zero bs=1M
[5]   Terminated              dd of=5 if=/dev/zero bs=1M
[10]+  Terminated              dd of=10 if=/dev/zero bs=1M

root@pranithk-laptop - /mnt/ec2 
22:53:18 :) ⚡ 
[2]   Terminated              dd of=2 if=/dev/zero bs=1M
[3]   Terminated              dd of=3 if=/dev/zero bs=1M
[6]   Terminated              dd of=6 if=/dev/zero bs=1M
[7]   Terminated              dd of=7 if=/dev/zero bs=1M
[8]-  Terminated              dd of=8 if=/dev/zero bs=1M
[9]+  Terminated              dd of=9 if=/dev/zero bs=1M

root@pranithk-laptop - /mnt/ec2 
22:53:19 :) ⚡

Comment 3 Pranith Kumar K 2015-05-09 17:33:40 UTC

*** Bug 1210193 has been marked as a duplicate of this bug. ***

Comment 4 jiademing.dd 2015-05-13 04:50:40 UTC

You should restart the brick that you killed,trigger heal.

Comment 5 Ashish Pandey 2017-08-11 09:07:04 UTC

The given scenario is working fine on current master and the bug is not reproducible.