Description of problem: I use ten dd process to continueed write ten files in mountpoint,like this: dd if=/dev/zero of=/mountpoint/test/cnt.bak bs=1M &; cnt={1,2,3....10} I disconnect one brick,then ls /mountpoint/test, this command blocked(longer than the timeout).I reconnected that brick, ls command still blocked. I kill `pidof dd`, the ls command return. Version-Release number of selected component (if applicable): This problem also in 3.6.2 How reproducible: Steps to Reproduce: 1.I create a disperse 3 redundancy 1 volume Volume Name: test Type: Disperse Volume ID: bfdbfc8e-3dcc-4459-a1e4-9de17df03db5 Status: Started Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: node-1:/sda/ Brick2: node-2:/sdb/ Brick3: node-3:/sdc/ 2.mkdir /mountpoint/test,then dd if=/dev/zero of=/mountpoint/test/cnt.bak bs=1M &;(cnt=1,2,3...10) 3.disconnect node-3, ls /mountpoint/test Actual results: ls /mountpoint/test command blocked all the time. Expected results: ls /mountpoint/test command can return at soon. Additional info:
in heal.c: case EC_STATE_HEAL_PRE_INODELK_LOCK: if (heal->done) return EC_STATE_HEAL_DISPATCH; /* Only heal data/metadata if enough information is supplied. */ if (uuid_is_null(heal->loc.gfid)) { ec_heal_entrylk(heal, ENTRYLK_UNLOCK); return EC_STATE_HEAL_DISPATCH; } ec_heal_inodelk(heal, F_WRLCK, 0, 0, 0); return EC_STATE_HEAL_PRE_INODE_LOOKUP; after heal entry,it not unlock the parent entrylk, then to get the inodelk to heal attr and xattr. but lock_reuse lead to dd hold inodelk all the time. so, the heal can not get the inodelk,but hold it's parent's entrylk. so ls command blocled (opendir need to get the entrylk)
With code on master, hang is not observed anymore: root@pranithk-laptop - ~ 22:50:28 :) ⚡ glusterd && gluster volume create ec2 disperse 3 redundancy 1 `hostname`:/home/gfs/ec_{0..2} force && gluster volume start ec2 && mount -t glusterfs `hostname`:/ec2 /mnt/ec2 volume create: ec2: success: please start the volume to access data volume start: ec2: success root@pranithk-laptop - ~ 22:50:44 :) ⚡ cd /mnt/ec2 root@pranithk-laptop - /mnt/ec2 22:51:50 :) ⚡ for i in {1..10}; do eval "dd of=$i if=/dev/zero bs=1M &"; done [1] 16673 [2] 16674 [3] 16675 [4] 16676 [5] 16677 [6] 16678 [7] 16679 [8] 16680 [9] 16681 [10] 16682 root@pranithk-laptop - /mnt/ec2 22:52:48 :) ⚡ /home/pk1/.scripts/gfs -c k -v ec2 -k 0 Dir: /var/lib/glusterd/vols/ec2/run/ kill -9 16511 root@pranithk-laptop - /mnt/ec2 22:52:53 :) ⚡ ls -l ls: cannot access 7: Input/output error ls: cannot access 6: Input/output error ls: cannot access 10: Input/output error ls: cannot access 5: Input/output error ls: cannot access 1: Input/output error ls: cannot access 3: Input/output error total 374232 ??????????? ? ? ? ? ? 1 ??????????? ? ? ? ? ? 10 -rw-r--r--. 1 root root 96468992 May 9 22:52 2 ??????????? ? ? ? ? ? 3 -rw-r--r--. 1 root root 99614720 May 9 22:52 4 ??????????? ? ? ? ? ? 5 ??????????? ? ? ? ? ? 6 ??????????? ? ? ? ? ? 7 -rw-r--r--. 1 root root 88997888 May 9 22:52 8 -rw-r--r--. 1 root root 98041856 May 9 22:52 9 root@pranithk-laptop - /mnt/ec2 22:52:55 :( ⚡ ls -l ls: cannot access 9: Input/output error ls: cannot access 4: Input/output error ls: cannot access 2: Input/output error ls: cannot access 7: Input/output error ls: cannot access 6: Input/output error ls: cannot access 10: Input/output error ls: cannot access 8: Input/output error ls: cannot access 1: Input/output error ls: cannot access 3: Input/output error total 99328 ??????????? ? ? ? ? ? 1 ??????????? ? ? ? ? ? 10 ??????????? ? ? ? ? ? 2 ??????????? ? ? ? ? ? 3 ??????????? ? ? ? ? ? 4 -rw-r--r--. 1 root root 101580800 May 9 22:52 5 ??????????? ? ? ? ? ? 6 ??????????? ? ? ? ? ? 7 ??????????? ? ? ? ? ? 8 ??????????? ? ? ? ? ? 9 root@pranithk-laptop - /mnt/ec2 22:52:57 :( ⚡ ls -l ls: cannot access 9: Input/output error ls: cannot access 4: Input/output error ls: cannot access 2: Input/output error ls: cannot access 7: Input/output error ^[[Als: cannot access 6: Input/output error ls: cannot access 3: Input/output error total 436736 -rw-r--r--. 1 root root 112459776 May 9 22:52 1 -rw-r--r--. 1 root root 114163712 May 9 22:52 10 ??????????? ? ? ? ? ? 2 ??????????? ? ? ? ? ? 3 ??????????? ? ? ? ? ? 4 -rw-r--r--. 1 root root 113770496 May 9 22:52 5 ??????????? ? ? ? ? ? 6 ??????????? ? ? ? ? ? 7 -rw-r--r--. 1 root root 106823680 May 9 22:52 8 ??????????? ? ? ? ? ? 9 root@pranithk-laptop - /mnt/ec2 22:53:06 :( ⚡ ls -l total 1170176 -rw-r--r--. 1 root root 118882304 May 9 22:53 1 -rw-r--r--. 1 root root 120193024 May 9 22:53 10 -rw-r--r--. 1 root root 121634816 May 9 22:53 2 -rw-r--r--. 1 root root 114163712 May 9 22:53 3 -rw-r--r--. 1 root root 125173760 May 9 22:53 4 -rw-r--r--. 1 root root 120586240 May 9 22:53 5 -rw-r--r--. 1 root root 115343360 May 9 22:53 6 -rw-r--r--. 1 root root 122683392 May 9 22:53 7 -rw-r--r--. 1 root root 111935488 May 9 22:53 8 -rw-r--r--. 1 root root 127664128 May 9 22:53 9 root@pranithk-laptop - /mnt/ec2 22:53:07 :) ⚡ killall dd [1] Terminated dd of=1 if=/dev/zero bs=1M [4] Terminated dd of=4 if=/dev/zero bs=1M [5] Terminated dd of=5 if=/dev/zero bs=1M [10]+ Terminated dd of=10 if=/dev/zero bs=1M root@pranithk-laptop - /mnt/ec2 22:53:18 :) ⚡ [2] Terminated dd of=2 if=/dev/zero bs=1M [3] Terminated dd of=3 if=/dev/zero bs=1M [6] Terminated dd of=6 if=/dev/zero bs=1M [7] Terminated dd of=7 if=/dev/zero bs=1M [8]- Terminated dd of=8 if=/dev/zero bs=1M [9]+ Terminated dd of=9 if=/dev/zero bs=1M root@pranithk-laptop - /mnt/ec2 22:53:19 :) ⚡
*** Bug 1210193 has been marked as a duplicate of this bug. ***
You should restart the brick that you killed,trigger heal.
The given scenario is working fine on current master and the bug is not reproducible.