Bug 1205709 - ls command blocked when one brick disconnect, even reconnect.
Summary: ls command blocked when one brick disconnect, even reconnect.
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: disperse
Version: mainline
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
Assignee: Pranith Kumar K
QA Contact:
URL:
Whiteboard:
: 1210193 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-03-25 13:46 UTC by jiademing.dd
Modified: 2017-08-11 09:07 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-08-11 09:07:04 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description jiademing.dd 2015-03-25 13:46:56 UTC
Description of problem:
   I use ten dd process to continueed write ten files in mountpoint,like this:
      dd if=/dev/zero of=/mountpoint/test/cnt.bak bs=1M &;   cnt={1,2,3....10}
 
   I disconnect one brick,then ls /mountpoint/test, this command  blocked(longer than the timeout).I reconnected that brick, ls command still blocked.
   I kill `pidof dd`, the ls command return.

Version-Release number of selected component (if applicable):
  This problem also in 3.6.2

How reproducible:


Steps to Reproduce:
1.I create a disperse 3 redundancy 1 volume

Volume Name: test
Type: Disperse
Volume ID: bfdbfc8e-3dcc-4459-a1e4-9de17df03db5
Status: Started
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: node-1:/sda/
Brick2: node-2:/sdb/
Brick3: node-3:/sdc/
2.mkdir /mountpoint/test,then dd if=/dev/zero of=/mountpoint/test/cnt.bak bs=1M &;(cnt=1,2,3...10)
3.disconnect node-3, ls /mountpoint/test

Actual results:
ls /mountpoint/test command blocked all the time.

Expected results:
ls /mountpoint/test command can return at soon.

Additional info:

Comment 1 jiademing.dd 2015-04-03 03:08:48 UTC
in heal.c:
case EC_STATE_HEAL_PRE_INODELK_LOCK:
            if (heal->done)
                    return EC_STATE_HEAL_DISPATCH;

            /* Only heal data/metadata if enough information is supplied. */
            if (uuid_is_null(heal->loc.gfid))
            {
                ec_heal_entrylk(heal, ENTRYLK_UNLOCK);

                return EC_STATE_HEAL_DISPATCH;
            }

            ec_heal_inodelk(heal, F_WRLCK, 0, 0, 0);

            return EC_STATE_HEAL_PRE_INODE_LOOKUP;

after heal entry,it not unlock the parent entrylk, then to get the inodelk to heal attr and xattr. but lock_reuse lead to dd hold inodelk all the time. so, the heal can not get the inodelk,but hold it's parent's entrylk. so ls command blocled (opendir need to get the entrylk)

Comment 2 Pranith Kumar K 2015-05-09 17:25:22 UTC
With code on master, hang is not observed anymore:
root@pranithk-laptop - ~ 
22:50:28 :) ⚡ glusterd && gluster volume create ec2 disperse 3 redundancy 1 `hostname`:/home/gfs/ec_{0..2} force && gluster volume start ec2 && mount -t glusterfs `hostname`:/ec2 /mnt/ec2
volume create: ec2: success: please start the volume to access data
volume start: ec2: success

root@pranithk-laptop - ~ 
22:50:44 :) ⚡ cd /mnt/ec2

root@pranithk-laptop - /mnt/ec2 
22:51:50 :) ⚡ for i in {1..10}; do eval "dd of=$i if=/dev/zero bs=1M &"; done
[1] 16673
[2] 16674
[3] 16675
[4] 16676
[5] 16677
[6] 16678
[7] 16679
[8] 16680
[9] 16681
[10] 16682

root@pranithk-laptop - /mnt/ec2 
22:52:48 :) ⚡ /home/pk1/.scripts/gfs -c k -v ec2 -k 0
Dir: /var/lib/glusterd/vols/ec2/run/
kill -9 16511


root@pranithk-laptop - /mnt/ec2 
22:52:53 :) ⚡ ls -l
ls: cannot access 7: Input/output error
ls: cannot access 6: Input/output error
ls: cannot access 10: Input/output error
ls: cannot access 5: Input/output error
ls: cannot access 1: Input/output error
ls: cannot access 3: Input/output error
total 374232
??????????? ? ?    ?           ?            ? 1
??????????? ? ?    ?           ?            ? 10
-rw-r--r--. 1 root root 96468992 May  9 22:52 2
??????????? ? ?    ?           ?            ? 3
-rw-r--r--. 1 root root 99614720 May  9 22:52 4
??????????? ? ?    ?           ?            ? 5
??????????? ? ?    ?           ?            ? 6
??????????? ? ?    ?           ?            ? 7
-rw-r--r--. 1 root root 88997888 May  9 22:52 8
-rw-r--r--. 1 root root 98041856 May  9 22:52 9

root@pranithk-laptop - /mnt/ec2 
22:52:55 :( ⚡ ls -l
ls: cannot access 9: Input/output error
ls: cannot access 4: Input/output error
ls: cannot access 2: Input/output error
ls: cannot access 7: Input/output error
ls: cannot access 6: Input/output error
ls: cannot access 10: Input/output error
ls: cannot access 8: Input/output error
ls: cannot access 1: Input/output error
ls: cannot access 3: Input/output error
total 99328
??????????? ? ?    ?            ?            ? 1
??????????? ? ?    ?            ?            ? 10
??????????? ? ?    ?            ?            ? 2
??????????? ? ?    ?            ?            ? 3
??????????? ? ?    ?            ?            ? 4
-rw-r--r--. 1 root root 101580800 May  9 22:52 5
??????????? ? ?    ?            ?            ? 6
??????????? ? ?    ?            ?            ? 7
??????????? ? ?    ?            ?            ? 8
??????????? ? ?    ?            ?            ? 9

root@pranithk-laptop - /mnt/ec2 
22:52:57 :( ⚡ ls -l
ls: cannot access 9: Input/output error
ls: cannot access 4: Input/output error
ls: cannot access 2: Input/output error
ls: cannot access 7: Input/output error
^[[Als: cannot access 6: Input/output error
ls: cannot access 3: Input/output error
total 436736
-rw-r--r--. 1 root root 112459776 May  9 22:52 1
-rw-r--r--. 1 root root 114163712 May  9 22:52 10
??????????? ? ?    ?            ?            ? 2
??????????? ? ?    ?            ?            ? 3
??????????? ? ?    ?            ?            ? 4
-rw-r--r--. 1 root root 113770496 May  9 22:52 5
??????????? ? ?    ?            ?            ? 6
??????????? ? ?    ?            ?            ? 7
-rw-r--r--. 1 root root 106823680 May  9 22:52 8
??????????? ? ?    ?            ?            ? 9


root@pranithk-laptop - /mnt/ec2 
22:53:06 :( ⚡ ls -l
total 1170176
-rw-r--r--. 1 root root 118882304 May  9 22:53 1
-rw-r--r--. 1 root root 120193024 May  9 22:53 10
-rw-r--r--. 1 root root 121634816 May  9 22:53 2
-rw-r--r--. 1 root root 114163712 May  9 22:53 3
-rw-r--r--. 1 root root 125173760 May  9 22:53 4
-rw-r--r--. 1 root root 120586240 May  9 22:53 5
-rw-r--r--. 1 root root 115343360 May  9 22:53 6
-rw-r--r--. 1 root root 122683392 May  9 22:53 7
-rw-r--r--. 1 root root 111935488 May  9 22:53 8
-rw-r--r--. 1 root root 127664128 May  9 22:53 9

root@pranithk-laptop - /mnt/ec2 
22:53:07 :) ⚡ killall dd
[1]   Terminated              dd of=1 if=/dev/zero bs=1M
[4]   Terminated              dd of=4 if=/dev/zero bs=1M
[5]   Terminated              dd of=5 if=/dev/zero bs=1M
[10]+  Terminated              dd of=10 if=/dev/zero bs=1M

root@pranithk-laptop - /mnt/ec2 
22:53:18 :) ⚡ 
[2]   Terminated              dd of=2 if=/dev/zero bs=1M
[3]   Terminated              dd of=3 if=/dev/zero bs=1M
[6]   Terminated              dd of=6 if=/dev/zero bs=1M
[7]   Terminated              dd of=7 if=/dev/zero bs=1M
[8]-  Terminated              dd of=8 if=/dev/zero bs=1M
[9]+  Terminated              dd of=9 if=/dev/zero bs=1M

root@pranithk-laptop - /mnt/ec2 
22:53:19 :) ⚡

Comment 3 Pranith Kumar K 2015-05-09 17:33:40 UTC
*** Bug 1210193 has been marked as a duplicate of this bug. ***

Comment 4 jiademing.dd 2015-05-13 04:50:40 UTC
You should restart the brick that you killed,trigger heal.

Comment 5 Ashish Pandey 2017-08-11 09:07:04 UTC
The given scenario is working fine on current master and the bug is not reproducible.


Note You need to log in before you can comment on or make changes to this bug.