Bug 1400093

Summary: ls and move hung on disperse volume
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Nag Pavan Chilakam <nchilaka>
Component: disperseAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED ERRATA QA Contact: Nag Pavan Chilakam <nchilaka>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: rhgs-3.2CC: amukherj, aspandey, rcyriac, rhinduja, rhs-bugs, storage-qa-internal
Target Milestone: ---   
Target Release: RHGS 3.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.8.4-9 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1402710 (view as bug list) Environment:
Last Closed: 2017-03-23 05:53:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1351528, 1402710, 1404572, 1412922    
Attachments:
Description Flags
possible fix none

Description Nag Pavan Chilakam 2016-11-30 12:49:58 UTC
Description of problem:
======================
There were two bricks down and I started creating files in below fashion
c1:under dir1 touch of f{1..10000}
c2:under / touch x{1..1000}
c3: after c2 action completed, started touch x{1001..10000}
I then brought up all the bricks which were down (2 in each of the two subvols)
then while c1 and c3 were happening, on c2 i did a move of all the x* into dir1
at this time the mv was hung for ever

After all files were written and healing completed, I did a ls on / from c1 and c3 they are also not responding


Version-Release number of selected component (if applicable):
=======
3.8.4-5


[root@dhcp35-37 ~]# gluster v info disperse
 
Volume Name: disperse
Type: Distributed-Disperse
Volume ID: 9620f72c-f8bb-4e05-98e4-d1a2132db8a7
Status: Started
Snapshot Count: 2
Number of Bricks: 2 x (4 + 2) = 12
Transport-type: tcp
Bricks:
Brick1: 10.70.35.37:/rhs/brick2/disperse
Brick2: 10.70.35.116:/rhs/brick2/disperse
Brick3: 10.70.35.239:/rhs/brick2/disperse
Brick4: 10.70.35.135:/rhs/brick2/disperse
Brick5: 10.70.35.8:/rhs/brick2/disperse
Brick6: 10.70.35.196:/rhs/brick2/disperse
Brick7: 10.70.35.37:/rhs/brick3/disperse
Brick8: 10.70.35.116:/rhs/brick3/disperse
Brick9: 10.70.35.239:/rhs/brick3/disperse
Brick10: 10.70.35.135:/rhs/brick3/disperse
Brick11: 10.70.35.8:/rhs/brick3/disperse
Brick12: 10.70.35.196:/rhs/brick3/disperse
Options Reconfigured:
features.barrier: disable
features.uss: enable
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
[root@dhcp35-37 ~]# 

[root@dhcp35-37 ~]# gluster v status disperse
Status of volume: disperse
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.37:/rhs/brick2/disperse      49153     0          Y       32685
Brick 10.70.35.116:/rhs/brick2/disperse     49152     0          Y       13804
Brick 10.70.35.239:/rhs/brick2/disperse     49152     0          Y       6677 
Brick 10.70.35.135:/rhs/brick2/disperse     49153     0          Y       6698 
Brick 10.70.35.8:/rhs/brick2/disperse       49153     0          Y       6617 
Brick 10.70.35.196:/rhs/brick2/disperse     49153     0          Y       6700 
Brick 10.70.35.37:/rhs/brick3/disperse      49154     0          Y       32704
Brick 10.70.35.116:/rhs/brick3/disperse     49154     0          Y       13823
Brick 10.70.35.239:/rhs/brick3/disperse     49154     0          Y       6696 
Brick 10.70.35.135:/rhs/brick3/disperse     49154     0          Y       6717 
Brick 10.70.35.8:/rhs/brick3/disperse       49154     0          Y       6636 
Brick 10.70.35.196:/rhs/brick3/disperse     49154     0          Y       6719 
Snapshot Daemon on localhost                49155     0          Y       13153
Self-heal Daemon on localhost               N/A       N/A        Y       23366
Quota Daemon on localhost                   N/A       N/A        Y       13090
Snapshot Daemon on 10.70.35.8               49155     0          Y       20959
Self-heal Daemon on 10.70.35.8              N/A       N/A        Y       32233
Quota Daemon on 10.70.35.8                  N/A       N/A        Y       20907
Snapshot Daemon on 10.70.35.116             49155     0          Y       26181
Self-heal Daemon on 10.70.35.116            N/A       N/A        Y       2799 
Quota Daemon on 10.70.35.116                N/A       N/A        Y       26127
Snapshot Daemon on 10.70.35.239             49155     0          Y       21107
Self-heal Daemon on 10.70.35.239            N/A       N/A        Y       32373
Quota Daemon on 10.70.35.239                N/A       N/A        Y       21056
Snapshot Daemon on 10.70.35.196             49155     0          Y       21087
Self-heal Daemon on 10.70.35.196            N/A       N/A        Y       32376
Quota Daemon on 10.70.35.196                N/A       N/A        Y       21036
Snapshot Daemon on 10.70.35.135             49155     0          Y       21068
Self-heal Daemon on 10.70.35.135            N/A       N/A        Y       32324
Quota Daemon on 10.70.35.135                N/A       N/A        Y       21018
 
Task Status of Volume disperse
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : e401b7fd-1ef6-4d1f-a818-228cba36a128
Status               : completed           
 
[root@dhcp35-37 ~]# kill -9 32685
[root@dhcp35-37 ~]# kill -9 32704
[root@dhcp35-37 ~]# gluster v status|grep disp|grep 37
Volume halo is not started
Brick 10.70.35.37:/rhs/brick2/disperse      N/A       N/A        N       N/A  
Brick 10.70.35.37:/rhs/brick3/disperse      N/A       N/A        N       N/A  
[root@dhcp35-37 ~]# gluster v heal disperse info|grep ntries
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
[root@dhcp35-37 ~]# 
[root@dhcp35-37 ~]# 
[root@dhcp35-37 ~]# 
[root@dhcp35-37 ~]#

Comment 2 Nag Pavan Chilakam 2016-11-30 12:51:46 UTC
the client log where mv got hung was as below
[2016-11-30 12:18:40.514765] W [MSGID: 122015] [ec-common.c:1774:ec_unlocked] 3-disperse-disperse-0: entry/inode unlocking failed (RENAME) [Invalid argument]
[2016-11-30 12:18:40.534720] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 3-disperse-disperse-1: Operation failed on some subvolumes (up=3F, mask=2E, remaining=0, good=2E, bad=11)
[2016-11-30 12:18:40.538990] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 3-disperse-disperse-0: Operation failed on some subvolumes (up=3F, mask=2E, remaining=0, good=2E, bad=11)
[2016-11-30 12:18:32.135232] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 3-disperse-disperse-1: Heal failed [Invalid argument]
The message "W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 3-disperse-disperse-0: Heal failed [Invalid argument]" repeated 3 times between [2016-11-30 12:18:32.108297] and [2016-11-30 12:18:32.135236]

Comment 3 Nag Pavan Chilakam 2016-11-30 13:05:38 UTC
sosreports
[qe@rhsqe-repo nchilaka]$ pwd
/var/www/html/sosreports/nchilaka
[qe@rhsqe-repo nchilaka]$ hostname
rhsqe-repo.lab.eng.blr.redhat.com
[qe@rhsqe-repo nchilaka]$ cd bug.1400093
[qe@rhsqe-repo bug.1400093]$ ls
clients  servers
[qe@rhsqe-repo bug.1400093]$

Comment 4 Ashish Pandey 2016-12-05 10:33:08 UTC
This bug is easily reproducible on plain dispersed volume too without killing any brick.

[root@apandey gluster]# gvi
 
Volume Name: vol
Type: Disperse
Volume ID: c3e903e0-e7b5-42a3-9e75-798c4e3268a0
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (4 + 2) = 6
Transport-type: tcp
Bricks:
Brick1: apandey:/brick/gluster/vol-1
Brick2: apandey:/brick/gluster/vol-2
Brick3: apandey:/brick/gluster/vol-3
Brick4: apandey:/brick/gluster/vol-4
Brick5: apandey:/brick/gluster/vol-5
Brick6: apandey:/brick/gluster/vol-6
Options Reconfigured:
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on


1 - Mount this volume on 3 mount point, c1, c2, and c3
2 - on c1 - mkdir /c1/dir ; cd dir
3 - on c2 touch 4000 files on mount point i.e. "/"
4 - After step 3, start touching next 4000 files on c2 on mount point i.e. "/" 
3 - on c1 start touching 10000 files on /dir/.
4 - on c3 start moving 4000 files created on step 3 from mount point to /dir/
5 - on c3, from different console , start ls in a loop.

All the client will be hanged after sometime (within 3-4 min).


I took statedump and found following inode lock contention - 

lock-dump.domain.domain=dht.file.migrate
inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=0, len=0, pid = 12892, owner=c4582de9117f0000, client=0x7f0c800b0430, connection-id=apandey-29101-2016/12/05-07:33:03:610275-vol-client-0-0-0, granted at 2016-12-05 07:37:05
lock-dump.domain.domain=vol-disperse-0

[conn.0.bound_xl./brick/gluster/vol-1.active.2]
gfid=6a7c71ba-9814-41ad-9975-3a7ef4e97d60
nlookup=2528
fd-count=0
ref=9179
ia_type=2

[xlator.features.locks.vol-locks.inode]
path=/dir
mandatory=0
inodelk-count=2
lock-dump.domain.domain=dht.layout.heal
lock-dump.domain.domain=vol-disperse-0
inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=0, len=0, pid = 12892, owner=6c20b6d5117f0000, client=0x7f0c800b0430, connection-id=apandey-29101-2016/12/05-07:33:03:610275-vol-client-0-0-0, blocked at 2016-12-05 07:37:05, granted at 2016-12-05 07:37:05
inodelk.inodelk[1](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 12894, owner=e870234b277f0000, client=0x7f0c800ac5f0, connection-id=apandey-29029-2016/12/05-07:32:51:253616-vol-client-0-0-0, blocked at 2016-12-05 07:37:05

[conn.0.bound_xl./brick/gluster/vol-1.active.3]
gfid=00000000-0000-0000-0000-000000000001
nlookup=1
fd-count=1
ref=1
ia_type=2

[xlator.features.locks.vol-locks.inode]
path=/
mandatory=0
inodelk-count=3
lock-dump.domain.domain=dht.layout.heal
lock-dump.domain.domain=vol-disperse-0:self-heal
lock-dump.domain.domain=vol-disperse-0
inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=0, len=0, pid = 12885, owner=6c20b6d5117f0000, client=0x7f0c800b0430, connection-id=apandey-29101-2016/12/05-07:33:03:610275-vol-client-0-0-0, blocked at 2016-12-05 07:37:05, granted at 2016-12-05 07:37:05
inodelk.inodelk[1](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 12890, owner=ac4c67f4257f0000, client=0x7f0c800a87b0, connection-id=apandey-28922-2016/12/05-07:32:35:356027-vol-client-0-0-0, blocked at 2016-12-05 07:37:05
inodelk.inodelk[2](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 12892, owner=6825b6d5117f0000, client=0x7f0c800b0430, connection-id=apandey-29101-2016/12/05-07:33:03:610275-vol-client-0-0-0, blocked at 2016-12-05 07:37:05

[conn.0.bound_xl./brick/gluster/vol-1.lru.1]
gfid=00000000-0000-0000-0000-000000000005
nlookup=6
fd-count=0
ref=0
ia_type=2



On first brick log i see following filed unlock request -


[2016-12-05 07:32:58.401408] I [MSGID: 115029] [server-handshake.c:693:server_setvolume] 0-vol-server: accepted client from apandey-29029-2016/12/05-07:32:51:253616-vol-client-0-0-0 (version: 3.8.5)
[2016-12-05 07:33:10.774116] I [login.c:76:gf_auth] 0-auth/login: allowed user names: 0b9b19e0-5b6f-4401-8e1e-53fb21a2fd9d
[2016-12-05 07:33:10.774138] I [MSGID: 115029] [server-handshake.c:693:server_setvolume] 0-vol-server: accepted client from apandey-29101-2016/12/05-07:33:03:610275-vol-client-0-0-0 (version: 3.8.5)
[2016-12-05 07:37:05.212110] E [inodelk.c:404:__inode_unlock_lock] 0-vol-locks:  Matching lock not found for unlock 0-9223372036854775807, by 6825b6d5117f0000 on 0x7f0c800b0430
[2016-12-05 07:37:05.212146] E [MSGID: 115053] [server-rpc-fops.c:278:server_inodelk_cbk] 0-vol-server: 45050: INODELK / (00000000-0000-0000-0000-000000000001), client: apandey-29101-2016/12/05-07:33:03:610275-vol-client-0-0-0, error-xlator: vol-locks [Invalid argument]


only on C3 logs I found this error - 


[2016-12-05 07:37:05.212211] E [MSGID: 114031] [client-rpc-fops.c:1549:client3_3_inodelk_cbk] 0-vol-client-1: remote operation failed [Invalid argument]
[2016-12-05 07:37:05.212268] E [MSGID: 114031] [client-rpc-fops.c:1549:client3_3_inodelk_cbk] 0-vol-client-2: remote operation failed [Invalid argument]
[2016-12-05 07:37:05.212315] E [MSGID: 114031] [client-rpc-fops.c:1549:client3_3_inodelk_cbk] 0-vol-client-3: remote operation failed [Invalid argument]
[2016-12-05 07:37:05.212345] E [MSGID: 114031] [client-rpc-fops.c:1549:client3_3_inodelk_cbk] 0-vol-client-0: remote operation failed [Invalid argument]
[2016-12-05 07:37:05.212381] E [MSGID: 114031] [client-rpc-fops.c:1549:client3_3_inodelk_cbk] 0-vol-client-5: remote operation failed [Invalid argument]
[2016-12-05 07:37:05.212387] E [MSGID: 114031] [client-rpc-fops.c:1549:client3_3_inodelk_cbk] 0-vol-client-4: remote operation failed [Invalid argument]
[2016-12-05 07:37:05.212414] W [MSGID: 122015] [ec-common.c:1749:ec_unlocked] 0-vol-disperse-0: entry/inode unlocking failed (RENAME) [Invalid argument]
[2016-12-05 07:37:05.222999] I [MSGID: 109066] [dht-rename.c:1562:dht_rename] 0-vol-dht: renaming /a-2501 (hash=vol-disperse-0/cache=vol-disperse-0) => /dir/a-2501 (hash=vol-disperse-0/cache=<nul>)

Comment 5 Pranith Kumar K 2016-12-06 13:53:47 UTC
Created attachment 1228478 [details]
possible fix

I found one race which could cause this bug:
i.e. lk-owner changes just before inodelk unlock.
Rename takes two locks 1) On parent directory of the source file 2) On parent directory of the destination file/location. Lk-owner of these two locks is different i.e. the address of the ctx->inode_lock member. After rename, at the time of unlock it sets the lk-owner on the same rename fop frame and unwinds. But it can so happen that just before ec_inodelk is called the owner can be changed to that of the second parent inode's inode_lock. This can cause this problem.

I am a bit confused about the steps I will take your help Ashish tomorrow at 10AM

Comment 7 Pranith Kumar K 2016-12-07 03:40:10 UTC
(In reply to Pranith Kumar K from comment #5)
> Created attachment 1228478 [details]
> possible fix
> 
> I found one race which could cause this bug:
> i.e. lk-owner changes just before inodelk unlock.
> Rename takes two locks 1) On parent directory of the source file 2) On
> parent directory of the destination file/location. Lk-owner of these two
> locks is different i.e. the address of the ctx->inode_lock member. After
> rename, at the time of unlock it sets the lk-owner on the same rename fop
> frame and unwinds. But it can so happen that just before ec_inodelk is
> called the owner can be changed to that of the second parent inode's
> inode_lock. This can cause this problem.
> 
> I am a bit confused about the steps I will take your help Ashish tomorrow at
> 10AM

Ashish found that this patch leads to a crash, I just checked and the fix is a bit more involved than anticipated.

Comment 12 Nag Pavan Chilakam 2016-12-22 10:32:25 UTC
verified below on 3.8.4-9 and didn't find any hang hence moving to verified

TC#1:
There were two bricks down and I started creating files in below fashion
c1:under dir1 touch of f{1..10000}
c2:under / touch x{1..1000}
c3: after c2 action completed, started touch x{1001..10000}
I then brought up all the bricks which were down (2 in each of the two subvols)
then while c1 and c3 were happening, on c2 i did a move of all the x* into dir1

Behavior:ls and mv didnt hang


TC2:no bricks down
1 - Mount this volume on 3 mount point, c1, c2, and c3
2 - on c1 - mkdir /c1/dir ; cd dir
3 - on c2 touch 4000 files on mount point i.e. "/"
4 - After step 3, start touching next 4000 files on c2 on mount point i.e. "/" 
3 - on c1 start touching 10000 files on /dir/.
4 - on c3 start moving 4000 files created on step 3 from mount point to /dir/
5 - on c3, from different console , start ls in a loop.

Behavior:ls and mv didnt hang

Comment 13 Nag Pavan Chilakam 2017-01-23 14:12:48 UTC
testcases added to polarion:
RHG3-11786 - BZ#1400093 -TC#1: simultaneous issue of lookup and write (ls and move) should not hang on disperse volume (All bricks up)
RHG3-11787 - BZ#1400093 -TC#2: simultaneous issue of lookup and write (ls and move) should not hang on disperse volume (few bricks down)

Comment 15 errata-xmlrpc 2017-03-23 05:53:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html