Bug 1311002 - NFS+attach tier:IOs hang while attach tier is issued
NFS+attach tier:IOs hang while attach tier is issued
Status: CLOSED CURRENTRELEASE
Product: GlusterFS
Classification: Community
Component: tiering (Show other bugs)
mainline
Unspecified Unspecified
urgent Severity urgent
: ---
: ---
Assigned To: Mohammed Rafi KC
bugs@gluster.org
tier-fuse-nfs-samba
: ZStream
Depends On: 1306194
Blocks: 1305205 1306930 1333645 1347524
  Show dependency treegraph
 
Reported: 2016-02-23 02:29 EST by Mohammed Rafi KC
Modified: 2017-03-27 14:27 EDT (History)
8 users (show)

See Also:
Fixed In Version: glusterfs-3.9.0
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1306194
: 1333645 (view as bug list)
Environment:
Last Closed: 2017-03-27 14:27:08 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Mohammed Rafi KC 2016-02-23 02:29:25 EST
+++ This bug was initially created as a clone of Bug #1306194 +++

on a 16 node setup, with ec volume, I started IOs from 3 different clients.
While IOs were going on I attached a tier to the volume, and the IOs were hung.

I tried this twice and both times IOs got hung.

In 3.7.5-17 there used to be a temporary pause(about 5 min) when attach tier was issued. But in this build 3.7.5-19 the IOs have hung for more than 2 Hours




volinfo before and after attach tier:
gluster v create npcvol disperse 12 disperse-data 8 10.70.37.202:/bricks/brick1/npcvol 10.70.37.195:/bricks/brick1/npcvol 10.70.35.133:/bricks/brick1/npcvol 10.70.35.239:/bricks/brick1/npcvol 10.70.35.225:/bricks/brick1/npcvol 10.70.35.11:/bricks/brick1/npcvol 10.70.35.10:/bricks/brick1/npcvol 10.70.35.231:/bricks/brick1/npcvol 10.70.35.176:/bricks/brick1/npcvol 10.70.35.232:/bricks/brick1/npcvol 10.70.35.173:/bricks/brick1/npcvol 10.70.35.163:/bricks/brick1/npcvol 10.70.37.101:/bricks/brick1/npcvol 10.70.37.69:/bricks/brick1/npcvol 10.70.37.60:/bricks/brick1/npcvol 10.70.37.120:/bricks/brick1/npcvol 10.70.37.202:/bricks/brick2/npcvol 10.70.37.195:/bricks/brick2/npcvol 10.70.35.133:/bricks/brick2/npcvol 10.70.35.239:/bricks/brick2/npcvol 10.70.35.225:/bricks/brick2/npcvol 10.70.35.11:/bricks/brick2/npcvol 10.70.35.10:/bricks/brick2/npcvol 10.70.35.231:/bricks/brick2/npcvol

gluster volume tier npcvol attach rep 2 10.70.35.176:/bricks/brick7/npcvol_hot 10.70.35.232:/bricks/brick7/npcvol_hot 10.70.35.173:/bricks/brick7/npcvol_hot 10.70.35.163:/bricks/brick7/npcvol_hot 10.70.37.101:/bricks/brick7/npcvol_hot 10.70.37.69:/bricks/brick7/npcvol_hot 10.70.37.60:/bricks/brick7/npcvol_hot 10.70.37.120:/bricks/brick7/npcvol_hot 10.70.37.195:/bricks/brick7/npcvol_hot 10.70.37.202:/bricks/brick7/npcvol_hot 10.70.35.133:/bricks/brick7/npcvol_hot 10.70.35.239:/bricks/brick7/npcvol_hot



[root@dhcp37-202 ~]# gluster v status npcvol
Status of volume: npcvol
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.37.202:/bricks/brick1/npcvol    49161     0          Y       628  
Brick 10.70.37.195:/bricks/brick1/npcvol    49161     0          Y       30704
Brick 10.70.35.133:/bricks/brick1/npcvol    49158     0          Y       24148
Brick 10.70.35.239:/bricks/brick1/npcvol    49158     0          Y       24128
Brick 10.70.35.225:/bricks/brick1/npcvol    49157     0          Y       24467
Brick 10.70.35.11:/bricks/brick1/npcvol     49157     0          Y       24272
Brick 10.70.35.10:/bricks/brick1/npcvol     49160     0          Y       24369
Brick 10.70.35.231:/bricks/brick1/npcvol    49160     0          Y       32189
Brick 10.70.35.176:/bricks/brick1/npcvol    49161     0          Y       1392 
Brick 10.70.35.232:/bricks/brick1/npcvol    49161     0          Y       26630
Brick 10.70.35.173:/bricks/brick1/npcvol    49161     0          Y       28493
Brick 10.70.35.163:/bricks/brick1/npcvol    49161     0          Y       28592
Brick 10.70.37.101:/bricks/brick1/npcvol    49161     0          Y       28410
Brick 10.70.37.69:/bricks/brick1/npcvol     49161     0          Y       357  
Brick 10.70.37.60:/bricks/brick1/npcvol     49161     0          Y       31071
Brick 10.70.37.120:/bricks/brick1/npcvol    49176     0          Y       1311 
Brick 10.70.37.202:/bricks/brick2/npcvol    49162     0          Y       651  
Brick 10.70.37.195:/bricks/brick2/npcvol    49162     0          Y       30723
Brick 10.70.35.133:/bricks/brick2/npcvol    49159     0          Y       24167
Brick 10.70.35.239:/bricks/brick2/npcvol    49159     0          Y       24148
Brick 10.70.35.225:/bricks/brick2/npcvol    49158     0          Y       24486
Brick 10.70.35.11:/bricks/brick2/npcvol     49158     0          Y       24291
Brick 10.70.35.10:/bricks/brick2/npcvol     49161     0          Y       24388
Brick 10.70.35.231:/bricks/brick2/npcvol    49161     0          Y       32208
Snapshot Daemon on localhost                49163     0          Y       810  
NFS Server on localhost                     2049      0          Y       818  
Self-heal Daemon on localhost               N/A       N/A        Y       686  
Quota Daemon on localhost                   N/A       N/A        Y       859  
Snapshot Daemon on 10.70.37.101             49162     0          Y       28538
NFS Server on 10.70.37.101                  2049      0          Y       28546
Self-heal Daemon on 10.70.37.101            N/A       N/A        Y       28439
Quota Daemon on 10.70.37.101                N/A       N/A        Y       28576
Snapshot Daemon on 10.70.37.195             49163     0          Y       30851
NFS Server on 10.70.37.195                  2049      0          Y       30859
Self-heal Daemon on 10.70.37.195            N/A       N/A        Y       30751
Quota Daemon on 10.70.37.195                N/A       N/A        Y       30889
Snapshot Daemon on 10.70.37.120             49177     0          Y       1438 
NFS Server on 10.70.37.120                  2049      0          Y       1446 
Self-heal Daemon on 10.70.37.120            N/A       N/A        Y       1339 
Quota Daemon on 10.70.37.120                N/A       N/A        Y       1477 
Snapshot Daemon on 10.70.37.69              49162     0          Y       492  
NFS Server on 10.70.37.69                   2049      0          Y       500  
Self-heal Daemon on 10.70.37.69             N/A       N/A        Y       385  
Quota Daemon on 10.70.37.69                 N/A       N/A        Y       542  
Snapshot Daemon on 10.70.37.60              49162     0          Y       31197
NFS Server on 10.70.37.60                   2049      0          Y       31205
Self-heal Daemon on 10.70.37.60             N/A       N/A        Y       31099
Quota Daemon on 10.70.37.60                 N/A       N/A        Y       31235
Snapshot Daemon on 10.70.35.239             49160     0          Y       24287
NFS Server on 10.70.35.239                  2049      0          Y       24295
Self-heal Daemon on 10.70.35.239            N/A       N/A        Y       24176
Quota Daemon on 10.70.35.239                N/A       N/A        Y       24325
Snapshot Daemon on 10.70.35.231             49162     0          Y       32340
NFS Server on 10.70.35.231                  2049      0          Y       32348
Self-heal Daemon on 10.70.35.231            N/A       N/A        Y       32236
Quota Daemon on 10.70.35.231                N/A       N/A        Y       32389
Snapshot Daemon on 10.70.35.176             49162     0          Y       1535 
NFS Server on 10.70.35.176                  2049      0          Y       1545 
Self-heal Daemon on 10.70.35.176            N/A       N/A        Y       1420 
Quota Daemon on 10.70.35.176                N/A       N/A        Y       1589 
Snapshot Daemon on dhcp35-225.lab.eng.blr.r
edhat.com                                   49159     0          Y       24623
NFS Server on dhcp35-225.lab.eng.blr.redhat
.com                                        2049      0          Y       24631
Self-heal Daemon on dhcp35-225.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       24514
Quota Daemon on dhcp35-225.lab.eng.blr.redh
at.com                                      N/A       N/A        Y       24661
Snapshot Daemon on 10.70.35.232             49162     0          Y       26759
NFS Server on 10.70.35.232                  2049      0          Y       26767
Self-heal Daemon on 10.70.35.232            N/A       N/A        Y       26658
Quota Daemon on 10.70.35.232                N/A       N/A        Y       26805
Snapshot Daemon on 10.70.35.163             49162     0          Y       28721
NFS Server on 10.70.35.163                  2049      0          Y       28729
Self-heal Daemon on 10.70.35.163            N/A       N/A        Y       28620
Quota Daemon on 10.70.35.163                N/A       N/A        Y       28760
Snapshot Daemon on 10.70.35.11              49159     0          Y       24427
NFS Server on 10.70.35.11                   2049      0          Y       24435
Self-heal Daemon on 10.70.35.11             N/A       N/A        Y       24319
Quota Daemon on 10.70.35.11                 N/A       N/A        Y       24465
Snapshot Daemon on 10.70.35.10              49162     0          Y       24521
NFS Server on 10.70.35.10                   2049      0          Y       24529
Self-heal Daemon on 10.70.35.10             N/A       N/A        Y       24416
Quota Daemon on 10.70.35.10                 N/A       N/A        Y       24560
Snapshot Daemon on 10.70.35.133             49160     0          Y       24314
NFS Server on 10.70.35.133                  2049      0          Y       24322
Self-heal Daemon on 10.70.35.133            N/A       N/A        Y       24203
Quota Daemon on 10.70.35.133                N/A       N/A        Y       24352
Snapshot Daemon on 10.70.35.173             49162     0          Y       28625
NFS Server on 10.70.35.173                  2049      0          Y       28633
Self-heal Daemon on 10.70.35.173            N/A       N/A        Y       28521
Quota Daemon on 10.70.35.173                N/A       N/A        Y       28671
 
Task Status of Volume npcvol
------------------------------------------------------------------------------
There are no active volume tasks
 


#####after attach tier
[root@dhcp37-202 ~]# gluster v status npcvol
Status of volume: npcvol
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Hot Bricks:
Brick 10.70.35.239:/bricks/brick7/npcvol_ho
t                                           49161     0          Y       25252
Brick 10.70.35.133:/bricks/brick7/npcvol_ho
t                                           49161     0          Y       25276
Brick 10.70.37.202:/bricks/brick7/npcvol_ho
t                                           49164     0          Y       2028 
Brick 10.70.37.195:/bricks/brick7/npcvol_ho
t                                           49164     0          Y       31793
Brick 10.70.37.120:/bricks/brick7/npcvol_ho
t                                           49178     0          Y       2504 
Brick 10.70.37.60:/bricks/brick7/npcvol_hot 49163     0          Y       32188
Brick 10.70.37.69:/bricks/brick7/npcvol_hot 49163     0          Y       1548 
Brick 10.70.37.101:/bricks/brick7/npcvol_ho
t                                           49163     0          Y       29535
Brick 10.70.35.163:/bricks/brick7/npcvol_ho
t                                           49163     0          Y       29799
Brick 10.70.35.173:/bricks/brick7/npcvol_ho
t                                           49163     0          Y       29669
Brick 10.70.35.232:/bricks/brick7/npcvol_ho
t                                           49163     0          Y       27813
Brick 10.70.35.176:/bricks/brick7/npcvol_ho
t                                           49163     0          Y       2607 
Cold Bricks:
Brick 10.70.37.202:/bricks/brick1/npcvol    49161     0          Y       628  
Brick 10.70.37.195:/bricks/brick1/npcvol    49161     0          Y       30704
Brick 10.70.35.133:/bricks/brick1/npcvol    49158     0          Y       24148
Brick 10.70.35.239:/bricks/brick1/npcvol    49158     0          Y       24128
Brick 10.70.35.225:/bricks/brick1/npcvol    49157     0          Y       24467
Brick 10.70.35.11:/bricks/brick1/npcvol     49157     0          Y       24272
Brick 10.70.35.10:/bricks/brick1/npcvol     49160     0          Y       24369
Brick 10.70.35.231:/bricks/brick1/npcvol    49160     0          Y       32189
Brick 10.70.35.176:/bricks/brick1/npcvol    49161     0          Y       1392 
Brick 10.70.35.232:/bricks/brick1/npcvol    49161     0          Y       26630
Brick 10.70.35.173:/bricks/brick1/npcvol    49161     0          Y       28493
Brick 10.70.35.163:/bricks/brick1/npcvol    49161     0          Y       28592
Brick 10.70.37.101:/bricks/brick1/npcvol    49161     0          Y       28410
Brick 10.70.37.69:/bricks/brick1/npcvol     49161     0          Y       357  
Brick 10.70.37.60:/bricks/brick1/npcvol     49161     0          Y       31071
Brick 10.70.37.120:/bricks/brick1/npcvol    49176     0          Y       1311 
Brick 10.70.37.202:/bricks/brick2/npcvol    49162     0          Y       651  
Brick 10.70.37.195:/bricks/brick2/npcvol    49162     0          Y       30723
Brick 10.70.35.133:/bricks/brick2/npcvol    49159     0          Y       24167
Brick 10.70.35.239:/bricks/brick2/npcvol    49159     0          Y       24148
Brick 10.70.35.225:/bricks/brick2/npcvol    49158     0          Y       24486
Brick 10.70.35.11:/bricks/brick2/npcvol     49158     0          Y       24291
Brick 10.70.35.10:/bricks/brick2/npcvol     49161     0          Y       24388
Brick 10.70.35.231:/bricks/brick2/npcvol    49161     0          Y       32208
Snapshot Daemon on localhost                49163     0          Y       810  
NFS Server on localhost                     2049      0          Y       2048 
Self-heal Daemon on localhost               N/A       N/A        Y       2056 
Quota Daemon on localhost                   N/A       N/A        Y       2064 
Snapshot Daemon on 10.70.37.60              49162     0          Y       31197
NFS Server on 10.70.37.60                   2049      0          Y       32208
Self-heal Daemon on 10.70.37.60             N/A       N/A        Y       32216
Quota Daemon on 10.70.37.60                 N/A       N/A        Y       32224
Snapshot Daemon on 10.70.37.195             49163     0          Y       30851
NFS Server on 10.70.37.195                  2049      0          Y       31813
Self-heal Daemon on 10.70.37.195            N/A       N/A        Y       31821
Quota Daemon on 10.70.37.195                N/A       N/A        Y       31829
Snapshot Daemon on 10.70.37.120             49177     0          Y       1438 
NFS Server on 10.70.37.120                  2049      0          Y       2524 
Self-heal Daemon on 10.70.37.120            N/A       N/A        Y       2532 
Quota Daemon on 10.70.37.120                N/A       N/A        Y       2540 
Snapshot Daemon on 10.70.37.101             49162     0          Y       28538
NFS Server on 10.70.37.101                  2049      0          Y       29555
Self-heal Daemon on 10.70.37.101            N/A       N/A        Y       29563
Quota Daemon on 10.70.37.101                N/A       N/A        Y       29571
Snapshot Daemon on 10.70.37.69              49162     0          Y       492  
NFS Server on 10.70.37.69                   2049      0          Y       1574 
Self-heal Daemon on 10.70.37.69             N/A       N/A        Y       1582 
Quota Daemon on 10.70.37.69                 N/A       N/A        Y       1590 
Snapshot Daemon on 10.70.35.173             49162     0          Y       28625
NFS Server on 10.70.35.173                  2049      0          Y       29690
Self-heal Daemon on 10.70.35.173            N/A       N/A        Y       29698
Quota Daemon on 10.70.35.173                N/A       N/A        Y       29713
Snapshot Daemon on 10.70.35.231             49162     0          Y       32340
NFS Server on 10.70.35.231                  2049      0          Y       1022 
Self-heal Daemon on 10.70.35.231            N/A       N/A        Y       1033 
Quota Daemon on 10.70.35.231                N/A       N/A        Y       1043 
Snapshot Daemon on 10.70.35.176             49162     0          Y       1535 
NFS Server on 10.70.35.176                  2049      0          Y       2627 
Self-heal Daemon on 10.70.35.176            N/A       N/A        Y       2635 
Quota Daemon on 10.70.35.176                N/A       N/A        Y       2659 
Snapshot Daemon on 10.70.35.239             49160     0          Y       24287
NFS Server on 10.70.35.239                  2049      0          Y       25272
Self-heal Daemon on 10.70.35.239            N/A       N/A        Y       25280
Quota Daemon on 10.70.35.239                N/A       N/A        Y       25288
Snapshot Daemon on dhcp35-225.lab.eng.blr.r
edhat.com                                   49159     0          Y       24623
NFS Server on dhcp35-225.lab.eng.blr.redhat
.com                                        2049      0          Y       25622
Self-heal Daemon on dhcp35-225.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       25630
Quota Daemon on dhcp35-225.lab.eng.blr.redh
at.com                                      N/A       N/A        Y       25638
Snapshot Daemon on 10.70.35.11              49159     0          Y       24427
NFS Server on 10.70.35.11                   2049      0          Y       25455
Self-heal Daemon on 10.70.35.11             N/A       N/A        Y       25463
Quota Daemon on 10.70.35.11                 N/A       N/A        Y       25471
Snapshot Daemon on 10.70.35.133             49160     0          Y       24314
NFS Server on 10.70.35.133                  2049      0          Y       25296
Self-heal Daemon on 10.70.35.133            N/A       N/A        Y       25304
Quota Daemon on 10.70.35.133                N/A       N/A        Y       25312
Snapshot Daemon on 10.70.35.10              49162     0          Y       24521
NFS Server on 10.70.35.10                   2049      0          Y       25578
Self-heal Daemon on 10.70.35.10             N/A       N/A        Y       25586
Quota Daemon on 10.70.35.10                 N/A       N/A        Y       25594
Snapshot Daemon on 10.70.35.232             49162     0          Y       26759
NFS Server on 10.70.35.232                  2049      0          Y       27833
Self-heal Daemon on 10.70.35.232            N/A       N/A        Y       27841
Quota Daemon on 10.70.35.232                N/A       N/A        Y       27866
Snapshot Daemon on 10.70.35.163             49162     0          Y       28721
NFS Server on 10.70.35.163                  2049      0          Y       29819
Self-heal Daemon on 10.70.35.163            N/A       N/A        Y       29827
Quota Daemon on 10.70.35.163                N/A       N/A        Y       29852
 
Task Status of Volume npcvol
------------------------------------------------------------------------------
Task                 : Tier migration      
ID                   : 524ad8fe-a743-47df-a4e9-edd2db05c60b
Status               : in progress         
 





Following is the Ios triggered before attach and were going on while attach:
1)client1:created a 300Mb file and started to copy the file to new files 
for i in {2..50};do cp hlfile.1 hlfile.$i;done

2)client2:created 50Mb file and initiated a rename of file continuously 
for i in {2..1000};do cp rename.1 rename.$i;done

3)client3: linux untar
4)copying a 3GB file to create new files in loop 
for i in {1..10};do cp File.mkv cheema$i.mkv;done

4)Client 4: created 10000 Zerobyte file and while then triggered remove of 5000 file so that it goes on while attach tier
[root@rhs-client30 zerobyte]# rm -rf zb{5000..10000}

--- Additional comment from Red Hat Bugzilla Rules Engine on 2016-02-10 04:45:45 EST ---

This bug is automatically being proposed for the current z-stream release of Red Hat Gluster Storage 3 by setting the release flag 'rhgs‑3.1.z' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from nchilaka on 2016-02-10 05:18:43 EST ---

sosreports of both clients and servers available at 
[nchilaka@rhsqe-repo nchilaka]$ chmod -R 0777 bug.1306194
[nchilaka@rhsqe-repo nchilaka]$ pwd
/home/repo/sosreports/nchilaka

--- Additional comment from Mohammed Rafi KC on 2016-02-10 11:13:31 EST ---

There is a blocking lock held on one of the brick, which is not released. All of the other clients are waiting on this lock. We couldn't look into the owner of the lock, because by the time ping timer is expired and lock was released.

After that i/o's resumed. We need to look which client acquired the lock and why they are not releasing it.

--- Additional comment from Soumya Koduri on 2016-02-11 09:00:05 EST ---

When we tried to reproduce the issue, we see "Stale File Handle" errors after attach-tier. When did RCA using gdb, we found that ESTALE is returned via svc_client (which is enabled by USS). So we have disabled USS and then re-tried the test. Now we see the mount points hang.


On the server side, the volume got unexported -

[skoduri@skoduri ~]$ showmount -e 10.70.35.225
Export list for 10.70.35.225:
[skoduri@skoduri ~]$ 


Tracing back from the logs and the code, 

[2016-02-11 13:26:02.540565] E [MSGID: 112070] [nfs3.c:896:nfs3_getattr] 0-nfs-nfsv3: Volume is disabled: finalvol
[2016-02-11 13:28:02.600425] E [MSGID: 112070] [nfs3.c:896:nfs3_getattr] 0-nfs-nfsv3: Volume is disabled: finalvol
[2016-02-11 13:28:02.600546] E [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully


This msg is logged when that volume is not in the list of nfs->initedxl[] list. This list will be updated as part of "nfs_startup_subvolume()" which is invoked during notify of "GF_EVENT_CHILD_UP". So suspecting that nfs xlator has not received this event which resulted in this volume being in unexported state. Attaching the nfs log for further debugging.

--- Additional comment from Mohammed Rafi KC on 2016-02-11 09:55:09 EST ---

During the nfs graph initialization, we do a lookup on the root. Looks like this lookup is blocked on a lock which held by another nfs process. We need to figure it out why the nfs server who acquired the lock failed to unlock it.

--- Additional comment from Raghavendra G on 2016-02-12 01:43:36 EST ---

Rafi reported that stale lock or unlock failures are seen even when first lookup on root is happening. Here is a most likely RCA. I am assuming a "tier-dht" has two dht subvols "hot-dht" and "cold-dht". Also stale lock is found on one of the bricks corresponding to hot-dht.

1. Lookup on / on tier-dht.
2. Lookup is wound to hashed subvol - cold-dht and is successful.
3. tier-dht figures out / is a directory and does a lookup on both hot-dht and cold-dht.
4. on hot-dht, some subvols - say c1, c2 - are down. But lookup is still successful as some other subvols (say c3, c4) are up.
5. lookup on / is successful on cold-dht.
6. tier-dht decides it needs to heal layout of "/".

From here I am skipping events on cold-dht as they are irrelevant for this RCA.

7. tier-dht winds inodelk on hot-dht. hot-dht winds it to first subvol in the layout-list (Say c1 in this case). Note that subvols with 0 ranges are stored in the beginning of the list. All the subvols where lookup failed (say because of ENOTCONN) ends up with 0 ranges. The relative order of subvols with 0 ranges is undefined and depends on whose lookup failed first.
8. c1 comes up
9. hot-dht acquires lock on c1.
10. tier-dht tries to refresh its layout of /. Winds lookup on hot and cold dhts again.
11. hot-dht sees that layout's generation number is lagging behind current generation number (as c1 came after lookup on / completed). It issues a fresh lookup and reconstructs the layout for /. Since c2 is still down, it is pushed to the beginning of the subvol list of layout.
12. tier-dht is done with healing. It issues unlock on hot-dht.
13. hot-dht winds unlock call to first subvol in layout of /, which is c2.
14. unlock fails with ENOTCONN and a stale lock is left on c1.

--- Additional comment from Raghavendra G on 2016-02-12 01:46:00 EST ---

steps 7 and 8 can be swapped for more clarity and RCA is still valid

--- Additional comment from Mohammed Rafi KC on 2016-02-12 05:04:36 EST ---

Based on comment 6, it could be a an intrusive fix, that requires testing for pure dht and tier also. To recover from this hang with out any interruption for application continuity would be to restart nfs server, which can be done by volume start force. This will restart only nfs server , if there is no other process requires restart.

--- Additional comment from Laura Bailey on 2016-02-14 21:45:57 EST ---

Rafi, based on https://bugzilla.redhat.com/show_bug.cgi?id=1303045#c3, I shouldn't document this as a known issue, right?

--- Additional comment from Mohammed Rafi KC on 2016-02-15 01:15:28 EST ---

Yes, I have included this as part of the bug 1303045 .

--- Additional comment from Laura Bailey on 2016-02-15 20:07:44 EST ---

Thanks Rafi, removing this from the tracker bug.

--- Additional comment from nchilaka on 2016-02-17 01:17:25 EST ---

Workaround testing: I tested the work around by restarting volume using force.
While the IOs resumed, which means the workaround is fine, but there is a small problem which has been discussed for which bz#1309186 - file creates fail with " failed to open '<filename>': Too many levels of symbolic links for file create/write when restarting NFS using vol start force  has been raised
Comment 1 Vijay Bellur 2016-02-23 02:36:21 EST
REVIEW: http://review.gluster.org/13492 (dht:remember locked subvol and send unlock to the same) posted (#1) for review on master by mohammed rafi  kc (rkavunga@redhat.com)
Comment 2 Vijay Bellur 2016-02-27 13:42:49 EST
REVIEW: http://review.gluster.org/13492 (dht:remember locked subvol and send unlock to the same) posted (#2) for review on master by mohammed rafi  kc (rkavunga@redhat.com)
Comment 3 Vijay Bellur 2016-03-04 00:56:10 EST
REVIEW: http://review.gluster.org/13492 (dht:remember locked subvol and send unlock to the same) posted (#3) for review on master by mohammed rafi  kc (rkavunga@redhat.com)
Comment 4 Vijay Bellur 2016-03-04 11:45:18 EST
REVIEW: http://review.gluster.org/13492 (dht:remember locked subvol and send unlock to the same) posted (#4) for review on master by mohammed rafi  kc (rkavunga@redhat.com)
Comment 5 Vijay Bellur 2016-03-08 16:49:15 EST
REVIEW: http://review.gluster.org/13492 (dht:remember locked subvol and send unlock to the same) posted (#5) for review on master by mohammed rafi  kc (rkavunga@redhat.com)
Comment 6 Vijay Bellur 2016-03-09 01:56:50 EST
REVIEW: http://review.gluster.org/13492 (dht:remember locked subvol and send unlock to the same) posted (#6) for review on master by mohammed rafi  kc (rkavunga@redhat.com)
Comment 7 Vijay Bellur 2016-03-15 02:15:00 EDT
REVIEW: http://review.gluster.org/13492 (dht:remember locked subvol and send unlock to the same) posted (#7) for review on master by mohammed rafi  kc (rkavunga@redhat.com)
Comment 8 Vijay Bellur 2016-03-16 08:13:52 EDT
REVIEW: http://review.gluster.org/13492 (dht:remember locked subvol and send unlock to the same) posted (#8) for review on master by mohammed rafi  kc (rkavunga@redhat.com)
Comment 9 Vijay Bellur 2016-05-03 07:01:09 EDT
REVIEW: http://review.gluster.org/13492 (dht:remember locked subvol and send unlock to the same) posted (#9) for review on master by mohammed rafi  kc (rkavunga@redhat.com)
Comment 10 Vijay Bellur 2016-05-04 08:42:37 EDT
REVIEW: http://review.gluster.org/13492 (dht:remember locked subvol and send unlock to the same) posted (#10) for review on master by mohammed rafi  kc (rkavunga@redhat.com)
Comment 11 Vijay Bellur 2016-05-05 06:54:34 EDT
REVIEW: http://review.gluster.org/13492 (dht:remember locked subvol and send unlock to the same) posted (#11) for review on master by mohammed rafi  kc (rkavunga@redhat.com)
Comment 12 Vijay Bellur 2016-05-05 09:34:58 EDT
REVIEW: http://review.gluster.org/13492 (dht:remember locked subvol and send unlock to the same) posted (#12) for review on master by mohammed rafi  kc (rkavunga@redhat.com)
Comment 13 Vijay Bellur 2016-05-05 13:33:55 EDT
REVIEW: http://review.gluster.org/13492 (dht:remember locked subvol and send unlock to the same) posted (#13) for review on master by mohammed rafi  kc (rkavunga@redhat.com)
Comment 14 Vijay Bellur 2016-05-06 01:16:30 EDT
REVIEW: http://review.gluster.org/13492 (dht:remember locked subvol and send unlock to the same) posted (#14) for review on master by mohammed rafi  kc (rkavunga@redhat.com)
Comment 15 Vijay Bellur 2016-05-06 04:54:32 EDT
COMMIT: http://review.gluster.org/13492 committed in master by Raghavendra G (rgowdapp@redhat.com) 
------
commit ef0db52bc55a51fe5e3856235aed0230b6a188fe
Author: Mohammed Rafi KC <rkavunga@redhat.com>
Date:   Tue May 3 14:43:20 2016 +0530

    dht:remember locked subvol and send unlock to the same
    
    During locking we send lock request to cached subvol,
    and normally we unlock to the cached subvol
    But with parallel fresh lookup on a directory, there
    is a race window where the cached subvol can change
    and the unlock can go into a different subvol from
    which we took lock.
    
    This will result in a stale lock held on one of the
    subvol.
    
    So we will store the details of subvol which we took the lock
    and will unlock from the same subvol
    
    Change-Id: I47df99491671b10624eb37d1d17e40bacf0b15eb
    BUG: 1311002
    Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
    Reviewed-on: http://review.gluster.org/13492
    Reviewed-by: N Balachandran <nbalacha@redhat.com>
    Smoke: Gluster Build System <jenkins@build.gluster.com>
    NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
    Reviewed-by: Raghavendra G <rgowdapp@redhat.com>
    CentOS-regression: Gluster Build System <jenkins@build.gluster.com>
Comment 16 Shyamsundar 2017-03-27 14:27:08 EDT
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.9.0, please open a new bug report.

glusterfs-3.9.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2016-November/029281.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.