1139193 – git operations fail when add-brick operation is done

Bug 1139193 - git operations fail when add-brick operation is done

Summary: git operations fail when add-brick operation is done

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	distribute
Sub Component:
Version:	rhgs-3.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.1.2
Assignee:	Susant Kumar Palai
QA Contact:	RajeshReddy
Docs Contact:
URL:
Whiteboard:	dht-file-access,dht-add-brick
Depends On:	1278399
Blocks:	1087818 1260783
TreeView+	depends on / blocked

Reported:	2014-09-08 10:56 UTC by Anush Shetty
Modified:	2018-11-13 10:06 UTC (History)
CC List:	11 users (show)
Fixed In Version:	glusterfs-3.7.5-6
Doc Type:	Bug Fix
Doc Text:	Previously, git operations failed when add-brick operation was done as the data was being looked up even in the newly added bricks. With this fix, after add-brick, any application such as git which attempts opendir add on a previously existing directory fails with ESTALE/ENOENT errors.
Clone Of:
Environment:
Last Closed:	2016-03-01 05:22:42 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Client log (1.84 MB, text/plain) 2014-09-08 10:57 UTC, Anush Shetty	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:0193	0	normal	SHIPPED_LIVE	Red Hat Gluster Storage 3.1 update 2	2016-03-01 10:20:36 UTC

Description Anush Shetty 2014-09-08 10:56:43 UTC

Description of problem: git operations fail when add-brick operation is done. This is due to data being looked up even in the newly added bricks when it is not supposed to do that.


Version-Release number of selected component (if applicable): 
glusterfs-3.6.0.28-1.el6rhs.x86_64

How reproducible: Always


Steps to Reproduce:
1. Create a distribute volume and mount a fuse volume.
2  Create a git repo on the mount point.
3. On the git repo, add a file and commit it.

   for i in `seq 1 10000`; do 
         dd if=/dev/urandom of=ddfile_$i bs=1M count=1; 
         git add ddfile_$i; 
         git commit -m "ddfile";  
   done

4. Do an add-brick operation over the volume
5. When add-brick is done, git operations see failures.

Actual results:

git operations fail with the following error message:

1048576 bytes (1.0 MB) copied, 0.20487 s, 5.1 MB/s
fatal: Unable to read current working directory: No such file or directory
fatal: Unable to read current working directory: No such file or directory
1+0 records in
1+0 records out


Expected results:

There should be no errors.

Additional info:

Comment 1 Anush Shetty 2014-09-08 10:57:37 UTC

Created attachment 935306 [details]
Client log

Comment 3 Lalatendu Mohanty 2014-09-08 12:43:44 UTC

On Windows client (SAMBA) Add brick is resulting in to permission denied error (I/O error ).

Error on Client:

cp: cannot create regular file ‘971.docx’: Permission denied
cp: cannot create regular file ‘974.docx’: Permission denied
cp: cannot create regular file ‘975.docx’: Permission denied
cp: cannot create regular file ‘977.docx’: Permission denied
cp: cannot create regular file ‘978.docx’: Permission denied
cp: cannot create regular file ‘979.docx’: Permission denied
cp: cannot create regular file ‘983.docx’: Permission denied
cp: cannot create regular file ‘984.docx’: Permission denied
cp: cannot create regular file ‘985.docx’: Permission denied
cp: cannot create regular file ‘986.docx’: Permission denied
cp: cannot create regular file ‘987.docx’: Permission denied
cp: cannot create regular file ‘989.docx’: Permission denied
cp: cannot create regular file ‘991.docx’: Permission denied
cp: cannot create regular file ‘993.docx’: Permission denied

From /var/log/samba/glusterfs-testvol.10.70.35.48.log:

[2014-09-08 12:31:34.958229] W [client-rpc-fops.c:2677:client3_3_opendir_cbk] 2-testvol-client-2: remote operation failed: Stale file handle. Path: /rhsdata01/ms (b34da6ae-b39f-431d-a111-06be74d44caa)
[2014-09-08 12:31:35.010739] W [client-rpc-fops.c:2677:client3_3_opendir_cbk] 2-testvol-client-2: remote operation failed: Stale file handle. Path: /rhsdata01/ms (b34da6ae-b39f-431d-a111-06be74d44caa)
[2014-09-08 12:31:36.014934] W [client-rpc-fops.c:2677:client3_3_opendir_cbk] 2-testvol-client-2: remote operation failed: Stale file handle. Path: /rhsdata01/ms (b34da6ae-b39f-431d-a111-06be74d44caa)
[2014-09-08 12:31:37.063339] W [client-rpc-fops.c:2677:client3_3_opendir_cbk] 2-testvol-client-2: remote operation failed: Stale file handle. Path: /rhsdata01/ms (b34da6ae-b39f-431d-a111-06be74d44caa)

Comment 4 Saurabh 2014-09-09 11:58:07 UTC

Here is the issue,
1. create a 6x2 volume, start it
2. enable quota
3. set quota limit of 800GB on "/"
4. mount the volume over nfs
5. touch <mount-point>/a
6. mkdir <mount-point>/dir
7. start iozone -a
8. add-brick
[root@nfs1 ~]# gluster volume add-brick vol0 10.70.37.156:/bricks/d1r1-add 10.70.37.81:/bricks/d1r2-add
volume add-brick: success

9. rebalance start force,
[root@nfs1 ~]# gluster volume rebalance vol0 start force
volume rebalance: vol0: success: Starting rebalance on volume vol0 has been successful.
ID: 2adeb93f-54bb-42ad-97df-9bade1e1c847


10. check iozone execution on mount-point. 

Result it fails,
            4096      64 1194472 1818012   288408   283895 3340563 2154808   50890  5306022   187451  2254337  2708986  240291  2329851
            4096     128 1897111 2985308   405790  5867831 7642951 3263773   84846  4399673   222428  2213100  2639075  297523  5020702
            4096     256 1110972 2711124   298542  2594046 3182161 1667656  106100  3680502   594916  1023232  1231813  270307  4971306
            4096     512 1577320 2522443   268591  5032468 5894001 3136268  205025  3771811   357922  2031724  2925823  311078  4487004
            4096    1024 1231195 2609016   184329  4779081 6013663 3330202  200962  4639689   386196  1628607  2222548  333934  5418140
            4096    2048 1990298 1758645   349370  5127086 4545172 2212815  296942  4297324   219165  1565819  3512686  383626  3650781
            4096    4096 2031724 2972395   309063  5013377 7816828 3657777  345533  3806915   315147  2542602  2059243  298563  3893185
            8192       4
Error writing block 1844, fd= 3
write: No such file or directory

iozone: interrupted

exiting iozone


real	1m2.843s
user	0m0.551s
sys	0m13.098s


[root@nfs1 ~]# gluster volume rebalance vol0 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                1        0Bytes             3             0             0            completed               0.00
                             10.70.37.81                0        0Bytes             2             0             0            completed               0.00
                             10.70.37.50                0        0Bytes             2             0             0            completed               0.00
                             10.70.37.95                0        0Bytes             2             0             0            completed               1.00
volume rebalance: vol0: success:

Comment 7 Shalaka 2014-09-22 09:18:35 UTC

Please review and sign-off edited doc text.

Comment 8 Raghavendra G 2014-10-15 17:30:26 UTC

RCA:
====

After add-brick named-lookup was not sent on the directory in question and it was not created in new-brick. However, opendir is wound to that node and hence it is failing with ENOENT/ESTALE.

Fix:
====
Two solutions are possible:
1. Send named-lookup during resolution of that inode in new-graph.
2. Make sure opendir is sent on only those bricks which were part of parent directory's layout in the previous graph. However, the tricky thing is how to find out parent of the inode on which opendir is done since during opendir we only get gfid of the directory (without pargfid). One solution is to read the symbolic link of the gfid from gfid backend and extract pargfid from it, but it seems like a messy hack.

Comment 9 Raghavendra G 2015-11-10 04:20:04 UTC

A duplicate of:
https://bugzilla.redhat.com/show_bug.cgi?id=1278399

Fixed by:
https://code.engineering.redhat.com/gerrit/#/c/61036/2

Comment 11 RajeshReddy 2015-11-23 07:42:28 UTC

Tested with build glusterfs-server-3.7.5-6, created 2x2 volume and mounted it on client using fuse and created 200 deep directories and cd to the leaf directory (../dir199/dir200) and then added two new bricks to the volume and while re-balance is going on, from the client able to run ls  and mkdir but it was hang and no IO error is poping up for the hang created 1283990 and marking this bug verified

Comment 12 Bhavana 2016-02-23 09:19:33 UTC

Hi Susant,

The doc text is edited. Do signoff on the same if it looks OK.

Comment 14 errata-xmlrpc 2016-03-01 05:22:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0193.html

Note You need to log in before you can comment on or make changes to this bug.