Bug 1322518

Summary: SAMBA+TIER : File when created from one windows client over the same volume mount is not accessible from other windows client over same volume mount
Product: Red Hat Gluster Storage Reporter: Vivek Das <vdas>
Component: sambaAssignee: Michael Adam <madam>
Status: CLOSED DEFERRED QA Contact: Vivek Das <vdas>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: aspandey, gdeschner, ira, kramdoss, madam, mzywusko, nlevinki, pkarampu, rhinduja, rhs-smb, sankarshan
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-05 10:36:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Vivek Das 2016-03-30 15:47:58 UTC
Description of problem:

On multiple windows client when tiered volume is mounted and then microsoft windows file is created with some data, in one of the client. 
The same file when tried to access from other windows client having mounted the same tier volume does not allow any access to the file (file doesnt open).



Version-Release number of selected component (if applicable):
samba-4.4.0-1.el7rhgs.x86_64
glusterfs-3.7.9-1.el7rhgs.x86_64

How reproducible:
Always

Steps to Reproduce:
1.Mount tier volume on multiple windows client (cold tier : Distributed-Disperse | 2 x (8 + 4) | Hot Tier Type : Distributed-Replicate 4 x 2 = 8)

2. Create microsoft office files with data in it on a windows client.

3. Open the microsoft office file from a different windows client explorer

Actual results:
The file doesn't open. 

Expected results:
The file should open successfully and the reflect the data in it.

Additional info:

Comment 1 Vivek Das 2016-03-30 15:59:00 UTC
sosreports uploaded @http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1322518/

Comment 3 Ira Cooper 2016-04-06 20:06:14 UTC
Vivek:

I just sat down with Dan in Westford, and we found the following:

1. I don't know what version of windows you used, or what office program.  That makes reproduction very difficult.  Please send that information.

2. The behavior you see may be due to share modes, and 100% normal. please verify.

3. When we used smbclient to put a file to a tiered volume and then ran "ls" from another client, we showed a 0 byte size.  When the tier was removed, the file showed the correct size.

Clearly #3 is a bug, and the type of bug that might make office or any other app behave strange.  I'd like us to get that cleaned up before looking into this in more detail.

Thanks,

-Ira

Comment 5 Ira Cooper 2016-04-07 11:00:16 UTC
To be clear: .docx means Word here?

And how much time is spent between operations?  (Tiering means timing matters, alas.)

Also, can you reproduce the issue without office, just using smbclient?

What is the type of error thrown?

But this is a big improvement, and a big step in the right direction.

Thanks,

Comment 6 rjoseph 2016-04-13 15:07:53 UTC
I debugged the problem little more and found that the exact problem is rename. 

The following set of operations can easily recreate the problem


Client1:
+ test.txt is accessbile

Client2:
+ test.txt is accessible

Client1:
+ create new file ("new.txt")
+ rename new.txt to test.txt
+ test.txt is accessible

Client2:
+ test.txt is NOT accessible

If time permits I will do little more investigation on the same.

Comment 7 Michael Adam 2016-04-13 15:59:30 UTC
(In reply to rjoseph from comment #6)
> I debugged the problem little more and found that the exact problem is
> rename. 
> 
> The following set of operations can easily recreate the problem
> 
> 
> Client1:
> + test.txt is accessbile
> 
> Client2:
> + test.txt is accessible
> 
> Client1:
> + create new file ("new.txt")
> + rename new.txt to test.txt
> + test.txt is accessible
> 
> Client2:
> + test.txt is NOT accessible
> 
> If time permits I will do little more investigation on the same.

Thanks for the analysis!

It looks very plausible to me that the above pattern, should be a reproducer. It should be able to repro it with that in smbclient.

Can you confirm that, Rajesh?

Comment 8 rjoseph 2016-04-13 16:55:29 UTC
The problem is reproducible via smbclient as well.

client1:

smb: \> open bad.txt
open file \bad.txt: for read/write fnum 50799
smb: \> close 50799
smb: \> rm bad.txt 
smb: \> rename sample.txt bad.txt


client2:

smb: \> open bad.txt 
open file \bad.txt: for read/write fnum 64454
smb: \> close 64454
smb: \> open bad.txt 
Failed to open file \gala.txt. NT_STATUS_INVALID_PARAMETER


I opened two smbclient session and performed the above operations interactively.

Initially bad.txt is accessible by both the clients, but once I renamed the file I could not open it again.

Comment 9 rjoseph 2016-04-14 11:51:25 UTC
Update after further analysis of the bug:

Currently the problem is seen if the cold/hot tier is a disperse volume. This prompted me to check disperse volume directly without tiered volume.

The problem is seen with distributed-disperse volume as well. Though there is a slight difference in behavior between the two volume types. If it is a distributed-disperse volume if you do a "ls" on the directory (i.e. parent directory of the file, bad.txt) then the subsequent open passes. But in tiered volume it always fails even after ls on the parent directory.

client2:

smb: \> open bad.txt 
open file \bad.txt: for read/write fnum 64454
smb: \> close 64454
smb: \> open bad.txt 
Failed to open file \bad.txt. NT_STATUS_INVALID_PARAMETER
smb: \> ls
  .                                   D        0  Thu Apr 14 16:26:21 2016
  ..                                  D        0  Thu Apr 14 16:26:21 2016
  bad.txt                             R        0  Thu Apr 14 16:26:21 2016
smb: \> open bad.txt 
open file \bad.txt: for read/write fnum 1749



From the initial analysis it seems that in glfs_resolve_component the first lookup (syncop_lookup) always fails with ESTALE. which causes glfs_resolve_component function to regenerate a new inode and gfid and re-trigger a new lookup. In normal distribute-replicate volume the second lookup always passes. But for disperse the second lookup also fails leading to glfs_resolve_component function to fail.

Comment 11 Ashish Pandey 2016-04-26 12:34:59 UTC
Following are the findings in EC - 

After getting the response form server, EC updates the loc->gfid too. To do so, it checks for the loc->gfid and iatt->ia_gfid in "ec_loc_gfid_check". if loc->gfid is null, it just copies the iattt->ia_gfid into it and return success.

If both the gfid's are different, It returns failure with op_errno received from server which will be a case of stale gfid.

At this point, "glfs_resolve_component" sends a fresh lookup. It creates new inode and gfid and sends the lookup. However, It does not reset the loc->gfid for the fresh lookup.

Now, for the second (fresh) lookup, EC gets proper response from backend. But in 
"ec_loc_gfid_check" it again fails as it tries to compare loc->gfid (which is still older) and iatt->ia_gfid (received from server). 

There could be two solutions for this - 
1 - For fresh lookup reset the loc->gfid to null.
2 - If [1] is not possible, we have to handle ESTALE case in EC in different way.

Comment 12 Ashish Pandey 2016-04-28 10:21:06 UTC
As we only have parent gfid and name in loc for a fresh lookup, having gfid set, to an old gfid, is incorrect.

Assigning it to gfapi team.

Comment 14 Vivek Das 2016-05-02 12:43:06 UTC
More update on this. With the new build being provided i.e 3.7.9-3 i created a microsoft word file with some data in windows client 1 and it reflected the file size too. Then in windows client 2 mounted the same tier volume , accessed the same file (created in client 1) and appended some more data. That too updated the file size which clearly reflected. Now when i login back to windows client 1 and even after repeated refresh of the volume share the size for the file is reflecting as 0Kb.
Is this happening because of the look up issue ??

Comment 20 Michael Adam 2018-04-05 10:36:42 UTC
best i know we don't support smb+tier