Bug 1581306

Summary: [GSS][SAS library corruption on GlusterFS]
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: nravinas
Component: fuseAssignee: Raghavendra G <rgowdapp>
Status: CLOSED CURRENTRELEASE QA Contact: Rahul Hinduja <rhinduja>
Severity: high Docs Contact:
Priority: high    
Version: rhgs-3.3CC: abhishku, amukherj, apaladug, atumball, bkunal, csaba, gsapienz, nbalacha, nchilaka, nravinas, rgowdapp, rhs-bugs, sankarshan, sheggodu, storage-qa-internal, vdas
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard: transactional-workload
Fixed In Version: glusterfs-3.12.2-25 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1593078 (view as bug list) Environment:
Last Closed: 2018-11-28 10:06:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1593078, 1637393    
Bug Blocks:    
Attachments:
Description Flags
parsefuse utility none

Comment 10 Csaba Henk 2018-05-24 12:01:38 UTC
Created attachment 1441040 [details]
parsefuse utility

This can be use to convert fusedumps to json, with

parsefuse-7.22rhel7 -format json <dumpfile> > <dumpfile>.json

or (compressed)

parsefuse-7.22rhel7 -format json <dumpfile> | gzip > <dumpfile>.json.gz

Comment 24 Raghavendra G 2018-06-01 12:56:20 UTC
(In reply to Csaba Henk from comment #23)
> Other possibility is that just at the time when the base and the lck files
> get resolved to the same id, some other client is performing a RENAME on
> them which is not sufficiently synchronized and the LOOKUPs on this side can
> hit into an inconsistent intermediate state; in this case the hardlinked
> situation is ephemeral.

There is one scenario in dht_rename where for a brief period lookups on src and dst can be successful and identify them as hardlinks to each other. For the scenario to happen following conditions should hold good for a rename (src, dst).

(dst-cached != src-cached) and (dst-hashed == src-hashed) and (src-cached != dst-hashed)

In this scenario following is the control flow of dht-rename:

1. link (src, dst) on src-cached.
2. rename (src, dst) on dst-hashed/src-hashed (Note that dst-hashed == src-hashed).
3. rest of the rename which removes hardlink src on src-cached.

Note that between 2 and 3 till the hardlink is removed,
* lookup (src) would fail on src-hashed resulting in lookup-everywhere. Since hardlink src exists on src-cached, lookup will be successful mapping it to inode  with src-gfid.
* lookup (dst) would identify a linkto file on dst-hashed. The linkto file points to src-cached, following which we'll find the hardlink dst on src-cached. Lookup (dst) succeeds mapping dst to inode with src-gfid.

Both src and dst would be identified as hardlinks to file with src-gfid in inode table of client. Same result is conveyed back to application.

If we've hit this scenario, we would see the failure of lookup (src) on src-hashed and eventual finding it on src-cached through lookup-everywhere. We need to set diagnostics.client-log-level to DEBUG to see these logs.

The other work-around (if the hypothesis is true) is to turn on cluster.lookup-optimize to true. When lookup-optimize is turned on dht-lookup doesn't resort to lookup-everywhere on not finding src on src-hashed. Instead it just conveys a failure to application. Since lookup won't reach src-cached, it won't find the hard-link.

Comment 28 Raghavendra G 2018-06-20 02:26:11 UTC
(In reply to Raghavendra G from comment #24)
> (In reply to Csaba Henk from comment #23)
> 
> The other work-around (if the hypothesis is true) is to turn on
> cluster.lookup-optimize to true. When lookup-optimize is turned on
> dht-lookup doesn't resort to lookup-everywhere on not finding src on
> src-hashed. Instead it just conveys a failure to application. Since lookup
> won't reach src-cached, it won't find the hard-link.

Since patch [1] We've made hashed-subvol as the subvol to take locks during rename. This means presence of entrylks on hashed-subvol on the basename of src/dst of rename indicates a rename in progress. On observing locks, dht_lookup can,

* acquire entrylk on parent inode with basename
* Do the lookup
* unlock entrylk
* unwind the lookup

This will help lookup to synchronize with rename and hence to preserve the atomicity of rename. Note that this algorithm works even when first lookup fails with ENOENT. Also, the cost of synchronization is isolated to the lookups happening only during rename. Lookups happening outside rename window won't suffer the cost of synchronization.

This will be a code change and I'll be submitting a patch to do this.
[1] https://review.gluster.org/19547/

Comment 29 Nithya Balachandran 2018-06-20 03:44:32 UTC
(In reply to Raghavendra G from comment #28)
> (In reply to Raghavendra G from comment #24)
> > (In reply to Csaba Henk from comment #23)
> > 
> > The other work-around (if the hypothesis is true) is to turn on
> > cluster.lookup-optimize to true. When lookup-optimize is turned on
> > dht-lookup doesn't resort to lookup-everywhere on not finding src on
> > src-hashed. Instead it just conveys a failure to application. Since lookup
> > won't reach src-cached, it won't find the hard-link.
> 
> Since patch [1] We've made hashed-subvol as the subvol to take locks during
> rename. This means presence of entrylks on hashed-subvol on the basename of
> src/dst of rename indicates a rename in progress. On observing locks,

How will dht_lookup check for locks? Does it need to take an entry lock in order to find out whether there is a lock already taken?

> dht_lookup can,
> 
> * acquire entrylk on parent inode with basename
> * Do the lookup
> * unlock entrylk
> * unwind the lookup
> 
> This will help lookup to synchronize with rename and hence to preserve the
> atomicity of rename. Note that this algorithm works even when first lookup
> fails with ENOENT. Also, the cost of synchronization is isolated to the
> lookups happening only during rename. Lookups happening outside rename
> window won't suffer the cost of synchronization.
> 
> This will be a code change and I'll be submitting a patch to do this.
> [1] https://review.gluster.org/19547/

Comment 31 Raghavendra G 2018-06-25 05:24:13 UTC
(In reply to Nithya Balachandran from comment #29)
> (In reply to Raghavendra G from comment #28)
> > (In reply to Raghavendra G from comment #24)
> > > (In reply to Csaba Henk from comment #23)
> > > 
> > > The other work-around (if the hypothesis is true) is to turn on
> > > cluster.lookup-optimize to true. When lookup-optimize is turned on
> > > dht-lookup doesn't resort to lookup-everywhere on not finding src on
> > > src-hashed. Instead it just conveys a failure to application. Since lookup
> > > won't reach src-cached, it won't find the hard-link.
> > 
> > Since patch [1] We've made hashed-subvol as the subvol to take locks during
> > rename. This means presence of entrylks on hashed-subvol on the basename of
> > src/dst of rename indicates a rename in progress. On observing locks,
> 
> How will dht_lookup check for locks? Does it need to take an entry lock in
> order to find out whether there is a lock already taken?

See following thread on gluster-devel for more details:
https://www.spinics.net/lists/gluster-devel/msg25006.html