Bug 763375 (GLUSTER-1643)

Summary: Initial requests after mount ESTALE if DHT subvolumes connect after nfs startup
Product: [Community] GlusterFS Reporter: Shehjar Tikoo <shehjart>
Component: nfsAssignee: Shehjar Tikoo <shehjart>
Severity: high Docs Contact:
Priority: low    
Version: 3.1-alphaCC: aavati, amarts, gluster-bugs, vbellur
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: RTP Mount Type: nfs
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Bug Depends On: 763373    
Bug Blocks:    

Description Shehjar Tikoo 2010-09-18 04:44:54 EDT
Complete log same as 1641 at dev:/share/tickets/1641
Comment 1 Shehjar Tikoo 2010-09-18 04:56:45 EDT
Adding dependency and adding Amar, Vijay and Avati because this needs a discussion of the policy surrounding whether dht should notify gnfs even if one node goes down.
Comment 2 Shehjar Tikoo 2010-09-18 07:43:21 EDT
o Setting to Blocker because of customer issues.
o Seen on nfs-beta-rc11.

If NFS has detected that the volume start-up in nfs_startup_subvolume and nfs_start_subvol_lookup_cbk has failed, it needs to wait for the subvolume to start-up correctly before allowing NFS requests on it. Allowing NFS requests results in the situation below.

The OPENDIR error are seen because the lookup on too had failed in nfs_start_subvol_lookup_cbk but NFS continued acting like everything is OK. Of course, this bug depends on 1641 because only a fix to that will allow NFS to detect volume start-up failure.

[2010-09-18 04:37:14] T [rpcsvc.c:1176:rpcsvc_record_read_partial_frag] rpc-service: Fragment remaining: 0
[2010-09-18 04:37:14] T [rpcsvc.c:2293:rpcsvc_handle_vectored_frag] rpc-service: Vectored frag complete
[2010-09-18 04:37:14] T [rpcsvc.c:2235:rpcsvc_update_vectored_state] rpc-service: Vectored RPC vector read
[2010-09-18 04:37:14] D [rpcsvc.c:1266:rpcsvc_program_actor] rpc-service: Actor found: NFS3 - WRITE
[2010-09-18 04:37:14] D [nfs3-helpers.c:2345:nfs3_log_rw_call] nfs-nfsv3: XID: f4f0825, WRITE: args: FH: hashcount 3, xlid 0, gen 5517976529369300997, ino 2843738595, offset: 42205184,  count: 65536, UNSTABLE
[2010-09-18 04:37:14] T [nfs3.c:1836:nfs3_write] nfs-nfsv3: FH to Volume: distribute
[2010-09-18 04:37:14] T [nfs3-helpers.c:2970:nfs3_fh_resolve_inode] nfs-nfsv3: FH needs inode resolution
[2010-09-18 04:37:14] T [nfs3-helpers.c:2903:nfs3_fh_resolve_inode_hard] nfs-nfsv3: FH hard resolution: ino: 2843738595, gen: 5517976529369300997, hashidx: 1
[2010-09-18 04:37:14] T [nfs3-helpers.c:2908:nfs3_fh_resolve_inode_hard] nfs-nfsv3: Dir will be opened: /
[2010-09-18 04:37:14] T [nfs-fops.c:417:nfs_fop_opendir] nfs: Opendir: /
[2010-09-18 04:37:14] D [client-protocol.c:2371:client_opendir] OPENDIR 1 (/): failed to get remote inode number
[2010-09-18 04:37:14] D [dht-common.c:1667:dht_fd_cbk] distribute: subvolume returned -1 (Invalid argument)
[2010-09-18 04:37:14] D [client-protocol.c:2371:client_opendir] OPENDIR 1 (/): failed to get remote inode number
[2010-09-18 04:37:14] D [dht-common.c:1667:dht_fd_cbk] distribute: subvolume returned -1 (Invalid argument)
[2010-09-18 04:37:14] D [client-protocol.c:2371:client_opendir] OPENDIR 1 (/): failed to get remote inode number
[2010-09-18 04:37:14] D [dht-common.c:1667:dht_fd_cbk] distribute: subvolume returned -1 (Invalid argument)
Comment 3 Shehjar Tikoo 2010-09-20 00:06:58 EDT
First part of the fix involves not exporting the volume from NFS untill all volumes have returned a Child Up. This is fixed and works in:


Second part requires that the subvolume become inaccessible from NFS when NFS gets a Child Down so that an access on a downed subvolume does not result in ESTALE for NFS.

The log that helped figure the second part out is at(size warning):
Comment 4 Shehjar Tikoo 2010-09-20 02:08:52 EDT
(In reply to comment #3)

> Second part requires that the subvolume become inaccessible from NFS when NFS
> gets a Child Down so that an access on a downed subvolume does not result in

The problem in handling this scenario is that DHT returns unequal number of CHILD_UPs and CHILD_DOWNs.

This introduces a problem for gnfs because it cannot keep track of whether all distribute children have come back after a disconnect. gnfs is receiving multiple CHILD_UPs for the same child that went down earlier and perhaps even multiple CHILD_DOWNs for the same distribute child.
Comment 5 Shehjar Tikoo 2010-09-20 22:22:45 EDT
Email conversation attached:

Yes, I noticed that this is causing some problems in the way layout is written for top level directory.

I was thinking of holding the CHILD_UP till all the subvolumes comes up or for a timeout period (may be 10sec). That should solve the current issues.


On Mon, Sep 20, 2010 at 3:57 PM, Shehjar Tikoo <shehjart@gluster.com> wrote:

    Hi guys

    What disruptions can be expected if I change dht_notify behaviour such that it propagates only unique CHILD_UPs and CHILD_DOWNs to parent translators.
Comment 6 Shehjar Tikoo 2010-09-20 23:18:39 EDT
Here is what we agree to

o have distribute return CHILD-UP only when all subvolumes are up.
o have distribute return CHILD-DOWN even if one child goes down.

In response,
o FUSE can continue working as normal.
o NFS can disallow access to this subvolume.

The next problem for nfs is, how to disallow access:

o It can stop accepting request from a client connection, but NFS clients like in linux use the same connection between a client and server to access multiple subvols. So blocking a connection will block access to other subvolumes also.

o It can stop accepting requests for a particular subvol only. The problem here is that the decision to ignore a request has to be made after it has been accepted by the server. After accepting, we can either a) ignore the request or b) queue it for serving it once the volume is back.

With a), there is a possibility that clients with broken retransmission logic will error out on not receiving a reply,

With b) queueing the request means, we need to be careful that we do not queue multiple retransmission from a client with the same XID, otherwise a queue with a second CREATE request will clobber the writes after the first CREATE request. What we need then is a duplicate request cache to avoid queueing a request with a duplicate RPC XID.
Comment 7 Shehjar Tikoo 2010-09-21 00:36:43 EDT
Given the time constraints for 3.1 beta, going with approach (a) and hoping that Linux clients will be the predominant use case. Filing a bug for duplicate request cache in case we do run into the situation described above.
Comment 8 Vijay Bellur 2010-09-22 04:14:16 EDT
PATCH: http://patches.gluster.com/patch/4918 in master (core: Introduce string representation of GF_EVENTS)
Comment 9 Vijay Bellur 2010-09-22 04:14:26 EDT
PATCH: http://patches.gluster.com/patch/4920 in master (distribute: Propagate CHILD-UP when all subvols are up)
Comment 10 Vijay Bellur 2010-09-22 04:14:31 EDT
PATCH: http://patches.gluster.com/patch/4921 in master (nfs, nfs3: Base volume access on CHILD-UP-DOWN event)
Comment 11 Vijay Bellur 2010-09-25 04:05:15 EDT
PATCH: http://patches.gluster.com/patch/4982 in master (nfs: Fix multiple subvolume CHILD-UP support)
Comment 12 Vijay Bellur 2010-10-11 07:31:40 EDT
PATCH: http://patches.gluster.com/patch/5435 in master (Revert "distribute: Propagate CHILD-UP when all subvols are up")
Comment 13 Vijay Bellur 2010-10-11 07:31:46 EDT
PATCH: http://patches.gluster.com/patch/5436 in master (dht: change behaviour CHILD_UP/DOWN/CONNECTING event propagation)