Red Hat Bugzilla – Bug 763375
Initial requests after mount ESTALE if DHT subvolumes connect after nfs startup
Last modified: 2015-12-01 11:45:32 EST
Complete log same as 1641 at dev:/share/tickets/1641
Adding dependency and adding Amar, Vijay and Avati because this needs a discussion of the policy surrounding whether dht should notify gnfs even if one node goes down.
o Setting to Blocker because of customer issues.
o Seen on nfs-beta-rc11.
If NFS has detected that the volume start-up in nfs_startup_subvolume and nfs_start_subvol_lookup_cbk has failed, it needs to wait for the subvolume to start-up correctly before allowing NFS requests on it. Allowing NFS requests results in the situation below.
The OPENDIR error are seen because the lookup on too had failed in nfs_start_subvol_lookup_cbk but NFS continued acting like everything is OK. Of course, this bug depends on 1641 because only a fix to that will allow NFS to detect volume start-up failure.
[2010-09-18 04:37:14] T [rpcsvc.c:1176:rpcsvc_record_read_partial_frag] rpc-service: Fragment remaining: 0
[2010-09-18 04:37:14] T [rpcsvc.c:2293:rpcsvc_handle_vectored_frag] rpc-service: Vectored frag complete
[2010-09-18 04:37:14] T [rpcsvc.c:2235:rpcsvc_update_vectored_state] rpc-service: Vectored RPC vector read
[2010-09-18 04:37:14] D [rpcsvc.c:1266:rpcsvc_program_actor] rpc-service: Actor found: NFS3 - WRITE
[2010-09-18 04:37:14] D [nfs3-helpers.c:2345:nfs3_log_rw_call] nfs-nfsv3: XID: f4f0825, WRITE: args: FH: hashcount 3, xlid 0, gen 5517976529369300997, ino 2843738595, offset: 42205184, count: 65536, UNSTABLE
[2010-09-18 04:37:14] T [nfs3.c:1836:nfs3_write] nfs-nfsv3: FH to Volume: distribute
[2010-09-18 04:37:14] T [nfs3-helpers.c:2970:nfs3_fh_resolve_inode] nfs-nfsv3: FH needs inode resolution
[2010-09-18 04:37:14] T [nfs3-helpers.c:2903:nfs3_fh_resolve_inode_hard] nfs-nfsv3: FH hard resolution: ino: 2843738595, gen: 5517976529369300997, hashidx: 1
[2010-09-18 04:37:14] T [nfs3-helpers.c:2908:nfs3_fh_resolve_inode_hard] nfs-nfsv3: Dir will be opened: /
[2010-09-18 04:37:14] T [nfs-fops.c:417:nfs_fop_opendir] nfs: Opendir: /
[2010-09-18 04:37:14] D [client-protocol.c:2371:client_opendir] 10.1.100.203-1: OPENDIR 1 (/): failed to get remote inode number
[2010-09-18 04:37:14] D [dht-common.c:1667:dht_fd_cbk] distribute: subvolume 10.1.100.203-1 returned -1 (Invalid argument)
[2010-09-18 04:37:14] D [client-protocol.c:2371:client_opendir] 10.1.100.203-2: OPENDIR 1 (/): failed to get remote inode number
[2010-09-18 04:37:14] D [dht-common.c:1667:dht_fd_cbk] distribute: subvolume 10.1.100.203-2 returned -1 (Invalid argument)
[2010-09-18 04:37:14] D [client-protocol.c:2371:client_opendir] 10.1.100.203-3: OPENDIR 1 (/): failed to get remote inode number
[2010-09-18 04:37:14] D [dht-common.c:1667:dht_fd_cbk] distribute: subvolume 10.1.100.203-3 returned -1 (Invalid argument)
First part of the fix involves not exporting the volume from NFS untill all volumes have returned a Child Up. This is fixed and works in:
Second part requires that the subvolume become inaccessible from NFS when NFS gets a Child Down so that an access on a downed subvolume does not result in ESTALE for NFS.
The log that helped figure the second part out is at(size warning):
(In reply to comment #3)
> Second part requires that the subvolume become inaccessible from NFS when NFS
> gets a Child Down so that an access on a downed subvolume does not result in
> ESTALE for NFS.
The problem in handling this scenario is that DHT returns unequal number of CHILD_UPs and CHILD_DOWNs.
This introduces a problem for gnfs because it cannot keep track of whether all distribute children have come back after a disconnect. gnfs is receiving multiple CHILD_UPs for the same child that went down earlier and perhaps even multiple CHILD_DOWNs for the same distribute child.
Email conversation attached:
Yes, I noticed that this is causing some problems in the way layout is written for top level directory.
I was thinking of holding the CHILD_UP till all the subvolumes comes up or for a timeout period (may be 10sec). That should solve the current issues.
On Mon, Sep 20, 2010 at 3:57 PM, Shehjar Tikoo <email@example.com> wrote:
What disruptions can be expected if I change dht_notify behaviour such that it propagates only unique CHILD_UPs and CHILD_DOWNs to parent translators.
Here is what we agree to
o have distribute return CHILD-UP only when all subvolumes are up.
o have distribute return CHILD-DOWN even if one child goes down.
o FUSE can continue working as normal.
o NFS can disallow access to this subvolume.
The next problem for nfs is, how to disallow access:
o It can stop accepting request from a client connection, but NFS clients like in linux use the same connection between a client and server to access multiple subvols. So blocking a connection will block access to other subvolumes also.
o It can stop accepting requests for a particular subvol only. The problem here is that the decision to ignore a request has to be made after it has been accepted by the server. After accepting, we can either a) ignore the request or b) queue it for serving it once the volume is back.
With a), there is a possibility that clients with broken retransmission logic will error out on not receiving a reply,
With b) queueing the request means, we need to be careful that we do not queue multiple retransmission from a client with the same XID, otherwise a queue with a second CREATE request will clobber the writes after the first CREATE request. What we need then is a duplicate request cache to avoid queueing a request with a duplicate RPC XID.
Given the time constraints for 3.1 beta, going with approach (a) and hoping that Linux clients will be the predominant use case. Filing a bug for duplicate request cache in case we do run into the situation described above.
PATCH: http://patches.gluster.com/patch/4918 in master (core: Introduce string representation of GF_EVENTS)
PATCH: http://patches.gluster.com/patch/4920 in master (distribute: Propagate CHILD-UP when all subvols are up)
PATCH: http://patches.gluster.com/patch/4921 in master (nfs, nfs3: Base volume access on CHILD-UP-DOWN event)
PATCH: http://patches.gluster.com/patch/4982 in master (nfs: Fix multiple subvolume CHILD-UP support)
PATCH: http://patches.gluster.com/patch/5435 in master (Revert "distribute: Propagate CHILD-UP when all subvols are up")
PATCH: http://patches.gluster.com/patch/5436 in master (dht: change behaviour CHILD_UP/DOWN/CONNECTING event propagation)