Bug 763375 (GLUSTER-1643)

Summary:	Initial requests after mount ESTALE if DHT subvolumes connect after nfs startup
Product:	[Community] GlusterFS	Reporter:	Shehjar Tikoo <shehjart>
Component:	nfs	Assignee:	Shehjar Tikoo <shehjart>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	high	Docs Contact:
Priority:	low
Version:	3.1-alpha	CC:	aavati, amarts, gluster-bugs, vbellur
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:		Type:	---
Regression:	RTP	Mount Type:	nfs
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	763373
Bug Blocks:

Description Shehjar Tikoo 2010-09-18 08:44:54 UTC

Complete log same as 1641 at dev:/share/tickets/1641

Comment 1 Shehjar Tikoo 2010-09-18 08:56:45 UTC

Adding dependency and adding Amar, Vijay and Avati because this needs a discussion of the policy surrounding whether dht should notify gnfs even if one node goes down.

Comment 2 Shehjar Tikoo 2010-09-18 11:43:21 UTC

o Setting to Blocker because of customer issues.
o Seen on nfs-beta-rc11.

If NFS has detected that the volume start-up in nfs_startup_subvolume and nfs_start_subvol_lookup_cbk has failed, it needs to wait for the subvolume to start-up correctly before allowing NFS requests on it. Allowing NFS requests results in the situation below.

The OPENDIR error are seen because the lookup on too had failed in nfs_start_subvol_lookup_cbk but NFS continued acting like everything is OK. Of course, this bug depends on 1641 because only a fix to that will allow NFS to detect volume start-up failure.

[2010-09-18 04:37:14] T [rpcsvc.c:1176:rpcsvc_record_read_partial_frag] rpc-service: Fragment remaining: 0
[2010-09-18 04:37:14] T [rpcsvc.c:2293:rpcsvc_handle_vectored_frag] rpc-service: Vectored frag complete
[2010-09-18 04:37:14] T [rpcsvc.c:2235:rpcsvc_update_vectored_state] rpc-service: Vectored RPC vector read
[2010-09-18 04:37:14] D [rpcsvc.c:1266:rpcsvc_program_actor] rpc-service: Actor found: NFS3 - WRITE
[2010-09-18 04:37:14] D [nfs3-helpers.c:2345:nfs3_log_rw_call] nfs-nfsv3: XID: f4f0825, WRITE: args: FH: hashcount 3, xlid 0, gen 5517976529369300997, ino 2843738595, offset: 42205184,  count: 65536, UNSTABLE
[2010-09-18 04:37:14] T [nfs3.c:1836:nfs3_write] nfs-nfsv3: FH to Volume: distribute
[2010-09-18 04:37:14] T [nfs3-helpers.c:2970:nfs3_fh_resolve_inode] nfs-nfsv3: FH needs inode resolution
[2010-09-18 04:37:14] T [nfs3-helpers.c:2903:nfs3_fh_resolve_inode_hard] nfs-nfsv3: FH hard resolution: ino: 2843738595, gen: 5517976529369300997, hashidx: 1
[2010-09-18 04:37:14] T [nfs3-helpers.c:2908:nfs3_fh_resolve_inode_hard] nfs-nfsv3: Dir will be opened: /
[2010-09-18 04:37:14] T [nfs-fops.c:417:nfs_fop_opendir] nfs: Opendir: /
[2010-09-18 04:37:14] D [client-protocol.c:2371:client_opendir] 10.1.100.203-1: OPENDIR 1 (/): failed to get remote inode number
[2010-09-18 04:37:14] D [dht-common.c:1667:dht_fd_cbk] distribute: subvolume 10.1.100.203-1 returned -1 (Invalid argument)
[2010-09-18 04:37:14] D [client-protocol.c:2371:client_opendir] 10.1.100.203-2: OPENDIR 1 (/): failed to get remote inode number
[2010-09-18 04:37:14] D [dht-common.c:1667:dht_fd_cbk] distribute: subvolume 10.1.100.203-2 returned -1 (Invalid argument)
[2010-09-18 04:37:14] D [client-protocol.c:2371:client_opendir] 10.1.100.203-3: OPENDIR 1 (/): failed to get remote inode number
[2010-09-18 04:37:14] D [dht-common.c:1667:dht_fd_cbk] distribute: subvolume 10.1.100.203-3 returned -1 (Invalid argument)

Comment 3 Shehjar Tikoo 2010-09-20 04:06:58 UTC

First part of the fix involves not exporting the volume from NFS untill all volumes have returned a Child Up. This is fixed and works in:

http://dev.gluster.com/~shehjart/nfs-export-on-root-lookup-success.mbox

Second part requires that the subvolume become inaccessible from NFS when NFS gets a Child Down so that an access on a downed subvolume does not result in ESTALE for NFS.


The log that helped figure the second part out is at(size warning):
dev:/share/tickets/1641/nfs-trace.tar.gz

Comment 4 Shehjar Tikoo 2010-09-20 06:08:52 UTC

(In reply to comment #3)

> Second part requires that the subvolume become inaccessible from NFS when NFS
> gets a Child Down so that an access on a downed subvolume does not result in
> ESTALE for NFS.
> 

The problem in handling this scenario is that DHT returns unequal number of CHILD_UPs and CHILD_DOWNs.

This introduces a problem for gnfs because it cannot keep track of whether all distribute children have come back after a disconnect. gnfs is receiving multiple CHILD_UPs for the same child that went down earlier and perhaps even multiple CHILD_DOWNs for the same distribute child.

Comment 5 Shehjar Tikoo 2010-09-21 02:22:45 UTC

Email conversation attached:

Yes, I noticed that this is causing some problems in the way layout is written for top level directory.

I was thinking of holding the CHILD_UP till all the subvolumes comes up or for a timeout period (may be 10sec). That should solve the current issues.

-Amar

On Mon, Sep 20, 2010 at 3:57 PM, Shehjar Tikoo <shehjart> wrote:

    Hi guys

    What disruptions can be expected if I change dht_notify behaviour such that it propagates only unique CHILD_UPs and CHILD_DOWNs to parent translators.

Comment 6 Shehjar Tikoo 2010-09-21 03:18:39 UTC

Here is what we agree to

o have distribute return CHILD-UP only when all subvolumes are up.
o have distribute return CHILD-DOWN even if one child goes down.

In response,
o FUSE can continue working as normal.
o NFS can disallow access to this subvolume.

The next problem for nfs is, how to disallow access:

o It can stop accepting request from a client connection, but NFS clients like in linux use the same connection between a client and server to access multiple subvols. So blocking a connection will block access to other subvolumes also.

o It can stop accepting requests for a particular subvol only. The problem here is that the decision to ignore a request has to be made after it has been accepted by the server. After accepting, we can either a) ignore the request or b) queue it for serving it once the volume is back.

With a), there is a possibility that clients with broken retransmission logic will error out on not receiving a reply,

With b) queueing the request means, we need to be careful that we do not queue multiple retransmission from a client with the same XID, otherwise a queue with a second CREATE request will clobber the writes after the first CREATE request. What we need then is a duplicate request cache to avoid queueing a request with a duplicate RPC XID.

Comment 7 Shehjar Tikoo 2010-09-21 04:36:43 UTC

Given the time constraints for 3.1 beta, going with approach (a) and hoping that Linux clients will be the predominant use case. Filing a bug for duplicate request cache in case we do run into the situation described above.

Comment 8 Vijay Bellur 2010-09-22 08:14:16 UTC

PATCH: http://patches.gluster.com/patch/4918 in master (core: Introduce string representation of GF_EVENTS)

Comment 9 Vijay Bellur 2010-09-22 08:14:26 UTC

PATCH: http://patches.gluster.com/patch/4920 in master (distribute: Propagate CHILD-UP when all subvols are up)

Comment 10 Vijay Bellur 2010-09-22 08:14:31 UTC

PATCH: http://patches.gluster.com/patch/4921 in master (nfs, nfs3: Base volume access on CHILD-UP-DOWN event)

Comment 11 Vijay Bellur 2010-09-25 08:05:15 UTC

PATCH: http://patches.gluster.com/patch/4982 in master (nfs: Fix multiple subvolume CHILD-UP support)

Comment 12 Vijay Bellur 2010-10-11 11:31:40 UTC

PATCH: http://patches.gluster.com/patch/5435 in master (Revert "distribute: Propagate CHILD-UP when all subvols are up")

Comment 13 Vijay Bellur 2010-10-11 11:31:46 UTC

PATCH: http://patches.gluster.com/patch/5436 in master (dht: change behaviour CHILD_UP/DOWN/CONNECTING event propagation)