Bug 1687360
Summary: | RHEL7.4 NFS4.1 client and server repeated SEQUENCE / TEST_STATEIDs with SEQUENCE Reply has SEQ4_STATUS_RECALLABLE_STATE_REVOKED set - even after NFS server patch to properly return NFS4ERR_DELEG_REVOKED or NFS4ERR_BAD_STATEID for revoked delegations | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Dave Wysochanski <dwysocha> |
Component: | kernel | Assignee: | Scott Mayhew <smayhew> |
kernel sub component: | NFS | QA Contact: | Zhi Li <yieli> |
Status: | CLOSED INSUFFICIENT_DATA | Docs Contact: | |
Severity: | urgent | ||
Priority: | high | CC: | agaikwad, awshaikh, baumanmo, bcodding, Bert.Deknuydt, bfields, cww, dweglinski, dwysocha, fsorenso, james-p, jiyin, jwedgeco, plambri, rbergant, revers, rhandlin, rik.theys, rmonther, sgardner, smayhew, steved, swhiteho, tbecker, xzhou, yieli, yoyang |
Version: | 7.6 | ||
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | 1552203 | Environment: | |
Last Closed: | 2020-01-23 23:30:00 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1552203 | ||
Bug Blocks: | 1477664, 1594286, 1711360 | ||
Attachments: |
Description
Dave Wysochanski
2019-03-11 11:16:21 UTC
Created attachment 1542889 [details]
tcpdump script to be used with nfs client systemtap script
Created attachment 1542890 [details]
nfs client systemtap script
Created attachment 1542891 [details]
nfs server simulated error injection systemtap script
Unfortunately this probably will not make RHEL7.7 unless there is a breakthrough this week where we find missing patches that can be backported. At this point I don't think we even know if this is an upstream issue or a RHEL only issue. Hey Bruce, I may have found a race that could lead to this problem. Background: The SEQ4_STATUS_RECALLABLE_STATE_REVOKED bit gets set on a SEQUENCE reply only in the following code, when cl_revoked list is not empty: ~~~ 3023 __be32 3024 nfsd4_sequence(struct svc_rqst *rqstp, 3025 struct nfsd4_compound_state *cstate, 3026 struct nfsd4_sequence *seq) 3027 { ... 3121 out: 3122 switch (clp->cl_cb_state) { 3123 case NFSD4_CB_DOWN: 3124 seq->status_flags = SEQ4_STATUS_CB_PATH_DOWN; 3125 break; 3126 case NFSD4_CB_FAULT: 3127 seq->status_flags = SEQ4_STATUS_BACKCHANNEL_FAULT; 3128 break; 3129 default: 3130 seq->status_flags = 0; 3131 } 3132 if (!list_empty(&clp->cl_revoked)) 3133 seq->status_flags |= SEQ4_STATUS_RECALLABLE_STATE_REVOKED; 3134 out_no_session: 3135 if (conn) 3136 free_conn(conn); 3137 spin_unlock(&nn->client_lock); 3138 return status; 3139 out_put_session: 3140 nfsd4_put_session_locked(session); 3141 goto out_no_session; 3142 } ~~~ There's one place we add to the list ~~~ fs/nfsd/nfs4state.c 989 static void revoke_delegation(struct nfs4_delegation *dp) 990 { 991 struct nfs4_client *clp = dp->dl_stid.sc_client; 992 993 WARN_ON(!list_empty(&dp->dl_recall_lru)); 994 995 put_clnt_odstate(dp->dl_clnt_odstate); 996 nfs4_put_deleg_lease(dp->dl_stid.sc_file); 997 998 if (clp->cl_minorversion == 0) 999 nfs4_put_stid(&dp->dl_stid); 1000 else { 1001 dp->dl_stid.sc_type = NFS4_REVOKED_DELEG_STID; 1002 spin_lock(&clp->cl_lock); 1003 list_add(&dp->dl_recall_lru, &clp->cl_revoked); 1004 spin_unlock(&clp->cl_lock); 1005 } 1006 } ~~~ So the question is, could we have added to this list and forgotten to delete something we should have deleted? Well as it turns out, it looks like a race could occur with the above code and the handling of a FREE_STATEID. In handling FREE_STATEID, if sc_type == NFS4_REVOKED_DELEG_STID, we call list_del_init(&dp->dl_recall_lru) ~~~ 5082 __be32 5083 nfsd4_free_stateid(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, 5084 struct nfsd4_free_stateid *free_stateid) 5085 { 5086 stateid_t *stateid = &free_stateid->fr_stateid; 5087 struct nfs4_stid *s; 5088 struct nfs4_delegation *dp; 5089 struct nfs4_client *cl = cstate->session->se_client; 5090 __be32 ret = nfserr_bad_stateid; 5091 5092 spin_lock(&cl->cl_lock); 5093 s = find_stateid_locked(cl, stateid); 5094 if (!s) 5095 goto out_unlock; 5096 switch (s->sc_type) { 5097 case NFS4_DELEG_STID: 5098 ret = nfserr_locks_held; 5099 break; 5100 case NFS4_OPEN_STID: 5101 ret = check_stateid_generation(stateid, &s->sc_stateid, 1); 5102 if (ret) 5103 break; 5104 ret = nfserr_locks_held; 5105 break; 5106 case NFS4_LOCK_STID: 5107 atomic_inc(&s->sc_count); 5108 spin_unlock(&cl->cl_lock); 5109 ret = nfsd4_free_lock_stateid(stateid, s); 5110 goto out; 5111 case NFS4_REVOKED_DELEG_STID: 5112 dp = delegstateid(s); 5113 list_del_init(&dp->dl_recall_lru); 5114 spin_unlock(&cl->cl_lock); 5115 nfs4_put_stid(s); 5116 ret = nfs_ok; 5117 goto out; 5118 /* Default falls through and returns nfserr_bad_stateid */ 5119 } 5120 out_unlock: 5121 spin_unlock(&cl->cl_lock); 5122 out: 5123 return ret; 5124 } ~~~ What if one thread inside revoke_delegation() races with a second thread inside nfsd4_free_stateid()? Let's say we have: thread1: 989 static void revoke_delegation(struct nfs4_delegation *dp) 990 { 991 struct nfs4_client *clp = dp->dl_stid.sc_client; 992 993 WARN_ON(!list_empty(&dp->dl_recall_lru)); 994 995 put_clnt_odstate(dp->dl_clnt_odstate); 996 nfs4_put_deleg_lease(dp->dl_stid.sc_file); 997 998 if (clp->cl_minorversion == 0) 999 nfs4_put_stid(&dp->dl_stid); 1000 else { 1001 dp->dl_stid.sc_type = NFS4_REVOKED_DELEG_STID; thread2: 5092 spin_lock(&cl->cl_lock); 5093 s = find_stateid_locked(cl, stateid); 5094 if (!s) 5095 goto out_unlock; 5096 switch (s->sc_type) { 5097 case NFS4_DELEG_STID: 5098 ret = nfserr_locks_held; 5099 break; 5100 case NFS4_OPEN_STID: 5101 ret = check_stateid_generation(stateid, &s->sc_stateid, 1); 5102 if (ret) 5103 break; 5104 ret = nfserr_locks_held; 5105 break; 5106 case NFS4_LOCK_STID: 5107 atomic_inc(&s->sc_count); 5108 spin_unlock(&cl->cl_lock); 5109 ret = nfsd4_free_lock_stateid(stateid, s); 5110 goto out; 5111 case NFS4_REVOKED_DELEG_STID: 5112 dp = delegstateid(s); 5113 list_del_init(&dp->dl_recall_lru); 5114 spin_unlock(&cl->cl_lock); 5115 nfs4_put_stid(s); 5116 ret = nfs_ok; 5117 goto out; 5118 /* Default falls through and returns nfserr_bad_stateid */ 5119 } 5120 out_unlock: 5121 spin_unlock(&cl->cl_lock); thread1: 1002 spin_lock(&clp->cl_lock); 1003 list_add(&dp->dl_recall_lru, &clp->cl_revoked); 1004 spin_unlock(&clp->cl_lock); 1005 } 1006 } ~~~ Thread1 will set sc_type = NFS4_REVOKED_DELEG_STID but lose the race on cl->cl_lock. Thread2 will win the race and call list_del_init(&dp->dl_recall_lru) but this won't remove from the cl_revoked yet since Thread1 lost the race. Thread2 will then add to the cl_revoked list and it won't be removed until the client is destroyed. I think this race also exists in upstream code as it looks similar. Naive fix would be to move the spinlock a little higher in revoke_delegation() to cover setting of sc_type: [dwysocha@dwysocha rhel7-kernel]$ git diff diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c index 13023d6..7b97ed3 100644 --- a/fs/nfsd/nfs4state.c +++ b/fs/nfsd/nfs4state.c @@ -998,8 +998,8 @@ static void revoke_delegation(struct nfs4_delegation *dp) if (clp->cl_minorversion == 0) nfs4_put_stid(&dp->dl_stid); else { - dp->dl_stid.sc_type = NFS4_REVOKED_DELEG_STID; spin_lock(&clp->cl_lock); + dp->dl_stid.sc_type = NFS4_REVOKED_DELEG_STID; list_add(&dp->dl_recall_lru, &clp->cl_revoked); spin_unlock(&clp->cl_lock); } I haven't fully worked out what Thread1 actually would be as far as a code path so I don't have a reproducer yet. But it looks like this race could occur and could lead to what we're seeing - the NFS server has cl_revoked non-empty but the NFS client must have already forgotten about this delegation. Otherwise the NFS client would send TEST_STATEID on it and it will receive a NFS4ERR_DELEG_REVOKED due to ... Otherwise the NFS client would send TEST_STATEID on it and it will receive a NFS4ERR_DELEG_REVOKED due to the patch for https://bugzilla.redhat.com/show_bug.cgi?id=1552203 One final related thought before I look for a reproducer. Assume we have that patch from previous comment and thread2 wins the race and sees sc_type == NFS4_DELEG_STID while handling FREE_STATEID, it will return NFS4ERR_LOCKS_HELD to the NFS client. I don't see any handling of that error in the NFS client either in rhel7 or upstream which seems odd. We don't have a reproducer and possible patch would need to go upstream first so won't make 7.7 That's interesting. In https://tools.ietf.org/html/rfc5661#section-18.38 it says "The FREE_STATEID operation is used to free a stateid that no longer has any associated locks (including opens, byte-range locks, delegations, and layouts)." So if the stateid still represents a real delegation, the client should be sending DELEGRETURN; it should only be sending FREE_STATEID if it knows the delegation has already been revoked--but it can't know that till after a server thread has added the stateid to a revoked list and returned the SEQUENCE reply. But now that I say that--that STATE_REVOKED bit doesn't tell the client that *something* was revoked, it doesn't necessarily know what. So the client may just send FREE_STATEID's for all of the state it knows about, ignoring any that return any kind of error? So the situation here may be that we're revoking a bunch of delegations at once. The cl_revoked list is already non-empty and the client has already started sending out random FREE_STATEID's, and one of them happens to arrive just as we're revoking another delegation. That seems like a small window for a race, but maybe if you're doing a ton of revocations it becomes possible? Your patch would make sense anyway, good find. (In reply to J. Bruce Fields from comment #13) > But now that I say that--that STATE_REVOKED bit doesn't tell the client that (Sorry, "doesn't tell" should have been "only tells". > *something* was revoked, it doesn't necessarily know what. So the client > may just send FREE_STATEID's for all of the state it knows about, ignoring > any that return any kind of error? > > So the situation here may be that we're revoking a bunch of delegations at > once. The cl_revoked list is already non-empty and the client has already > started sending out random FREE_STATEID's, and one of them happens to arrive > just as we're revoking another delegation. > > That seems like a small window for a race, but maybe if you're doing a ton > of revocations it becomes possible? > > Your patch would make sense anyway, good find. (In reply to J. Bruce Fields from comment #13) > That's interesting. In https://tools.ietf.org/html/rfc5661#section-18.38 it > says "The FREE_STATEID operation is used to free a stateid that no longer > has any associated locks (including opens, byte-range locks, delegations, > and layouts)." > > So if the stateid still represents a real delegation, the client should be > sending DELEGRETURN; it should only be sending FREE_STATEID if it knows the > delegation has already been revoked--but it can't know that till after a > server thread has added the stateid to a revoked list and returned the > SEQUENCE reply. > Hmm.. I was looking today at the NFS client code and found out it looks like it could send FREE_STATEID following a TEST_STATEID that completes with NFS4ERR_EXPIRED or similar error. Not sure exactly what sequence would cause that or if it's possible or not. Maybe if the client gets a delegation and a network partition occurs, you won't see a DELEGRETURN but a TEST_STATEID / FREE_STATEID sequence? Depends on what the server returns though - I saw NFS4ERR_BAD_STATEID for the TEST_STATEID not NFS4ERR_EXPIRED. nfs41_check_delegation_stateid() -> nfs41_test_and_free_expired_stateid() 2477 #if defined(CONFIG_NFS_V4_1) 2478 static int nfs41_test_and_free_expired_stateid(struct nfs_server *server, 2479 nfs4_stateid *stateid, 2480 struct rpc_cred *cred) 2481 { 2482 int status; 2483 2484 switch (stateid->type) { 2485 default: 2486 break; 2487 case NFS4_INVALID_STATEID_TYPE: 2488 case NFS4_SPECIAL_STATEID_TYPE: 2489 return -NFS4ERR_BAD_STATEID; 2490 case NFS4_REVOKED_STATEID_TYPE: 2491 goto out_free; 2492 } 2493 2494 status = nfs41_test_stateid(server, stateid, cred); 2495 switch (status) { 2496 case -NFS4ERR_EXPIRED: 2497 case -NFS4ERR_ADMIN_REVOKED: 2498 case -NFS4ERR_DELEG_REVOKED: 2499 break; 2500 default: 2501 return status; 2502 } 2503 out_free: 2504 /* Ack the revoked state to the server */ 2505 nfs41_free_stateid(server, stateid, cred, true); 2506 return -NFS4ERR_EXPIRED; 2507 } > But now that I say that--that STATE_REVOKED bit doesn't tell the client that > *something* was revoked, it doesn't necessarily know what. So the client > may just send FREE_STATEID's for all of the state it knows about, ignoring > any that return any kind of error? > > So the situation here may be that we're revoking a bunch of delegations at > once. The cl_revoked list is already non-empty and the client has already > started sending out random FREE_STATEID's, and one of them happens to arrive > just as we're revoking another delegation. > Hmmm, yeah I'll have to check that scenario and see what can generate the FREE_STATEIDs. > That seems like a small window for a race, but maybe if you're doing a ton > of revocations it becomes possible? > > Your patch would make sense anyway, good find. Ok cool. I'll study this a bit more tomorrow then assuming I can come up with a header that makes sense I'll send that 1-liner to the list. Created attachment 1550559 [details]
patch to fix race on cl_revoked list
Created attachment 1550822 [details]
Patch to fix this bug against 5.1-rc3
Created attachment 1551104 [details]
one possible code flow that may lead to race condition
So far no luck reproducing this or showing that the race condition can occur. Scott and I have been talking on IRC but so far neither of us can repro this. To my knowledge so far I've not even been able to successfully see FREE_STATEID sent from the client or even cl_revoked populated. It looks like maybe for this to occur you need: 1. NFS server to trigger recalls of delegation, which will populate the del_recall_lru list - This is easily done with having an NFS client hold files open (i.e. tail -f) while a second NFS client or on the server either truncates or writes to the file 2. The del_recall_lru list remains populated for at least /proc/fs/nfsd/nfsv4leasetime so that nfs4_laundromat runs and calls revoke_delegation which adds to cl_revoked So far #2 seems elusive but I think scott got this at one point Ok so for #2 scott tried a small stap on the client and it's really effective to get a good portion of the NFS client behavior: stap -gve 'probe module("nfsv4").function("nfs4_callback_recall") { printf("%s: delaying\n", ppfunc()); mdelay(10); }' I ran the above on an NFS client (957.5.1.el7) plus this: for i in $(seq 1 5000); do tail -f /mnt/nfs/test/file-$i.bin & done Then on a second NFS client did this: for i in $(seq 1 5000); do date >> /mnt/nfs/test/file-$i.bin & done I didn't get the hang yet but I think scott may have. In a tcpdump I see TEST_STATEID reply with NFS4ERR_DELEG_REVOKED followed immediately by FREE_STATEID, and SEQUENCE replies with SEQ4_STATUS_RECALLABLE_STATE_REVOKED set. The stream of operations clears for me though and I don't get a hang - eventually at the end I just see an idle connection with SEQUENCE sent every 10s (lease time is 15 seconds) and there is no SEQ4_STATUS_RECALLABLE_STATE_REVOKED bit set. The only piece missing for the race to happen is the nfsd's nfs4_laundromat thread to get delayed on the cl_lock inside revoke_delegation(). This seems like it may be a stretch but its definitely possible to get contention on cl_lock and the other missing pieces are there now. (In reply to Dave Wysochanski from comment #22) > Ok so for #2 scott tried a small stap on the client and it's really > effective to get a good portion of the NFS client behavior: > stap -gve 'probe module("nfsv4").function("nfs4_callback_recall") { > printf("%s: delaying\n", ppfunc()); mdelay(10); }' > While the above is an artificially injected delay, I think it's at least possible such a delay could happen. Consider the below code path. nfs_delegation_find_inode_server: loop that is O(# delegations, and delegation->lock) * We would not expect delegation->lock to be highly contended, but it's possible some contention due to state manager running through the list at same time * nfs_delegation_find_inode -> nfs_delegation_find_inode_server: only called from nfs4_callback_recall (should only be called once with Linux NFS server) and nfs4_callback_getattr (should be never called with Linux NFS server) * It is possible inode->i_lock is contended though it should only be one spinlock per nfs4_callback_recall 71 __be32 nfs4_callback_recall(struct cb_recallargs *args, void *dummy, 72 struct cb_process_state *cps) 73 { 74 struct inode *inode; 75 __be32 res; 76 77 res = htonl(NFS4ERR_OP_NOT_IN_SESSION); 78 if (!cps->clp) /* Always set for v4.0. Set in cb_sequence for v4.1 */ 79 goto out; 80 81 dprintk_rcu("NFS: RECALL callback request from %s\n", 82 rpc_peeraddr2str(cps->clp->cl_rpcclient, RPC_DISPLAY_ADDR)); 83 84 res = htonl(NFS4ERR_BADHANDLE); 85 inode = nfs_delegation_find_inode(cps->clp, &args->fh); ... 781 /** 782 * nfs_delegation_find_inode - retrieve the inode associated with a delegation 783 * @clp: client state handle 784 * @fhandle: filehandle from a delegation recall 785 * 786 * Returns pointer to inode matching "fhandle," or NULL if a matching inode 787 * cannot be found. 788 */ 789 struct inode *nfs_delegation_find_inode(struct nfs_client *clp, 790 const struct nfs_fh *fhandle) 791 { 792 struct nfs_server *server; 793 struct inode *res = NULL; 794 795 rcu_read_lock(); 796 list_for_each_entry_rcu(server, &clp->cl_superblocks, client_link) { 797 res = nfs_delegation_find_inode_server(server, fhandle); 798 if (res != NULL) 799 break; 800 } 801 rcu_read_unlock(); 802 return res; 803 } 761 static struct inode * 762 nfs_delegation_find_inode_server(struct nfs_server *server, 763 const struct nfs_fh *fhandle) 764 { 765 struct nfs_delegation *delegation; 766 struct inode *res = NULL; 767 768 list_for_each_entry_rcu(delegation, &server->delegations, super_list) { 769 spin_lock(&delegation->lock); 770 if (delegation->inode != NULL && 771 nfs_compare_fh(fhandle, &NFS_I(delegation->inode)->fh) == 0) { 772 res = igrab(delegation->inode); 773 } 774 spin_unlock(&delegation->lock); 775 if (res != NULL) 776 break; 777 } 778 return res; 779 } 1230 struct inode *igrab(struct inode *inode) 1231 { 1232 spin_lock(&inode->i_lock); 1233 if (!(inode->i_state & (I_FREEING|I_WILL_FREE))) { 1234 __iget(inode); 1235 spin_unlock(&inode->i_lock); 1236 } else { 1237 spin_unlock(&inode->i_lock); 1238 /* 1239 * Handle the case where s_op->clear_inode is not been 1240 * called yet, and somebody is calling igrab 1241 * while the inode is getting freed. 1242 */ 1243 inode = NULL; 1244 } 1245 return inode; 1246 } 1247 EXPORT_SYMBOL(igrab); Customer reproduced with the patch from comment #17 and gave us tcpdump and mountstats data that contains a FREE_STATEID op count of 0. So it's safe to say the customer is hitting something other than the theoretical race I described earlier. I'm actually not sure the data the customer provided shows a loop though - under "normal operation" due to how this bit works, there could be a crap ton of TEST_STATEIDs and SEQUENCE ops with RECALLABLE_STATE_REVOKED set if there's a ton of delegations (which sounds like is the case for the customer). Right now I'm looking at this patch as a possibility since it's not yet backported to RHEL7 and it affects the clients CB_RECALL reply status code and whether a client will send a DELEGRETURN. commit 6c342655022d5189c45e4f7ed0cc8048c9ad9815 Author: Trond Myklebust <trond.myklebust> Date: Thu Jun 7 14:22:00 2018 -0400 NFSv4: Return NFS4ERR_DELAY when a delegation recall fails due to igrab() If the attempt to recall the delegation fails because the inode is in the process of being evicted from cache, then use NFS4ERR_DELAY to ask the server to retry later. Signed-off-by: Trond Myklebust <trond.myklebust> I've been able to reproduce this a couple times now. My client is running 3.10.0-957.5.1.el7.x86_64 and my server is running 4.20.16-200.fc29.x86_64. Both VMs have 4 CPUs and 8G of memory. The server has 64 nfsd threads and a 15 second lease time. On the client I'm running in one window: # stap -gve 'probe module("nfsv4").function("nfs4_callback_recall") { printf("%s: delaying\n", ppfunc()); mdelay(10); }' and in another window: # for i in `seq -w 1 10000`; do sleep 2m </mnt/t/dir1/file.$i & done (Note: I already created the files ahead of time) As soon as the bash prompt returns on the client, I run the following on the server: # for i in `seq -w 1 10000`; do date >/export/dir1/file.$i & done At first the server starts doing CB_RECALLs and the client starts doing DELEGRETURNs. Then the server begins revoking delegations and the SEQ4_STATUS_RECALLABLE_STATE_REVOKED flag is set. The client starts doing TEST_STATEID and FREE_STATEID (when the response from the server is NFS4ERR_DELEG_REVOKED). I even see the SEQ4_STATUS_RECALLABLE_STATE_REVOKED flag get cleared briefly, but it is set again a short time later. Eventually I see the last couple of CB_RECALLs returning NFS4ERR_DELAY, and finally a CB_RECALL returning NFS4ERR_SEQ_MISORDERED. After the NFS4ERR_SEQ_MISORDERED, I see no more CB_RECALLs. Eventually after all the sleep commands finish, I see no more OPEN or TEST_STATEID ops, but the bare SEQUENCE ops to keep the lease active still have the SEQ4_STATUS_RECALLABLE_STATE_REVOKED flag set in the replies. So I took a vmcore of the server at this point. crash> p nfsd_net_id nfsd_net_id = $1 = 10 crash> p &init_net $2 = (struct net *) 0xffffffffbe314d40 crash> p $2->gen $3 = (struct net_generic *) 0xffff92853750f180 crash> p $3->ptr[10] $4 = (void *) 0xffff928532619200 crash> p &((struct nfsd_net *)0xffff928532619200)->client_lru $5 = (struct list_head *) 0xffff928532619288 crash> list -H 0xffff928532619288 -o nfs4_client.cl_lru ffff9285245da958 crash> p &((struct nfs4_client *)0xffff9285245da958)->cl_revoked $6 = (struct list_head *) 0xffff9285245da9c0 crash> list -H 0xffff9285245da9c0 -o nfs4_delegation.dl_recall_lru|wc -l 8470 That's 8470 delegations in the cl_revoked list. crash> p &((struct nfs4_client *)0xffff9285245da958)->cl_cb_waitq $7 = (struct rpc_wait_queue *) 0xffff9285245dad38 crash> rpc_wait_queue.qlen 0xffff9285245dad38 qlen = 8315 8315 tasks on the cl_cb_waitq. None of them are making progress. I know this because earlier I had a live crash session running on the server and there were no changes for hours. Probably because cl_cb_slot_busy is set: crash> nfs4_client.cl_cb_slot_busy 0xffff9285245da958 cl_cb_slot_busy = 1 crash> px &((struct rpc_wait_queue *)0xffff9285245dad38)->tasks[0] $8 = (struct list_head *) 0xffff9285245dad40 all of the tasks have a tk_action of rpc_prepare_task. So those haven't been transmitted yet (which makes sense, because I definitely didn't see 10,000 CB_RECALLs in wireshark). crash> list -H 0xffff9285245dad40 -o rpc_task.u -s rpc_task.tk_action |head ffff92851b978f00 tk_action = 0xffffffffc06bcff0 <rpc_prepare_task> ffff92851b978e00 tk_action = 0xffffffffc06bcff0 <rpc_prepare_task> ffff92851b978800 tk_action = 0xffffffffc06bcff0 <rpc_prepare_task> ffff92851b978a00 tk_action = 0xffffffffc06bcff0 <rpc_prepare_task> ffff92851b978000 tk_action = 0xffffffffc06bcff0 <rpc_prepare_task> I think maybe there's something wrong with how NFS4ERR_SEQ_MISORDERED is handled in nfsd4_cb_sequence_done(). I have been reviewing the list of commits upstream but not yet backported to RHEL7. This is a possibly interesting NFS client commit, given: 1) customer's indication that the repro involves interrupting a build process 2) Scott's observation of SEQ_MISORDERED which is an unusual error commit 3453d5708b33efe76f40eca1c0ed60923094b971 Author: Trond Myklebust <trond.myklebust> Date: Wed Jun 20 17:53:34 2018 -0400 NFSv4.1: Avoid false retries when RPC calls are interrupted A 'false retry' in NFSv4.1 occurs when the client attempts to transmit a new RPC call using a slot+sequence number combination that references an already cached one. Currently, the Linux NFS client will do this if a user process interrupts an RPC call that is in progress. The problem with doing so is that we defeat the main mechanism used by the server to differentiate between a new call and a replayed one. Even if the server is able to perfectly cache the arguments of the old call, it cannot know if the client intended to replay or send a new call. The obvious fix is to bump the sequence number pre-emptively if an RPC call is interrupted, but in order to deal with the corner cases where the interrupted call is not actually received and processed by the server, we need to interpret the error NFS4ERR_SEQ_MISORDERED as a sign that we need to either wait or locate a correct sequence number that lies between the value we sent, and the last value that was acked by a SEQUENCE call on that slot. Signed-off-by: Trond Myklebust <trond.myklebust> Tested-by: Jason Tibbitts <tibbs.edu> We do not have a reproducer and the only person hitting this has said they can no longer reproduce after 3.10.0-1062.4.1.el7 We have patches that fix issues in this area of code, but as far as I understand there was a series involved and cost/benefit unclear at this point. |