Bug 2217964

Summary: NFSv4.1+ client is not freezing session table upon BADSESSION leading to improper re-use of the slot and the application to fail with EIO
Product: Red Hat Enterprise Linux 9 Reporter: Olga Kornieskaia <kolga>
Component: kernelAssignee: Benjamin Coddington <bcodding>
kernel sub component: NFS QA Contact: Zhi Li <yieli>
Status: CLOSED ERRATA Docs Contact:
Severity: unspecified    
Priority: unspecified CC: bcodding, nfs-team, xzhou, yoyang
Version: 9.2Keywords: Triaged
Target Milestone: rcFlags: pm-rhel: mirror+
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kernel-5.14.0-340.el9 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2217963 Environment:
Last Closed: 2023-11-07 08:48:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2217963    
Bug Blocks:    

Description Olga Kornieskaia 2023-06-27 16:38:50 UTC
+++ This bug was initially created as a clone of Bug #2217963 +++

Description of problem:

During session trunking testing, upon an event where one of the trunked server IPs leaves the trunking group, the application would fail with EIO.

The sequence if events that required the failure were as follows:
1. Client sends a LOCK request to the server on IP1. The server processes the request, populates the session cache. However, the reply never reaches the client. The connection gets reset.
2. At this point, one of the LIFs (IP2) migrates and triggers a change in trunking group membership.
3. On the client, retries the request but this time the request is sent to server on IP2. This cause the server to send BADSESSION error.
4. Upon receiving the BADSESSION the client initiates the session recovery which schedules a state manager thread but it doesn't get to run right away. Client releases the slot.
5. A different LOCK operation gets to run, because the session table isn't frozen until the state manager actually runs, the client re-uses the released slot and sends a LOCK (with different arguments) to the server on IP1. Server replies out of the cache.
6. Client recovers the session and then proceeds to use the locking stateid received in step#5. However, that stateid is bogus for that file handle. The server fails with BAD_STATEID which leads to an EIO error.

The solution is Trond's testing branch and slotted for 6.5.


ommit c907e72f58ed979a24a9fdcadfbc447c51d5e509
Author: Olga Kornievskaia <kolga>
Date:   Sun Jun 18 17:32:25 2023 -0400

    NFSv4.1: freeze the session table upon receiving NFS4ERR_BADSESSION
    
    When the client received NFS4ERR_BADSESSION, it schedules recovery
    and start the state manager thread which in turn freezes the
    session table and does not allow for any new requests to use the
    no-longer valid session. However, it is possible that before
    the state manager thread runs, a new operation would use the
    released slot that received BADSESSION and was therefore not
    updated its sequence number. Such re-use of the slot can lead
    the application errors.
    
    Fixes: 5c441544f045 ("NFSv4.x: Handle bad/dead sessions correctly in nfs41_s
equence_process()")
    Signed-off-by: Olga Kornievskaia <kolga>
    Signed-off-by: Trond Myklebust <trond.myklebust>

Asking for this to be fixed in zstream.



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 13 errata-xmlrpc 2023-11-07 08:48:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: kernel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6583