Bug 2127067
Summary: | [RHEL 9] BUG: KASAN: use-after-free in nfsd4_cb_prepare+0x227/0x250 [nfsd] | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Zhi Li <yieli> |
Component: | kernel | Assignee: | Jeff Layton <jlayton> |
kernel sub component: | NFS | QA Contact: | Zhi Li <yieli> |
Status: | CLOSED MIGRATED | Docs Contact: | |
Severity: | unspecified | ||
Priority: | unspecified | CC: | adscvr, bcodding, chuck.lever, jiyin, jlayton, smayhew, steved, xzhou, yoyang |
Version: | 9.2 | Keywords: | MigratedToJIRA |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-09-23 11:15:48 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 2091421 |
Description
Zhi Li
2022-09-15 09:14:17 UTC
This issue can also be reproduced on rhel8, tracked by bz1849472. Making this bug public, since I suspect this is related to an upstream bug Chuck has been hunting here: https://bugzilla.linux-nfs.org/show_bug.cgi?id=394 All of the places where this has been reproduced so far seem to be running a regression test for bz1687360. Here's the script that runs in that test: Server() { rlPhaseStartSetup do-$role-Setup- rlFileBackup /etc/exports /etc/nfs.conf run "mkdir -p $expdir" run "echo \"$expdir *(rw,no_root_squash)\" >/etc/exports" run "echo -e \"[nfsd]\n threads=64\n lease-time=15\" > /etc/nfs.conf" run "service_nfs restart" run "tmux new -d \"tcpdump -U -q -i $(getDefaultNic) -w $PCAP_FILE -s 0 host $CLIENT\"" run "ps aux | grep -v grep | grep -i tmux" rlPhaseEnd rlPhaseStartTest do-$role-Test- run "rhts-sync-set -s servReady" run "rhts-sync-block -s testDone $CLIENT" run 'for i in $(seq 1 10000); do date > $expdir/file_$i & done' run "sleep 200" run "pkill tcpdump && sleep 3" run "ps aux | grep -v grep | grep tcpdump" 1 run "tshark -ntad -r ${PCAP_FILE} &> /tmp/test.log" run 'count=`cat /tmp/test.log | grep TEST_STATEID | wc -l`' run "test $count = 1" 0 "NFS client should only send TEST_STATEID one time, got $count" run "cat /tmp/test.log | grep TEST_STATEID | tail" - run "rlFileSubmit $PCAP_FILE" run "rhts-sync-set -s servtestDone" run "rhts-sync-block -s clicleanDone $CLIENT" rlPhaseEnd rlPhaseStartCleanup do-$role-Cleanup- rlFileRestore run "rm -rf $expdir" rlPhaseEnd } Client() { rlPhaseStartSetup do-$role-Setup- run "mkdir -p $nfsmp" Vers=`uname -r` [[ "$Vers" != *+debug ]] && { run "debuginfo-install -y kernel-${Vers}" run 'rpm -q kernel-debuginfo-$Vers || brewinstall.sh -onlydebuginfo kernel-${Vers%.*}' } run "rhts-sync-block -s servReady $SERVER" rlPhaseEnd rlPhaseStartTest do-$role-Test- run "mount -overs=4.2 $SERVER:$expdir $nfsmp" run 'for i in $(seq 1 10000); do touch $nfsmp/file_$i; done' run "stap -gve 'probe module(\"nfsv4\").function(\"nfs4_callback_recall\") { printf(\"%s: delaying\n\", ppfunc()); mdelay(10); }' &> $STAP_OUT &" run "while true; do grep -q \"Pass 5: starting run\" $STAP_OUT && break; sleep 1; done" run "cat $STAP_OUT" run 'for i in $(seq 1 10000); do sleep 2m < $nfsmp/file_$i & done' run "rhts-sync-set -s testDone" run "rhts-sync-block -s servtestDone $SERVER" while ps aux | grep -v grep | grep sleep &> /dev/null; do pkill sleep sleep 3 done run "ps aux | grep -v grep | grep sleep" 1 run "umount -l $nfsmp" run "rm -rf $nfsmp" run "rhts-sync-set -s clicleanDone" rlPhaseEnd rlPhaseStartCleanup do-$role-Cleanup- rlFileRestore rlPhaseEnd } I don't have the test harness set up, but this gives you an idea of what this test is doing. It sets a very short lease lifetime, and then introduces some artificial delays on the client-side recall handling. I guess this is to set up a situation where the delegation gets fully revoked. Hypothesis: the problem is in the revoked delegation handling I'll try to turn this into something I can run by hand and see if I can reproduce it that way. I think I may have spotted it: [jlayton@tleilax kernel-rhel9]$ git diff diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c index 9069599c699f..e6384e350785 10064 --- a/fs/nfsd/nfs4state.c +++ b/fs/nfsd/nfs4state.c @@ -5991,7 +5991,6 @@ nfsd4_lookup_stateid(struct nfsd4_compound_state *cstate, if (!*s) return nfserr_bad_stateid; if (((*s)->sc_type == NFS4_REVOKED_DELEG_STID) && !return_revoked) { - nfs4_put_stid(*s); if (cstate->minorversion) return nfserr_deleg_revoked; return nfserr_bad_stateid; In the case where we find a REVOKED stateid, we are putting the reference but still filling out the return pointer. No, spoke too soon. It's ugly, but the callers seem to handle that ok. I'll still plan to send a cleanup patch to make nfsd4_lookup_stateid only set the return pointer on success, since that's best practice. I spent a bunch of time on Friday trying to reproduce this, but no joy. I also crawled through the delegation handling code (again) in the hopes I'd spot the refcounting issue, but I still don't see it. I did come away with a few cleanup patches that I'll post soon. I may have to table working on this until we can come up with a way to reliably reproduce it. I'm running Fedora Server 37 and I hit the same BUG making the system totally unstable. The only solution I've found was to downgrade to libnfsidmap-1:2.6.2-1.rc3.fc37 and nfs-utils-1:2.6.2-1.rc3.fc37. Now the system is stable. Hope this helps. Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug. This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there. Due to differences in account names between systems, some fields were not replicated. Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information. To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer. You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like: "Bugzilla Bug" = 1234567 In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information. |