Bug 2331382 - Kernel 6.12.* causes NFS/mount hangs when using cachefilesd (NFS caching)
Summary: Kernel 6.12.* causes NFS/mount hangs when using cachefilesd (NFS caching)
Keywords:
Status: NEW
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 41
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-12-10 14:52 UTC by Bert DeKnuydt
Modified: 2025-06-28 09:36 UTC (History)
19 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed:
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Bert DeKnuydt 2024-12-10 14:52:33 UTC
1. Please describe the problem:

Using 6.12.* as NFS-4.2 client, with cachefilesd ON, causes
mount/unmount problems. Switching off cachefilesd solves
the problems.

2. What is the Version-Release number of the kernel:

6.12.4 (but same on .1 and .3)

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

This has worked without problems in all 6.11.*
First problems was with 6.12.1

(However: there were previously similar problems in Fedora 40 with
much older kernels)

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

It's not 100% sure, but simply logging in with a NFS-4.2 home directory
triggers a hang with a half-mounted filesystem. 

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

Not tried with Rawhide yet (as 6.12 is already ahead of current)

6. Are you running any modules that not shipped with directly Fedora's kernel?:

No.

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

There is nothing logged about this, not in the dmesg, not in the journal.


Reproducible: Always

Comment 1 Bert DeKnuydt 2024-12-19 09:23:37 UTC
Some extra info:

* Kernel 6.12.5-200.fc41 has the exact same problem

* Also aarch64 has the exact same problem

Comment 2 Jack Snodgrass 2025-01-16 17:18:01 UTC
I have the same issue.  I am on: 
Linux 6.12.9-200.fc41.x86_64
nfs-utils-2.8.1-4.rc2.fc41.x86_64

I start up cachefilesd

I have: 
NV SERVER   PORT DEV          FSID                              FSC
v4 0a0c0e01  801 0:90         5c95aeb110ab56f0:0                yes
v4 0a0c0e1e  801 0:76         e228d38d2b7a0f8c:0                yes

I have files (newly created AFTER cachefilesd is started) in  /var/cache/fscache
find  /var/cache/fscache/ -type f  | wc
     58      58    9135

but very shortly after trying to access anything on my nfs share the process accessing the nfs share hangs. 

dmesg and journalctl don't show any errors. 

I do have some 'stuck' processes that seemed to have started up at the time of the file system hang: 

root        2177     678  0 11:21 ?        00:00:00 systemd-nsresourcework: waiting...
root        2178     678  0 11:21 ?        00:00:00 systemd-nsresourcework: waiting...
root        2182     678  0 11:21 ?        00:00:00 systemd-nsresourcework: waiting...
root        2183     678  0 11:21 ?        00:00:00 systemd-nsresourcework: waiting...
root        2186     678  0 11:21 ?        00:00:00 systemd-nsresourcework: waiting...
root        2198     679  0 11:22 ?        00:00:00 systemd-userwork: waiting...
root        2199     679  0 11:22 ?        00:00:00 systemd-userwork: waiting...
root        2200     679  0 11:22 ?        00:00:00 systemd-userwork: waiting...

I am not 100% certain that they are related... but I don't recall seeing those before... and the system is definilty 'waiting' on something and the time is about the time I had the issue.... so.... 

This is the FIRST time I've ever looked at cachefilesd so I don't have any idea if it worked on a different kernel version.

Comment 3 Jack Snodgrass 2025-01-16 18:21:10 UTC
I downloaded and installed 

dnf install kernel-modules-core-6.11.4-301.fc41.x86_64  \
kernel-core-6.11.4-301.fc41.x86_64 \
kernel-modules-6.11.4-301.fc41.x86_64 \
kernel-tools-libs-6.11.4-301.fc41.x86_64 \
kernel-tools-6.11.4-301.fc41.x86_64 \
kernel-modules-extra-6.11.4-301.fc41.x86_64 \
kernel-6.11.4-301.fc41.x86_64

using koji download-build

and booted up with the older, kernel-6.11.4-301.fc41.x86_64 kernel. 
uname reports: Linux 6.11.4-301.fc41.x86_64

Now I have more 1500+ files in my /var/cache/fscache dir ( -vs- 50 when it started and hung up ) and my nfs stuff works and does not hang, so I can say that the older 6.11.4-301 kernel works with the cachefilesd stuff. 

I checked again and I still have: 
root        4223     670  0 12:16 ?        00:00:00 systemd-nsresourcework: waiting...
root        4224     671  0 12:16 ?        00:00:00 systemd-userwork: waiting...
root        4225     670  0 12:16 ?        00:00:00 systemd-nsresourcework: waiting...
root        4226     671  0 12:16 ?        00:00:00 systemd-userwork: waiting...
root        4227     671  0 12:16 ?        00:00:00 systemd-userwork: waiting...
root        4228     670  0 12:16 ?        00:00:00 systemd-nsresourcework: waiting...
root        4229     670  0 12:16 ?        00:00:00 systemd-nsresourcework: waiting...
root        4230     670  0 12:16 ?        00:00:00 systemd-nsresourcework: waiting...
but the nfs stuff seems to be working so I don't think that those are related to the issue with the newer 6.12 kernel.

Comment 4 Ian Donaldson 2025-01-17 01:08:31 UTC
I'm seeing same on 6.12.9-200.fc41.x86_64 with cachefilesd enabled.

Comment 5 Ian Donaldson 2025-01-17 01:10:38 UTC
(rolled back to 6.11.11-300.fc41.x86_64 which works fine)

Comment 6 Bert DeKnuydt 2025-01-20 10:25:38 UTC
It seems kernel 6.12.10 solved the problem.  At least, I'm no longer able to trigger the problem.

Comment 7 Bert DeKnuydt 2025-01-23 08:54:11 UTC
Correction: it happens a lot less frequently.

Comment 8 Francesco Simula 2025-06-26 13:30:19 UTC
The very same problem reappeared instantly and repeatably on kernel 6.15 on Fedora 42.

Comment 9 Bert DeKnuydt 2025-06-26 14:22:08 UTC
Seconded, with all 6.15.{1..3} affected.

Actually, it's even worse: after a reboot into 6.15.3, even before any NFS is actually mounted, a 'systemctl stop cachefilesd' 
can already hang the whole machine. So you need to 'systemctl disable' it before anyone boots into the fresh kernel.

As we heavily use NFS-caching, the reoccurring of this is really a pain.  But it seems a little used feature outside of academia...

Comment 10 Francesco Simula 2025-06-26 18:49:56 UTC
(In reply to Bert DeKnuydt from comment #9)
> Actually, it's even worse: after a reboot into 6.15.3, even before any NFS
> is actually mounted, a 'systemctl stop cachefilesd' 
> can already hang the whole machine. So you need to 'systemctl disable' it
> before anyone boots into the fresh kernel.

Identical behaviour here - really not fun when different users that left the lab in the evening and authorized the automcatic packages update at reboot come all up to you howling that their machine has frozen...
 
> As we heavily use NFS-caching, the reoccurring of this is really a pain. 
> But it seems a little used feature outside of academia...

At this point, I'm considering simply removing cachefilesd and be done with it - the perpetual risk of hosing the whole lab for an ordinary upgrade-and-reboot cycle (which has already occurred several times) can't be reasonably justified without hard performance numbers in favour of keeping it enabled or a glaring difference in responsiveness, which I really don't see...

Comment 11 Bert DeKnuydt 2025-06-28 09:36:57 UTC
FYI: 6.15.4, with quite some NFS fixes, still suffers.

@Francesco:  As for performance of cachefilesd: we measured no increased responsiveness on the NFS client (in fact: the opposite: a bit more latency), but ... a lot less traffic to the NFS server. And that makes is worthwhile for us. When it works.


Note You need to log in before you can comment on or make changes to this bug.