Bug 2331382

Summary:	Kernel 6.12.* causes NFS/mount hangs when using cachefilesd (NFS caching)
Product:	[Fedora] Fedora	Reporter:	Bert DeKnuydt <deknuydt>
Component:	kernel	Assignee:	Kernel Maintainer List <kernel-maint>
Status:	CLOSED EOL	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	41	CC:	acaringi, adscvr, airlied, alciregi, Bert.Deknuydt, boroske, bskeggs, francesco.simula, hdegoede, hpa, idonaldson0, jack, josef, kernel-maint, linville, masami256, mchehab, pgnd, ptalbert, steved, suraj.ghimire7
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	---
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2025-12-16 18:02:37 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Bert DeKnuydt 2024-12-10 14:52:33 UTC

1. Please describe the problem:

Using 6.12.* as NFS-4.2 client, with cachefilesd ON, causes
mount/unmount problems. Switching off cachefilesd solves
the problems.

2. What is the Version-Release number of the kernel:

6.12.4 (but same on .1 and .3)

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

This has worked without problems in all 6.11.*
First problems was with 6.12.1

(However: there were previously similar problems in Fedora 40 with
much older kernels)

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

It's not 100% sure, but simply logging in with a NFS-4.2 home directory
triggers a hang with a half-mounted filesystem. 

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

Not tried with Rawhide yet (as 6.12 is already ahead of current)

6. Are you running any modules that not shipped with directly Fedora's kernel?:

No.

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

There is nothing logged about this, not in the dmesg, not in the journal.


Reproducible: Always

Comment 1 Bert DeKnuydt 2024-12-19 09:23:37 UTC

Some extra info:

* Kernel 6.12.5-200.fc41 has the exact same problem

* Also aarch64 has the exact same problem

Comment 2 Jack Snodgrass 2025-01-16 17:18:01 UTC

I have the same issue.  I am on: 
Linux 6.12.9-200.fc41.x86_64
nfs-utils-2.8.1-4.rc2.fc41.x86_64

I start up cachefilesd

I have: 
NV SERVER   PORT DEV          FSID                              FSC
v4 0a0c0e01  801 0:90         5c95aeb110ab56f0:0                yes
v4 0a0c0e1e  801 0:76         e228d38d2b7a0f8c:0                yes

I have files (newly created AFTER cachefilesd is started) in  /var/cache/fscache
find  /var/cache/fscache/ -type f  | wc
     58      58    9135

but very shortly after trying to access anything on my nfs share the process accessing the nfs share hangs. 

dmesg and journalctl don't show any errors. 

I do have some 'stuck' processes that seemed to have started up at the time of the file system hang: 

root        2177     678  0 11:21 ?        00:00:00 systemd-nsresourcework: waiting...
root        2178     678  0 11:21 ?        00:00:00 systemd-nsresourcework: waiting...
root        2182     678  0 11:21 ?        00:00:00 systemd-nsresourcework: waiting...
root        2183     678  0 11:21 ?        00:00:00 systemd-nsresourcework: waiting...
root        2186     678  0 11:21 ?        00:00:00 systemd-nsresourcework: waiting...
root        2198     679  0 11:22 ?        00:00:00 systemd-userwork: waiting...
root        2199     679  0 11:22 ?        00:00:00 systemd-userwork: waiting...
root        2200     679  0 11:22 ?        00:00:00 systemd-userwork: waiting...

I am not 100% certain that they are related... but I don't recall seeing those before... and the system is definilty 'waiting' on something and the time is about the time I had the issue.... so.... 

This is the FIRST time I've ever looked at cachefilesd so I don't have any idea if it worked on a different kernel version.

Comment 3 Jack Snodgrass 2025-01-16 18:21:10 UTC

I downloaded and installed 

dnf install kernel-modules-core-6.11.4-301.fc41.x86_64  \
kernel-core-6.11.4-301.fc41.x86_64 \
kernel-modules-6.11.4-301.fc41.x86_64 \
kernel-tools-libs-6.11.4-301.fc41.x86_64 \
kernel-tools-6.11.4-301.fc41.x86_64 \
kernel-modules-extra-6.11.4-301.fc41.x86_64 \
kernel-6.11.4-301.fc41.x86_64

using koji download-build

and booted up with the older, kernel-6.11.4-301.fc41.x86_64 kernel. 
uname reports: Linux 6.11.4-301.fc41.x86_64

Now I have more 1500+ files in my /var/cache/fscache dir ( -vs- 50 when it started and hung up ) and my nfs stuff works and does not hang, so I can say that the older 6.11.4-301 kernel works with the cachefilesd stuff. 

I checked again and I still have: 
root        4223     670  0 12:16 ?        00:00:00 systemd-nsresourcework: waiting...
root        4224     671  0 12:16 ?        00:00:00 systemd-userwork: waiting...
root        4225     670  0 12:16 ?        00:00:00 systemd-nsresourcework: waiting...
root        4226     671  0 12:16 ?        00:00:00 systemd-userwork: waiting...
root        4227     671  0 12:16 ?        00:00:00 systemd-userwork: waiting...
root        4228     670  0 12:16 ?        00:00:00 systemd-nsresourcework: waiting...
root        4229     670  0 12:16 ?        00:00:00 systemd-nsresourcework: waiting...
root        4230     670  0 12:16 ?        00:00:00 systemd-nsresourcework: waiting...
but the nfs stuff seems to be working so I don't think that those are related to the issue with the newer 6.12 kernel.

Comment 4 Ian Donaldson 2025-01-17 01:08:31 UTC

I'm seeing same on 6.12.9-200.fc41.x86_64 with cachefilesd enabled.

Comment 5 Ian Donaldson 2025-01-17 01:10:38 UTC

(rolled back to 6.11.11-300.fc41.x86_64 which works fine)

Comment 6 Bert DeKnuydt 2025-01-20 10:25:38 UTC

It seems kernel 6.12.10 solved the problem.  At least, I'm no longer able to trigger the problem.

Comment 7 Bert DeKnuydt 2025-01-23 08:54:11 UTC

Correction: it happens a lot less frequently.

Comment 8 Francesco Simula 2025-06-26 13:30:19 UTC

The very same problem reappeared instantly and repeatably on kernel 6.15 on Fedora 42.

Comment 9 Bert DeKnuydt 2025-06-26 14:22:08 UTC

Seconded, with all 6.15.{1..3} affected.

Actually, it's even worse: after a reboot into 6.15.3, even before any NFS is actually mounted, a 'systemctl stop cachefilesd' 
can already hang the whole machine. So you need to 'systemctl disable' it before anyone boots into the fresh kernel.

As we heavily use NFS-caching, the reoccurring of this is really a pain.  But it seems a little used feature outside of academia...

Comment 10 Francesco Simula 2025-06-26 18:49:56 UTC

(In reply to Bert DeKnuydt from comment #9)
> Actually, it's even worse: after a reboot into 6.15.3, even before any NFS
> is actually mounted, a 'systemctl stop cachefilesd' 
> can already hang the whole machine. So you need to 'systemctl disable' it
> before anyone boots into the fresh kernel.

Identical behaviour here - really not fun when different users that left the lab in the evening and authorized the automcatic packages update at reboot come all up to you howling that their machine has frozen...
 
> As we heavily use NFS-caching, the reoccurring of this is really a pain. 
> But it seems a little used feature outside of academia...

At this point, I'm considering simply removing cachefilesd and be done with it - the perpetual risk of hosing the whole lab for an ordinary upgrade-and-reboot cycle (which has already occurred several times) can't be reasonably justified without hard performance numbers in favour of keeping it enabled or a glaring difference in responsiveness, which I really don't see...

Comment 11 Bert DeKnuydt 2025-06-28 09:36:57 UTC

FYI: 6.15.4, with quite some NFS fixes, still suffers.

@Francesco:  As for performance of cachefilesd: we measured no increased responsiveness on the NFS client (in fact: the opposite: a bit more latency), but ... a lot less traffic to the NFS server. And that makes is worthwhile for us. When it works.

Comment 12 pgnd 2025-07-09 18:10:54 UTC

unlike the NFS-triggered hangs described above, I see a consistent failure of `cachefilesd` at start
i'm not whether it's the same issue, or not ...

posting here, even though NFS doesn't appear to be required here, since

	involves cachefilesd + recent Fedora kernel
	indicates kernel-level breakage in FS-Cache / cachefiles backend

with,

$ distro
	Name: Fedora Linux 42 (Adams)
	Version: 42
	Codename:

$ uname -rm
	6.15.4-200.fc42.x86_64 x86_64

$ lsmod | grep cachefiles
	cachefiles            204800  0
	netfs                 602112  1 cachefiles

$ grep dir /etc/cachefilesd.conf
	dir /fscache

$ mount | grep /fscache
	tmpfs on /fscache type tmpfs (rw,relatime,size=8388608k,mode=755,inode64)

$ systemctl start cachefilesd.service
	Job for cachefilesd.service failed because the control process exited with error code.
	See "systemctl status cachefilesd.service" and "journalctl -xeu cachefilesd.service" for details.


$ journalctl -f

	Jul 09 13:57:59 svr systemd[1]: Starting cachefilesd.service - CacheFiles daemon (loc)...
	Jul 09 13:57:59 svr kernel: CacheFiles: Failed to register: -95
	Jul 09 13:57:59 svr systemd[1]: cachefilesd.service: Control process exited, code=exited, status=1/FAILURE
	Jul 09 13:57:59 svr systemd[1]: cachefilesd.service: Failed with result 'exit-code'.
	Jul 09 13:57:59 svr systemd[1]: Failed to start cachefilesd.service - CacheFiles daemon (loc).
	Jul 09 13:58:00 svr systemd[1]: cachefilesd.service: Scheduled restart job, restart counter is at 1.

$ test -d /proc/fs/cachefiles && echo OK || echo Missing
	Missing

$ zgrep CACHEFILES /proc/config.gz
	CONFIG_CACHEFILES=m
	# CONFIG_CACHEFILES_DEBUG is not set
	# CONFIG_CACHEFILES_ERROR_INJECTION is not set
	CONFIG_CACHEFILES_ONDEMAND=y

iiuc, error (`-95`, `EOPNOTSUPP`) here indicates a kernel-side failure of cachefiles backend registration request

occurs here even with no NFS mounts and a valid/writable cache dir -> `/fscache` on tmpfs

which suggests a possible regression in kernel-side cachefiles support in 6.15.x.

in conjunction with the above, the broken backend appears to affect both initialization and NFS interaction depending on system usage.

i know i didn't have this problem previously.  i haven't bisected, or even tested earlier kernle versions yet.

Comment 13 pgnd 2025-07-09 19:12:11 UTC

an alternative is to switch to in-kernel FS-Cache v2, using its non-fixed/dynamic memory cache

given

$ grep -i fscache /boot/config-6.15.4-200.fc42.x86_64
	CONFIG_FSCACHE=y
	CONFIG_FSCACHE_STATS=y
	CONFIG_NFS_FSCACHE=y
	CONFIG_CEPH_FSCACHE=y
	CONFIG_CIFS_FSCACHE=y
	CONFIG_AFS_FSCACHE=y
	CONFIG_9P_FSCACHE=y

with `cachefiles` removed

$ rpm -qa | grep -i cachefiles
	(empty)

and `fsc` usage enabled

$ grep fsc /etc/auto.nfs4
	TEST            -fstype=nfs4,vers=4.2,_netdev,...,fsc,...         machine.example.com:/
	                                                  ^^^

when accessing `TEST`, e.g.,

$ cat /proc/fs/fscache/stats | grep -E "Cookies|Acquire|Reads|IO"
	Reads  : DR=0 RA=25015 RF=0 RS=0 WB=0 WBZ=0
	Cookies: n=20572 v=1 vcol=0 voom=0
	Acquire: n=28995 ok=28995 oom=0
	IO     : rd=0 wr=0 mis=0

$ ps ax | grep -E "fscache|cachefiles"

$ cat /proc/fs/fscache/caches
	CACHE    REF   VOLS  OBJS  ACCES S NAME
	======== ===== ===== ===== ===== = ===============
	00000024     1     1     0     0 - -

$

Comment 14 pgnd 2025-07-09 19:34:20 UTC

looks like FSCache v2 can support disk-backed cache, but currently 
_only_ (?) using legacy cachefiles backend.

which is what looks broken here.

so, for now, FSCv2 is RAM-only unless/until cachefiles is functional.
which is limiting if 

  working data's > RAM,
  cache needs to persist across boot
  other memory pressure's significant

Comment 15 Bert DeKnuydt 2025-07-28 11:28:19 UTC

It seems the problems with NFS/fsc/cachefilesd are solved in 6.15.8-200.fc42.x86_64. 
Ran it over the weekend on a dozen machines without problems.

There are indeed changes to this in the Changelog:

--
Zizhi Wo (1):
      cachefiles: Fix the incorrect return value in __cachefiles_write()
--

Now what about 6.16 :)

Comment 16 Adam Williamson 2025-12-02 01:55:08 UTC

This message is a reminder that Fedora Linux 41 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 41 on 2025-12-15.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '41'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version. Note that the version field may be hidden.
Click the "Show advanced fields" button if you do not see it.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 41 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 17 Samyak Jain (RedHat) 2025-12-16 18:02:37 UTC

Fedora Linux 41 entered end-of-life (EOL) status on 2025-12-15.

Fedora Linux 41 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of Fedora Linux
please feel free to reopen this bug against that version. Note that the version
field may be hidden.Â Click the "Show advanced fields" button if you do not see
the version field.

If you are unable to reopen this bug, please file a new report against an
active release.

Thank you for reporting this bug and we are sorry it could not be fixed.