Bug 2055362 - NFS mount hang on kernel 5.16.10
Summary: NFS mount hang on kernel 5.16.10
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 35
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 2061035 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-16 18:47 UTC by Gurenko Alex
Modified: 2022-06-29 11:17 UTC (History)
34 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-04-02 16:26:36 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
dmesg output (103.95 KB, text/plain)
2022-02-16 18:47 UTC, Gurenko Alex
no flags Details

Description Gurenko Alex 2022-02-16 18:47:56 UTC
Created attachment 1861556 [details]
dmesg output

1. Please describe the problem: Cannot mount NFS


2. What is the Version-Release number of the kernel: 5.16.10-200.fc35.x86_64


3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

  Worked perfectly on 5.16.9-200.fc35.x86_64


4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

   sudo mount -v -t nfs4 <nas_ip>:<share> /tmp/nas


5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

   As it's a regression from previous kernel, didn't try rawhdie


6. Are you running any modules that not shipped with directly Fedora's kernel?:

  No

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

  Attached log from one of the machines. Verified on 2 different machines: laptop (wifi) and a desktop (lan)

I have systemd mount and automount units configured and they timeout at 30 seconds as configured. When I try to mount manually, verbose output shows:

$ sudo mount -v -t nfs4 <NAS IP>:/<SHARE> ./NAS
mount.nfs: timeout set for Wed Feb 16 19:40:02 2022
mount.nfs: trying text-based options 'vers=4.2,addr=<NAS IP>,clientaddr=<LAPTOP IP>'

however command still hangs passed specified timeout time.

No entries in journalctl.

Comment 1 Justin M. Forbes 2022-02-16 19:00:57 UTC
Mind trying the f36 kernel 5.17-rc4? https://koji.fedoraproject.org/koji/buildinfo?buildID=1917827
Curious if the backport of a patch is missing something it depends on, or if the patch is just bad.

Comment 2 Gurenko Alex 2022-02-16 19:42:56 UTC
(In reply to Justin M. Forbes from comment #1)
> Mind trying the f36 kernel 5.17-rc4?
> https://koji.fedoraproject.org/koji/buildinfo?buildID=1917827
> Curious if the backport of a patch is missing something it depends on, or if
> the patch is just bad.

I've tried 5.17.0-0.rc4.96.fc36.x86_64 kernel and same problem reproduces there as well.

Comment 3 Gurenko Alex 2022-02-21 09:54:50 UTC
@jforbes After several reports of success with NFS and Fedora server as a target, I've tried few VMs this morning, and indeed if a RHEL 8 machine is acting as a server, it works with this kernel. However, my original NFS share are hosted on a QNAP NAS and no matter what I do, I cannot mount those shares with this kernel and it works perfectly fine with 5.16.9. I'm not sure how to proceed or what to try.

Comment 4 wayne6001 2022-02-22 19:10:27 UTC
Is your qnap/nas firmware up to date (in qnap nas admin settings   Control Panel > System > Firmware Update > Live Update )

Probably won't help, but just in case -- Long ago, in a qnap community support forum post, someone worked around vfsv4 mounting timeouts by turning vfsv4 support off in qnap/nas settings and saved and then back on again.

Comment 5 Gurenko Alex 2022-02-22 20:09:01 UTC
(In reply to wayne6001 from comment #4)
> Is your qnap/nas firmware up to date (in qnap nas admin settings   Control
> Panel > System > Firmware Update > Live Update )
> 
> Probably won't help, but just in case -- Long ago, in a qnap community
> support forum post, someone worked around vfsv4 mounting timeouts by turning
> vfsv4 support off in qnap/nas settings and saved and then back on again.

Thank you for the suggestion, however I've got latest firmware about 5 days ago and it didn't help. Without changing anything on a QNAP side and trying previous kernel works perfectly fine, so... In any case, I've opened a ticket with QNAP to see what's their take on that, but I'm still waiting for the reply.

Comment 6 Gurenko Alex 2022-02-23 12:22:23 UTC
I've tried today 5.17.0-0.rc5.102.fc36.x86_64 kernel and still same problem, still battling 1st line of support at QNAP at this point, no actionable information from their side. Also waiting for 5.16.11 build to try it out.

Comment 7 Gurenko Alex 2022-02-23 19:50:59 UTC
No luck with 5.16.11-200.fc35.x86_64...

Comment 8 teppot 2022-02-26 06:57:07 UTC
Happens for me too, also on a QNAP NAS with the latest firmware.

According to [1], upstream 5.16.10 had many NFS related changes.

[1] https://lore.kernel.org/all/20220214092510.221474733@linuxfoundation.org/

Comment 9 Mattia Verga 2022-02-26 13:58:07 UTC
I've been hit by the same problem today, also on a QNAP NAS.
The strange is that till yesterday I was using kernel 2.6.10-200.fc35.x86_64 and everything was working. This morning I've upgraded my system and after a reboot in kernel 2.6.11-200.fc35.x86_64 I found that the NFSv4 mount command was stuck as described above.
I tried to reboot in 2.6.10-200 but it doesn't work anymore. I have to boot in 2.6.9-200 or set the mount options as 'vers=3'.

Comment 10 Mattia Verga 2022-02-26 14:04:27 UTC
Forget what I said before, till yesterday I was on 2.6.9-200, so the problem started with 2.6.10 indeed.

Comment 11 Gurenko Alex 2022-02-26 16:20:08 UTC
(In reply to Teppo Turtiainen from comment #8)
> Happens for me too, also on a QNAP NAS with the latest firmware.
> 
> According to [1], upstream 5.16.10 had many NFS related changes.
> 
> [1] https://lore.kernel.org/all/20220214092510.221474733@linuxfoundation.org/

Can you please also open tickets with Qnap to raise their awareness of the problem?

Comment 12 Christoph Roeper 2022-02-26 16:35:24 UTC
Same here, started with Kernel 5.16.10 and 5.16.11 on Fedora, Kernel 5.16.9 works fine. Qnap Firmware is 4.3.3.1864 from 2021-12-12. Not sure if it is related to the Qnap, nothing changed on the Qnap when it stopped working with the new Fedora Kernel update a few days ago, unless the Qnap NFSv4 is now outdated.

Comment 13 Frank Crawford 2022-02-27 00:32:59 UTC
While I am not having this problem (I have a different NFS issue), there is a ticket for Arch Linux very similar and they believe they have found the issue.

The issue resolves around https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/nfs/nfs4proc.c#n3859 and in particular that QNAP does not handle NFS_CAP_FS_LOCATIONS properly.

Their fix is:

 /* Restrict FS_LOCATIONS to NFS v4.2+ to work around Qnap knfsd-3.4.6 bug */
                if (res.attr_bitmask[0] & FATTR4_WORD0_FS_LOCATIONS && minorversion >= 2)
                        server->caps |= NFS_CAP_FS_LOCATIONS;

in nfs4proc.c

Comment 14 Frank Crawford 2022-02-27 00:34:00 UTC
I forgot, full details are here https://bbs.archlinux.org/viewtopic.php?id=274259

Comment 15 Chris Egolf 2022-03-06 20:08:39 UTC
I started having random issues with "kernel: nfs: server x.x.x.x not responding, still trying" using an NSF share on a Synology DS1813+/DSM 7.0 NAS.  The issues started after upgrading from kernel-5.16.9-200 to kernel-5.16.11-200 and now 5.16.12-200 as well.  NAS is running NFS4 and the share is mounted with type nfs4.  

More background, the NFS share is mounted as libvirt NetFS storage for KVM guests.  When the error occurs, the VMs become unresponsive and libvirt/virsh commands hang as well.  Sometimes, it will recover ("kernel: nfs: server x.x.x.x OK"), but it might take minutes or hours.

Comment 16 Bradi 2022-03-09 04:18:00 UTC
Is this getting any traction from Fedora / Red Hat / Kernal maintainers or QNAP?  How can I help raise visibility of this issue?

Comment 17 Gurenko Alex 2022-03-09 08:56:30 UTC
(In reply to Bradi from comment #16)
> Is this getting any traction from Fedora / Red Hat / Kernal maintainers or
> QNAP?  How can I help raise visibility of this issue?

No response from anyone. I keep pinging QNAP every other day, but zero response.

Comment 18 Bradi 2022-03-09 09:04:29 UTC
Can we pinpoint the actual package at fault with QNAP / Synology?  Is there something we can say definatively is a problem?

I don't know enough to be able to log a ticket with QNAP at this point, but have logged my own bug report with Fedora : https://bugzilla.redhat.com/show_bug.cgi?id=2061035

Comment 19 Gurenko Alex 2022-03-09 09:16:03 UTC
(In reply to Bradi from comment #18)
> Can we pinpoint the actual package at fault with QNAP / Synology?  Is there
> something we can say definatively is a problem?
> 
> I don't know enough to be able to log a ticket with QNAP at this point, but
> have logged my own bug report with Fedora :
> https://bugzilla.redhat.com/show_bug.cgi?id=2061035

If the forum reference to be believed, the culprit is knfsd-3.4.6 package on the QNAP side which contains nfs4proc.c where the problem lies.

As for the ticket with Qnap as a user it's not *really* you task to pinpoint the problem, you need to open a ticket to raise awareness for qnap and increase priority as more users report the problem, more chances we have someone look at it.

Comment 20 Bradi 2022-03-09 09:58:03 UTC
Logged ticket "Q-202203-76200" with QNAP.

Comment 21 Gurenko Alex 2022-03-09 10:10:40 UTC
I've opened a ticket for them to look at my other ticket and I got following reply:

"For clarification - you are constantly updating the kernal of your Fedora 35 machines. Is that correct? 
I will try to escalte the case to the development department, but since this rather seems to be linux support (this issue seems kernel version of Fedora related to me) then QNAP it is unclear if the development department will acceppt the case"

Comment 22 Bradi 2022-03-09 10:15:06 UTC
Did you link this bug to your QNAP ticket?  May be worthwhile, since it does seem to indicate the problem is not kernel related and more package related (on the QNAP side).

Comment 23 Gurenko Alex 2022-03-09 10:18:18 UTC
(In reply to Bradi from comment #22)
> Did you link this bug to your QNAP ticket?  May be worthwhile, since it does
> seem to indicate the problem is not kernel related and more package related
> (on the QNAP side).

Yes, I've linked this BZ, link to the arch forum, added snippet of the code and the library name

Comment 24 Bradi 2022-03-09 10:39:49 UTC
Thanks Alex, appreciate all you have done.  Hopefully QNAP (and other vendors) may take notice.

Comment 25 Mattia Verga 2022-03-09 16:44:07 UTC
I have posted a thread on Qnap community forum with a link on this bug:
https://forum.qnap.com/viewtopic.php?f=35&t=165385

Another user commented that there's also a bug report in Debian for this, but I cannot find it. There are also several other users opening threads about not being able to mount NFS shares anymore.

You should make clear to Qnap support that this is not a Fedora kernel bug, but rather an incompatibility of Qnap NFS implementation with Linux kernel starting from 5.16.10 and above.

Meanwhile, can we backport the patch from Arch linux in Comment#13 ?

Comment 26 Bradi 2022-03-10 23:54:26 UTC
*** Bug 2061035 has been marked as a duplicate of this bug. ***

Comment 27 Bradi 2022-03-15 08:25:47 UTC
After contacting QNAP I am not confident a fix or workaround will be forthcoming from them.  My device is EOL and I can't blame QNAP for not spending resources investigating my problem.  Currently running NFSv3 but will look at building or upgrading to a TrueNAS server in the near future.

Comment 28 Per Osbeck 2022-03-16 08:32:36 UTC
I'm a bit confused to why this isn't reverted already upstream given it's a clear regression over 5.16.9.

Having everyone downgrade their NFS version or replacing QNAP NFS hardware doesn't seem like a road forward.

Now that this kernel is also getting out there in stable FCOS (https://github.com/coreos/fedora-coreos-tracker/issues/1121) it at least looks (for my part) like additional workarounds in k8s clusters are needed to either pin older dirty pipe kernel or change everything to use lower NFS version (and volumeMounts doesn't support mount options directly, only via PV).

Comment 29 Bradi 2022-03-16 09:56:07 UTC
I personally don't have the skills or experience to definitively state the root cause of this issue, but have done enough troubleshooting to believe this to be a problem with my QNAP device and not anything to do with Fedora or the Linux Kernel.  While I am not able to mount any shares from my QNAP device using NFSv4, I am able to mount NFSv4 shares on the latest kernel from my Fedora 35 workstation to another Fedora 35 workstation (both running kernel v5.16.14) and I don't believe this should be resolved by specific device workarounds in any packages as this is more likely to discourage vendors from fixing issues in their own packages and create an unmaintainable list of vendor specific patches.

If this issue turns out to be caused by a kernel or Fedora package, then I would be surprised if it wasn't fixed already in an upgrade.

Comment 30 Gurenko Alex 2022-03-16 10:07:38 UTC
(In reply to Bradi from comment #29)
> I personally don't have the skills or experience to definitively state the
> root cause of this issue, but have done enough troubleshooting to believe
> this to be a problem with my QNAP device and not anything to do with Fedora
> or the Linux Kernel.  While I am not able to mount any shares from my QNAP
> device using NFSv4, I am able to mount NFSv4 shares on the latest kernel
> from my Fedora 35 workstation to another Fedora 35 workstation (both running
> kernel v5.16.14) and I don't believe this should be resolved by specific
> device workarounds in any packages as this is more likely to discourage
> vendors from fixing issues in their own packages and create an
> unmaintainable list of vendor specific patches.
> 
> If this issue turns out to be caused by a kernel or Fedora package, then I
> would be surprised if it wasn't fixed already in an upgrade.

The issue was introduced upstream and the first report was on Feb 21st. So as stated in the upstream conversation, this should be either a) Fixed b) reverted, but neither happened. It does not really matter who's at fault here, but this is a regression which was introduced in upstream and made it to the stable Fedora kernel. That shouldn't happen IMO, because otherwise I don't understand the purpose of testing repo then as it was reported and acknowledged, but still successfully made it to stable. Now we're 4 weeks into broken functionality and the problem spreading as those kernels are made available in other flavors and use cases.

*Maybe* problem can be resolved on QNAP side, but in my limited experience with their product and support (I only purchased it this Christmas) their approach is a complete disappointment along with total lack of communication (Again have not heard anything from them in a week).

On the other hand, based on https://bugzilla.redhat.com/show_bug.cgi?id=2055362#c15 Synology is also affected and I would assume any other 3rd party that relies on the same library.

Unfortunately, I don't have a clue how to raise the visibility on this matter (apart from raising a priority and adding regression tag that I did today).

Comment 31 Bradi 2022-03-16 10:13:45 UTC
(In reply to Gurenko Alex from comment #30)
> The issue was introduced upstream and the first report was on Feb 21st. So
> as stated in the upstream conversation, this should be either a) Fixed b)
> reverted, but neither happened. It does not really matter who's at fault
> here, but this is a regression which was introduced in upstream and made it
> to the stable Fedora kernel. That shouldn't happen IMO, because otherwise I
> don't understand the purpose of testing repo then as it was reported and
> acknowledged, but still successfully made it to stable. Now we're 4 weeks
> into broken functionality and the problem spreading as those kernels are
> made available in other flavors and use cases.
> 
> *Maybe* problem can be resolved on QNAP side, but in my limited experience
> with their product and support (I only purchased it this Christmas) their
> approach is a complete disappointment along with total lack of communication
> (Again have not heard anything from them in a week).
> 
> On the other hand, based on
> https://bugzilla.redhat.com/show_bug.cgi?id=2055362#c15 Synology is also
> affected and I would assume any other 3rd party that relies on the same
> library.
> 
> Unfortunately, I don't have a clue how to raise the visibility on this
> matter (apart from raising a priority and adding regression tag that I did
> today).

I agree that this is not a situation that either Fedora or the Kernel maintainers would want to see.  However, I would prefer not to stop (or regress) kernel development because some OEMs are using outdated services.  A better, more workable solution would be to provide accurate information regarding the problem, workarounds and suggestions as to how best to proceed.  Hence bugzilla :)

Comment 32 Bradi 2022-03-16 10:46:06 UTC
I do understand the frustration with purchasing an expensive piece of equipment that shortly after purchase is unable to do what you bought it for.  This is not a problem for Fedora or the kernel maintainers to be concerned about though.  If you have a supported OEM device that exhibits these issues, I would wholeheartedly encourage you to log support tickets and keep communicating with your vendor regarding this issue, as if they resolve the issue in newer firmware, it shouldn't be hard for the community to take advantage of this.  Or you can find a vendor that is more proactive :(

Comment 33 Justin M. Forbes 2022-03-16 12:50:22 UTC
I also understand the frustration. I have been following this upstream. There are often cases where upstream makes a change which is clearly correct, and it breaks something that was working fine before.  Sometimes the answer is a quirk to work around buggy hardware, but that can not be done when we do not control the hardware. There are options that have been passed around upstream, including disabling FS_LOCATIONS on servers advertising anything below NFS 4.2, though that seems to have gotten little traction upstream. The other option would require passing a different argument notrunkdiscovery, to mount, which would only work on updated client kernels which support it.  This is also messy. As of yet, there has been no consensus or clear solution, 5.17 is still not going to work with these buggy server implementations. QNAP is shipping newer kernels with OS updates for supported products, but at some point their products fall to EOL and get no further updates (and QNAP systems showing this issue are likely vulnerable to a few CVEs that were fixed in their later releases as well).  So where does that leave us?

I am willing to patch 5.16.15 to disable FS_LOCATIONS on servers advertising anything below NFS v4.2, and that patch will remain through the 5.16 life cycle.  Roughly 3-4 weeks from now, the 5.17 rebases will start. I will not carry that patch for 5.17 kernels, but that does give us a little bit more time to get an upstream solution.  I can make no guarantees that upstream will have a solution, or that the solution they do agree to will not require additional mount options be passed. Of course there is also the option to mount the QNAP shares as NFS v3.

Given that the 5.16.15 update will be a temporary solution, I will not mark it as fixing this bug.  The bug needs to remain open until upstream comes to some agreement.

Comment 34 Gurenko Alex 2022-03-16 13:40:29 UTC
(In reply to Justin M. Forbes from comment #33)
> I also understand the frustration. I have been following this upstream.
> There are often cases where upstream makes a change which is clearly
> correct, and it breaks something that was working fine before.  Sometimes
> the answer is a quirk to work around buggy hardware, but that can not be
> done when we do not control the hardware. There are options that have been
> passed around upstream, including disabling FS_LOCATIONS on servers
> advertising anything below NFS 4.2, though that seems to have gotten little
> traction upstream. The other option would require passing a different
> argument notrunkdiscovery, to mount, which would only work on updated client
> kernels which support it.  This is also messy. As of yet, there has been no
> consensus or clear solution, 5.17 is still not going to work with these
> buggy server implementations. QNAP is shipping newer kernels with OS updates
> for supported products, but at some point their products fall to EOL and get
> no further updates (and QNAP systems showing this issue are likely
> vulnerable to a few CVEs that were fixed in their later releases as well). 
> So where does that leave us?
> 
> I am willing to patch 5.16.15 to disable FS_LOCATIONS on servers advertising
> anything below NFS v4.2, and that patch will remain through the 5.16 life
> cycle.  Roughly 3-4 weeks from now, the 5.17 rebases will start. I will not
> carry that patch for 5.17 kernels, but that does give us a little bit more
> time to get an upstream solution.  I can make no guarantees that upstream
> will have a solution, or that the solution they do agree to will not require
> additional mount options be passed. Of course there is also the option to
> mount the QNAP shares as NFS v3.
> 
> Given that the 5.16.15 update will be a temporary solution, I will not mark
> it as fixing this bug.  The bug needs to remain open until upstream comes to
> some agreement.

This sounds like a good solution to me. Are you bringing this conversation upstream? On a slightly different, but not unrelated topic, for some reason, I can force nfs3 with a manual mount, but systemd mount unit ignores nfsvers=3 and vers=3 option flags, so it does not work for me that way, do you happen to have any inside/advice on that?

Comment 35 Bradi 2022-03-16 13:48:45 UTC
Thank you Justin.  You can pass vers=3 in /etc/fstab if that helps your situation.

Comment 36 Dusty Mabe 2022-03-16 13:50:06 UTC
In the FCOS issue we have reports that reverting to `vers=4.0` works, so IIUC users don't need to go all the way back to NFS v3.

https://github.com/coreos/fedora-coreos-tracker/issues/1121#issuecomment-1066859336

Comment 37 Benjamin Coddington 2022-03-16 13:54:24 UTC
Are folks really stuck on kernel v3.4.6 on QNAP?  I pulled down the GPL sources from QNAP (https://sourceforge.net/projects/qosgpl/), and the kernel versions within are v4.2, v4.14, and v5.10.

I haven't located what server version is actually needed, but I'm surprised we're trying to fix the client for a server released 10 years ago.  Can someone with shell access to a QNAP verify kernel versions, and that they're really using the most up-to-date version?

Comment 38 Benjamin Coddington 2022-03-16 13:56:41 UTC
https://www.qnap.com/en-us/release-notes/kernel

Comment 39 Christoph Roeper 2022-03-16 13:57:31 UTC
Yes, I can confirm it works with NFS 4.0 setting in fstab with my Qnap and NFS v4 active only. No need to go back to NFS v2/v3.
Example:
qnap:/Public /home/user/Mounts/Qnap/Public nfs4 _netdev,rw,noauto,vers=4.0,user 0 0

Comment 40 Gurenko Alex 2022-03-16 14:01:11 UTC
Yes, I can also confirm that forcing  vers=4.0 works both manually and through systemd mount/automount units! Thanks you, that was a very important piece of information. 

(In reply to Benjamin Coddington from comment #37)
> Are folks really stuck on kernel v3.4.6 on QNAP?  I pulled down the GPL
> sources from QNAP (https://sourceforge.net/projects/qosgpl/), and the kernel
> versions within are v4.2, v4.14, and v5.10.
> 
> I haven't located what server version is actually needed, but I'm surprised
> we're trying to fix the client for a server released 10 years ago.  Can
> someone with shell access to a QNAP verify kernel versions, and that they're
> really using the most up-to-date version?

My QNAP is currently running QuTS Hero (with ZFS support) and latest version is based on "Linux 5.10.60-qnap #1 SMP Tue Feb 15 08:43:23 CST 2022 x86_64 GNU/Linux", 3.4.6 was refering to the knfsd (if I'm not mistaken) package version, not the kernel.

Comment 41 Benjamin Coddington 2022-03-16 14:16:41 UTC
(In reply to Gurenko Alex from comment #40)
> My QNAP is currently running QuTS Hero (with ZFS support) and latest version
> is based on "Linux 5.10.60-qnap #1 SMP Tue Feb 15 08:43:23 CST 2022 x86_64
> GNU/Linux", 3.4.6 was refering to the knfsd (if I'm not mistaken) package
> version, not the kernel.

Very interesting -- knfsd is part of the linux kernel upstream, though it does have userspace components.  I wonder what's in this knfsd-3.4.6 package, or how QNAP is distinguishing their knfsd service from their kernel version?

Comment 42 Bradi 2022-03-16 14:20:59 UTC
It appears QNAP have released a newer version (v4.3.6.1965 build 20220302) for my device on 2022-03-04.  This is interesting since their support recommended I upgrade to the latest version (v4.3.6.1907 build 20220107) on 2022-03-10 as part of their troubleshooting.

- Below are the kernel versions for NAS models that are supported by QTS 4.3.6: (1) Kernel 3.10.20: TS-128, TS-228 (2) Kernel 3.2.26: TS-x31, TS-x31U (3) Kernel 4.2.8: all other models supported by QTS 4.3.6
- Due to the limitations of future kernel updates, QTS 4.3.6 is the final available QTS update for the following NAS models: TS-EC1679U-SAS-RP, TS-EC1679U-RP, TS-1679U-RP, TS-EC1279U-SAS-RP, TS-EC1279U-RP, TS-1279U-RP, TS-1079 Pro, TS-EC879U-RP, TS-879U-RP, TS-1270U-RP, TS-870U-RP, TS-470U-RP, TS-470U-SP, TS-879 Pro, TVS-870, TS-870 Pro, TS-870, TVS-670, TS-670 Pro, TS-670, TVS-470, TS-470 Pro, and TS-470.

I can also confirm the using "vers=4.0" succeeds on my setup.

Comment 43 Justin M. Forbes 2022-03-16 14:40:40 UTC
(In reply to Gurenko Alex from comment #34)
> 
> This sounds like a good solution to me. Are you bringing this conversation
> upstream?

This conversation is upstream, and has been going for a bit. I posted the most recent relevant thread in the FCOS issue. There is no resolution yet.

Comment 44 Per Osbeck 2022-03-16 16:12:35 UTC
QTS 5.0 lists one of the major improvement kernel 5.10.x

I'm running QTS 5.0 but my device is limited to the 4.2.8 kernel. (https://www.qnap.com/en-us/release-notes/kernel)

If someone has a QTS 5.0 with the 5.10 to test?

Comment 45 Justin M. Forbes 2022-03-16 20:58:29 UTC
The 5,16,15 kernel has finished building. Anyone with an impacted QNAP want to give it a try (make sure to remove the vers argument from the fstab) and let me know if it is working?

https://koji.fedoraproject.org/koji/buildinfo?buildID=1934820

Again, there is no guarantee that this will continue once we move to 5.17, but we should have some indication of what QNAP and/or upstream intend to do.   The vers=4.0 argument should always work though.

Comment 46 Gurenko Alex 2022-03-16 21:20:07 UTC
(In reply to Justin M. Forbes from comment #45)
> The 5,16,15 kernel has finished building. Anyone with an impacted QNAP want
> to give it a try (make sure to remove the vers argument from the fstab) and
> let me know if it is working?
> 
> https://koji.fedoraproject.org/koji/buildinfo?buildID=1934820
> 
> Again, there is no guarantee that this will continue once we move to 5.17,
> but we should have some indication of what QNAP and/or upstream intend to
> do.   The vers=4.0 argument should always work though.

 Thank you very much. I've just tried 5.16.15 with quite interesting results: so, dropping 4.0 pinning still does not work, unfortunately, however 4.1 pinning now works with this .15 kernel. I've re-tested with 5.16.14 and confirm that 4.1 didn't work with previously with .14.

Comment 47 Chris Egolf 2022-03-16 21:47:11 UTC
(In reply to Justin M. Forbes from comment #45)
> The 5,16,15 kernel has finished building. Anyone with an impacted QNAP want
> to give it a try (make sure to remove the vers argument from the fstab) and
> let me know if it is working?
> 
> https://koji.fedoraproject.org/koji/buildinfo?buildID=1934820
> 
> Again, there is no guarantee that this will continue once we move to 5.17,
> but we should have some indication of what QNAP and/or upstream intend to
> do.   The vers=4.0 argument should always work though.

I will try this kernel as well.  My issue is slightly different from the QNAP related issues mentioned by others.  As mentioned in Comment 15, I'm using a Synology that is running an NFSv4 server and exporting a share containing VM images running on several KVM/libvirt servers.  The KVM/libvirt servers (all Fedora 35) are the NFS client in this case.  The shared storage is mounted as nfs4 (vers=4.0) via libvirtd.  

Mounting the share works immediately.  The problem is that after running for several hours, the KVM host will start logging errors such as:

kernel: nfs: server x.x.x.x not responding, still trying 

I have not been able to determine what initially causes the error, but once it happens, the VM's become unresponsive.  Trying to shut them down w/ virsh commands or the GUI all timeout and eventually the servers must be reset manually -- normal shutdowns eventually hang and timeout.  

Sometimes, however, if left alone, the KVM server might log the following after several hours:

kernel: nfs: server x.x.x.x OK

Unfortunately, this is not guaranteed, and it will eventually return to "not responding".  

This started after upgrading from kernel-5.16.9-200.fc35 (which still works) to any later kernel in the Fedora repos.  The only fix that worked consistently is to keep the KVM host on kernel-5.16.9-200.

Comment 48 Justin M. Forbes 2022-03-17 06:03:05 UTC
One more test if you would. It is not done building yet, but it is very late here and I need some sleep. Could you please try the 5.16.15-201 build at https://koji.fedoraproject.org/koji/taskinfo?taskID=84311635 when it finishes?  Again, without the forced vers in the mount options.

Comment 49 Per Osbeck 2022-03-17 07:15:40 UTC
> https://koji.fedoraproject.org/koji/buildinfo?buildID=1934820
Does not work with QNAP.

> https://koji.fedoraproject.org/koji/taskinfo?taskID=84311635
Works like a charm without any special mount options.


I have also open a new support ticket with QNAP.

Comment 50 Gurenko Alex 2022-03-17 11:22:19 UTC
(In reply to Justin M. Forbes from comment #48)
> One more test if you would. It is not done building yet, but it is very late
> here and I need some sleep. Could you please try the 5.16.15-201 build at
> https://koji.fedoraproject.org/koji/taskinfo?taskID=84311635 when it
> finishes?  Again, without the forced vers in the mount options.

I can also confirm that -201 is working without the need of version pinning.

Comment 51 Justin M. Forbes 2022-03-17 11:48:56 UTC
Thanks, filing the 201 kernel in bodhi.

Comment 52 Chris Egolf 2022-03-17 21:40:29 UTC
-200 kernel has been stable for over 24 hours.  Anything 5.16.10+ had usually caused the nfs kernel error within 4-12 hours.

Comment 53 Graham Miller 2022-03-18 22:26:52 UTC
I can confirm that adding vers=4.0 to fstab resolved the issue in my environment.

Fedora 35 (kernel 5.16.14-200.fc35)

QNAP QTS 5.0.0.1932 (kernel-5.10.60)

Comment 54 Chris Egolf 2022-03-20 01:15:13 UTC
Spoke too soon.  Both the -200 and -201 kernels eventually caused the same kernel/nfs error.  If there's any other data or logs I can collect to help track this down, let me know. 

Another data point I've noticed is that even though multiple KVM/libvirt guests are using the NFS share for disk image storage, when the error occurs, usually only one guest is affected.  The others are responsive and can be logged into, shutdown either interactively or via the 'virsh shutdown' commands.  The file systems for these VM's are still writable, and the images stored on the NFS share are updated, unlike the system that becomes unresponsive.  I haven't been able to determine a pattern as to which system becomes unresponsive.

Comment 55 Gurenko Alex 2022-03-20 14:35:45 UTC
(In reply to Chris Egolf from comment #54)
> Spoke too soon.  Both the -200 and -201 kernels eventually caused the same
> kernel/nfs error.  If there's any other data or logs I can collect to help
> track this down, let me know. 
> 
> Another data point I've noticed is that even though multiple KVM/libvirt
> guests are using the NFS share for disk image storage, when the error
> occurs, usually only one guest is affected.  The others are responsive and
> can be logged into, shutdown either interactively or via the 'virsh
> shutdown' commands.  The file systems for these VM's are still writable, and
> the images stored on the NFS share are updated, unlike the system that
> becomes unresponsive.  I haven't been able to determine a pattern as to
> which system becomes unresponsive.

I think you issue is unrelated to the problem here and worth opening a separate BZ, while it can be caused by some patch set, you're not having problem related to the mount and let's not mix different issues, it makes it impossible to track.

Comment 56 jole.secondary 2022-03-21 03:20:33 UTC
I just wanted to chime in that kernel 5.16.15-101 fixes my issue of mounting my qnap box which broke for all kernel version after 5.16.9-100
my relevant mtab entry

<QNAP>:/<mnt> /home/<user>/<mnt> nfs4 rw,nosuid,nodev,relatime,vers=4.2,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,local_lock=none 0 0

Comment 57 Gurenko Alex 2022-04-01 23:46:22 UTC
For those who still uses supported NAS from QNAP, I've got an update today to version h5.0.0.1986. While not explicitly mentioned, I've also just tried 5.17.1-200.fc35.x86_64 kernel and I still can access my NFS shares without explicit NFS version and with explicit 4.2 version. I'm curious if the revert of the patch was also migrated to 5.17 branch or QNAP silently fixed the problem?

Comment 58 Justin M. Forbes 2022-04-02 16:25:28 UTC
No, the patch was not reverted. a proper upstream fix made it into 5.18 and I backported it. Specifically:

Author: Olga Kornievskaia <kolga>
Date:   Wed Mar 16 18:24:26 2022 -0400

    NFSv4.1 provide mount option to toggle trunking discovery
    
    Introduce a new mount option -- trunkdiscovery,notrunkdiscovery -- to
    toggle whether or not the client will engage in actively discovery
    of trunking locations.
    
    v2 make notrunkdiscovery default
    
    Signed-off-by: Olga Kornievskaia <kolga>
    Fixes: 1976b2b31462 ("NFSv4.1 query for fs_location attr on a new file system")
    Signed-off-by: Trond Myklebust <trond.myklebust>

Comment 59 Justin M. Forbes 2022-04-02 16:26:36 UTC
With the revert in 5.16 kernels, and the proper fix in 5.17+, I am going to close this bug.

Comment 60 Gurenko Alex 2022-04-02 18:48:15 UTC
Hm, this is very weird, I've been setting up a laptop with a fresh installation of F36 beta and it _didn't_ work, unless "vers=4.0" used. Was the patch cherry-picked for F36 kernel as well?

Comment 61 Justin M. Forbes 2022-04-04 17:36:53 UTC
Cherry picked, but it is not in the 5.17.1 kernel.  It will be in 5.17.2+

Comment 62 Gurenko Alex 2022-04-04 17:38:55 UTC
(In reply to Justin M. Forbes from comment #61)
> Cherry picked, but it is not in the 5.17.1 kernel.  It will be in 5.17.2+

Perfect, thanks a lot!

Comment 63 Graham Miller 2022-04-14 21:30:43 UTC
I have recently tested the kernel 5.16.19-200.fc35, removed the vers=4.0 from fstab, restricted nfs connections to V4 on the QNAP NAS (to make sure), and automount is working perfectly here. I have not tested the older kernels.

Comment 64 Gurenko Alex 2022-06-29 11:17:36 UTC
 For the posterity: QNAP finally released an update h5.0.0.2069 build 20220614 from 2022-06-23 (for my TS-473A, your update may vary) that includes the fix for the original problem. While it's been working for a long time thanks to the revert of default behavior, I went ahead and re-tested the old kernel and can confirm that it's now working as expected as well.


Note You need to log in before you can comment on or make changes to this bug.