Bug 750926

Summary: Netfs fails to unmount unreachable NFS filesystems
Product: [Fedora] Fedora Reporter: Orion Poplawski <orion>
Component: systemdAssignee: systemd-maint
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 18CC: Colin.Simpson, david.halliwell, d.bz-redhat, fedora, harald, iarlyy, jcapik, johannbg, jonathan, lnykryn, marmalodak, mschmidt, msekleta, notting, plautrba, systemd-maint, vpavlin, zbyszek
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 735458 Environment:
Last Closed: 2013-09-13 01:38:55 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 735458    
Bug Blocks:    

Description Orion Poplawski 2011-11-02 20:38:18 UTC
+++ This bug was initially created as a clone of Bug #735458 +++

Description of problem:

The netfs service fails to unmount NFS filesystems when the NFS server is unreachable. This bug was uncovered when Bug 676851 was rolled out.
This causes all of our machines using NFS filesystems to hang on shut down.


How reproducible:

Consistently reproducible for RHEL 5.7 systems using NFS filesystems.
initscripts version: initscripts-8.45.38-2.el5.x86_64


Steps to Reproduce:

1.  mkdir /mnt/mountpoint
2.  mount nfsserver:/path /mnt/mountpoint
3.  /etc/init.d/network stop
4.  /etc/init.d/netfs stop

  
Actual results:

netfs will hang while attempting to unmount the NFS filesystem.


Expected results:

netfs should unmount the NFS filesystem.


Additional info:

When netfs unmounts NFS filesystems, it now takes the following steps:
1.  Try to unmount normally
2.  Use /sbin/fuser to try and send a kill signal to anyone trying to access this file
3.  Steps 1 and 2 are repeated 3 times if mounts remain
4.  Finally the filesystem is force lazy unmounted

The problem arises if the network is already down (making the NFS server unreachable):
- At step 1, the unmount fails
- At step 2 the /sbin/fuser kill command hangs, preventing the rest of the script from running.

This was not an issue until 5.7, as previously the filesystem was unconditionally force lazy unmounted at step 1.

The most direct solution would presumably be to only run the /sbin/fuser kill command if the unmount attempt returns filesystem busy, not for NFS server unreachable.


I'm seeing this as well on Fedora 16 when unplugging the network cable and /etc/NetworkManager/dispatcher.d/05-netfs is called and runs netfs stop.

initscripts-9.34-2.fc16.i686

Comment 1 Orion Poplawski 2011-11-03 19:46:14 UTC
The problem is umount will return busy (32) first if someone is using filesystem, so I don't think we can tell the difference between a reachable and unreachable server.  While it probably would be nice to kill the processes using the filesystem, I think for nfs it's probably safer to just unmount.

psmisc 22.14 has a patch to prevent it from hanging - https://sourceforge.net/tracker/index.php?func=detail&aid=1963033&group_id=15273&atid=315273 has some more info.  I installed on my F16 machine and it does appear to work, though it spawns a *huge* amount of processes and a couple stale ones get left.  Apparently this may be an option in 22.15.  May be better ways to tackle this, but fixing fuser it probably the most correct option.

Comment 2 Fedora End Of Life 2012-08-07 16:26:27 UTC
This message is a notice that Fedora 15 is now at end of life. Fedora
has stopped maintaining and issuing updates for Fedora 15. It is
Fedora's policy to close all bug reports from releases that are no
longer maintained. At this time, all open bugs with a Fedora 'version'
of '15' have been closed as WONTFIX.

(Please note: Our normal process is to give advanced warning of this
occurring, but we forgot to do that. A thousand apologies.)

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, feel free to reopen
this bug and simply change the 'version' to a later Fedora version.

Bug Reporter: Thank you for reporting this issue and we are sorry that
we were unable to fix it before Fedora 15 reached end of life. If you
would still like to see this bug fixed and are able to reproduce it
against a later version of Fedora, you are encouraged to click on
"Clone This Bug" (top right of this page) and open it against that
version of Fedora.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.

The process we are following is described here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 3 Fedora End Of Life 2013-01-16 22:47:29 UTC
This message is a reminder that Fedora 16 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 16. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '16'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 16's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 16 is end of life. If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora, you are encouraged to click on 
"Clone This Bug" and open it against that version of Fedora.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 4 Didier 2013-06-17 08:30:19 UTC
Problem is still present with psmisc-22.16-1.fc17.x86_64 :

# umount -a -f -l -t nfs
- hangs,
- eventually (after many minutes) fails with "umount.nfs: /auto/XXX: Stale NFS file handle",
- and the NFS share is not unmounted.

# df
- hangs

# dmesg
...
kernel: [1296121.709474] nfs: server XXX not responding, timed out
...
(repeated)

# ps aux | grep umount
root     13070  0.0  0.0 119896  1052 pts/15   S+   10:26   0:00 umount -a -f -l -t nfs
root     13071  0.0  0.0  26032  1168 pts/15   D+   10:26   0:00 /sbin/umount.nfs /auto/XXX -l -f



This is extremely annoying when e.g. moving laptops with mounted NFS shares between location, without first unmounting.

Comment 5 Jaromír Cápík 2013-06-19 13:07:42 UTC
Even with recently introduced non-blocking mode, there need to be changes made in the initscripts. Once the non-blocking mode patch is accepted upstream, I'll change the component to initscripts.

Comment 6 Orion Poplawski 2013-06-26 21:54:02 UTC
The patch is in upstream, but it needs to be enabled with --enable-timeout-stat.  I don't quite understand what the --enable-timeout-stat=static version does though.  I also think we could really use a run-time option instead of compile time.

Comment 7 Jaromír Cápík 2013-06-27 16:10:46 UTC
Hi Orion.

Believe or not. I started fighting this issue quite a long time ago and it's a tricky one. The timeout doesn't solve anything. It's pretty unreliable and has a negative performance impact on the fuser tool. If the timeout interval it is too short, then you can see unwanted timeouts when the system is busy and when it's too long, then it takes ages for all hanged stats to timeout. The root cause lies in the whole concept of rebooting / umounting / killing processes / NFS.
I don't understand why this bug was created as a clone of RHEL5 bug. Fedora has systemd and the solution differs from RHEL5. I don't even know if fuser still plays any role here. I'm changing the component to systemd, because there's nothing I can do with psmisc in order to prevent this from happening.

Regards,
Jaromir.

Comment 8 Michal Schmidt 2013-06-27 16:24:37 UTC
The netfs service belonged to initscripts. Also it's gone in F18.

Whether it's a bug at a lower level (umount.nfs, kernel?) I don't know. NFS cannot surprise me by anything.

Comment 9 Jaromír Cápík 2013-06-27 16:47:43 UTC
Hi Michal.

AFAIK more causes of the NFS related reboot hangs exist. And as I experienced some of them in F18, it seems to me, that systemd needs to be finetuned too. But of course, it's up to you. Bill must be happy the issue finally got back to him again. This hot potato game lasts too long and users are suffering.

Regards,
Jaromir.

Comment 10 Michal Schmidt 2013-06-27 17:07:09 UTC
(In reply to Jaromír Cápík from comment #9)
> AFAIK more causes of the NFS related reboot hangs exist. And as I
> experienced some of them in F18, it seems to me, that systemd needs to be
> finetuned too.

That's quite possible. However, this BZ does not contain any information implicating systemd.

> Bill must be happy the issue finally got back to him again.
> This hot potato game lasts too long and users are suffering.

Sorry, I did not notice this BZ was already assigned to initscripts before.

The problem is that it's not clear what this BZ is meant to be about. It surely does not seem to be about "netfs" anymore.

Comment 11 Jaromír Cápík 2013-06-27 17:45:47 UTC
It's a long time present conceptual deadlock. I'll tell you more about the scenario I was fighting with in the past. It's possible it evolved/changed since that. So ... here's the story ... When a hardmounted NFS share becomes unreachable, all blocking calls like stat(2) called by the fuser tool wait forever till the server becomes reachable again (and that often doesn't happen). So ... the fuser tool waits for the NFS server forever. But as the fuser tool output is used for generating a list of processes which need to be killed because they use the mountpoint, these processes then cannot be killed even if you let the fuser tool timeout and consequently you cannot umount the mountpoint. The processes which block the NFS mountpoint simply cannot be reliably detected and killed. Funny, isn't it? In the past Bill Nottingham tried to lazy umount such shares, but that apparently resulted in hangs too. Maybe he'll tell you more, because I don't know much about his findings. Does it mean it's caused by the umount or kernel? I don't know ... I only know it often happens on servers with higher uptime. Unfortunately we were unable to reproduce the issue in the lab. Somebody would have to sacrifice himself and do a very deep analysis + document what exactly happens. We're lacking a reliable reproduction scenario. And I have no idea if this story is still valid for Fedora 17 and later. Maybe not.

Comment 12 Orion Poplawski 2013-06-27 18:00:18 UTC
See also bug 851665

Comment 13 Fedora End Of Life 2013-07-04 06:21:37 UTC
This message is a reminder that Fedora 17 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 17. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '17'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 17's end of life.

Bug Reporter:  Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 17 is end of life. If you 
would still like  to see this bug fixed and are able to reproduce it 
against a later version  of Fedora, you are encouraged  change the 
'version' to a later Fedora version prior to Fedora 17's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 14 Fedora End Of Life 2013-08-01 17:53:28 UTC
Fedora 17 changed to end-of-life (EOL) status on 2013-07-30. Fedora 17 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 15 Lukáš Nykrýn 2013-08-02 15:38:41 UTC
Netfs is no longer in initscripts.

Comment 16 Lennart Poettering 2013-09-13 01:38:55 UTC
So, netfs doesn't exist in initscripts anymore, and systemd doesn't use fuser or anything. I don't see how this would apply to systemd. Closing.

Comment 17 John Schmitt 2013-09-13 08:56:21 UTC
Wouldn't it be better to assign this bug to a more appropriate component than close it?

Comment 18 Harald Hoyer 2013-09-13 10:58:04 UTC
(In reply to John Schmitt from comment #17)
> Wouldn't it be better to assign this bug to a more appropriate component
> than close it?

The component is systemd, but it's should not be a bug anymore.

Test:

- mount nfs
- pull the network cable
- shutdown

systemd will maybe stall for 3 minutes, but should continue

Comment 19 Harald Hoyer 2013-09-13 11:27:07 UTC
(In reply to Harald Hoyer from comment #18)
> (In reply to John Schmitt from comment #17)
> > Wouldn't it be better to assign this bug to a more appropriate component
> > than close it?
> 
> The component is systemd, but it's should not be a bug anymore.
> 
> Test:
> 
> - mount nfs
> - pull the network cable
> - shutdown
> 
> systemd will maybe stall for 3 minutes, but should continue

ok, wrong.. it hangs in the kernel:

https://bugzilla.redhat.com/show_bug.cgi?id=1007607
https://bugzilla.redhat.com/show_bug.cgi?id=1007745

Comment 20 Lennart Poettering 2013-09-13 20:21:45 UTC
(In reply to John Schmitt from comment #17)
> Wouldn't it be better to assign this bug to a more appropriate component
> than close it?

Well, I don't know any more appropiate one. But certainly not systemd nor initscripts... Knock yourself out and reopen it and assign it to some component...