Bug 1312002 - hangs on reboot or shutdown when nfs file system mounted
hangs on reboot or shutdown when nfs file system mounted
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: systemd (Show other bugs)
x86_64 Linux
urgent Severity high
: rc
: ---
Assigned To: systemd-maint
Frantisek Sumsal
: ZStream
: 1420676 1462962 (view as bug list)
Depends On:
Blocks: 1298243 1420851 1466365 1469559 1473733 1522983 1519245
  Show dependency treegraph
Reported: 2016-02-25 08:43 EST by Rikard
Modified: 2018-03-06 06:22 EST (History)
36 users (show)

See Also:
Fixed In Version: systemd-219-46.el7
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1519245 (view as bug list)
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Rikard 2016-02-25 08:43:09 EST
Description of problem:
NFS mount points not able to be unmounted and cause a hangon reboot .

Version-Release number of selected component (if applicable):

# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.2 (Maipo)
All latest errata to date applied

How reproducible:
mount NFS in /etc/fstab
nfs01:/nfs/home /nfs/home      nfs vers=3,hard,intr,rsize=8192,wsize=8192,tcp 0 0

Steps to Reproduce:
1. Boot the server and let it mount the nfs volume
2. Have a login shell or some process on the nfs mount point
3. reboot the server "init 6"

Actual results:
The server never reboots output from /var/log/messages:

Feb 25 10:45:00 rhel7 systemd: Stopping Login Service...
Feb 25 10:45:00 rhel7 systemd: Stopped Login Service.
Feb 25 10:45:00 rhel7 systemd: Stopped Permit User Sessions.
Feb 25 10:45:00 rhel7 systemd: Stopped target Remote File Systems.
Feb 25 10:45:00 rhel7 systemd: Stopping Remote File Systems.
Feb 25 10:45:00 rhel7 systemd: Unmounting /nfs/home...
Feb 25 10:45:03 rhel7 systemd: Received SIGRTMIN+20 from PID 7757 (plymouthd).
Feb 25 10:46:30 rhel7 systemd: nfs-home.mount unmounting timed out. Stopping.
Feb 25 10:48:00 rhel7 systemd: nfs-home.mount unmounting timed out. Stopping.
Feb 25 10:49:31 rhel7 systemd: nfs-home.mount unmounting timed out. Stopping.
Feb 25 10:51:01 rhel7 systemd: nfs-home.mount unmounting timed out. Stopping.
etc .....

Expected results:
server to reboot

Additional info:
Found a workaround in Bug 1214466 and from Eric Bakkum adding "dbus.service" to "After=syslog.target" in "/usr/lib/systemd/system/wpa_supplicant.service"


Description=WPA Supplicant daemon
After=syslog.target dbus.service

ExecStart=/usr/sbin/wpa_supplicant -c /etc/wpa_supplicant/wpa_supplicant.conf $INTERFACES $DRIVERS $OTHER_ARGS


With this config loaded the server reboots
Comment 2 Susant Sahani 2016-06-09 01:44:35 EDT
seems like duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1334573
Comment 3 Chris Cheney 2016-07-25 12:26:38 EDT
BZ#1334573 appears to be about autofs, this also hangs without autofs use.
Comment 4 Lukáš Nykrýn 2016-09-06 05:59:43 EDT
Can you please try that with rhe-7.3 beta?
Comment 17 Renaud Métrich 2017-06-02 02:57:55 EDT
*** Bug 1420676 has been marked as a duplicate of this bug. ***
Comment 18 Renaud Métrich 2017-06-02 03:09:18 EDT
The exact same shutdown/reboot issue happens when network connectivity to NFS server is lost and shutdown/reboot is issued.
Network connectivity to NFS server means for example:
- unavailability of NFS server
- no route to NFS server (broken switch)

To easily reproduce the issue with NFS hard mounts:
1. stop the NFS server
2. trigger the reboot or shutdown of the NFS client

System will never shut down or reboot, printing "nfsshareX.mount unmounting timed out. Stopping." messages.

Occasionally, after 1800 seconds (30 minutes), system will just die with the following messages (systemd was in debug mode):

[ 1831.602805] systemd[1]: Timed out starting Reboot.
[ 1831.607376] systemd[1]: Job reboot.target/start failed with result 'timeout'.
[ 1831.617183] systemd[1]: Forcibly rebooting as result of failure.
[ !!  ] [ 1831.625209] systemd[1]: Shutting down.
Forcibly rebooting as result of failure.
[ 1831.649519] systemd-shutdown[1]: Sending SIGTERM to remaining processes...
[ 1831.675961] systemd-journald[465]: Received SIGTERM from PID 1 (systemd-shutdow).
[ 1841.666618] systemd-shutdown[1]: Sending SIGKILL to remaining processes...
[ 1841.680423] systemd-shutdown[1]: Sending SIGKILL to PID xxx (umount).
[ 1841.801172] systemd-shutdown[1]: Unmounting file systems.
[ 1841.805417] systemd-shutdown[1]: Unmounting /run/user/0.

This happens because there seems to be no timeout in the NFS umount in the kernel.
For sure, it's no systemd issue but NFS issue.

IMO, it's a major issue, since network is unstable by nature.
Comment 19 masanari iida 2017-06-06 04:50:47 EDT
NetworkManager-1.4.0-13.el7 or later, 
it includes following fix.  
2016-11-02 Thomas Haller <thaller@redhat.com> -  1.4.0-13
- core: don't unmanage devices on shutdown (rh#1371126)
(which is related to https://bugzilla.redhat.com/show_bug.cgi?id=1311988 )

My question to Renaud Métrich is, which version of NetworkManager did you use
when you encountered the symptom?
Comment 20 Renaud Métrich 2017-06-06 05:19:40 EDT
This is NetworkManager-1.4.0-20.el7_3.x86_64
Comment 21 Renaud Métrich 2017-06-06 08:58:17 EDT
I can reproduce the exact same behaviour with the following setup:

1. NFS mount /share1
2. dummy service "test-simple.service" entering /share1 directory to have /share1 busy (does sleep/ls)


with /root/share1 being:

# cat /root/share1 

while [ -z "$(/bin/ls /share1)" ]; do
	sleep 1
echo "/share1 is now mounted ..."
cd /share1
while :; do
	sleep 10

NOTE: the service has not been configured with a dependency on "/share1" on purpose of reproducing.

3. dummy service is made not to stop correctly

  ExecStop=/bin/sleep 300

4. issue reboot from terminal

What we see:

NFS unmount fails:
[   60.822617] umount[10737]: umount.nfs: /share1: device is busy
[  OK  ] Failed unmounting /share1.

Interface is taken down:
[   61.411976] network[10751]: Shutting down interface eth0:  Device 'eth0' successfully disconnected.

test-simple is getting killed after 1m30:
[**    ] A stop job is running for Test 'simple' service (1min 29s / 1min 30s)[  150.937597] systemd[1]: test-simple.service stopping timed out. Terminating.
[  OK  [  150.963373] systemd[1]: Stopped Test 'simple' service.

Systemd finishes the shutdown, but never reboots:
[  OK  ] Reached target Shutdown.
[  151.282370] systemd[1]: Reached target Final Step.
[  151.282378] systemd[1]: Starting Final Step.
[  151.283008] systemd[1]: Starting Reboot...
[  151.283271] systemd[1]: Stopping Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling...
[  151.294818] systemd[1]: Shutting down.
[  151.309229] systemd-shutdown[1]: Sending SIGTERM to remaining processes...
[  150.900422] lvmetad[492]: Failed to accept connection errno 11.
[  151.326900] systemd-journald[464]: Received SIGTERM from PID 1 (systemd-shutdow).
[  151.361927] systemd-shutdown[1]: Sending SIGKILL to remaining processes...
[  151.370835] systemd-shutdown[1]: Unmounting file systems.

NFS server unreachable pops out later:
[  331.656199] nfs: server not responding, still trying

Shutdown/reboot never happens, eventually the VM starts to eat all the CPU after some time (the 1800s systemd timeout???).

Of course, when adding the Before/Requires dependency on share1.mount in the dummy service, the issue stops happening, since NFS umount will be performed only after the dummy service has been killed.
However, in real life, such bad thing may happen, it may not always be possible for the system administrator to know on which mounts the application is depending on.

So, clearly, some hardening must be performed: for example, not have the stop of "Remote File System" target be reached on shutdown if some remote umounts have failed and/or retry remote umounts until they succeed / use "force" flag in umounts, etc.
Comment 22 masanari iida 2017-06-07 07:03:23 EDT
Hello Renaud
Thanks for the detail steps to reproduce the issue.

Hello Rikard, how about your original symptom?
Have you tested with latest systemd and/or NetworkManager?
Comment 26 Bertrand 2017-08-08 03:18:01 EDT
Additionally, the issue does not seem to be limited to NFS version 3. A test done on NFS version 4 and RHEL7.3 shows similar behavior with Network Manager service disabled to startup with on Client Machine.

Step 1: 
Set up NFSv4 share on RHEL7.3 Server Machine (A VM,,  in my case running RHEL7.3 / 3.10.0-514.21.1.el7.x86_64)

Step 2: 
Mount NFSv4 Share on RHEL7.3 Client Machine (A VM, in my case running RHEL7.3 / 3.10.0-514.26.1.el7.x86_64)
e.g: on /mounted type nfs4 (rw,relatime,vers=4.0,rsize=262144,wsize=262144,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=,local_lock=none,addr=

Step 3: 
Stop NFSv4 Service on RHEL7.3 Server Machine

Step 4: 
Attempt reboot of RHEL7.3 Client Machine result in the machine hanging.

a) If the the mounted filesystem is being accessed, then upon reboot command issue, the network stack stays up and running. I can ssh to the Client Machine.
b) If the mounted filesystem is not being accessed, then upon reboot, command issue, the network stack is no longer available. I can no longer ssh to the client Machine.
Comment 30 Anatoly Pugachev 2017-09-07 04:50:33 EDT
Can someone change version from 7.2 to 7.3 in bug header? Thanks.
Comment 31 Christian Horn 2017-09-07 05:09:10 EDT
(In reply to Anatoly Pugachev from comment #30)
> Can someone change version from 7.2 to 7.3 in bug header? Thanks.
Why do you think this should be done?
The original report was for 7.2, the issue seems to still exist, so if you have hit it on 7.3 that is in line with this bz.
Comment 33 Christian Horn 2017-09-11 21:58:40 EDT
This is starting to cause us pain, we see that in the number of attached cases, plus customers behind least 2 partners.

systemd and nfs are involved here.  For my understanding, getting lazy umount into rhel7 (as brought up in bz1408791) would be a solution with worked before for us in rhel6, but the bz got CLOSED CANTFIX.

We need to come up with something, things as common as a temporary network issue or a restarted NFS server prevents clients from properly rebooting.

Is a different forum required to discuss this issue which affects systemd and NFS areas, i.e. a mail thread involving 2 upstream lists?
Comment 34 J. Bruce Fields 2017-09-12 09:54:25 EDT
I've added a comment to https://bugzilla.redhat.com/show_bug.cgi?id=1408791 in an attempt to better understand the motivation for the lazy umount change.

It's not always going to be possible to shut down NFS clients cleanly in the face of network issues and/or bad application behavior--perhaps user expectations are unrealistic in some of these cases?  But it also sounds like there are cases where we could do better.
Comment 38 Kyle Walker 2017-10-05 12:53:47 EDT
*** Bug 1462962 has been marked as a duplicate of this bug. ***
Comment 40 Lukáš Nykrýn 2017-10-06 07:50:11 EDT
fix merged to staging branch -> https://github.com/lnykryn/systemd-rhel/pull/155 -> post
Comment 49 Thomas Jones 2018-02-01 00:25:15 EST
Is there an expected release-date for this fix?

Running into what appears to be the similar/same issue when hosting a slow-to-shutdown Java application on an EFS share (AWS's NFS 4.1 interface to S3). I can work around the issue by adding a TimeoutStopSec=10 to my application's service definition. However, this means I'm SIGKILLing my application rather than allowing it to gracefully shut down.
Comment 50 Filip Krska 2018-02-09 07:09:58 EST
Hello Thomas, 7.4.Z Stream Bug 1519245 for this issue was fixed and released with Errata https://access.redhat.com/errata/RHBA-2018:0155 

Best Regards, Filip

Note You need to log in before you can comment on or make changes to this bug.