Bug 1312002

Summary: hangs on reboot or shutdown when nfs file system mounted
Product: Red Hat Enterprise Linux 7 Reporter: Rikard <rikard.oberg>
Component: systemdAssignee: systemd-maint
Status: CLOSED ERRATA QA Contact: Frantisek Sumsal <fsumsal>
Severity: high Docs Contact:
Priority: urgent    
Version: 7.2CC: ajmitchell, arawat, bart.demeester, bcodding, bfields, brault, brubisch, bsingh, ccheney, chorn, davor, ddouwsma, dwysocha, fadamo, fkrska, fsumsal, jaeshin, jbyrd, kfujii, kwalker, leif, luf, masanari.iida, matorola, mlinden, mpoole, mr.xkurt, msivakum, myllynen, onatalen, redhat, rkothiya, rmetrich, rsawhill, smayhew, ssahani, stefan, sweettea, swhiteho, systemd-maint-list, systemd-maint, tnagata, yoguma, yoyang
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: systemd-219-46.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1519245 (view as bug list) Environment:
Last Closed: 2018-04-10 11:16:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1298243, 1420851, 1466365, 1469559, 1473733, 1519245, 1522983    

Description Rikard 2016-02-25 13:43:09 UTC
Description of problem:
NFS mount points not able to be unmounted and cause a hangon reboot .

Version-Release number of selected component (if applicable):

# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.2 (Maipo)
All latest errata to date applied
wpa_supplicant-2.0-17.el7_1.x86_64
systemd-219-19.el7_2.4.x86_64
systemd-libs-219-19.el7_2.4.x86_64
systemd-python-219-19.el7_2.4.x86_64
systemd-sysv-219-19.el7_2.4.x86_64

How reproducible:
mount NFS in /etc/fstab
nfs01:/nfs/home /nfs/home      nfs vers=3,hard,intr,rsize=8192,wsize=8192,tcp 0 0

Steps to Reproduce:
1. Boot the server and let it mount the nfs volume
2. Have a login shell or some process on the nfs mount point
3. reboot the server "init 6"

Actual results:
The server never reboots output from /var/log/messages:

Feb 25 10:45:00 rhel7 systemd: Stopping Login Service...
Feb 25 10:45:00 rhel7 systemd: Stopped Login Service.
Feb 25 10:45:00 rhel7 systemd: Stopped Permit User Sessions.
Feb 25 10:45:00 rhel7 systemd: Stopped target Remote File Systems.
Feb 25 10:45:00 rhel7 systemd: Stopping Remote File Systems.
Feb 25 10:45:00 rhel7 systemd: Unmounting /nfs/home...
Feb 25 10:45:03 rhel7 systemd: Received SIGRTMIN+20 from PID 7757 (plymouthd).
Feb 25 10:46:30 rhel7 systemd: nfs-home.mount unmounting timed out. Stopping.
Feb 25 10:48:00 rhel7 systemd: nfs-home.mount unmounting timed out. Stopping.
Feb 25 10:49:31 rhel7 systemd: nfs-home.mount unmounting timed out. Stopping.
Feb 25 10:51:01 rhel7 systemd: nfs-home.mount unmounting timed out. Stopping.
etc .....

Expected results:
server to reboot

Additional info:
Found a workaround in Bug 1214466 and from Eric Bakkum adding "dbus.service" to "After=syslog.target" in "/usr/lib/systemd/system/wpa_supplicant.service"

Example:

[Unit]
Description=WPA Supplicant daemon
Before=network.target
After=syslog.target dbus.service

[Service]
Type=dbus
BusName=fi.w1.wpa_supplicant1
EnvironmentFile=-/etc/sysconfig/wpa_supplicant
ExecStart=/usr/sbin/wpa_supplicant -c /etc/wpa_supplicant/wpa_supplicant.conf $INTERFACES $DRIVERS $OTHER_ARGS

[Install]
WantedBy=multi-user.target

With this config loaded the server reboots

Comment 2 Susant Sahani 2016-06-09 05:44:35 UTC
seems like duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1334573

Comment 3 Chris Cheney 2016-07-25 16:26:38 UTC
BZ#1334573 appears to be about autofs, this also hangs without autofs use.

Comment 4 Lukáš Nykrýn 2016-09-06 09:59:43 UTC
Can you please try that with rhe-7.3 beta?

Comment 17 Renaud Métrich 2017-06-02 06:57:55 UTC
*** Bug 1420676 has been marked as a duplicate of this bug. ***

Comment 18 Renaud Métrich 2017-06-02 07:09:18 UTC
The exact same shutdown/reboot issue happens when network connectivity to NFS server is lost and shutdown/reboot is issued.
Network connectivity to NFS server means for example:
- unavailability of NFS server
- no route to NFS server (broken switch)

To easily reproduce the issue with NFS hard mounts:
1. stop the NFS server
2. trigger the reboot or shutdown of the NFS client

System will never shut down or reboot, printing "nfsshareX.mount unmounting timed out. Stopping." messages.

Occasionally, after 1800 seconds (30 minutes), system will just die with the following messages (systemd was in debug mode):

[ 1831.602805] systemd[1]: Timed out starting Reboot.
[ 1831.607376] systemd[1]: Job reboot.target/start failed with result 'timeout'.
[ 1831.617183] systemd[1]: Forcibly rebooting as result of failure.
[ !!  ] [ 1831.625209] systemd[1]: Shutting down.
Forcibly rebooting as result of failure.
[ 1831.649519] systemd-shutdown[1]: Sending SIGTERM to remaining processes...
[ 1831.675961] systemd-journald[465]: Received SIGTERM from PID 1 (systemd-shutdow).
[ 1841.666618] systemd-shutdown[1]: Sending SIGKILL to remaining processes...
[ 1841.680423] systemd-shutdown[1]: Sending SIGKILL to PID xxx (umount).
...
[ 1841.801172] systemd-shutdown[1]: Unmounting file systems.
[ 1841.805417] systemd-shutdown[1]: Unmounting /run/user/0.


This happens because there seems to be no timeout in the NFS umount in the kernel.
For sure, it's no systemd issue but NFS issue.

IMO, it's a major issue, since network is unstable by nature.

Comment 19 masanari iida 2017-06-06 08:50:47 UTC
NetworkManager-1.4.0-13.el7 or later, 
it includes following fix.  
2016-11-02 Thomas Haller <thaller> -  1.4.0-13
- core: don't unmanage devices on shutdown (rh#1371126)
(which is related to https://bugzilla.redhat.com/show_bug.cgi?id=1311988 )

My question to Renaud Métrich is, which version of NetworkManager did you use
when you encountered the symptom?

Comment 20 Renaud Métrich 2017-06-06 09:19:40 UTC
This is NetworkManager-1.4.0-20.el7_3.x86_64

Comment 21 Renaud Métrich 2017-06-06 12:58:17 UTC
I can reproduce the exact same behaviour with the following setup:

1. NFS mount /share1
2. dummy service "test-simple.service" entering /share1 directory to have /share1 busy (does sleep/ls)

  ExecStart=/root/share1

with /root/share1 being:

# cat /root/share1 
#!/bin/bash

while [ -z "$(/bin/ls /share1)" ]; do
	sleep 1
done
echo "/share1 is now mounted ..."
cd /share1
while :; do
	/bin/ls
	sleep 10
done

NOTE: the service has not been configured with a dependency on "/share1" on purpose of reproducing.

3. dummy service is made not to stop correctly

  ExecStop=/bin/sleep 300

4. issue reboot from terminal


What we see:

NFS unmount fails:
[   60.822617] umount[10737]: umount.nfs: /share1: device is busy
[  OK  ] Failed unmounting /share1.

Interface is taken down:
[   61.411976] network[10751]: Shutting down interface eth0:  Device 'eth0' successfully disconnected.

test-simple is getting killed after 1m30:
[**    ] A stop job is running for Test 'simple' service (1min 29s / 1min 30s)[  150.937597] systemd[1]: test-simple.service stopping timed out. Terminating.
[  OK  [  150.963373] systemd[1]: Stopped Test 'simple' service.

Systemd finishes the shutdown, but never reboots:
[  OK  ] Reached target Shutdown.
[  151.282370] systemd[1]: Reached target Final Step.
[  151.282378] systemd[1]: Starting Final Step.
[  151.283008] systemd[1]: Starting Reboot...
[  151.283271] systemd[1]: Stopping Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling...
[  151.294818] systemd[1]: Shutting down.
[  151.309229] systemd-shutdown[1]: Sending SIGTERM to remaining processes...
[  150.900422] lvmetad[492]: Failed to accept connection errno 11.
[  151.326900] systemd-journald[464]: Received SIGTERM from PID 1 (systemd-shutdow).
[  151.361927] systemd-shutdown[1]: Sending SIGKILL to remaining processes...
[  151.370835] systemd-shutdown[1]: Unmounting file systems.

NFS server unreachable pops out later:
[  331.656199] nfs: server 192.168.122.211 not responding, still trying

Shutdown/reboot never happens, eventually the VM starts to eat all the CPU after some time (the 1800s systemd timeout???).

Of course, when adding the Before/Requires dependency on share1.mount in the dummy service, the issue stops happening, since NFS umount will be performed only after the dummy service has been killed.
However, in real life, such bad thing may happen, it may not always be possible for the system administrator to know on which mounts the application is depending on.

So, clearly, some hardening must be performed: for example, not have the stop of "Remote File System" target be reached on shutdown if some remote umounts have failed and/or retry remote umounts until they succeed / use "force" flag in umounts, etc.

Comment 22 masanari iida 2017-06-07 11:03:23 UTC
Hello Renaud
Thanks for the detail steps to reproduce the issue.

Hello Rikard, how about your original symptom?
Have you tested with latest systemd and/or NetworkManager?

Comment 26 Bertrand 2017-08-08 07:18:01 UTC
Additionally, the issue does not seem to be limited to NFS version 3. A test done on NFS version 4 and RHEL7.3 shows similar behavior with Network Manager service disabled to startup with on Client Machine.

Step 1: 
Set up NFSv4 share on RHEL7.3 Server Machine (A VM, 192.168.1.23,  in my case running RHEL7.3 / 3.10.0-514.21.1.el7.x86_64)

Step 2: 
Mount NFSv4 Share on RHEL7.3 Client Machine (A VM, 192.168.1.36 in my case running RHEL7.3 / 3.10.0-514.26.1.el7.x86_64)
e.g:
192.168.1.23:/share1 on /mounted type nfs4 (rw,relatime,vers=4.0,rsize=262144,wsize=262144,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.1.36,local_lock=none,addr=192.168.1.23)

Step 3: 
Stop NFSv4 Service on RHEL7.3 Server Machine

Step 4: 
Attempt reboot of RHEL7.3 Client Machine result in the machine hanging.

Observation:
a) If the the mounted filesystem is being accessed, then upon reboot command issue, the network stack stays up and running. I can ssh to the Client Machine.
b) If the mounted filesystem is not being accessed, then upon reboot, command issue, the network stack is no longer available. I can no longer ssh to the client Machine.

Comment 30 Anatoly Pugachev 2017-09-07 08:50:33 UTC
Can someone change version from 7.2 to 7.3 in bug header? Thanks.

Comment 31 Christian Horn 2017-09-07 09:09:10 UTC
(In reply to Anatoly Pugachev from comment #30)
> Can someone change version from 7.2 to 7.3 in bug header? Thanks.
Why do you think this should be done?
The original report was for 7.2, the issue seems to still exist, so if you have hit it on 7.3 that is in line with this bz.

Comment 33 Christian Horn 2017-09-12 01:58:40 UTC
This is starting to cause us pain, we see that in the number of attached cases, plus customers behind least 2 partners.

systemd and nfs are involved here.  For my understanding, getting lazy umount into rhel7 (as brought up in bz1408791) would be a solution with worked before for us in rhel6, but the bz got CLOSED CANTFIX.

We need to come up with something, things as common as a temporary network issue or a restarted NFS server prevents clients from properly rebooting.

Is a different forum required to discuss this issue which affects systemd and NFS areas, i.e. a mail thread involving 2 upstream lists?

Comment 34 J. Bruce Fields 2017-09-12 13:54:25 UTC
I've added a comment to https://bugzilla.redhat.com/show_bug.cgi?id=1408791 in an attempt to better understand the motivation for the lazy umount change.

It's not always going to be possible to shut down NFS clients cleanly in the face of network issues and/or bad application behavior--perhaps user expectations are unrealistic in some of these cases?  But it also sounds like there are cases where we could do better.

Comment 38 Kyle Walker 2017-10-05 16:53:47 UTC
*** Bug 1462962 has been marked as a duplicate of this bug. ***

Comment 40 Lukáš Nykrýn 2017-10-06 11:50:11 UTC
fix merged to staging branch -> https://github.com/lnykryn/systemd-rhel/pull/155 -> post

Comment 49 Thomas Jones 2018-02-01 05:25:15 UTC
Is there an expected release-date for this fix?

Running into what appears to be the similar/same issue when hosting a slow-to-shutdown Java application on an EFS share (AWS's NFS 4.1 interface to S3). I can work around the issue by adding a TimeoutStopSec=10 to my application's service definition. However, this means I'm SIGKILLing my application rather than allowing it to gracefully shut down.

Comment 50 Filip Krska 2018-02-09 12:09:58 UTC
Hello Thomas, 7.4.Z Stream Bug 1519245 for this issue was fixed and released with Errata https://access.redhat.com/errata/RHBA-2018:0155 

Best Regards, Filip

Comment 61 errata-xmlrpc 2018-04-10 11:16:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0711

Comment 68 Thomas Jones 2018-06-22 17:03:33 UTC
I'm not seeing a "solution" in that errata? I just did a test redeployment of an NFS-using 7.5 instance and it appears to still be experiencing the hang problem. The behavior does not appear to be fixed. Do I really need to open a new bug to report this lack of fixative-action? Seems this bug didn't actually result in the fix it claimed to result in. :-\

Comment 69 Dave Wysochanski 2018-06-22 18:58:11 UTC
(In reply to Thomas Jones from comment #68)
> I'm not seeing a "solution" in that errata? I just did a test redeployment
> of an NFS-using 7.5 instance and it appears to still be experiencing the
> hang problem. The behavior does not appear to be fixed. Do I really need to
> open a new bug to report this lack of fixative-action? Seems this bug didn't
> actually result in the fix it claimed to result in. :-\

There is another open bug about NFS hangs on shutdown with a pending systemd update in https://bugzilla.redhat.com/show_bug.cgi?id=1571098

Please note a "hang on reboot due to NFS" could be the result of a number of conditions.  So yes, in general, you have to file a new case or bug to properly diagnose the hang, even if you feel the high level symptom is identical - the underlying fix to any given package may be different so you cannot for example just ask for a bug that has an errata on it to be re-opened.  Red Hat does not do this since there was a patch for this bug but it may not fix all of the underlying conditions causing a hang.

Comment 70 Thomas Jones 2018-06-22 19:07:18 UTC
Unfortunately, that bug doesn't seem to be generically visible. :-\

I know that in my case, I've got a Java application that's slow to release the disk (even though its systemd unit thinks the process has exited) and it's holding the NFS shares open long enough that we still end up in the scenario where NFS doesn't shut down before networking does ...and then wedges.

Comment 71 Stefan Lasiewski 2018-08-24 20:35:10 UTC
It's not clear that this issue is fixed at all.

Echoing what @Thomas said, https://bugzilla.redhat.com/show_bug.cgi?id=1571098 isn't publicly available, and thus it's not clear how that is related to this issue.

This bug was closed with a resolution of ERRATA, and refers us to https://access.redhat.com/errata/RHBA-2018:0711 . However, RHBA-2018:0711 only references us back to this bug, #1312002. Furthermore, the release notes at https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/7.5_release_notes/ make no specific mention of the bug reported here.

Comment 72 Donald Douwsma 2018-08-25 04:45:58 UTC
Note: Bug 1571098 is tracking the RHEL-7.6 release. A zstream bug was cloned for a fix in RHEL-7.5 and closed as errata https://access.redhat.com/errata/RHBA-2018:2447 (systemd-219-57.el7_5.1), this released on 2018-08-16.

Comment 73 Plumber Bot 2022-01-21 15:38:43 UTC
Dropping the stale needinfo. If our input is still needed, please set the needinfo again.