Bug 1312002
Summary: | hangs on reboot or shutdown when nfs file system mounted | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Rikard <rikard.oberg> | |
Component: | systemd | Assignee: | systemd-maint | |
Status: | CLOSED ERRATA | QA Contact: | Frantisek Sumsal <fsumsal> | |
Severity: | high | Docs Contact: | ||
Priority: | urgent | |||
Version: | 7.2 | CC: | ajmitchell, arawat, bart.demeester, bcodding, bfields, brault, brubisch, bsingh, ccheney, chorn, davor, ddouwsma, dwysocha, fadamo, fkrska, fsumsal, jaeshin, jbyrd, kfujii, kwalker, leif, luf, masanari.iida, matorola, mlinden, mpoole, mr.xkurt, msivakum, myllynen, onatalen, redhat, rkothiya, rmetrich, rsawhill, smayhew, ssahani, stefan, sweettea, swhiteho, systemd-maint-list, systemd-maint, tnagata, yoguma, yoyang | |
Target Milestone: | rc | Keywords: | ZStream | |
Target Release: | --- | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | systemd-219-46.el7 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1519245 (view as bug list) | Environment: | ||
Last Closed: | 2018-04-10 11:16:36 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1298243, 1420851, 1466365, 1469559, 1473733, 1519245, 1522983 |
Description
Rikard
2016-02-25 13:43:09 UTC
seems like duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1334573 BZ#1334573 appears to be about autofs, this also hangs without autofs use. Can you please try that with rhe-7.3 beta? *** Bug 1420676 has been marked as a duplicate of this bug. *** The exact same shutdown/reboot issue happens when network connectivity to NFS server is lost and shutdown/reboot is issued. Network connectivity to NFS server means for example: - unavailability of NFS server - no route to NFS server (broken switch) To easily reproduce the issue with NFS hard mounts: 1. stop the NFS server 2. trigger the reboot or shutdown of the NFS client System will never shut down or reboot, printing "nfsshareX.mount unmounting timed out. Stopping." messages. Occasionally, after 1800 seconds (30 minutes), system will just die with the following messages (systemd was in debug mode): [ 1831.602805] systemd[1]: Timed out starting Reboot. [ 1831.607376] systemd[1]: Job reboot.target/start failed with result 'timeout'. [ 1831.617183] systemd[1]: Forcibly rebooting as result of failure. [ !! ] [ 1831.625209] systemd[1]: Shutting down. Forcibly rebooting as result of failure. [ 1831.649519] systemd-shutdown[1]: Sending SIGTERM to remaining processes... [ 1831.675961] systemd-journald[465]: Received SIGTERM from PID 1 (systemd-shutdow). [ 1841.666618] systemd-shutdown[1]: Sending SIGKILL to remaining processes... [ 1841.680423] systemd-shutdown[1]: Sending SIGKILL to PID xxx (umount). ... [ 1841.801172] systemd-shutdown[1]: Unmounting file systems. [ 1841.805417] systemd-shutdown[1]: Unmounting /run/user/0. This happens because there seems to be no timeout in the NFS umount in the kernel. For sure, it's no systemd issue but NFS issue. IMO, it's a major issue, since network is unstable by nature. NetworkManager-1.4.0-13.el7 or later, it includes following fix. 2016-11-02 Thomas Haller <thaller> - 1.4.0-13 - core: don't unmanage devices on shutdown (rh#1371126) (which is related to https://bugzilla.redhat.com/show_bug.cgi?id=1311988 ) My question to Renaud Métrich is, which version of NetworkManager did you use when you encountered the symptom? This is NetworkManager-1.4.0-20.el7_3.x86_64 I can reproduce the exact same behaviour with the following setup: 1. NFS mount /share1 2. dummy service "test-simple.service" entering /share1 directory to have /share1 busy (does sleep/ls) ExecStart=/root/share1 with /root/share1 being: # cat /root/share1 #!/bin/bash while [ -z "$(/bin/ls /share1)" ]; do sleep 1 done echo "/share1 is now mounted ..." cd /share1 while :; do /bin/ls sleep 10 done NOTE: the service has not been configured with a dependency on "/share1" on purpose of reproducing. 3. dummy service is made not to stop correctly ExecStop=/bin/sleep 300 4. issue reboot from terminal What we see: NFS unmount fails: [ 60.822617] umount[10737]: umount.nfs: /share1: device is busy [ OK ] Failed unmounting /share1. Interface is taken down: [ 61.411976] network[10751]: Shutting down interface eth0: Device 'eth0' successfully disconnected. test-simple is getting killed after 1m30: [** ] A stop job is running for Test 'simple' service (1min 29s / 1min 30s)[ 150.937597] systemd[1]: test-simple.service stopping timed out. Terminating. [ OK [ 150.963373] systemd[1]: Stopped Test 'simple' service. Systemd finishes the shutdown, but never reboots: [ OK ] Reached target Shutdown. [ 151.282370] systemd[1]: Reached target Final Step. [ 151.282378] systemd[1]: Starting Final Step. [ 151.283008] systemd[1]: Starting Reboot... [ 151.283271] systemd[1]: Stopping Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling... [ 151.294818] systemd[1]: Shutting down. [ 151.309229] systemd-shutdown[1]: Sending SIGTERM to remaining processes... [ 150.900422] lvmetad[492]: Failed to accept connection errno 11. [ 151.326900] systemd-journald[464]: Received SIGTERM from PID 1 (systemd-shutdow). [ 151.361927] systemd-shutdown[1]: Sending SIGKILL to remaining processes... [ 151.370835] systemd-shutdown[1]: Unmounting file systems. NFS server unreachable pops out later: [ 331.656199] nfs: server 192.168.122.211 not responding, still trying Shutdown/reboot never happens, eventually the VM starts to eat all the CPU after some time (the 1800s systemd timeout???). Of course, when adding the Before/Requires dependency on share1.mount in the dummy service, the issue stops happening, since NFS umount will be performed only after the dummy service has been killed. However, in real life, such bad thing may happen, it may not always be possible for the system administrator to know on which mounts the application is depending on. So, clearly, some hardening must be performed: for example, not have the stop of "Remote File System" target be reached on shutdown if some remote umounts have failed and/or retry remote umounts until they succeed / use "force" flag in umounts, etc. Hello Renaud Thanks for the detail steps to reproduce the issue. Hello Rikard, how about your original symptom? Have you tested with latest systemd and/or NetworkManager? Additionally, the issue does not seem to be limited to NFS version 3. A test done on NFS version 4 and RHEL7.3 shows similar behavior with Network Manager service disabled to startup with on Client Machine. Step 1: Set up NFSv4 share on RHEL7.3 Server Machine (A VM, 192.168.1.23, in my case running RHEL7.3 / 3.10.0-514.21.1.el7.x86_64) Step 2: Mount NFSv4 Share on RHEL7.3 Client Machine (A VM, 192.168.1.36 in my case running RHEL7.3 / 3.10.0-514.26.1.el7.x86_64) e.g: 192.168.1.23:/share1 on /mounted type nfs4 (rw,relatime,vers=4.0,rsize=262144,wsize=262144,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.1.36,local_lock=none,addr=192.168.1.23) Step 3: Stop NFSv4 Service on RHEL7.3 Server Machine Step 4: Attempt reboot of RHEL7.3 Client Machine result in the machine hanging. Observation: a) If the the mounted filesystem is being accessed, then upon reboot command issue, the network stack stays up and running. I can ssh to the Client Machine. b) If the mounted filesystem is not being accessed, then upon reboot, command issue, the network stack is no longer available. I can no longer ssh to the client Machine. Can someone change version from 7.2 to 7.3 in bug header? Thanks. (In reply to Anatoly Pugachev from comment #30) > Can someone change version from 7.2 to 7.3 in bug header? Thanks. Why do you think this should be done? The original report was for 7.2, the issue seems to still exist, so if you have hit it on 7.3 that is in line with this bz. This is starting to cause us pain, we see that in the number of attached cases, plus customers behind least 2 partners. systemd and nfs are involved here. For my understanding, getting lazy umount into rhel7 (as brought up in bz1408791) would be a solution with worked before for us in rhel6, but the bz got CLOSED CANTFIX. We need to come up with something, things as common as a temporary network issue or a restarted NFS server prevents clients from properly rebooting. Is a different forum required to discuss this issue which affects systemd and NFS areas, i.e. a mail thread involving 2 upstream lists? I've added a comment to https://bugzilla.redhat.com/show_bug.cgi?id=1408791 in an attempt to better understand the motivation for the lazy umount change. It's not always going to be possible to shut down NFS clients cleanly in the face of network issues and/or bad application behavior--perhaps user expectations are unrealistic in some of these cases? But it also sounds like there are cases where we could do better. *** Bug 1462962 has been marked as a duplicate of this bug. *** fix merged to staging branch -> https://github.com/lnykryn/systemd-rhel/pull/155 -> post Is there an expected release-date for this fix? Running into what appears to be the similar/same issue when hosting a slow-to-shutdown Java application on an EFS share (AWS's NFS 4.1 interface to S3). I can work around the issue by adding a TimeoutStopSec=10 to my application's service definition. However, this means I'm SIGKILLing my application rather than allowing it to gracefully shut down. Hello Thomas, 7.4.Z Stream Bug 1519245 for this issue was fixed and released with Errata https://access.redhat.com/errata/RHBA-2018:0155 Best Regards, Filip Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0711 I'm not seeing a "solution" in that errata? I just did a test redeployment of an NFS-using 7.5 instance and it appears to still be experiencing the hang problem. The behavior does not appear to be fixed. Do I really need to open a new bug to report this lack of fixative-action? Seems this bug didn't actually result in the fix it claimed to result in. :-\ (In reply to Thomas Jones from comment #68) > I'm not seeing a "solution" in that errata? I just did a test redeployment > of an NFS-using 7.5 instance and it appears to still be experiencing the > hang problem. The behavior does not appear to be fixed. Do I really need to > open a new bug to report this lack of fixative-action? Seems this bug didn't > actually result in the fix it claimed to result in. :-\ There is another open bug about NFS hangs on shutdown with a pending systemd update in https://bugzilla.redhat.com/show_bug.cgi?id=1571098 Please note a "hang on reboot due to NFS" could be the result of a number of conditions. So yes, in general, you have to file a new case or bug to properly diagnose the hang, even if you feel the high level symptom is identical - the underlying fix to any given package may be different so you cannot for example just ask for a bug that has an errata on it to be re-opened. Red Hat does not do this since there was a patch for this bug but it may not fix all of the underlying conditions causing a hang. Unfortunately, that bug doesn't seem to be generically visible. :-\ I know that in my case, I've got a Java application that's slow to release the disk (even though its systemd unit thinks the process has exited) and it's holding the NFS shares open long enough that we still end up in the scenario where NFS doesn't shut down before networking does ...and then wedges. It's not clear that this issue is fixed at all. Echoing what @Thomas said, https://bugzilla.redhat.com/show_bug.cgi?id=1571098 isn't publicly available, and thus it's not clear how that is related to this issue. This bug was closed with a resolution of ERRATA, and refers us to https://access.redhat.com/errata/RHBA-2018:0711 . However, RHBA-2018:0711 only references us back to this bug, #1312002. Furthermore, the release notes at https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/7.5_release_notes/ make no specific mention of the bug reported here. Note: Bug 1571098 is tracking the RHEL-7.6 release. A zstream bug was cloned for a fix in RHEL-7.5 and closed as errata https://access.redhat.com/errata/RHBA-2018:2447 (systemd-219-57.el7_5.1), this released on 2018-08-16. Dropping the stale needinfo. If our input is still needed, please set the needinfo again. |