Bug 1462962 - systemd-shutdown hard hang when NFS mounts are unavailable
systemd-shutdown hard hang when NFS mounts are unavailable
Status: CLOSED DUPLICATE of bug 1312002
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: systemd (Show other bugs)
7.5
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: systemd-maint
qe-baseos-daemons
:
Depends On:
Blocks: 1420851 1466365
  Show dependency treegraph
 
Reported: 2017-06-19 14:43 EDT by Kyle Walker
Modified: 2017-10-05 12:55 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-10-05 12:53:47 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Kyle Walker 2017-06-19 14:43:14 EDT
Description of problem:
 When an NFS version 3 share is unavailable during shutdown, the failure can propagate itself into a hard hang state. Specifically, this hang occurs in the following:

src/core/shutdown.c
<snip>
        /* Unmount all mountpoints, swaps, and loopback devices */
        for (retries = 0; retries < FINALIZE_ATTEMPTS; retries++) {
                bool changed = false;

                if (use_watchdog)
                        watchdog_ping();

                /* Let's trim the cgroup tree on each iteration so
                   that we leave an empty cgroup tree around, so that
                   container managers get a nice notify event when we
                   are down */
                if (cgroup)
                        cg_trim(SYSTEMD_CGROUP_CONTROLLER, cgroup, false);

                if (need_umount) {
                        log_info("Unmounting file systems.");
                        r = umount_all(&changed);
                        if (r == 0) {
                                need_umount = false;
                                log_info("All filesystems unmounted.");
                        } else if (r > 0)
                                log_info("Not all file systems unmounted, %d left.", r);
                        else
                                log_error_errno(r, "Failed to unmount file systems: %m");
                }

<snip>

In the example of NFS version 3 mount points being unavailable, the umount above will hang without a timeout. This type of failure state shouldn't result in an endless hang condition, but simply delay the shutdown operation.
 

Version-Release number of selected component (if applicable):
 systemd-219-30.el7_3.9

How reproducible:
 Easily

Steps to Reproduce:
1. Configure an NFS v3 mount point on the client system
2. Disconnect the NFS v3 mount point, but from the server side
3. Issue a restart

Actual results:
 Output similar to the following, as visible from the console when the nfs v3 mount point is mounted to /mnt/test:
~~~~
[**    ] A stop job is running for /mnt/test (1min 30s / 3min)[  221.946558] systemd[1]: mnt-test.mount unmounting timed out. Stopping.
[**    ] A stop job is running for /mnt/test (3min / 3min)[  312.196510] systemd[1]: mnt-test.mount unmounting timed out. Stopping.
[***   ] A stop job is running for /mnt/test (4min 31s / 6min 1s)[  402.446732] systemd[1]: mnt-test.mount unmounting timed out. Stopping.
[ !!  ] Forcibly rebooting as result of failure.
[  431.447144] systemd[1]: Job reboot.target/start timed out.
[  431.447742] systemd[1]: Timed out starting Reboot.
[  431.448170] systemd[1]: Job reboot.target/start failed with result 'timeout'.
[  431.448561] systemd[1]: Forcibly rebooting as result of failure.
[  431.448931] systemd[1]: Shutting down.
[  431.244885] systemd-shutdown[1]: Sending SIGTERM to remaining processes...
[  431.263215] systemd-journald[457]: Received SIGTERM from PID 1 (systemd-shutdow).
[  431.295701] type=1305 audit(1497897377.802:185): audit_pid=0 old=602 auid=4294967295 ses=4294967295 subj=system_u:system_r:auditd_t:s0 res=1
[  441.284108] systemd-shutdown[1]: Sending SIGKILL to remaining processes...
[  441.295134] systemd-shutdown[1]: Sending SIGKILL to PID 1540 (umount).
[  441.298339] systemd-shutdown[1]: Unmounting file systems.
[  441.300740] systemd-shutdown[1]: Unmounting /run/user/0.
[  745.209555] nfs: server <IP> not responding, still trying
~~~~


Expected results:
 Actions following the above "Unmounting" behaviour indicating that the reboot is being forced due to the extended failure of the NFS mounts.


Additional info:
Comment 4 Bertrand 2017-08-08 03:24:39 EDT
Similar case: BZ# 1312002

Additionally, the issue does not seem to be limited to NFS version 3. A test done on NFS version 4 and RHEL7.3 shows similar behavior with Network Manager service disabled to startup with on Client Machine.

Step 1: 
Set up NFSv4 share on RHEL7.3 Server Machine (A VM, 192.168.1.23,  in my case running RHEL7.3 / 3.10.0-514.21.1.el7.x86_64)

Step 2: 
Mount NFSv4 Share on RHEL7.3 Client Machine (A VM, 192.168.1.36 in my case running RHEL7.3 / 3.10.0-514.26.1.el7.x86_64)
e.g:
192.168.1.23:/share1 on /mounted type nfs4 (rw,relatime,vers=4.0,rsize=262144,wsize=262144,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.1.36,local_lock=none,addr=192.168.1.23)

Step 3: 
Stop NFSv4 Service on RHEL7.3 Server Machine

Step 4: 
Attempt reboot of RHEL7.3 Client Machine result in the machine hanging.

Observation:
a) If the the mounted filesystem is being accessed, then upon reboot command issue, the network stack stays up and running. I can ssh to the Client Machine.
b) If the mounted filesystem is not being accessed, then upon reboot, command issue, the network stack is no longer available. I can no longer ssh to the client Machine.
Comment 5 Bertrand 2017-08-08 08:29:11 EDT
But with /usr/lib/systemd/system/reboot.target:JobTimeoutSec changed from 30min to 5 minutes for example, the client machine eventually reboot.
Comment 7 Kyle Walker 2017-08-09 12:17:59 EDT
@Bertrand,

Just to note, there seems to be a number of different end stall behaviours in reboot when a backing NFS server is unavailable. The one above regarding JobTimeoutSec is just one. 

The hang in the context of this particular bug report is a stall in the following operation:

(gdb) bt
#0  0x00007f708d0b3eca in mount () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f708dc04e59 in mount_points_list_umount.2401 (head=0xffff88013740ba90,
    changed=0xffff88013ff9e350, log_error=false) at src/core/umount.c:387
#2  0x00007f708dbffe9b in umount_all (changed=<synthetic pointer>) at src/core/umount.c:526
#3  main (argc=<optimized out>, argv=0x7ffe86d66458) at src/core/shutdown.c:234


The specific hang is in:

(gdb) list
382                              * somehwere else via a bind mount. If we
383                              * explicitly remount the super block of that
384                              * alias read-only we hence should be
385                              * relatively safe regarding keeping the fs we
386                              * can otherwise not see dirty. */
387                             mount(NULL, m->path, NULL, MS_REMOUNT|MS_RDONLY, NULL);
388                     }
389
390                     /* Skip / and /usr since we cannot unmount that
391                      * anyway, since we are running from it. They have


Where the underlying kernel is showing the following common backtrace:

[<ffffffffa02c8e24>] rpc_wait_bit_killable+0x24/0xb0 [sunrpc]
[<ffffffffa02ca364>] __rpc_execute+0x154/0x430 [sunrpc]
[<ffffffffa02cd39e>] rpc_execute+0x5e/0xa0 [sunrpc]
[<ffffffffa02c0310>] rpc_run_task+0x70/0x90 [sunrpc]
[<ffffffffa02c0380>] rpc_call_sync+0x50/0xc0 [sunrpc]
[<ffffffffa06995bb>] nfs3_rpc_wrapper.constprop.11+0x6b/0xb0 [nfsv3]
[<ffffffffa069a296>] nfs3_proc_getattr+0x56/0xb0 [nfsv3]
[<ffffffffa05dd14f>] __nfs_revalidate_inode+0xbf/0x310 [nfs]
[<ffffffffa05dd952>] nfs_revalidate_inode+0x22/0x60 [nfs]
[<ffffffffa05d4beb>] nfs_weak_revalidate+0x4b/0xf0 [nfs]
[<ffffffff812091a7>] complete_walk+0x87/0xe0
[<ffffffff8120c453>] path_lookupat+0x83/0x7a0
[<ffffffff8120cb9b>] filename_lookup+0x2b/0xc0
[<ffffffff812105b7>] user_path_at_empty+0x67/0xc0
[<ffffffff8120436b>] SyS_readlinkat+0x5b/0x140
[<ffffffff8120446b>] SyS_readlink+0x1b/0x20
[<ffffffff81697709>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff


Note, the above behaviour is common with BZ1312002. That bug report is specifically targeting the condition where the network is disabled with NFS mounts still needing to be unmounted. However, in this instance we should not stall endlessly in that MS_REMOUNT|MS_RDONLY mount operation due to the backing NFS server being inaccessible.

- Kyle Walker
Comment 11 Kyle Walker 2017-10-05 12:53:47 EDT
Based on efforts in 1312002 indicating that systemd is the best place to address the hang issue, it would seem that the best course of action here is to consolidate efforts.

I'm currently closing this bug as a Duplicate of 1312002 and will continue efforts there.

- Kyle Walker

*** This bug has been marked as a duplicate of bug 1312002 ***

Note You need to log in before you can comment on or make changes to this bug.