Bug 1466221

Summary: Update of nfs-utils breaks running nfs mounts and blocks reboot
Product: [Fedora] Fedora Reporter: David Jansen <jansen>
Component: systemdAssignee: systemd-maint
Status: CLOSED WORKSFORME QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 25CC: bcodding, bfields, harald, jansen, jlayton, jsynacek, kay, lnykryn, lpoetter, mschmidt, msekleta, smayhew, ssahani, s, steved, systemd-maint, zbyszek
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-10-26 15:25:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description David Jansen 2017-06-29 09:40:38 UTC
Description of problem:

Last night, nfs-utils was updated (from nfs-utils-1.3.4-1.rc2.fc25.x86_64 to nfs-utils-2.1.1-5.rc4.fc25.x86_64). This went well when no disks of the system were currently mounted, but if something was mounted from another computer, the nfs server leaves some processes in a 'D' state, and subsequent attempts to restart the nfs service complain about an unkillable process. This also happens on a reboot, so direct access (reset button) is required to recover.
After the reboot, nfs access seems to work fine again. So the updated version is working, but the update process has (severe) side effects.

Version-Release number of selected component (if applicable):
nfs-utils-2.1.1-5.rc4.fc25.x86_64

How reproducible:

Steps to Reproduce:
1. Install the 'old' version of nfs-utils, 1.3.4)
2. Export a local disk over nfs and mount it from another machine (the other machine does't have to run Fedora)
2. Install the update (2.1.1)

Actual results:
Unkillable process on nfs server, and hanging nfs disk on client

Expected results:


Additional info:
Note: on some systems, I could recover nfs operation by downgrading nfs-utils through dnf, but that only worked if there were no unkillable nfs server processes

Comment 1 J. Bruce Fields 2017-06-29 20:56:09 UTC
(In reply to David Jansen from comment #0)
> This went well when no disks of the
> system were currently mounted, but if something was mounted from another
> computer, the nfs server leaves some processes in a 'D' state, and
> subsequent attempts to restart the nfs service complain about an unkillable
> process.  This also happens on a reboot, so direct access (reset button) is
> required to recover.

In the above, I'm not completely clear when you're talking about the NFS server and when you're talking about the NFS client.

Are the processes in D state process on the NFS client, and is it the NFS client(s) that you had to reset, or the server, or both?

If it's reproduceable, a network trace taken during the upgrade might be interesting.

(So: run something like "tcpdump -s0 -wtmp.pcap host myclient', then examine tmp.pcap in wireshark and/or attach it to this bug.

Comment 2 David Jansen 2017-06-30 07:24:45 UTC
Processes can hang on both server and client.
On nfs server: the nfs service is not restarting after the upgrade, or the old nfs service is hanging, and becomes unkillable, which also blocks a reboot (but after a hard reboot, nfs is ok again)
On nfs clients that happened to have a disk mounted at the time of the update, the mount becomes unresponsive. 'umount -f -l' will take care of that, mostly.

Comment 3 J. Bruce Fields 2017-06-30 18:45:15 UTC
OK, and rereading the discussion, I now understand that it's the NFS server that you're upgrading.  So the client hangs are a consequence of a server side bug.

"subsequent attempts to restart the nfs service complain about an unkillable process."

Do you have the exact text of the complaint?

Comment 4 David Jansen 2017-10-12 14:50:50 UTC
Journal entries like this, when the nfs-server was being restarted:

Jun 29 10:13:00 jeker.strw.leidenuniv.nl systemd[1]: nfs-server.service: Processes still around after SIGKILL. Ignoring.
Jun 29 10:13:27 jeker.strw.leidenuniv.nl kernel: IPTABLES-SSH IN=enp9s0 OUT= MAC=18:03:73:1e:3c:96:00:08:e3:ff:fc:6c:08:00 SRC=116.31.116.6 DST=132.229.222.19 LEN=60 TOS=0x00 PREC=0x00 TTL=46 ID=24104 DF PROTO=TCP SPT=27698 DPT=22 WINDOW=
Jun 29 10:14:10 jeker.strw.leidenuniv.nl audit[34920]: CRYPTO_KEY_USER pid=34920 uid=0 auid=4294967295 ses=4294967295 msg='op=destroy kind=server fp=SHA256:93:93:86:1c:91:f8:70:84:dc:a9:08:af:6d:15:68:32:02:c0:64:39:e3:e5:35:ff:4b:ee:6d:e
Jun 29 10:14:10 jeker.strw.leidenuniv.nl kernel: audit: type=2404 audit(1498724050.265:37239): pid=34920 uid=0 auid=4294967295 ses=4294967295 msg='op=destroy kind=server fp=SHA256:93:93:86:1c:91:f8:70:84:dc:a9:08:af:6d:15:68:32:02:c0:64:3
Jun 29 10:14:10 jeker.strw.leidenuniv.nl kernel: audit: type=2404 audit(1498724050.265:37240): pid=34920 uid=0 auid=4294967295 ses=4294967295 msg='op=destroy kind=server fp=SHA256:18:9f:f7:a7:80:ac:af:57:e7:eb:86:a5:51:54:96:70:25:15:74:3
Jun 29 10:14:10 jeker.strw.leidenuniv.nl kernel: audit: type=2404 audit(1498724050.265:37241): pid=34920 uid=0 auid=4294967295 ses=4294967295 msg='op=destroy kind=server fp=SHA256:c9:3a:c2:34:80:27:4c:9c:dd:df:0d:cb:79:6c:fb:0b:d5:07:43:5
Jun 29 10:14:10 jeker.strw.leidenuniv.nl kernel: audit: type=2404 audit(1498724050.269:37242): pid=34919 uid=0 auid=4294967295 ses=4294967295 msg='op=destroy kind=server fp=SHA256:c9:3a:c2:34:80:27:4c:9c:dd:df:0d:cb:79:6c:fb:0b:d5:07:43:5
Jun 29 10:14:10 jeker.strw.leidenuniv.nl kernel: audit: type=2404 audit(1498724050.269:37243): pid=34919 uid=0 auid=4294967295 ses=4294967295 msg='op=destroy kind=server fp=SHA256:93:93:86:1c:91:f8:70:84:dc:a9:08:af:6d:15:68:32:02:c0:64:3
Jun 29 10:14:10 jeker.strw.leidenuniv.nl kernel: audit: type=2404 audit(1498724050.269:37244): pid=34919 uid=0 auid=4294967295 ses=4294967295 msg='op=destroy kind=server fp=SHA256:18:9f:f7:a7:80:ac:af:57:e7:eb:86:a5:51:54:96:70:25:15:74:3
Jun 29 10:14:10 jeker.strw.leidenuniv.nl kernel: audit: type=2404 audit(1498724050.269:37245): pid=34919 uid=0 auid=4294967295 ses=4294967295 msg='op=destroy kind=server fp=SHA256:c9:3a:c2:34:80:27:4c:9c:dd:df:0d:cb:79:6c:fb:0b:d5:07:43:5
Jun 29 10:14:10 jeker.strw.leidenuniv.nl kernel: audit: type=1112 audit(1498724050.269:37246): pid=34919 uid=0 auid=4294967295 ses=4294967295 msg='op=login acct="(unknown)" exe="/usr/sbin/sshd" hostname=? addr=132.229.214.166 terminal=ssh
Jun 29 10:14:31 jeker.strw.leidenuniv.nl systemd[1]: nfs-server.service: State 'stop-post' timed out. Terminating.
Jun 29 10:14:32 jeker.strw.leidenuniv.nl kernel: IPTABLES-SSH IN=enp9s0 OUT= MAC=18:03:73:1e:3c:96:00:08:e3:ff:fc:6c:08:00 SRC=116.31.116.6 DST=132.229.222.19 LEN=60 TOS=0x00 PREC=0x00 TTL=46 ID=19615 DF PROTO=TCP SPT=43453 DPT=22 WINDOW=
Jun 29 10:15:27 jeker.strw.leidenuniv.nl kernel: IPTABLES-SSH IN=enp9s0 OUT= MAC=18:03:73:1e:3c:96:00:08:e3:ff:fc:6c:08:00 SRC=116.31.116.6 DST=132.229.222.19 LEN=60 TOS=0x00 PREC=0x00 TTL=44 ID=37459 DF PROTO=TCP SPT=15577 DPT=22 WINDOW=
Jun 29 10:16:01 jeker.strw.leidenuniv.nl systemd[1]: nfs-server.service: State 'stop-final-sigterm' timed out. Killing.
Jun 29 10:16:01 jeker.strw.leidenuniv.nl systemd[1]: nfs-server.service: Killing process 33804 (rpc.nfsd) with signal SIGKILL.
Jun 29 10:16:27 jeker.strw.leidenuniv.nl kernel: IPTABLES-SSH IN=enp9s0 OUT= MAC=18:03:73:1e:3c:96:00:08:e3:ff:fc:6c:08:00 SRC=116.31.116.6 DST=132.229.222.19 LEN=60 TOS=0x00 PREC=0x00 TTL=46 ID=8574 DF PROTO=TCP SPT=35115 DPT=22 WINDOW=2
Jun 29 10:17:27 jeker.strw.leidenuniv.nl kernel: IPTABLES-SSH IN=enp9s0 OUT= MAC=18:03:73:1e:3c:96:00:08:e3:ff:fc:6c:08:00 SRC=116.31.116.6 DST=132.229.222.19 LEN=60 TOS=0x00 PREC=0x00 TTL=47 ID=34445 DF PROTO=TCP SPT=59231 DPT=22 WINDOW=
Jun 29 10:17:31 jeker.strw.leidenuniv.nl systemd[1]: nfs-server.service: Processes still around after final SIGKILL. Entering failed mode.
Jun 29 10:17:31 jeker.strw.leidenuniv.nl audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=nfs-server comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
Jun 29 10:17:31 jeker.strw.leidenuniv.nl systemd[1]: Stopped NFS server and services.
Jun 29 10:17:31 jeker.strw.leidenuniv.nl systemd[1]: nfs-server.service: Unit entered failed state.
Jun 29 10:17:31 jeker.strw.leidenuniv.nl systemd[1]: nfs-server.service: Failed with result 'timeout'.
Jun 29 10:17:31 jeker.strw.leidenuniv.nl systemd[1]: Stopping NFSv4 ID-name mapping service...
Jun 29 10:17:31 jeker.strw.leidenuniv.nl systemd[1]: Stopping NFS Mount Daemon...
Jun 29 10:17:31 jeker.strw.leidenuniv.nl systemd[1]: Stopped NFSv4 ID-name mapping service.
Jun 29 10:17:31 jeker.strw.leidenuniv.nl systemd[1]: Stopped NFS Mount Daemon.

Comment 5 Steve Dickson 2017-10-16 12:10:04 UTC
Maybe the systemd people can tell us what's going on here... 

I have no idea..

Comment 6 Michal Schmidt 2017-10-16 13:06:48 UTC
(In reply to David Jansen from comment #4)
> Jun 29 10:13:00 jeker.strw.leidenuniv.nl systemd[1]: nfs-server.service:
> Processes still around after SIGKILL. Ignoring.

Wasn't there an earlier warning message (about 90 seconds before this one) saying "State 'stop-sigterm' timed out. Killing."?

The RPM scriptlets of nfs-utils called the command "systemctl try-restart nfs-server". systemd tried to stop the service. Presumably it first ran the ExecStop command of the service. Then it noticed there were still some processes alive in the service's cgroup, so it sent SIGTERM to them. After a timeout when it saw they were still around, it sent them SIGKILL. They still did not exit. This is when systemd emits the warning. Probably because they are stuck in the kernel in TASK_UNINTERRUPTIBLE ('D') state.
Then systemd went on to run the ExecStopPost commands...

> Jun 29 10:14:31 jeker.strw.leidenuniv.nl systemd[1]: nfs-server.service:
> State 'stop-post' timed out. Terminating.

An ExecStopStop command got stuck too. It did not finish in 90 seconds. systemd sent SIGTERM...

> Jun 29 10:16:01 jeker.strw.leidenuniv.nl systemd[1]: nfs-server.service:
> State 'stop-final-sigterm' timed out. Killing.
> Jun 29 10:16:01 jeker.strw.leidenuniv.nl systemd[1]: nfs-server.service:
> Killing process 33804 (rpc.nfsd) with signal SIGKILL.

90 seconds later the process was still alive. systemd sent SIGKILL...

> Jun 29 10:17:31 jeker.strw.leidenuniv.nl systemd[1]: nfs-server.service:
> Processes still around after final SIGKILL. Entering failed mode.

And the process is still stuck. systemd considers the service failed.

> Jun 29 10:17:31 jeker.strw.leidenuniv.nl systemd[1]: Stopped NFS server and
> services.
> Jun 29 10:17:31 jeker.strw.leidenuniv.nl systemd[1]: nfs-server.service:
> Unit entered failed state.
> Jun 29 10:17:31 jeker.strw.leidenuniv.nl systemd[1]: nfs-server.service:
> Failed with result 'timeout'.

Services whose stopping is ordered after nfs-server.service were then stopped:

> Jun 29 10:17:31 jeker.strw.leidenuniv.nl systemd[1]: Stopping NFSv4 ID-name
> mapping service...
> Jun 29 10:17:31 jeker.strw.leidenuniv.nl systemd[1]: Stopping NFS Mount
> Daemon...
> Jun 29 10:17:31 jeker.strw.leidenuniv.nl systemd[1]: Stopped NFSv4 ID-name
> mapping service.
> Jun 29 10:17:31 jeker.strw.leidenuniv.nl systemd[1]: Stopped NFS Mount
> Daemon.

I see no indication of a systemd bug.

Comment 7 Steve Dickson 2017-10-16 21:41:37 UTC
Thank you for your analogy... 

Basically is sounds like a process was hung on an NFS mount or
an NFS umount was hung... 

Would these hang(s) because by the network being brought down
before NFS?

Comment 8 Michal Schmidt 2017-10-18 14:50:11 UTC
Steve,
nobody mentioned anything about the network going down.
And it's the *server* that's hanging. Shouldn't it be possible to restart the server service regardless of the network connection state?

Comment 9 Steve Dickson 2017-10-18 15:32:54 UTC
(In reply to Michal Schmidt from comment #8)
> Steve,
> nobody mentioned anything about the network going down.
The description talks about this happening during a reboot
so I'm assuming the system was on its way down
 
> And it's the *server* that's hanging. Shouldn't it be possible to restart
> the server service regardless of the network connection state?
Process in the 'D' state generally happen on the client which
is the confusing part (at least for me :-) )

But to answer your question, in general it shouldn't matter but
that may not be the case... With v3 the clients have to be
notified about the reboot (but that happens in background 
so if there is no network it *should* keep trying).

Also loopback interface needs to be around so the server
can register with rpcbind. I'm not sure what the server
does if rpcbind is not around.

So it is best to have a functioning network when the
server is brought up and down... which I'm hoping 
the nfs-server service file specifies.

Comment 10 Michal Schmidt 2017-10-18 15:39:07 UTC
The stuck processes appeared already during the nfs-utils package update.

I tried to reproduce this, but failed. You mention v3. I only tried the default, which is v4 (?) on a recent Fedora, so that may be why it did not reproduce.

Comment 11 David Jansen 2017-10-26 09:32:48 UTC
I've not been able to reproduce the issue later; it only happened at the time of this rpm upgrade. so something in the upgrade was slightly incompatible, but after a reboot, the new packages seem to work just fine. 
Also, on systems where no files on nfs were open at the time of the update, there was no problem; the updates nfs daemons were properly restarted and later file access went just fine. 
So the problem would only occur when files were open on a remote disk at the time of this update.

As far as I'm concerned, such a one-time issue is not too much of a problem

Comment 12 Steve Dickson 2017-10-26 15:25:54 UTC
(In reply to David Jansen from comment #11)
> I've not been able to reproduce the issue later; it only happened at the
> time of this rpm upgrade. so something in the upgrade was slightly
> incompatible, but after a reboot, the new packages seem to work just fine. 
> Also, on systems where no files on nfs were open at the time of the update,
> there was no problem; the updates nfs daemons were properly restarted and
> later file access went just fine. 
> So the problem would only occur when files were open on a remote disk at the
> time of this update.
> 
> As far as I'm concerned, such a one-time issue is not too much of a problem

Ok... then I'll close this... Thanks!