Bug 1466221
Summary: | Update of nfs-utils breaks running nfs mounts and blocks reboot | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | David Jansen <jansen> |
Component: | systemd | Assignee: | systemd-maint |
Status: | CLOSED WORKSFORME | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 25 | CC: | bcodding, bfields, harald, jansen, jlayton, jsynacek, kay, lnykryn, lpoetter, mschmidt, msekleta, smayhew, ssahani, s, steved, systemd-maint, zbyszek |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-10-26 15:25:54 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
David Jansen
2017-06-29 09:40:38 UTC
(In reply to David Jansen from comment #0) > This went well when no disks of the > system were currently mounted, but if something was mounted from another > computer, the nfs server leaves some processes in a 'D' state, and > subsequent attempts to restart the nfs service complain about an unkillable > process. This also happens on a reboot, so direct access (reset button) is > required to recover. In the above, I'm not completely clear when you're talking about the NFS server and when you're talking about the NFS client. Are the processes in D state process on the NFS client, and is it the NFS client(s) that you had to reset, or the server, or both? If it's reproduceable, a network trace taken during the upgrade might be interesting. (So: run something like "tcpdump -s0 -wtmp.pcap host myclient', then examine tmp.pcap in wireshark and/or attach it to this bug. Processes can hang on both server and client. On nfs server: the nfs service is not restarting after the upgrade, or the old nfs service is hanging, and becomes unkillable, which also blocks a reboot (but after a hard reboot, nfs is ok again) On nfs clients that happened to have a disk mounted at the time of the update, the mount becomes unresponsive. 'umount -f -l' will take care of that, mostly. OK, and rereading the discussion, I now understand that it's the NFS server that you're upgrading. So the client hangs are a consequence of a server side bug. "subsequent attempts to restart the nfs service complain about an unkillable process." Do you have the exact text of the complaint? Journal entries like this, when the nfs-server was being restarted: Jun 29 10:13:00 jeker.strw.leidenuniv.nl systemd[1]: nfs-server.service: Processes still around after SIGKILL. Ignoring. Jun 29 10:13:27 jeker.strw.leidenuniv.nl kernel: IPTABLES-SSH IN=enp9s0 OUT= MAC=18:03:73:1e:3c:96:00:08:e3:ff:fc:6c:08:00 SRC=116.31.116.6 DST=132.229.222.19 LEN=60 TOS=0x00 PREC=0x00 TTL=46 ID=24104 DF PROTO=TCP SPT=27698 DPT=22 WINDOW= Jun 29 10:14:10 jeker.strw.leidenuniv.nl audit[34920]: CRYPTO_KEY_USER pid=34920 uid=0 auid=4294967295 ses=4294967295 msg='op=destroy kind=server fp=SHA256:93:93:86:1c:91:f8:70:84:dc:a9:08:af:6d:15:68:32:02:c0:64:39:e3:e5:35:ff:4b:ee:6d:e Jun 29 10:14:10 jeker.strw.leidenuniv.nl kernel: audit: type=2404 audit(1498724050.265:37239): pid=34920 uid=0 auid=4294967295 ses=4294967295 msg='op=destroy kind=server fp=SHA256:93:93:86:1c:91:f8:70:84:dc:a9:08:af:6d:15:68:32:02:c0:64:3 Jun 29 10:14:10 jeker.strw.leidenuniv.nl kernel: audit: type=2404 audit(1498724050.265:37240): pid=34920 uid=0 auid=4294967295 ses=4294967295 msg='op=destroy kind=server fp=SHA256:18:9f:f7:a7:80:ac:af:57:e7:eb:86:a5:51:54:96:70:25:15:74:3 Jun 29 10:14:10 jeker.strw.leidenuniv.nl kernel: audit: type=2404 audit(1498724050.265:37241): pid=34920 uid=0 auid=4294967295 ses=4294967295 msg='op=destroy kind=server fp=SHA256:c9:3a:c2:34:80:27:4c:9c:dd:df:0d:cb:79:6c:fb:0b:d5:07:43:5 Jun 29 10:14:10 jeker.strw.leidenuniv.nl kernel: audit: type=2404 audit(1498724050.269:37242): pid=34919 uid=0 auid=4294967295 ses=4294967295 msg='op=destroy kind=server fp=SHA256:c9:3a:c2:34:80:27:4c:9c:dd:df:0d:cb:79:6c:fb:0b:d5:07:43:5 Jun 29 10:14:10 jeker.strw.leidenuniv.nl kernel: audit: type=2404 audit(1498724050.269:37243): pid=34919 uid=0 auid=4294967295 ses=4294967295 msg='op=destroy kind=server fp=SHA256:93:93:86:1c:91:f8:70:84:dc:a9:08:af:6d:15:68:32:02:c0:64:3 Jun 29 10:14:10 jeker.strw.leidenuniv.nl kernel: audit: type=2404 audit(1498724050.269:37244): pid=34919 uid=0 auid=4294967295 ses=4294967295 msg='op=destroy kind=server fp=SHA256:18:9f:f7:a7:80:ac:af:57:e7:eb:86:a5:51:54:96:70:25:15:74:3 Jun 29 10:14:10 jeker.strw.leidenuniv.nl kernel: audit: type=2404 audit(1498724050.269:37245): pid=34919 uid=0 auid=4294967295 ses=4294967295 msg='op=destroy kind=server fp=SHA256:c9:3a:c2:34:80:27:4c:9c:dd:df:0d:cb:79:6c:fb:0b:d5:07:43:5 Jun 29 10:14:10 jeker.strw.leidenuniv.nl kernel: audit: type=1112 audit(1498724050.269:37246): pid=34919 uid=0 auid=4294967295 ses=4294967295 msg='op=login acct="(unknown)" exe="/usr/sbin/sshd" hostname=? addr=132.229.214.166 terminal=ssh Jun 29 10:14:31 jeker.strw.leidenuniv.nl systemd[1]: nfs-server.service: State 'stop-post' timed out. Terminating. Jun 29 10:14:32 jeker.strw.leidenuniv.nl kernel: IPTABLES-SSH IN=enp9s0 OUT= MAC=18:03:73:1e:3c:96:00:08:e3:ff:fc:6c:08:00 SRC=116.31.116.6 DST=132.229.222.19 LEN=60 TOS=0x00 PREC=0x00 TTL=46 ID=19615 DF PROTO=TCP SPT=43453 DPT=22 WINDOW= Jun 29 10:15:27 jeker.strw.leidenuniv.nl kernel: IPTABLES-SSH IN=enp9s0 OUT= MAC=18:03:73:1e:3c:96:00:08:e3:ff:fc:6c:08:00 SRC=116.31.116.6 DST=132.229.222.19 LEN=60 TOS=0x00 PREC=0x00 TTL=44 ID=37459 DF PROTO=TCP SPT=15577 DPT=22 WINDOW= Jun 29 10:16:01 jeker.strw.leidenuniv.nl systemd[1]: nfs-server.service: State 'stop-final-sigterm' timed out. Killing. Jun 29 10:16:01 jeker.strw.leidenuniv.nl systemd[1]: nfs-server.service: Killing process 33804 (rpc.nfsd) with signal SIGKILL. Jun 29 10:16:27 jeker.strw.leidenuniv.nl kernel: IPTABLES-SSH IN=enp9s0 OUT= MAC=18:03:73:1e:3c:96:00:08:e3:ff:fc:6c:08:00 SRC=116.31.116.6 DST=132.229.222.19 LEN=60 TOS=0x00 PREC=0x00 TTL=46 ID=8574 DF PROTO=TCP SPT=35115 DPT=22 WINDOW=2 Jun 29 10:17:27 jeker.strw.leidenuniv.nl kernel: IPTABLES-SSH IN=enp9s0 OUT= MAC=18:03:73:1e:3c:96:00:08:e3:ff:fc:6c:08:00 SRC=116.31.116.6 DST=132.229.222.19 LEN=60 TOS=0x00 PREC=0x00 TTL=47 ID=34445 DF PROTO=TCP SPT=59231 DPT=22 WINDOW= Jun 29 10:17:31 jeker.strw.leidenuniv.nl systemd[1]: nfs-server.service: Processes still around after final SIGKILL. Entering failed mode. Jun 29 10:17:31 jeker.strw.leidenuniv.nl audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=nfs-server comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed' Jun 29 10:17:31 jeker.strw.leidenuniv.nl systemd[1]: Stopped NFS server and services. Jun 29 10:17:31 jeker.strw.leidenuniv.nl systemd[1]: nfs-server.service: Unit entered failed state. Jun 29 10:17:31 jeker.strw.leidenuniv.nl systemd[1]: nfs-server.service: Failed with result 'timeout'. Jun 29 10:17:31 jeker.strw.leidenuniv.nl systemd[1]: Stopping NFSv4 ID-name mapping service... Jun 29 10:17:31 jeker.strw.leidenuniv.nl systemd[1]: Stopping NFS Mount Daemon... Jun 29 10:17:31 jeker.strw.leidenuniv.nl systemd[1]: Stopped NFSv4 ID-name mapping service. Jun 29 10:17:31 jeker.strw.leidenuniv.nl systemd[1]: Stopped NFS Mount Daemon. Maybe the systemd people can tell us what's going on here... I have no idea.. (In reply to David Jansen from comment #4) > Jun 29 10:13:00 jeker.strw.leidenuniv.nl systemd[1]: nfs-server.service: > Processes still around after SIGKILL. Ignoring. Wasn't there an earlier warning message (about 90 seconds before this one) saying "State 'stop-sigterm' timed out. Killing."? The RPM scriptlets of nfs-utils called the command "systemctl try-restart nfs-server". systemd tried to stop the service. Presumably it first ran the ExecStop command of the service. Then it noticed there were still some processes alive in the service's cgroup, so it sent SIGTERM to them. After a timeout when it saw they were still around, it sent them SIGKILL. They still did not exit. This is when systemd emits the warning. Probably because they are stuck in the kernel in TASK_UNINTERRUPTIBLE ('D') state. Then systemd went on to run the ExecStopPost commands... > Jun 29 10:14:31 jeker.strw.leidenuniv.nl systemd[1]: nfs-server.service: > State 'stop-post' timed out. Terminating. An ExecStopStop command got stuck too. It did not finish in 90 seconds. systemd sent SIGTERM... > Jun 29 10:16:01 jeker.strw.leidenuniv.nl systemd[1]: nfs-server.service: > State 'stop-final-sigterm' timed out. Killing. > Jun 29 10:16:01 jeker.strw.leidenuniv.nl systemd[1]: nfs-server.service: > Killing process 33804 (rpc.nfsd) with signal SIGKILL. 90 seconds later the process was still alive. systemd sent SIGKILL... > Jun 29 10:17:31 jeker.strw.leidenuniv.nl systemd[1]: nfs-server.service: > Processes still around after final SIGKILL. Entering failed mode. And the process is still stuck. systemd considers the service failed. > Jun 29 10:17:31 jeker.strw.leidenuniv.nl systemd[1]: Stopped NFS server and > services. > Jun 29 10:17:31 jeker.strw.leidenuniv.nl systemd[1]: nfs-server.service: > Unit entered failed state. > Jun 29 10:17:31 jeker.strw.leidenuniv.nl systemd[1]: nfs-server.service: > Failed with result 'timeout'. Services whose stopping is ordered after nfs-server.service were then stopped: > Jun 29 10:17:31 jeker.strw.leidenuniv.nl systemd[1]: Stopping NFSv4 ID-name > mapping service... > Jun 29 10:17:31 jeker.strw.leidenuniv.nl systemd[1]: Stopping NFS Mount > Daemon... > Jun 29 10:17:31 jeker.strw.leidenuniv.nl systemd[1]: Stopped NFSv4 ID-name > mapping service. > Jun 29 10:17:31 jeker.strw.leidenuniv.nl systemd[1]: Stopped NFS Mount > Daemon. I see no indication of a systemd bug. Thank you for your analogy... Basically is sounds like a process was hung on an NFS mount or an NFS umount was hung... Would these hang(s) because by the network being brought down before NFS? Steve, nobody mentioned anything about the network going down. And it's the *server* that's hanging. Shouldn't it be possible to restart the server service regardless of the network connection state? (In reply to Michal Schmidt from comment #8) > Steve, > nobody mentioned anything about the network going down. The description talks about this happening during a reboot so I'm assuming the system was on its way down > And it's the *server* that's hanging. Shouldn't it be possible to restart > the server service regardless of the network connection state? Process in the 'D' state generally happen on the client which is the confusing part (at least for me :-) ) But to answer your question, in general it shouldn't matter but that may not be the case... With v3 the clients have to be notified about the reboot (but that happens in background so if there is no network it *should* keep trying). Also loopback interface needs to be around so the server can register with rpcbind. I'm not sure what the server does if rpcbind is not around. So it is best to have a functioning network when the server is brought up and down... which I'm hoping the nfs-server service file specifies. The stuck processes appeared already during the nfs-utils package update. I tried to reproduce this, but failed. You mention v3. I only tried the default, which is v4 (?) on a recent Fedora, so that may be why it did not reproduce. I've not been able to reproduce the issue later; it only happened at the time of this rpm upgrade. so something in the upgrade was slightly incompatible, but after a reboot, the new packages seem to work just fine. Also, on systems where no files on nfs were open at the time of the update, there was no problem; the updates nfs daemons were properly restarted and later file access went just fine. So the problem would only occur when files were open on a remote disk at the time of this update. As far as I'm concerned, such a one-time issue is not too much of a problem (In reply to David Jansen from comment #11) > I've not been able to reproduce the issue later; it only happened at the > time of this rpm upgrade. so something in the upgrade was slightly > incompatible, but after a reboot, the new packages seem to work just fine. > Also, on systems where no files on nfs were open at the time of the update, > there was no problem; the updates nfs daemons were properly restarted and > later file access went just fine. > So the problem would only occur when files were open on a remote disk at the > time of this update. > > As far as I'm concerned, such a one-time issue is not too much of a problem Ok... then I'll close this... Thanks! |