Bug 1409012

Summary: nfs-server runs before network is ready; "failed to resolve" for all hosts; nothing is exported
Product: Red Hat Enterprise Linux 7 Reporter: Chris Schanzle <bugzilla>
Component: nfs-utilsAssignee: Steve Dickson <steved>
Status: CLOSED ERRATA QA Contact: Yongcheng Yang <yoyang>
Severity: low Docs Contact:
Priority: low    
Version: 7.3CC: bugzilla, chunwang, eguan, jiyin, steved, yoyang
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: nfs-utils-1.3.0-0.41.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-01 19:48:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Chris Schanzle 2016-12-28 22:47:33 UTC
Greetings, this is coming from a community user running CentOS 7.3.  I hope it is welcomed as the effort to write this is an expression of my appreciation of the work you share.

Description of problem:
systemd may start the nfs-server service before the network is up.  Thus it cannot resolve hostnames in /etc/exports.
This is never retried and the server ends up not serving files as desired.


Version-Release number of selected component (if applicable):
nfs-utils-1.3.0-0.33.el7.x86_64

How reproducible:
100% since the upgrade to 7.3 (never noticed this in 7.1 or 7.2).  System is a spinning-rust HD-based Dell T5600 with PERC H710P (LSI Logic / Symbios Logic MegaRAID SAS 2208).
However, bootup can be racey.  Could reproduce 3 out of 7 tries on another ssd-based system, where things where twice things were exported to a *partial* list of hostnames.


Steps to Reproduce:

create /etc/exports exporting to several DNS-only hostnames
TIP: export to one hostname that is not resolvable so you always get an entry in the journal.
I use a wired network connection, configured via DHCP, which can take several seconds to get an IP.
systemctl enable nfs-server
I had enabled NetworkManager-wait-online.service
boot system, check "showmount -e" or "exportfs"

Actual results:
filesystems are not exported; output of "exportfs" is empty.

The below logs show exportfs is run before the network is up.  Other boots with the issue shows about an 8 second delay to get an IP.

journalctl -b -u nfs-server -u NetworkManager
...
Dec 22 15:51:27 hercules.cam.nist.gov systemd[1]: Starting Network Manager...
Dec 22 15:51:28 hercules.cam.nist.gov NetworkManager[1235]: <info>  [1482439888.2791] NetworkManager (version 1.4.0-13.el7_3) is starting...
Dec 22 15:51:28 hercules.cam.nist.gov NetworkManager[1235]: <info>  [1482439888.2793] Read config: /etc/NetworkManager/NetworkManager.conf
Dec 22 15:51:28 hercules.cam.nist.gov NetworkManager[1235]: <info>  [1482439888.3301] manager[0x7fe9d45a90d0]: monitoring kernel firmware directory '/lib/firmwar
Dec 22 15:51:28 hercules.cam.nist.gov NetworkManager[1235]: <info>  [1482439888.3452] dns-mgr[0x7fe9d45a2040]: init: dns=default, rc-manager=file
Dec 22 15:51:28 hercules.cam.nist.gov systemd[1]: Started Network Manager.
Dec 22 15:51:28 hercules.cam.nist.gov systemd[1]: Starting NFS server and services...
Dec 22 15:51:28 hercules.cam.nist.gov NetworkManager[1235]: <info>  [1482439888.5511] settings: loaded plugin ifcfg-rh: (c) 2007 - 2015 Red Hat, Inc.  To report 
Dec 22 15:51:28 hercules.cam.nist.gov exportfs[1320]: exportfs: Failed to resolve xanadu.cam.nist.gov
Dec 22 15:51:28 hercules.cam.nist.gov exportfs[1320]: exportfs: Failed to resolve xanadu.cam.nist.gov
Dec 22 15:51:28 hercules.cam.nist.gov exportfs[1320]: exportfs: Failed to resolve tigger.cam.nist.gov
Dec 22 15:51:28 hercules.cam.nist.gov exportfs[1320]: exportfs: Failed to resolve tigger.cam.nist.gov
Dec 22 15:51:28 hercules.cam.nist.gov exportfs[1320]: exportfs: Failed to resolve plum.cam.nist.gov
Dec 22 15:51:28 hercules.cam.nist.gov exportfs[1320]: exportfs: Failed to resolve plum.cam.nist.gov
Dec 22 15:51:28 hercules.cam.nist.gov NetworkManager[1235]: <info>  [1482439888.6193] settings: loaded plugin iBFT: (c) 2014 Red Hat, Inc.  To report bugs please
Dec 22 15:51:28 hercules.cam.nist.gov NetworkManager[1235]: <info>  [1482439888.6193] settings: loaded plugin keyfile: (c) 2007 - 2015 Red Hat, Inc.  To report b
Dec 22 15:51:28 hercules.cam.nist.gov NetworkManager[1235]: <info>  [1482439888.6204] ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-em1 (d924258b
Dec 22 15:51:28 hercules.cam.nist.gov systemd[1]: Started NFS server and services.
...
Dec 22 15:51:34 hercules.cam.nist.gov dhclient[1426]: DHCPREQUEST on em1 to 255.255.255.255 port 67 (xid=0x5b075789)
Dec 22 15:51:41 hercules.cam.nist.gov dhclient[1426]: DHCPREQUEST on em1 to 255.255.255.255 port 67 (xid=0x5b075789)
Dec 22 15:51:41 hercules.cam.nist.gov dhclient[1426]: DHCPACK from 129.6.88.116 (xid=0x5b075789)
Dec 22 15:51:41 hercules.cam.nist.gov NetworkManager[1235]: <info>  [1482439901.5860] dhcp4 (em1):   address 129.6.88.21
Dec 22 15:51:41 hercules.cam.nist.gov NetworkManager[1235]: <info>  [1482439901.5860] dhcp4 (em1):   plen 24 (255.255.255.0)
...


After creating the following, the system is booting with exported filesystems 4 times out of 4 tries.

$ cat /etc/systemd/system/nfs-server.service.d/wait.conf
[Unit]
After=NetworkManager-wait-online.service

Consider also adding Requires (unlikely if IP addresses are used in /etc/exports, don't want nfs-service to fail) or Wants (highly likely, so NMwo will be started even if not enabled already, if I rtfm correctly).

Thank you!

Comment 1 Steve Dickson 2017-01-04 19:08:10 UTC
What I don't understand why only your site is seeing 
this problem... I'm exporting dns hostname is a very
common... Plus Requires= network.target in the 
nfs-server.service should bring DNS up.

Comment 2 Chris Schanzle 2017-01-05 02:57:11 UTC
>Plus Requires= network.target in the nfs-server.service should bring DNS up.

Wrong, per the fine manual!

network.target means little per the documentation referenced in the unit file -- https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/

If you really want DNS resolution to work, you need to wait for an interface to be up and online, not just have the stack initialized.  Hence my suggestion for wait-online, but per those docs, perhaps I should have used:
After=network-online.target
Wants=network-online.target

Comment 3 Chris Schanzle 2017-01-05 18:35:20 UTC
Not that I understand this, but experimentally speaking, with /etc/systemd/system/nscd.service.d/network-online.conf containing:

[Unit]
After=network-online.target
Wants=network-online.target

and rebooting, host resolution was not working again.  By looking at the journal:

journalctl -b -u nscd -u NetworkManager -u network-online.target -u NetworkManager-wait-online.target

it was clear systemd started nscd before NetworkManager, well-before the network was even configured.  Here are some snippets:

Jan 05 12:17:18 localhost.localdomain systemd[1]: Starting Name Service Cache Daemon...
Jan 05 12:17:20 localhost.localdomain systemd[1]: Starting Network Manager...
Jan 05 12:17:20 localhost.localdomain systemd[1]: Started Name Service Cache Daemon.
Jan 05 12:17:22 localhost.localdomain NetworkManager[1360]: <info>  [1483636642.0293] NetworkManager (version 1.4.2-2.fc25) is starting...
Jan 05 12:17:33 localhost.localdomain dhclient[1540]: bound to 129.6.88.146 -- renewal in 158521 seconds.
Jan 05 12:17:33 grad.cam.nist.gov NetworkManager[1360]: <info>  [1483636653.4606] device (eno1): Activation: successful, device activated.
Jan 05 12:17:33 grad.cam.nist.gov NetworkManager[1360]: <info>  [1483636653.4627] connectivity: check for uri 'http://fedoraproject.org/static/hotspot.txt' failed with 'Error resolving 'fedoraproject.org': Name or service not known'

~20 mins passes...I ran "nscd -i hosts" to get resolution working again:

Jan 05 12:37:34 grad.cam.nist.gov NetworkManager[1360]: <info>  [1483637854.9200] manager: NetworkManager state is now CONNECTED_GLOBAL

So it seems After+Wants=network-online doesn't do what is stated in the docs, at least in Fedora 25.  Using After+Wants=NetworkManager-wait-online.service seems to do what is desired/expected.

But still, it does not explain the issues with nscd.

Perhaps worth noting, in one of these reboots, it appeared IPv6 hostname resolution worked, but not IPv4.  While our network supports IPv6, the system does not have an IPv6 address.  E.g.,

[root@grad ~]# getent hosts  fedoraproject.org
2607:f188::dead:beef:cafe:fed1 fedoraproject.org
2605:bc80:3010:600:dead:beef:cafe:feda fedoraproject.org
2604:1580:fe00:0:dead:beef:cafe:fed1 fedoraproject.org
2605:bc80:3010:600:dead:beef:cafe:fed9 fedoraproject.org
2610:28:3090:3001:dead:beef:cafe:fed3 fedoraproject.org

[root@grad ~]# telnet www.google.com
telnet: www.google.com: Name or service not known

[root@grad ~]# getent hosts www.google.com
2607:f8b0:400c:c0b::69 www.google.com
[root@grad ~]# getent hosts www.microsoft.com
[root@grad ~]# getent hosts www.amazon.com

Comment 4 Chris Schanzle 2017-01-05 18:39:38 UTC
My apologies, the above comment was meant for bug #1367565, but some overlaps here as well. [humble shrug]

Comment 7 Steve Dickson 2017-04-26 16:59:35 UTC
This is also needed

commit 09e5c6c2a3f8eac91d5353e6d4ff6aee7757ab08
Author: Steve Dickson <steved>
Date:   Mon Apr 24 11:25:39 2017 -0400

    systemd: Afters are also needed for the Wants=network-online.target

Comment 9 Yongcheng Yang 2017-04-28 02:00:12 UTC
Seems Chunyu has been tracking this issue in upstream.
If possible, please help to generate/update testcase to cover it.

Comment 12 Yongcheng Yang 2017-06-20 08:15:24 UTC
As we have never seen this problem within our test environment, just
verified the systemd scripts have changed to use "network-online".

Moving to VERIFIED and will update testcase to cover it.
-----------------------------------------------------------------
[root@ ~]# rpm -q nfs-utils
nfs-utils-1.3.0-0.44.el7.x86_64
[root@ ~]# systemctl list-dependencies nfs | grep network-online
● ├─network-online.target
[root@ ~]# 
[root@ ~]# systemctl cat nfs | grep network-online
Wants=rpcbind.socket network-online.target
After= network-online.target local-fs.target
[root@ ~]# 
[root@ ~]# systemctl cat nfs-mountd.service | grep network-online
Wants=network-online.target
After=network-online.target local-fs.target
[root@ ~]# 
[root@ ~]# systemctl cat rpc-statd-notify.service | grep network-online
Wants=network-online.target
After=local-fs.target network-online.target nss-lookup.target
[root@ ~]# 
[root@ ~]# systemctl cat rpc-statd.service | grep network-online
Wants=network-online.target
After=network-online.target nss-lookup.target rpcbind.socket
[root@ ~]# 


However, there still exits "network.target", don't know whether we
should replace all of them with "network-online.target".

Will read the doc and test some more to check it.
-----------------------------------------------------------------
[root@ ~]# rpm -ql nfs-utils | grep .*service$ | xargs grep "network.target"
/usr/lib/systemd/system/nfs-server.service:Requires= network.target proc-fs-nfsd.mount
/usr/lib/systemd/system/nfs.service:Requires= network.target proc-fs-nfsd.mount
[root@ ~]# systemctl cat nfs | grep 'network.target'
Requires= network.target proc-fs-nfsd.mount
[root@ ~]#

Comment 14 errata-xmlrpc 2017-08-01 19:48:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2233