Bug 1410288 - DNSMasq and NetworkManager scripts cause boot issues with network resources
Summary: DNSMasq and NetworkManager scripts cause boot issues with network resources
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 3.7.0
Assignee: Scott Dodson
QA Contact: Gan Huang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-05 01:45 UTC by Brennan Vincello
Modified: 2017-11-28 21:52 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
The NetworkManager dispatcher script responsible for configuring a host to use dnsmasq operated in a non atomic manner which could've resulted in failed dns queries during boot up. The script has been refactored to ensure that required services are verified before /etc/resolv.conf is reconfigured.
Clone Of:
Environment:
Last Closed: 2017-11-28 21:52:23 UTC
Target Upstream Version:
sdodson: needinfo-


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:3188 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update 2017-11-29 02:34:54 UTC

Description Brennan Vincello 2017-01-05 01:45:27 UTC
Description of problem:

Since 3.2 OCP has been using DNSMasq in order to allow resolution of internal services from pods/etc. The way this is being done though causes a race condition with the boot process. 

NetworkManager kicks off a script that modifies resolv.conf and gets DNSMasq inserted infront of everything, however in that period of time if another service (in our case NFS) tries to do DNS lookups they fail because there are no DNS servers defined, which then causes no NFS mounts to be mounted.

You can see how this can be problematic when there are requirements and dependencies for the cluster to be healthy and working only when NFS is working, but NFS cannot work because it cannot resolve DNS -- hence race condition.

Version-Release number of selected component (if applicable): OCP 3.3

How reproducible: Intermittent. 

On reboots, there is not a guarantee that it will happen as the whole thing is hinging on timing and NetworkManger script running at the same time as other services.

Steps to Reproduce:
1. Reboot node
2. Observe error in logs: 
"mount[1427]: mount.nfs: Failed to resolve server: Name or service not known"

Actual results: 

(see logs comment #1)

Expected results: 

Resolve servers more reliably after reboot.

Additional info:

This is kinda catastrophic as, right now, I cannot rely on evacuating a node and reliably being able to reboot it and bring it back into the fold.

Comment 2 Boris Kurktchiev 2017-01-05 19:07:43 UTC
so since this is spawning off a case i opened I figured I should do some due diligence and offer a solution that may/may not be suitable.

The issue is timing, and mainly timing between the nfs*.service and the NetworkManager-dispatcher.service as the dispatcher is what OCP uses in order to execute its script and make the DNSMasq changes the simplest solution is to add 

After= nfs*.service to NetworkManager-dispatcher.service

Now mind you 1, i dont think you can wildcard so a list would have to be generated. 2, I am encountering this bug with NFS but nothing keeps it from popping up with other services that need to do DNS lookups during bootup, so this may not really be the most optimal solution.

The other option is to add
Type=idle to NetworkManager-dispatcher.service which should make it the last service to run, but again caveats: dont know who else needs that dispatcher service to run some scripts other than OCP.

Anyway, thats my 2c so far

Comment 3 Ben Bennett 2017-01-06 14:59:43 UTC
I think this is an installer issue, reassigning.

Comment 4 Boris Kurktchiev 2017-01-06 15:01:31 UTC
Yes this comes from the ansible playbooks. For what its worth currently as a work-around in nfs-client.service I have After=NetworkManager-dispatcher.service which SEEMS to have solved the issue, but as I said, i dont know that it is a guarantee it will not happen with some other network service that needs to do DNS lookups while the dispatcher script does its thing.

Comment 5 Scott Dodson 2017-01-06 15:29:19 UTC
Boris,

We've made some improvements to make the configuration a bit more atomic which would hopefully eliminate this, mind checking to see if you have a diff between what's in /etc/NetworkManager/dispatcher.d/99-origin-dns.sh and the latest from here?

https://raw.githubusercontent.com/openshift/openshift-ansible/master/roles/openshift_node_dnsmasq/files/networkmanager/99-origin-dns.sh

If there's a diff and you update a host to that and reboot (without applying your workaround) does it improve?

Thanks,
Scott

Comment 6 Boris Kurktchiev 2017-01-06 15:50:42 UTC
I will need to do some coordinating to get this tested. Might be this afternoon before I can get everything situated right for a reboot test, note however that I will have to do this a bunch of times as the original description stated... it doesnt happen every time, sometimes it doesnt happen at all, it just ends up being a timing thing and wether one service does something in the middle of another.

Comment 7 Boris Kurktchiev 2017-01-12 15:10:03 UTC
Ok I have ran it with the new 99-origin-dns.sh and havent ran into the problem however I am weary of declaring this fixed as the nature of the bug is a timing thing and even though 10 reboots in a row did not trigger it does not mean that the 11th might not have. 

To me the best solution here is still something that leverages SystemD and their Before/After directives to get the timing correct.

Comment 8 Boris Kurktchiev 2017-01-18 15:47:22 UTC
So I just ran a rolling restart and the NFS service failed, so the new code you have did not solve the problem, it just made it happen less :/

Comment 9 Scott Dodson 2017-01-18 15:55:24 UTC
Ok, thanks for getting back to me. We'll see about making it happen post boot.

Comment 12 Scott Dodson 2017-08-24 18:51:52 UTC
The dispatcher script has been updated to be more atomic but I'm not sure it's addressed this. Requesting that QE attempt to reproduce this with the current codebase.

Comment 13 Gan Huang 2017-08-28 08:21:36 UTC
Tried with openshift-ansible-3.5.110-1.git.0.6f1f193.el7.noarch.rpm (Recent changes about `99-origin-dns.sh` are not applied to the package)

Steps:

1) Trigger all-in-one cluster with NFS docker-registry backend storage.

#cat inventory
[OSEv3:children]
masters
nodes
nfs

[OSEv3:vars]
<--snip-->
openshift_hosted_registry_storage_kind=nfs
openshift_hosted_registry_storage_nfs_options="*(rw,root_squash,sync,no_wdelay)"
openshift_hosted_registry_storage_nfs_directory=/var/lib/exports
openshift_hosted_registry_storage_volume_name=regpv
openshift_hosted_registry_storage_access_modes=["ReadWriteMany"]
openshift_hosted_registry_storage_volume_size=17G

#openshift host defination start
[masters]
host-8-241-31.host.centralci.eng.rdu2.redhat.com ansible_user=root ansible_ssh_user=root openshift_public_hostname=host-8-241-31.host.centralci.eng.rdu2.redhat.com openshift_hostname=host-8-241-31.host.centralci.eng.rdu2.redhat.com

[nodes]
host-8-241-31.host.centralci.eng.rdu2.redhat.com ansible_user=root ansible_ssh_user=root openshift_public_hostname=host-8-241-31.host.centralci.eng.rdu2.redhat.com openshift_hostname=host-8-241-31.host.centralci.eng.rdu2.redhat.com openshift_node_labels="{'role': 'node','registry': 'enabled','router': 'enabled'}" openshift_schedulable=true

[nfs]
host-8-241-31.host.centralci.eng.rdu2.redhat.com ansible_user=root ansible_ssh_user=root 

#openshift host defination end

2) Rebooting the server over and over after completing the installation

3) Check if there were logs indicating NFS startup failure as comment 1

Results:

2) Installation succeed

3) No failures related NFS found in the logs after *20* rebooting attempts. And all the pod still worked well.

The logs about NFS and NetworkManager services will be attached.

Comment 15 Gan Huang 2017-08-28 08:36:55 UTC
OS info:

# uname -r
3.10.0-693.el7.x86_64

# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.4 (Maipo)

# openshift version
openshift v3.5.5.31.19
kubernetes v1.5.2+43a9be4
etcd 3.1.0

# rpm -qa |grep nfs
nfs-utils-1.3.0-0.48.el7.x86_64
libnfsidmap-0.25-17.el7.x86_64

QE could not reproduce the issue. Please let us know if this is sufficient.

Comment 16 Gan Huang 2017-08-29 08:34:16 UTC
Hi, Brennan,

QE have troubles to reproduce it. Any advice?

Comment 18 Gan Huang 2017-09-15 07:06:51 UTC
Have not seen the failure so far during 3.7 testing.

Moving to verified. Please reopen and provide the detail info if it's still a issue for you.

Verified with openshift-ansible-3.7.0-0.126.1.git.0.0bb5b0c.el7.noarch.rpm

Comment 22 errata-xmlrpc 2017-11-28 21:52:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188


Note You need to log in before you can comment on or make changes to this bug.