Hide Forgot
Description of problem: Since 3.2 OCP has been using DNSMasq in order to allow resolution of internal services from pods/etc. The way this is being done though causes a race condition with the boot process. NetworkManager kicks off a script that modifies resolv.conf and gets DNSMasq inserted infront of everything, however in that period of time if another service (in our case NFS) tries to do DNS lookups they fail because there are no DNS servers defined, which then causes no NFS mounts to be mounted. You can see how this can be problematic when there are requirements and dependencies for the cluster to be healthy and working only when NFS is working, but NFS cannot work because it cannot resolve DNS -- hence race condition. Version-Release number of selected component (if applicable): OCP 3.3 How reproducible: Intermittent. On reboots, there is not a guarantee that it will happen as the whole thing is hinging on timing and NetworkManger script running at the same time as other services. Steps to Reproduce: 1. Reboot node 2. Observe error in logs: "mount[1427]: mount.nfs: Failed to resolve server: Name or service not known" Actual results: (see logs comment #1) Expected results: Resolve servers more reliably after reboot. Additional info: This is kinda catastrophic as, right now, I cannot rely on evacuating a node and reliably being able to reboot it and bring it back into the fold.
so since this is spawning off a case i opened I figured I should do some due diligence and offer a solution that may/may not be suitable. The issue is timing, and mainly timing between the nfs*.service and the NetworkManager-dispatcher.service as the dispatcher is what OCP uses in order to execute its script and make the DNSMasq changes the simplest solution is to add After= nfs*.service to NetworkManager-dispatcher.service Now mind you 1, i dont think you can wildcard so a list would have to be generated. 2, I am encountering this bug with NFS but nothing keeps it from popping up with other services that need to do DNS lookups during bootup, so this may not really be the most optimal solution. The other option is to add Type=idle to NetworkManager-dispatcher.service which should make it the last service to run, but again caveats: dont know who else needs that dispatcher service to run some scripts other than OCP. Anyway, thats my 2c so far
I think this is an installer issue, reassigning.
Yes this comes from the ansible playbooks. For what its worth currently as a work-around in nfs-client.service I have After=NetworkManager-dispatcher.service which SEEMS to have solved the issue, but as I said, i dont know that it is a guarantee it will not happen with some other network service that needs to do DNS lookups while the dispatcher script does its thing.
Boris, We've made some improvements to make the configuration a bit more atomic which would hopefully eliminate this, mind checking to see if you have a diff between what's in /etc/NetworkManager/dispatcher.d/99-origin-dns.sh and the latest from here? https://raw.githubusercontent.com/openshift/openshift-ansible/master/roles/openshift_node_dnsmasq/files/networkmanager/99-origin-dns.sh If there's a diff and you update a host to that and reboot (without applying your workaround) does it improve? Thanks, Scott
I will need to do some coordinating to get this tested. Might be this afternoon before I can get everything situated right for a reboot test, note however that I will have to do this a bunch of times as the original description stated... it doesnt happen every time, sometimes it doesnt happen at all, it just ends up being a timing thing and wether one service does something in the middle of another.
Ok I have ran it with the new 99-origin-dns.sh and havent ran into the problem however I am weary of declaring this fixed as the nature of the bug is a timing thing and even though 10 reboots in a row did not trigger it does not mean that the 11th might not have. To me the best solution here is still something that leverages SystemD and their Before/After directives to get the timing correct.
So I just ran a rolling restart and the NFS service failed, so the new code you have did not solve the problem, it just made it happen less :/
Ok, thanks for getting back to me. We'll see about making it happen post boot.
The dispatcher script has been updated to be more atomic but I'm not sure it's addressed this. Requesting that QE attempt to reproduce this with the current codebase.
Tried with openshift-ansible-3.5.110-1.git.0.6f1f193.el7.noarch.rpm (Recent changes about `99-origin-dns.sh` are not applied to the package) Steps: 1) Trigger all-in-one cluster with NFS docker-registry backend storage. #cat inventory [OSEv3:children] masters nodes nfs [OSEv3:vars] <--snip--> openshift_hosted_registry_storage_kind=nfs openshift_hosted_registry_storage_nfs_options="*(rw,root_squash,sync,no_wdelay)" openshift_hosted_registry_storage_nfs_directory=/var/lib/exports openshift_hosted_registry_storage_volume_name=regpv openshift_hosted_registry_storage_access_modes=["ReadWriteMany"] openshift_hosted_registry_storage_volume_size=17G #openshift host defination start [masters] host-8-241-31.host.centralci.eng.rdu2.redhat.com ansible_user=root ansible_ssh_user=root openshift_public_hostname=host-8-241-31.host.centralci.eng.rdu2.redhat.com openshift_hostname=host-8-241-31.host.centralci.eng.rdu2.redhat.com [nodes] host-8-241-31.host.centralci.eng.rdu2.redhat.com ansible_user=root ansible_ssh_user=root openshift_public_hostname=host-8-241-31.host.centralci.eng.rdu2.redhat.com openshift_hostname=host-8-241-31.host.centralci.eng.rdu2.redhat.com openshift_node_labels="{'role': 'node','registry': 'enabled','router': 'enabled'}" openshift_schedulable=true [nfs] host-8-241-31.host.centralci.eng.rdu2.redhat.com ansible_user=root ansible_ssh_user=root #openshift host defination end 2) Rebooting the server over and over after completing the installation 3) Check if there were logs indicating NFS startup failure as comment 1 Results: 2) Installation succeed 3) No failures related NFS found in the logs after *20* rebooting attempts. And all the pod still worked well. The logs about NFS and NetworkManager services will be attached.
OS info: # uname -r 3.10.0-693.el7.x86_64 # cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.4 (Maipo) # openshift version openshift v3.5.5.31.19 kubernetes v1.5.2+43a9be4 etcd 3.1.0 # rpm -qa |grep nfs nfs-utils-1.3.0-0.48.el7.x86_64 libnfsidmap-0.25-17.el7.x86_64 QE could not reproduce the issue. Please let us know if this is sufficient.
Hi, Brennan, QE have troubles to reproduce it. Any advice?
Have not seen the failure so far during 3.7 testing. Moving to verified. Please reopen and provide the detail info if it's still a issue for you. Verified with openshift-ansible-3.7.0-0.126.1.git.0.0bb5b0c.el7.noarch.rpm
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188