Description of problem: When user stop dnsmasq using systemctl stop dnsmasq command then atomic-openshift-node also gets stopped. So, when user try to recover dnsmasq by starting dnsmasq using systemctl start dnsmasq command, atomic-openshift-node does not recover automatically. so in that case pod will not come in running state it will remain in pending state forever. Then, user should start atomic-openshift-node manually to make the pod running. Version-Release number of selected component (if applicable): atomic-openshift-utils-3.6.173.0.48-1.git.0.1609d30.el7.noarch How reproducible: Execution of command through CLI Steps to Reproduce: 1. systemctl stop dnsmasq 2. oadm registry 3. systemctl start dnsmasq Actual results: pod will not come in running state it will remain in pending state forever due to not auto-recovery of atomic-openshift-node. Expected results: OpenShift should automatically recover the atomic-openshift-node when dnsmasq get started. Additional info:
Seems the issue is more related to Network or Master. Changing to component Networking and also CC'ing Master component's default Assignee :)
@sdodson: Do you know of any systemd magic we can do to the unit files here?
PartOf causes stop/restart to propagate to us whenever the target gets those events. WantedBy will cause a start of dnsmasq to start the node which seems like it would fix the concern here but to me it'd be unexpected to trigger the node service to start if you started dnsmasq in the event that the node service had been stopped otherwise. So perhaps [Unit] PartOf=dnsmasq.service [Install] WantedBy=dnsmasq.service Alternatively, we could just remove the Requires and rely on Wants to ensure that dnsmasq is requested to start when the node starts but if dnsmasq were stopped the node would continue running and pods would immediately have broken dns. man 5 systemd.unit for more thorough description of these options
> Alternatively, we could just remove the Requires and rely on Wants to ensure that dnsmasq is requested to start when the node starts but if dnsmasq were stopped the node would continue running and pods would immediately have broken dns. It seems like this is favourable because dnsmasq is wanted for complete functionality, but the node process does not need a restart if dnsmasq were to re-invent itself. To the possibility of someone stopping dnsmasq causing a broken dns is fair deal. The admin stops it, has to start it again for things to function. As an example, this is what we do with openvswitch. Possible fix in PR: https://github.com/openshift/openshift-ansible/pull/6843 @Scott, what do you think?
Yeah that sounds fine to me, /lgtm'd that PR
Tested on ocp 3.9.0-0.41.0 and openshift-ansible-3.9.0-0.41.0, issue has been fixed. The node service will not be stopped when the dnsmasq is stopped. [root@ip-172-18-3-105 ~]# systemctl status dnsmasq atomic-openshift-node | grep Active Active: active (running) since Thu 2018-02-08 02:40:56 EST; 13s ago Active: active (running) since Thu 2018-02-08 02:40:57 EST; 12s ago [root@ip-172-18-3-105 ~]# systemctl stop dnsmasq [root@ip-172-18-3-105 ~]# systemctl status dnsmasq atomic-openshift-node | grep Active Active: inactive (dead) since Thu 2018-02-08 02:41:35 EST; 1s ago Active: active (running) since Thu 2018-02-08 02:40:57 EST; 40s ago [root@ip-172-18-3-105 ~]# systemctl restart atomic-openshift-node [root@ip-172-18-3-105 ~]# systemctl status dnsmasq atomic-openshift-node | grep Active Active: active (running) since Thu 2018-02-08 02:41:44 EST; 2s ago Active: active (running) since Thu 2018-02-08 02:41:45 EST; 1s ago
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0489