Bug 1532960

Summary: No Auto-recovery for atomic-openshift-node
Product: OpenShift Container Platform Reporter: Ritesh Arya <sarya0113>
Component: NetworkingAssignee: Rajat Chopra <rchopra>
Status: CLOSED ERRATA QA Contact: Meng Bo <bmeng>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.6.0CC: aos-bugs, bbennett, jokerman, mfojtik, mmccomas, sdodson, zzhao
Target Milestone: ---   
Target Release: 3.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-03-28 14:18:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ritesh Arya 2018-01-10 06:01:16 UTC
Description of problem:
When user stop dnsmasq using systemctl stop dnsmasq command then atomic-openshift-node also gets stopped. So, when user try to recover dnsmasq by starting dnsmasq using systemctl start dnsmasq command, atomic-openshift-node does not recover automatically. so in that case pod will not come in running state it will remain in pending state forever. Then, user should start atomic-openshift-node manually to make the pod running.

Version-Release number of selected component (if applicable): 
atomic-openshift-utils-3.6.173.0.48-1.git.0.1609d30.el7.noarch


How reproducible:
Execution of command through CLI


Steps to Reproduce:
1. systemctl stop dnsmasq
2. oadm registry
3. systemctl start dnsmasq

Actual results:
pod will not come in running state it will remain in pending state forever due to not auto-recovery of atomic-openshift-node.

Expected results:
OpenShift should automatically recover the atomic-openshift-node when dnsmasq get started.


Additional info:

Comment 1 Xingxing Xia 2018-01-11 06:34:57 UTC
Seems the issue is more related to Network or Master. Changing to component Networking and also CC'ing Master component's default Assignee :)

Comment 2 Ben Bennett 2018-01-17 20:07:47 UTC
@sdodson: Do you know of any systemd magic we can do to the unit files here?

Comment 3 Scott Dodson 2018-01-17 20:37:39 UTC
PartOf causes stop/restart to propagate to us whenever the target gets those events. WantedBy will cause a start of dnsmasq to start the node which seems like it would fix the concern here but to me it'd be unexpected to trigger the node service to start if you started dnsmasq in the event that the node service had been stopped otherwise.

So perhaps
[Unit]
PartOf=dnsmasq.service

[Install]
WantedBy=dnsmasq.service

Alternatively, we could just remove the Requires and rely on Wants to ensure that dnsmasq is requested to start when the node starts but if dnsmasq were stopped the node would continue running and pods would immediately have broken dns.

man 5 systemd.unit   for more thorough description of these options

Comment 4 Rajat Chopra 2018-01-23 21:36:15 UTC
> Alternatively, we could just remove the Requires and rely on Wants to ensure that dnsmasq is requested to start when the node starts but if dnsmasq were stopped the node would continue running and pods would immediately have broken dns.

It seems like this is favourable because dnsmasq is wanted for complete functionality, but the node process does not need a restart if dnsmasq were to re-invent itself.
To the possibility of someone stopping dnsmasq causing a broken dns is fair deal. The admin stops it, has to start it again for things to function. As an example, this is what we do with openvswitch.

Possible fix in PR: https://github.com/openshift/openshift-ansible/pull/6843

@Scott, what do you think?

Comment 5 Scott Dodson 2018-01-24 18:06:34 UTC
Yeah that sounds fine to me, /lgtm'd that PR

Comment 7 Meng Bo 2018-02-08 07:44:38 UTC
Tested on ocp 3.9.0-0.41.0 and openshift-ansible-3.9.0-0.41.0, issue has been fixed. The node service will not be stopped when the dnsmasq is stopped.

[root@ip-172-18-3-105 ~]# systemctl status dnsmasq atomic-openshift-node | grep Active
   Active: active (running) since Thu 2018-02-08 02:40:56 EST; 13s ago
   Active: active (running) since Thu 2018-02-08 02:40:57 EST; 12s ago
[root@ip-172-18-3-105 ~]# systemctl stop dnsmasq 
[root@ip-172-18-3-105 ~]# systemctl status dnsmasq atomic-openshift-node | grep Active
   Active: inactive (dead) since Thu 2018-02-08 02:41:35 EST; 1s ago
   Active: active (running) since Thu 2018-02-08 02:40:57 EST; 40s ago
[root@ip-172-18-3-105 ~]# systemctl restart atomic-openshift-node 
[root@ip-172-18-3-105 ~]# systemctl status dnsmasq atomic-openshift-node | grep Active
   Active: active (running) since Thu 2018-02-08 02:41:44 EST; 2s ago
   Active: active (running) since Thu 2018-02-08 02:41:45 EST; 1s ago

Comment 10 errata-xmlrpc 2018-03-28 14:18:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489