Bug 1532960 - No Auto-recovery for atomic-openshift-node
Summary: No Auto-recovery for atomic-openshift-node
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.6.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 3.9.0
Assignee: Rajat Chopra
QA Contact: Meng Bo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-01-10 06:01 UTC by Ritesh Arya
Modified: 2018-07-11 06:59 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-03-28 14:18:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github https://github.com/openshift openshift-ansible pull 6843 0 None None None 2018-02-02 19:13:41 UTC
Red Hat Product Errata RHBA-2018:0489 0 None None None 2018-03-28 14:18:45 UTC

Description Ritesh Arya 2018-01-10 06:01:16 UTC
Description of problem:
When user stop dnsmasq using systemctl stop dnsmasq command then atomic-openshift-node also gets stopped. So, when user try to recover dnsmasq by starting dnsmasq using systemctl start dnsmasq command, atomic-openshift-node does not recover automatically. so in that case pod will not come in running state it will remain in pending state forever. Then, user should start atomic-openshift-node manually to make the pod running.

Version-Release number of selected component (if applicable): 
atomic-openshift-utils-3.6.173.0.48-1.git.0.1609d30.el7.noarch


How reproducible:
Execution of command through CLI


Steps to Reproduce:
1. systemctl stop dnsmasq
2. oadm registry
3. systemctl start dnsmasq

Actual results:
pod will not come in running state it will remain in pending state forever due to not auto-recovery of atomic-openshift-node.

Expected results:
OpenShift should automatically recover the atomic-openshift-node when dnsmasq get started.


Additional info:

Comment 1 Xingxing Xia 2018-01-11 06:34:57 UTC
Seems the issue is more related to Network or Master. Changing to component Networking and also CC'ing Master component's default Assignee :)

Comment 2 Ben Bennett 2018-01-17 20:07:47 UTC
@sdodson: Do you know of any systemd magic we can do to the unit files here?

Comment 3 Scott Dodson 2018-01-17 20:37:39 UTC
PartOf causes stop/restart to propagate to us whenever the target gets those events. WantedBy will cause a start of dnsmasq to start the node which seems like it would fix the concern here but to me it'd be unexpected to trigger the node service to start if you started dnsmasq in the event that the node service had been stopped otherwise.

So perhaps
[Unit]
PartOf=dnsmasq.service

[Install]
WantedBy=dnsmasq.service

Alternatively, we could just remove the Requires and rely on Wants to ensure that dnsmasq is requested to start when the node starts but if dnsmasq were stopped the node would continue running and pods would immediately have broken dns.

man 5 systemd.unit   for more thorough description of these options

Comment 4 Rajat Chopra 2018-01-23 21:36:15 UTC
> Alternatively, we could just remove the Requires and rely on Wants to ensure that dnsmasq is requested to start when the node starts but if dnsmasq were stopped the node would continue running and pods would immediately have broken dns.

It seems like this is favourable because dnsmasq is wanted for complete functionality, but the node process does not need a restart if dnsmasq were to re-invent itself.
To the possibility of someone stopping dnsmasq causing a broken dns is fair deal. The admin stops it, has to start it again for things to function. As an example, this is what we do with openvswitch.

Possible fix in PR: https://github.com/openshift/openshift-ansible/pull/6843

@Scott, what do you think?

Comment 5 Scott Dodson 2018-01-24 18:06:34 UTC
Yeah that sounds fine to me, /lgtm'd that PR

Comment 7 Meng Bo 2018-02-08 07:44:38 UTC
Tested on ocp 3.9.0-0.41.0 and openshift-ansible-3.9.0-0.41.0, issue has been fixed. The node service will not be stopped when the dnsmasq is stopped.

[root@ip-172-18-3-105 ~]# systemctl status dnsmasq atomic-openshift-node | grep Active
   Active: active (running) since Thu 2018-02-08 02:40:56 EST; 13s ago
   Active: active (running) since Thu 2018-02-08 02:40:57 EST; 12s ago
[root@ip-172-18-3-105 ~]# systemctl stop dnsmasq 
[root@ip-172-18-3-105 ~]# systemctl status dnsmasq atomic-openshift-node | grep Active
   Active: inactive (dead) since Thu 2018-02-08 02:41:35 EST; 1s ago
   Active: active (running) since Thu 2018-02-08 02:40:57 EST; 40s ago
[root@ip-172-18-3-105 ~]# systemctl restart atomic-openshift-node 
[root@ip-172-18-3-105 ~]# systemctl status dnsmasq atomic-openshift-node | grep Active
   Active: active (running) since Thu 2018-02-08 02:41:44 EST; 2s ago
   Active: active (running) since Thu 2018-02-08 02:41:45 EST; 1s ago

Comment 10 errata-xmlrpc 2018-03-28 14:18:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489


Note You need to log in before you can comment on or make changes to this bug.