Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1532960 - No Auto-recovery for atomic-openshift-node
No Auto-recovery for atomic-openshift-node
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking (Show other bugs)
3.6.0
Unspecified Unspecified
medium Severity medium
: ---
: 3.9.0
Assigned To: Rajat Chopra
Meng Bo
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2018-01-10 01:01 EST by Ritesh Arya
Modified: 2018-07-11 02:59 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-03-28 10:18:26 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Github https://github.com/openshift/openshift-ansible/pull/6843 None None None 2018-02-02 14:13 EST
Red Hat Product Errata RHBA-2018:0489 None None None 2018-03-28 10:18 EDT

  None (edit)
Description Ritesh Arya 2018-01-10 01:01:16 EST
Description of problem:
When user stop dnsmasq using systemctl stop dnsmasq command then atomic-openshift-node also gets stopped. So, when user try to recover dnsmasq by starting dnsmasq using systemctl start dnsmasq command, atomic-openshift-node does not recover automatically. so in that case pod will not come in running state it will remain in pending state forever. Then, user should start atomic-openshift-node manually to make the pod running.

Version-Release number of selected component (if applicable): 
atomic-openshift-utils-3.6.173.0.48-1.git.0.1609d30.el7.noarch


How reproducible:
Execution of command through CLI


Steps to Reproduce:
1. systemctl stop dnsmasq
2. oadm registry
3. systemctl start dnsmasq

Actual results:
pod will not come in running state it will remain in pending state forever due to not auto-recovery of atomic-openshift-node.

Expected results:
OpenShift should automatically recover the atomic-openshift-node when dnsmasq get started.


Additional info:
Comment 1 Xingxing Xia 2018-01-11 01:34:57 EST
Seems the issue is more related to Network or Master. Changing to component Networking and also CC'ing Master component's default Assignee :)
Comment 2 Ben Bennett 2018-01-17 15:07:47 EST
@sdodson: Do you know of any systemd magic we can do to the unit files here?
Comment 3 Scott Dodson 2018-01-17 15:37:39 EST
PartOf causes stop/restart to propagate to us whenever the target gets those events. WantedBy will cause a start of dnsmasq to start the node which seems like it would fix the concern here but to me it'd be unexpected to trigger the node service to start if you started dnsmasq in the event that the node service had been stopped otherwise.

So perhaps
[Unit]
PartOf=dnsmasq.service

[Install]
WantedBy=dnsmasq.service

Alternatively, we could just remove the Requires and rely on Wants to ensure that dnsmasq is requested to start when the node starts but if dnsmasq were stopped the node would continue running and pods would immediately have broken dns.

man 5 systemd.unit   for more thorough description of these options
Comment 4 Rajat Chopra 2018-01-23 16:36:15 EST
> Alternatively, we could just remove the Requires and rely on Wants to ensure that dnsmasq is requested to start when the node starts but if dnsmasq were stopped the node would continue running and pods would immediately have broken dns.

It seems like this is favourable because dnsmasq is wanted for complete functionality, but the node process does not need a restart if dnsmasq were to re-invent itself.
To the possibility of someone stopping dnsmasq causing a broken dns is fair deal. The admin stops it, has to start it again for things to function. As an example, this is what we do with openvswitch.

Possible fix in PR: https://github.com/openshift/openshift-ansible/pull/6843

@Scott, what do you think?
Comment 5 Scott Dodson 2018-01-24 13:06:34 EST
Yeah that sounds fine to me, /lgtm'd that PR
Comment 7 Meng Bo 2018-02-08 02:44:38 EST
Tested on ocp 3.9.0-0.41.0 and openshift-ansible-3.9.0-0.41.0, issue has been fixed. The node service will not be stopped when the dnsmasq is stopped.

[root@ip-172-18-3-105 ~]# systemctl status dnsmasq atomic-openshift-node | grep Active
   Active: active (running) since Thu 2018-02-08 02:40:56 EST; 13s ago
   Active: active (running) since Thu 2018-02-08 02:40:57 EST; 12s ago
[root@ip-172-18-3-105 ~]# systemctl stop dnsmasq 
[root@ip-172-18-3-105 ~]# systemctl status dnsmasq atomic-openshift-node | grep Active
   Active: inactive (dead) since Thu 2018-02-08 02:41:35 EST; 1s ago
   Active: active (running) since Thu 2018-02-08 02:40:57 EST; 40s ago
[root@ip-172-18-3-105 ~]# systemctl restart atomic-openshift-node 
[root@ip-172-18-3-105 ~]# systemctl status dnsmasq atomic-openshift-node | grep Active
   Active: active (running) since Thu 2018-02-08 02:41:44 EST; 2s ago
   Active: active (running) since Thu 2018-02-08 02:41:45 EST; 1s ago
Comment 10 errata-xmlrpc 2018-03-28 10:18:26 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489

Note You need to log in before you can comment on or make changes to this bug.