Bug 1532960
| Summary: | No Auto-recovery for atomic-openshift-node | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Ritesh Arya <sarya0113> |
| Component: | Networking | Assignee: | Rajat Chopra <rchopra> |
| Status: | CLOSED ERRATA | QA Contact: | Meng Bo <bmeng> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 3.6.0 | CC: | aos-bugs, bbennett, jokerman, mfojtik, mmccomas, sdodson, zzhao |
| Target Milestone: | --- | ||
| Target Release: | 3.9.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-03-28 14:18:26 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Ritesh Arya
2018-01-10 06:01:16 UTC
Seems the issue is more related to Network or Master. Changing to component Networking and also CC'ing Master component's default Assignee :) @sdodson: Do you know of any systemd magic we can do to the unit files here? PartOf causes stop/restart to propagate to us whenever the target gets those events. WantedBy will cause a start of dnsmasq to start the node which seems like it would fix the concern here but to me it'd be unexpected to trigger the node service to start if you started dnsmasq in the event that the node service had been stopped otherwise. So perhaps [Unit] PartOf=dnsmasq.service [Install] WantedBy=dnsmasq.service Alternatively, we could just remove the Requires and rely on Wants to ensure that dnsmasq is requested to start when the node starts but if dnsmasq were stopped the node would continue running and pods would immediately have broken dns. man 5 systemd.unit for more thorough description of these options > Alternatively, we could just remove the Requires and rely on Wants to ensure that dnsmasq is requested to start when the node starts but if dnsmasq were stopped the node would continue running and pods would immediately have broken dns. It seems like this is favourable because dnsmasq is wanted for complete functionality, but the node process does not need a restart if dnsmasq were to re-invent itself. To the possibility of someone stopping dnsmasq causing a broken dns is fair deal. The admin stops it, has to start it again for things to function. As an example, this is what we do with openvswitch. Possible fix in PR: https://github.com/openshift/openshift-ansible/pull/6843 @Scott, what do you think? Yeah that sounds fine to me, /lgtm'd that PR Tested on ocp 3.9.0-0.41.0 and openshift-ansible-3.9.0-0.41.0, issue has been fixed. The node service will not be stopped when the dnsmasq is stopped. [root@ip-172-18-3-105 ~]# systemctl status dnsmasq atomic-openshift-node | grep Active Active: active (running) since Thu 2018-02-08 02:40:56 EST; 13s ago Active: active (running) since Thu 2018-02-08 02:40:57 EST; 12s ago [root@ip-172-18-3-105 ~]# systemctl stop dnsmasq [root@ip-172-18-3-105 ~]# systemctl status dnsmasq atomic-openshift-node | grep Active Active: inactive (dead) since Thu 2018-02-08 02:41:35 EST; 1s ago Active: active (running) since Thu 2018-02-08 02:40:57 EST; 40s ago [root@ip-172-18-3-105 ~]# systemctl restart atomic-openshift-node [root@ip-172-18-3-105 ~]# systemctl status dnsmasq atomic-openshift-node | grep Active Active: active (running) since Thu 2018-02-08 02:41:44 EST; 2s ago Active: active (running) since Thu 2018-02-08 02:41:45 EST; 1s ago Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0489 |