Bug 1854801
| Summary: | After kill the ovs-vswitchd in the node, it will put the sdn in CrashLoopBackOff status. | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | huirwang |
| Component: | Networking | Assignee: | Aniket Bhat <anbhat> |
| Networking sub component: | openshift-sdn | QA Contact: | huirwang |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | urgent | ||
| Priority: | unspecified | CC: | anbhat, anusaxen, dmellado, rbrattai |
| Version: | 4.6 | Keywords: | UpcomingSprint |
| Target Milestone: | --- | ||
| Target Release: | 4.6.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-10-27 16:12:56 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1848374 | ||
|
Description
huirwang
2020-07-08 09:22:30 UTC
ovs-vswitchd is set to Restart=on-failure According to https://www.freedesktop.org/software/systemd/man/systemd.service.html#Restart= "on-failure" will not restart on clean exit code, so it won't restart on pkill it looks like the ovs-vswitchd.service is in RHCOS, so we need to modify it to be Restart="always" The expectation that pkill ovs-vswitchd not permanently stop ovs-vswitchd is inherited from the previous behavior where CNO and the OVS Daemonset will restart OVS pods that are terminated. Does CNO have the ability to monitor the host OVS to also ensure such policy? When moving to system ovs, we should rely on --monitor flag to ensure that any such in-advertent or in some cases intentional kill of the process will not result in vswitchd staying offline. I see that our RHCOS images don't have the flag set. i.e. I see --no-monitor in the service files. I'm no systemd expert but I thought the plan was to move all daemon monitoring responsibilities to systemd and not try to use daemon monitoring stuff like --monitor. Looking at the original BZ from the test case https://bugzilla.redhat.com/show_bug.cgi?id=1669311 the original problem was OVS getting OOM killed which is either SIGTERM or SIGKILL. I'm not sure how the current systemd unit will react when OOM killed. It is possible the --monitor could also be OOM killed whereas systemd itself I imagine is immune and thus always present to restart daemons. Scratch my previous comment. It looks like the service files do indeed have Restart=OnFailure set. However, SIGTERM is not one of the signals to which it will respond. From systemd documentation: Takes one of no, on-success, on-failure, on-abnormal, on-watchdog, on-abort, or always. If set to no (the default), the service will not be restarted. If set to on-success, it will be restarted only when the service process exits cleanly. In this context, a clean exit means an exit code of 0, or one of the signals SIGHUP, SIGINT, SIGTERM or SIGPIPE, and additionally, exit statuses and signals specified in SuccessExitStatus=. If set to on-failure, the service will be restarted when the process exits with a non-zero exit code, is terminated by a signal (including on core dump, but excluding the aforementioned four signals), when an operation (such as service reload) times out, and when the configured watchdog timeout is triggered. If set to on-abnormal, the service will be restarted when the process is terminated by a signal (including on core dump, excluding the aforementioned four signals), when an operation times out, or when the watchdog timeout is triggered. If set to on-abort, the service will be restarted only if the service process exits due to an uncaught signal not specified as a clean exit status. If set to on-watchdog, the service will be restarted only if the watchdog timeout for the service expires. If set to always, the service will be restarted regardless of whether it exited cleanly or not, got terminated abnormally by a signal, or hit a timeout. So back to the question Ross is asking. Do we want to support pkill as one of the modes? If so, we need to set the Restart=always. Just for reference, here are the restart polices for other services on 4.6.0-0.nightly-2020-07-15-091743 sh-4.4# systemctl show '*' -p Restart -p Names --type=service --state=active | grep -A1 'Restart=[^n]' Restart=always Names=getty -- Restart=always Names=systemd-udevd.service -- Restart=always Names=systemd-logind.service -- Restart=on-failure Names=sshd.service -- Restart=on-failure Names=ovs-vswitchd.service -- Restart=on-abnormal Names=crio.service -- Restart=always Names=systemd-journald.service -- Restart=always Names=kubelet.service -- Restart=on-failure Names=ovsdb-server.service -- Restart=on-failure Names=sssd.service -- Restart=always Names=serial-getty -- Restart=on-failure Names=NetworkManager.service Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |