Bug 1854801

Summary:	After kill the ovs-vswitchd in the node, it will put the sdn in CrashLoopBackOff status.
Product:	OpenShift Container Platform	Reporter:	huirwang
Component:	Networking	Assignee:	Aniket Bhat <anbhat>
Networking sub component:	openshift-sdn	QA Contact:	huirwang
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	unspecified	CC:	anbhat, anusaxen, dmellado, rbrattai
Version:	4.6	Keywords:	UpcomingSprint
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-27 16:12:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1848374

Description huirwang 2020-07-08 09:22:30 UTC

Description of problem:
As discussed in bug https://bugzilla.redhat.com/show_bug.cgi?id=1848374, open this bug to track the new issue in 4.6


Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-07-06-180619 

How reproducible:
Always

Steps to Reproduce:
After kill the ovs-vswitchd in the node, it will put the sdn in CrashLoopBackOff  status.  Tested on two nodes, both sdn pods got CrashLoopBackOff.

sh-4.4# pgrep ovs-vswitchd
1291
sh-4.4# pkill ovs-vswitchd
sh-4.4# pgrep ovs-vswitchd
sh-4.4

oc get pods -n openshift-sdn
NAME                   READY   STATUS             RESTARTS   AGE
ovs-274sx              1/1     Running            0          23m
ovs-dsd2g              1/1     Running            0          34m
ovs-s4mr8              1/1     Running            0          23m
ovs-tlggz              1/1     Running            0          33m
ovs-vrxtn              1/1     Running            0          23m
ovs-xw8mf              1/1     Running            0          34m
sdn-454ss              0/1     CrashLoopBackOff   6          23m
sdn-8dnhs              1/1     Running            0          33m
sdn-controller-6gbz7   1/1     Running            0          34m
sdn-controller-6kpdw   1/1     Running            1          34m
sdn-controller-ftrdn   1/1     Running            0          33m
sdn-hhzpv              1/1     Running            0          23m
sdn-k6xdc              1/1     Running            0          34m
sdn-kw7zn              0/1     CrashLoopBackOff   5          23m
sdn-ntbmv              1/1     Running            0          34m


oc logs sdn-kw7zn  -n openshift-sdn
I0707 01:30:47.634694   38474 cmd.go:121] Reading proxy configuration from /config/kube-proxy-config.yaml
I0707 01:30:47.635974   38474 feature_gate.go:243] feature gates: &{map[]}
I0707 01:30:47.636032   38474 cmd.go:216] Watching config file /config/kube-proxy-config.yaml for changes
I0707 01:30:47.636078   38474 cmd.go:216] Watching config file /config/..2020_07_07_01_11_15.728054255/kube-proxy-config.yaml for changes
I0707 01:30:47.670497   38474 node.go:150] Initializing SDN node "ip-10-0-139-165.us-east-2.compute.internal" (10.0.139.165) of type "redhat/openshift-ovs-networkpolicy"
I0707 01:30:47.675538   38474 cmd.go:159] Starting node networking (v0.0.0-alpha.0-181-ga461a1f6)
I0707 01:30:47.675557   38474 node.go:338] Starting openshift-sdn network plugin
I0707 01:30:47.773151   38474 ovs.go:176] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket
I0707 01:30:47.773379   38474 sdn_controller.go:139] [SDN setup] full SDN setup required (plugin is not setup)
I0707 01:31:17.808174   38474 ovs.go:176] Error executing ovs-vsctl: 2020-07-07T01:31:17Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
F0707 01:31:17.808256   38474 cmd.go:111] Failed to start sdn: node SDN setup failed: signal: alarm clock

Actual results:
sdn pods got CrashLoopBackOff.

Expected results:
All the pods in openshift-sdn should in running status

Additional info:

Comment 2 Ross Brattain 2020-07-08 16:09:58 UTC

ovs-vswitchd is set to 

Restart=on-failure

According to 
https://www.freedesktop.org/software/systemd/man/systemd.service.html#Restart=

"on-failure" will not restart on clean exit code, so it won't restart on pkill

it looks like the ovs-vswitchd.service is in RHCOS, so we need to modify it to be Restart="always"

Comment 4 Ross Brattain 2020-07-08 20:08:54 UTC

The expectation that pkill ovs-vswitchd not permanently stop ovs-vswitchd is inherited from the previous behavior where CNO and the OVS Daemonset will restart OVS pods that are terminated.  Does CNO have the ability to monitor the host OVS to also ensure such policy?

Comment 5 Aniket Bhat 2020-07-09 16:15:12 UTC

When moving to system ovs, we should rely on --monitor flag to ensure that any such in-advertent or in some cases intentional kill of the process will not result in vswitchd staying offline. I see that our RHCOS images don't have the flag set. i.e. I see --no-monitor in the service files.

Comment 6 Ross Brattain 2020-07-09 18:08:55 UTC

I'm no systemd expert but I thought the plan was to move all daemon monitoring responsibilities to systemd and not try to use daemon monitoring stuff like --monitor.

Looking at the original BZ from the test case https://bugzilla.redhat.com/show_bug.cgi?id=1669311 the original problem was OVS getting OOM killed which is either SIGTERM or SIGKILL. I'm not sure how the current systemd unit will react when OOM killed.  It is possible the --monitor could also be OOM killed whereas systemd itself I imagine is immune and thus always present to restart daemons.

Comment 7 Aniket Bhat 2020-07-09 19:19:00 UTC

Scratch my previous comment. It looks like the service files do indeed have Restart=OnFailure set. However, SIGTERM is not one of the signals to which it will respond.

From systemd documentation:

Takes one of no, on-success, on-failure, on-abnormal, on-watchdog, on-abort, or always. If set to no (the default), the service will not be restarted. If set to on-success, it will be restarted only when the service process exits cleanly. In this context, a clean exit means an exit code of 0, or one of the signals SIGHUP, SIGINT, SIGTERM or SIGPIPE, and additionally, exit statuses and signals specified in SuccessExitStatus=. If set to on-failure, the service will be restarted when the process exits with a non-zero exit code, is terminated by a signal (including on core dump, but excluding the aforementioned four signals), when an operation (such as service reload) times out, and when the configured watchdog timeout is triggered. If set to on-abnormal, the service will be restarted when the process is terminated by a signal (including on core dump, excluding the aforementioned four signals), when an operation times out, or when the watchdog timeout is triggered. If set to on-abort, the service will be restarted only if the service process exits due to an uncaught signal not specified as a clean exit status. If set to on-watchdog, the service will be restarted only if the watchdog timeout for the service expires. If set to always, the service will be restarted regardless of whether it exited cleanly or not, got terminated abnormally by a signal, or hit a timeout.

So back to the question Ross is asking. Do we want to support pkill as one of the modes? If so, we need to set the Restart=always.

Comment 9 Ross Brattain 2020-07-16 01:34:03 UTC

Just for reference, here are the restart polices for other services on 4.6.0-0.nightly-2020-07-15-091743

sh-4.4# systemctl show '*' -p Restart -p Names --type=service --state=active | grep -A1 'Restart=[^n]'
Restart=always
Names=getty
--
Restart=always
Names=systemd-udevd.service
--
Restart=always
Names=systemd-logind.service
--
Restart=on-failure
Names=sshd.service
--
Restart=on-failure
Names=ovs-vswitchd.service
--
Restart=on-abnormal
Names=crio.service
--
Restart=always
Names=systemd-journald.service
--
Restart=always
Names=kubelet.service
--
Restart=on-failure
Names=ovsdb-server.service
--
Restart=on-failure
Names=sssd.service
--
Restart=always
Names=serial-getty
--
Restart=on-failure
Names=NetworkManager.service

Comment 15 errata-xmlrpc 2020-10-27 16:12:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196