1868536 – patching networkType to invalid value causes master and worker node reboots, ovs-vswitchd fails to start and cluster fails

Bug 1868536 - patching networkType to invalid value causes master and worker node reboots, ovs-vswitchd fails to start and cluster fails

Summary: patching networkType to invalid value causes master and worker node reboots, ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Peng Liu
QA Contact:	huirwang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-13 03:06 UTC by Ross Brattain
Modified:	2020-10-27 16:28 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:28:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2015	0	None	closed	Bug 1868536: Watching the networkType in the status of Network.config.openshift.io	2020-11-04 08:13:00 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:28:24 UTC

Description Ross Brattain 2020-08-13 03:06:20 UTC

Description of problem:

There are two issues:
* Patching the  networks.config.openshift.io networkType to an invalid value causes master and worker node reboots.

* Once the node reboots ovs-vswitchd fails to start and cluster fails because the network is broken.

Version-Release number of selected component (if applicable):

4.6.0-0.nightly-2020-08-12-071533


How reproducible:

Always

Steps to Reproduce:
1. oc patch networks.config.openshift.io cluster -p \{\"spec\":\{\"networkType\":\"bad\"\}\} --type=merge


Actual results:

Master and worker nodes reboot.  On reboot systemd fails to start ovs-vswitchd.
sdn pods are in CrashLoopBackOff

Expected results:

Invalid networkType values should be rejected.
Nodes should not reboot.
ovs-vswitchd must always start on boot.

Additional info:

Recover by restarting ovs-vswitchd on all the nodes

for f in $(oc get nodes  -o jsonpath='{.items[*].metadata.name}') ; do oc debug node/"${f}" --  chroot /host systemctl start ovs-vswitchd & done

After starting ovs-vswitchd the node will reboot again.

Comment 3 Peng Liu 2020-08-13 07:28:16 UTC

It's caused by PR https://github.com/openshift/machine-config-operator/pull/1846. It disables the openvswitch services in MachineConfig when networkType does not equal 'OpenShiftSDN' or 'OvnKubernetes'. And Machine Config Operator triggered reboot. I think we may need to add some validation for networkType in the CRD.

Comment 6 Dan Winship 2020-08-17 16:15:44 UTC

Yeah, this is more correctly described as "if you set the networkType to something other than OpenShiftSDN or OVNKubernetes, then we don't run components that are only required for OpenShiftSDN and OVNKubernetes"

> Steps to Reproduce:
> 1. oc patch networks.config.openshift.io cluster -p \{\"spec\":\{\"networkType\":\"bad\"\}\} --type=merge

If you set `networkType` to something that CNO doesn't support, then CNO will un-deploy the existing network plugin and stop trying to manage the network. As Casey said, this is necessary for third-party plugin support.

If the real original issue was "I accidentally made a typo and set networkType to `OVNKurbunets` and then everything broke" then we can think about how to be more resilient against stuff like that.

If the real original issue was "I wanted to see what would happen if I set networkType to a bogus value", well, now you know...

Comment 8 Dan Winship 2020-08-17 16:30:29 UTC

The problem is that we don't know what the set of allowed values is. In hindsight, we probably should have required you to explicitly set some "I'm not using CNO" flag as well, but we can't add that now, for compatibility.

Comment 11 Casey Callendrello 2020-08-18 11:43:26 UTC

We don't need an admission controller; we can just report back via the operator Status. If someone asks us to do something we can't do, then we mark Degraded and set a message. This will happen so rarely in production clusters that I don't think we need to cover every possible case.

Comment 15 Ross Brattain 2020-08-20 20:21:01 UTC

QE note: another testcase OCP-22419 for SDN-179 tests {"spec":{"networkType":"None"}}

Comment 16 Ross Brattain 2020-08-20 21:23:20 UTC

Similar case with OCP-29299.  On an SDN cluster we test setting {"spec":{"networkType":"OVNKubernetes"}} without setting the network-migration annotation.

Once the networkType is patched MCO enables

name: ovs-configuration.service
enabled: {{if eq .NetworkType "OVNKubernetes"}}true{{else}}false{{end}}

ovs-configuration.service probably should not be run unless the migration is set.

Comment 17 Peng Liu 2020-08-21 03:12:35 UTC

(In reply to Ross Brattain from comment #16)
> Similar case with OCP-29299.  On an SDN cluster we test setting
> {"spec":{"networkType":"OVNKubernetes"}} without setting the
> network-migration annotation.
> 
> Once the networkType is patched MCO enables
> 
> name: ovs-configuration.service
> enabled: {{if eq .NetworkType "OVNKubernetes"}}true{{else}}false{{end}}
> 
> ovs-configuration.service probably should not be run unless the migration is
> set.

With current fix, the networkType of controllerconfigs.machineconfiguration.openshift.io follows the status instead of the spec. During migration, the status of Network.config.openshift.io will only be updated by CNO when the annotation is set. So it can work in migration case. However, for the migration, there're other problems to be fixed. It still doesn't work. I'm working on that.

Comment 22 errata-xmlrpc 2020-10-27 16:28:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.