Bug 1868536
Summary: | patching networkType to invalid value causes master and worker node reboots, ovs-vswitchd fails to start and cluster fails | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Ross Brattain <rbrattai> |
Component: | Networking | Assignee: | Peng Liu <pliu> |
Networking sub component: | openshift-sdn | QA Contact: | huirwang |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | unspecified | CC: | anusaxen, bbennett, cdc, danw, huirwang, mharri, pliu, wking |
Version: | 4.6 | ||
Target Milestone: | --- | ||
Target Release: | 4.6.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-10-27 16:28:06 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Ross Brattain
2020-08-13 03:06:20 UTC
It's caused by PR https://github.com/openshift/machine-config-operator/pull/1846. It disables the openvswitch services in MachineConfig when networkType does not equal 'OpenShiftSDN' or 'OvnKubernetes'. And Machine Config Operator triggered reboot. I think we may need to add some validation for networkType in the CRD. Yeah, this is more correctly described as "if you set the networkType to something other than OpenShiftSDN or OVNKubernetes, then we don't run components that are only required for OpenShiftSDN and OVNKubernetes"
> Steps to Reproduce:
> 1. oc patch networks.config.openshift.io cluster -p \{\"spec\":\{\"networkType\":\"bad\"\}\} --type=merge
If you set `networkType` to something that CNO doesn't support, then CNO will un-deploy the existing network plugin and stop trying to manage the network. As Casey said, this is necessary for third-party plugin support.
If the real original issue was "I accidentally made a typo and set networkType to `OVNKurbunets` and then everything broke" then we can think about how to be more resilient against stuff like that.
If the real original issue was "I wanted to see what would happen if I set networkType to a bogus value", well, now you know...
The problem is that we don't know what the set of allowed values is. In hindsight, we probably should have required you to explicitly set some "I'm not using CNO" flag as well, but we can't add that now, for compatibility. We don't need an admission controller; we can just report back via the operator Status. If someone asks us to do something we can't do, then we mark Degraded and set a message. This will happen so rarely in production clusters that I don't think we need to cover every possible case. QE note: another testcase OCP-22419 for SDN-179 tests {"spec":{"networkType":"None"}} Similar case with OCP-29299. On an SDN cluster we test setting {"spec":{"networkType":"OVNKubernetes"}} without setting the network-migration annotation. Once the networkType is patched MCO enables name: ovs-configuration.service enabled: {{if eq .NetworkType "OVNKubernetes"}}true{{else}}false{{end}} ovs-configuration.service probably should not be run unless the migration is set. (In reply to Ross Brattain from comment #16) > Similar case with OCP-29299. On an SDN cluster we test setting > {"spec":{"networkType":"OVNKubernetes"}} without setting the > network-migration annotation. > > Once the networkType is patched MCO enables > > name: ovs-configuration.service > enabled: {{if eq .NetworkType "OVNKubernetes"}}true{{else}}false{{end}} > > ovs-configuration.service probably should not be run unless the migration is > set. With current fix, the networkType of controllerconfigs.machineconfiguration.openshift.io follows the status instead of the spec. During migration, the status of Network.config.openshift.io will only be updated by CNO when the annotation is set. So it can work in migration case. However, for the migration, there're other problems to be fixed. It still doesn't work. I'm working on that. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |