Description of problem: There are two issues: * Patching the networks.config.openshift.io networkType to an invalid value causes master and worker node reboots. * Once the node reboots ovs-vswitchd fails to start and cluster fails because the network is broken. Version-Release number of selected component (if applicable): 4.6.0-0.nightly-2020-08-12-071533 How reproducible: Always Steps to Reproduce: 1. oc patch networks.config.openshift.io cluster -p \{\"spec\":\{\"networkType\":\"bad\"\}\} --type=merge Actual results: Master and worker nodes reboot. On reboot systemd fails to start ovs-vswitchd. sdn pods are in CrashLoopBackOff Expected results: Invalid networkType values should be rejected. Nodes should not reboot. ovs-vswitchd must always start on boot. Additional info: Recover by restarting ovs-vswitchd on all the nodes for f in $(oc get nodes -o jsonpath='{.items[*].metadata.name}') ; do oc debug node/"${f}" -- chroot /host systemctl start ovs-vswitchd & done After starting ovs-vswitchd the node will reboot again.
It's caused by PR https://github.com/openshift/machine-config-operator/pull/1846. It disables the openvswitch services in MachineConfig when networkType does not equal 'OpenShiftSDN' or 'OvnKubernetes'. And Machine Config Operator triggered reboot. I think we may need to add some validation for networkType in the CRD.
Yeah, this is more correctly described as "if you set the networkType to something other than OpenShiftSDN or OVNKubernetes, then we don't run components that are only required for OpenShiftSDN and OVNKubernetes" > Steps to Reproduce: > 1. oc patch networks.config.openshift.io cluster -p \{\"spec\":\{\"networkType\":\"bad\"\}\} --type=merge If you set `networkType` to something that CNO doesn't support, then CNO will un-deploy the existing network plugin and stop trying to manage the network. As Casey said, this is necessary for third-party plugin support. If the real original issue was "I accidentally made a typo and set networkType to `OVNKurbunets` and then everything broke" then we can think about how to be more resilient against stuff like that. If the real original issue was "I wanted to see what would happen if I set networkType to a bogus value", well, now you know...
The problem is that we don't know what the set of allowed values is. In hindsight, we probably should have required you to explicitly set some "I'm not using CNO" flag as well, but we can't add that now, for compatibility.
We don't need an admission controller; we can just report back via the operator Status. If someone asks us to do something we can't do, then we mark Degraded and set a message. This will happen so rarely in production clusters that I don't think we need to cover every possible case.
QE note: another testcase OCP-22419 for SDN-179 tests {"spec":{"networkType":"None"}}
Similar case with OCP-29299. On an SDN cluster we test setting {"spec":{"networkType":"OVNKubernetes"}} without setting the network-migration annotation. Once the networkType is patched MCO enables name: ovs-configuration.service enabled: {{if eq .NetworkType "OVNKubernetes"}}true{{else}}false{{end}} ovs-configuration.service probably should not be run unless the migration is set.
(In reply to Ross Brattain from comment #16) > Similar case with OCP-29299. On an SDN cluster we test setting > {"spec":{"networkType":"OVNKubernetes"}} without setting the > network-migration annotation. > > Once the networkType is patched MCO enables > > name: ovs-configuration.service > enabled: {{if eq .NetworkType "OVNKubernetes"}}true{{else}}false{{end}} > > ovs-configuration.service probably should not be run unless the migration is > set. With current fix, the networkType of controllerconfigs.machineconfiguration.openshift.io follows the status instead of the spec. During migration, the status of Network.config.openshift.io will only be updated by CNO when the annotation is set. So it can work in migration case. However, for the migration, there're other problems to be fixed. It still doesn't work. I'm working on that.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196