Bug 1877984

Summary: using OpenshiftSDN in install-config causes install failure post bootstrap
Product: OpenShift Container Platform Reporter: Greg Sheremeta <gshereme>
Component: NetworkingAssignee: Dan Winship <danw>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: unspecified CC: danw, dhellmann, eparis, fpan, jsica, ricarril, sdodson, wking, yanyang
Version: 4.6   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:40:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Greg Sheremeta 2020-09-11 00:56:42 UTC
Description of problem:

followup to BZ 1877481

In 4.5, `networkType: OpenshiftSDN` worked.
In 4.6, it results in installation failure with hard to understand messages. 
Using `networkType: OpenShiftSDN` works in 4.6.  (Capital-S vs lowercase)

install log usually looks like this:
time="2020-09-10T20:55:25Z" level=info msg="Cluster operator machine-config Progressing is True with : Working towards 4.6.0-0.nightly-2020-09-10-145837"
time="2020-09-10T20:55:25Z" level=error msg="Cluster operator machine-config Degraded is True with RequiredPoolsFailed: Unable to apply 4.6.0-0.nightly-2020-09-10-145837: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with \"3 nodes are reporting degraded status on sync\": \"Node ip-10-0-161-165.ec2.internal is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-64cb83bf095afac90544003fc5b9f2b6\\\\\\\" not found\\\", Node ip-10-0-244-171.ec2.internal is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-64cb83bf095afac90544003fc5b9f2b6\\\\\\\" not found\\\", Node ip-10-0-230-197.ec2.internal is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-64cb83bf095afac90544003fc5b9f2b6\\\\\\\" not found\\\"\", retrying"
time="2020-09-10T20:55:25Z" level=info msg="Cluster operator machine-config Available is False with : Cluster not available for 4.6.0-0.nightly-2020-09-10-145837"
time="2020-09-10T20:55:25Z" level=fatal msg="failed to initialize the cluster: Cluster operator machine-config is still updating"
time="2020-09-10T20:55:26Z" level=error msg="error after waiting for command completion" error="exit status 1" installID=hgbr6ffn

There should instead be some upfront validation. Or perhaps the case shouldn't break it like in 4.5.

Version-Release number of selected component (if applicable):
4.6 nightly

How reproducible:
always


Steps to Reproduce:
1. set `networkType: OpenshiftSDN` in install-config yaml
2. install

Actual results:
failure

Expected results:
successful install

Additional info:
follow up to BZ 1877481

Comment 1 Eric Paris 2020-09-17 16:56:55 UTC
This is a user facing API that changed, regressed, and broke real customers. This is a 4.6 blocker.

Comment 2 Scott Dodson 2020-09-17 17:05:28 UTC
*** Bug 1877481 has been marked as a duplicate of this bug. ***

Comment 3 Dan Winship 2020-09-17 17:16:10 UTC
CNO fixes it to be the canonical form in the network config Status, but MCO now has code that looks at the network config Spec (createDiscoveredControllerConfigSpec() in machine-config-operator/pkg/operator/render.go) because it wants to set up system OVS correctly from the get-go. So I guess we need to be case-insensitive there too.

Comment 4 Ricardo Carrillo Cruz 2020-09-21 15:11:26 UTC
Agreed on draft PR to fix this at installer layer, thus changing component and clearing up POST to NEW:

https://github.com/openshift/machine-config-operator/pull/2101

Comment 5 Scott Dodson 2020-09-22 15:50:26 UTC
The install-config API for this field is an opaque string. Therefore the canonicalization should not happen in the installer moving back to networking component.

$ openshift-install explain installconfig.networking.networkType
KIND:     InstallConfig
VERSION:  v1

RESOURCE: <string>
  NetworkType is the type of network to install. The default is OpenShiftSDN

Comment 7 zhaozhanqi 2020-09-25 05:12:24 UTC
Verified this bug on 4.6.0-0.nightly-2020-09-24-095222
with 'OpenshiftSDN' also works

Comment 10 errata-xmlrpc 2020-10-27 16:40:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196