Bug 1901355

Summary: [Azure][4.7] Invalid vm size from customized compute nodes does not fail properly
Product: OpenShift Container Platform Reporter: Etienne Simard <esimard>
Component: InstallerAssignee: Jeremiah Stuever <jstuever>
Installer sub component: openshift-installer QA Contact: Etienne Simard <esimard>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: bleanhar, jstuever
Version: 4.7   
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:35:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Etienne Simard 2020-11-24 22:54:03 UTC
Version:

./openshift-install 4.7.0-0.nightly-2020-11-20-234717
built from commit 68282c185253d4831514b20623b1717535c5e6f2
release image registry.svc.ci.openshift.org/ocp/release@sha256:b8667356942dce0e049d44470ba94f0dc1fa64876b324621cfb13c4fb25b9069


Platform:

Azure

Please specify:

IPI with custom install-config.yaml

What happened?

Entered an invalid VM type/size in install-config.yaml for workers only.

The NIC was created in Azure, and I did not see a graceful error with a proper error message.

~~~
INFO Waiting up to 40m0s for the cluster at https://api.esimardwrk03.qe.azure.devcluster.openshift.com:6443 to initialize... 
DEBUG Still waiting for the cluster to initialize: Working towards 4.7.0-0.nightly-2020-11-20-234717: 97% complete 
DEBUG Still waiting for the cluster to initialize: Working towards 4.7.0-0.nightly-2020-11-20-234717: 98% complete 
DEBUG Still waiting for the cluster to initialize: Working towards 4.7.0-0.nightly-2020-11-20-234717: 98% complete, waiting on authentication, console, image-registry, ingress, kube-storage-version-migrator, monitoring 
DEBUG Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, image-registry, ingress, kube-storage-version-migrator, monitoring 
DEBUG Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, image-registry, ingress, kube-storage-version-migrator, monitoring 
ERROR Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::OAuthRouteCheckEndpointAccessibleController_SyncError::OAuthServerDeployment_DeploymentAvailableReplicasCheckFailed::OAuthServerRoute_InvalidCanonicalHost::OAuthServiceCheckEndpointAccessibleController_SyncError::OAuthServiceEndpointsCheckEndpointAccessibleController_SyncError::OAuthVersionDeployment_GetFailed::Route_InvalidCanonicalHost::WellKnownReadyController_SyncError: OAuthServiceCheckEndpointAccessibleControllerDegraded: Get "https://172.30.54.209:443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) 
ERROR OAuthServiceEndpointsCheckEndpointAccessibleControllerDegraded: oauth service endpoints are not ready 
ERROR IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server 
ERROR OAuthRouteCheckEndpointAccessibleControllerDegraded: route status does not have host address 
ERROR OAuthVersionDeploymentDegraded: Unable to get OAuth server deployment: deployment.apps "oauth-openshift" not found 
ERROR WellKnownReadyControllerDegraded: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap "oauth-openshift" not found (check authentication operator, it is supposed to create this) 
ERROR OAuthServerDeploymentDegraded: deployments.apps "oauth-openshift" not found 
ERROR OAuthServerRouteDegraded: no ingress for host oauth-openshift.apps.esimardwrk03.qe.azure.devcluster.openshift.com in route oauth-openshift in namespace openshift-authentication 
ERROR RouteDegraded: no ingress for host oauth-openshift.apps.esimardwrk03.qe.azure.devcluster.openshift.com in route oauth-openshift in namespace openshift-authentication 
INFO Cluster operator authentication Available is False with OAuthServiceCheckEndpointAccessibleController_EndpointUnavailable::OAuthServiceEndpointsCheckEndpointAccessibleController_EndpointUnavailable::OAuthVersionDeployment_MissingDeployment::ReadyIngressNodes_NoReadyIngressNodes::WellKnown_NotReady: OAuthServiceEndpointsCheckEndpointAccessibleControllerAvailable: Failed to get oauth-openshift enpoints 
INFO ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes, 3 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods). 
INFO OAuthServiceCheckEndpointAccessibleControllerAvailable: Get "https://172.30.54.209:443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) 
INFO WellKnownAvailable: The well-known endpoint is not yet available: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap "oauth-openshift" not found (check authentication operator, it is supposed to create this) 
INFO Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform 
INFO Cluster operator console Progressing is True with DefaultRouteSync_FailedAdmitDefaultRoute::OAuthClientSync_FailedHost: DefaultRouteSyncProgressing: route "console" is not available at canonical host [] 
INFO OAuthClientSyncProgressing: route "console" is not available at canonical host [] 
INFO Cluster operator console Available is Unknown with NoData:  
INFO Cluster operator image-registry Available is False with NoReplicasAvailable: Available: The deployment does not have available replicas 
INFO ImagePrunerAvailable: Pruner CronJob has been created 
INFO Cluster operator image-registry Progressing is True with DeploymentNotCompleted: Progressing: The deployment has not completed 
INFO Cluster operator ingress Available is False with IngressUnavailable: Not all ingress controllers are available. 
INFO Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available. 
ERROR Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: ingresscontroller "default" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-86d99b9467-scc4m" cannot be scheduled: 0/3 nodes are available: 3 node(s) didn't match node selector. Pod "router-default-86d99b9467-5bzng" cannot be scheduled: 0/3 nodes are available: 3 node(s) didn't match node selector. Make sure you have sufficient worker nodes.), DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1) 
INFO Cluster operator insights Disabled is False with AsExpected:  
INFO Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available 
ERROR Cluster operator monitoring Degraded is True with UpdatingAlertmanagerFailed: Failed to rollout the stack. Error: running task Updating Alertmanager failed: waiting for Alertmanager Route to become ready failed: waiting for route openshift-monitoring/alertmanager-main: no status available 
INFO Cluster operator monitoring Available is False with :  
INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack. 
ERROR Cluster initialization failed because one or more operators are not functioning properly. 
ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below, 
ERROR https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html 
ERROR The 'wait-for install-complete' subcommand can then be used to continue the installation 
FATAL failed to initialize the cluster: Some cluster operators are still updating: authentication, console, image-registry, ingress, kube-storage-version-migrator, monitoring

What did you expect to happen?

Fail gracefully, similarly to a master provisionning error, without provisionning the NIC.


Equivalent error on a control node:
~~~
ERROR Error: Error creating Linux Virtual Machine "esimardmst04-ccpbd-master-0" (Resource Group "esimardmst04-ccpbd-rg"): compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="InvalidParameter" Message="The value does_not_exist provided for the VM size is not valid.
~~~

Ideally this would be caught before deploying.


How to reproduce it (as minimally and precisely as possible)?

generate an install-config.yaml file and configure a custom worker type that does not exist:

~~~
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform:
    azure:
      type: does_not_exist
  replicas: 3
~~~

Comment 1 Jeremiah Stuever 2020-12-02 18:38:05 UTC
This should be handled with
https://issues.redhat.com/browse/CORS-1549
https://github.com/openshift/installer/pull/4419

Comment 2 Etienne Simard 2020-12-07 22:35:11 UTC
Hello Jeremiah,

I tested it with a nightly build and it works. 
Nit: I noticed that the beginning of the error message refers to `Master Machines` for both the compute and the control nodes.

`level=fatal msg=failed to fetch Master Machines: failed to load asset "Install Config": compute[0].platform.azure.type: Invalid value: "potatoes": not found in region northcentralus`

Should the first part be adjusted?

Comment 3 Jeremiah Stuever 2020-12-09 19:06:02 UTC
I do not believe so... the string "Master Machines" was pre-existing code outside the scope of this bz.

Comment 5 Etienne Simard 2020-12-09 21:00:09 UTC
Verified with: 4.7.0-0.nightly-2020-12-04-013308

./openshift-install 4.7.0-0.nightly-2020-12-04-013308
built from commit b9701c56ece235c8a988530816aac84980a91bdd
release image registry.svc.ci.openshift.org/ocp/release@sha256:2352dfe2655dcc891e3c09b4c260b9e346e930ee4dcdc96c6a7fd003860ef100

~~~
...
info msg=Credentials loaded from file ...
fatal msg=failed to fetch Master Machines: failed to load asset "Install Config": compute[0].platform.azure.type: Invalid value: "potatoes": not found in region northcentralus
~~~

Comment 8 errata-xmlrpc 2021-02-24 15:35:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633