Bug 1956251

Summary: [4.8] [IPI] [Provider Network Secondary NIC] Installation failed on workers instances creation (duplicate workers)
Product: OpenShift Container Platform Reporter: Udi Shkalim <ushkalim>
Component: InstallerAssignee: egarcia
Installer sub component: OpenShift on OpenStack QA Contact: Jon Uriarte <juriarte>
Status: CLOSED DUPLICATE Docs Contact:
Severity: urgent    
Priority: unspecified CC: egarcia
Version: 4.8Keywords: TestBlocker
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-05 15:37:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Udi Shkalim 2021-05-03 10:02:06 UTC
Version:

$ openshift-install version
openshift-install 4.8.0-0.nightly-2021-04-26-151924
built from commit fdf0269c32e313b6213ec36a84a5a849b43318f4
release image registry.ci.openshift.org/ocp/release@sha256:45a9d4b7e2a8e55ff20f1518436e965663161fab7cc3bec7e6c49376e2b62711

Platform:
OpenStack

Please specify:
* IPI

Installer fails to create workers in OpenStack:
(shiftstack) [stack@undercloud-0 ~]$ openstack server list
+--------------------------------------+-----------------------------+--------+-------------------------------------+--------------------+--------+
| ID                                   | Name                        | Status | Networks                            | Image              | Flavor |
+--------------------------------------+-----------------------------+--------+-------------------------------------+--------------------+--------+
| 712c46bf-6d2c-4779-8a09-9637007e1ce9 | ostest-8cr65-worker-0-tgndg | ERROR  |                                     | ostest-8cr65-rhcos |        |
| f3033a66-e0d2-42d8-bd75-190f934ebd92 | ostest-8cr65-worker-0-9mbf8 | ERROR  |                                     | ostest-8cr65-rhcos |        |
| bcbbf313-177a-467d-90b7-8f264d0b2021 | ostest-8cr65-worker-0-6dpb9 | ERROR  |                                     | ostest-8cr65-rhcos |        |
| d8c1e447-d43c-43d6-bebe-dde8a8e12a9e | ostest-8cr65-master-2       | ACTIVE | ostest-8cr65-openshift=10.196.3.238 | ostest-8cr65-rhcos |        |
| 269ab17a-3216-4dd7-a966-7efa92ef74e5 | ostest-8cr65-master-1       | ACTIVE | ostest-8cr65-openshift=10.196.0.166 | ostest-8cr65-rhcos |        |
| 00d90ccf-8b92-4c03-865a-f51973371936 | ostest-8cr65-master-0       | ACTIVE | ostest-8cr65-openshift=10.196.3.229 | ostest-8cr65-rhcos |        |
+--------------------------------------+-----------------------------+--------+-------------------------------------+--------------------+--------+



time="2021-04-28T15:25:33Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.8.0-0.nightly-2021-04-26-151924: 647 of 677 done (95% complete)"
time="2021-04-28T15:25:34Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.8.0-0.nightly-2021-04-26-151924: 649 of 677 done (95% complete)"
time="2021-04-28T15:27:30Z" level=debug msg="Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, image-registry, ingress, kube-apiserver, monitoring"
time="2021-04-28T15:29:15Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.8.0-0.nightly-2021-04-26-151924: 654 of 677 done (96% complete)"
time="2021-04-28T15:30:30Z" level=debug msg="Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, image-registry, ingress, monitoring"
time="2021-04-28T15:36:24Z" level=debug msg="Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, image-registry, ingress, monitoring"
time="2021-04-28T15:54:30Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.8.0-0.nightly-2021-04-26-151924: 655 of 677 done (96% complete)"
time="2021-04-28T15:57:00Z" level=debug msg="Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, image-registry, ingress, monitoring"
time="2021-04-28T15:58:34Z" level=error msg="Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::OAuthClientsController_SyncError::OAuthServerDeployment_PreconditionNotFulfilled::OAuthServerRouteEndpointAccessibleController_SyncError::OAuthServerServiceEndpointAccessibleController_SyncError::OAuthServerServiceEndpointsEndpointAccessibleController_SyncError::Route_InvalidCanonicalHost::WellKnownReadyController_SyncError: IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server\nOAuthClientsControllerDegraded: no ingress for host oauth-openshift.apps.ostest.shiftstack.com in route oauth-openshift in namespace openshift-authentication\nOAuthServerDeploymentDegraded: waiting for the oauth-openshift route to contain an admitted ingress: no admitted ingress for route oauth-openshift in namespace openshift-authentication\nOAuthServerDeploymentDegraded: \nOAuthServerRouteEndpointAccessibleControllerDegraded: route \"openshift-authentication/oauth-openshift\": status does not have a host address\nOAuthServerServiceEndpointAccessibleControllerDegraded: Get \"https://172.30.178.123:443/healthz\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)\nOAuthServerServiceEndpointsEndpointAccessibleControllerDegraded: oauth service endpoints are not ready\nRouteDegraded: no ingress for host oauth-openshift.apps.ostest.shiftstack.com in route oauth-openshift in namespace openshift-authentication\nWellKnownReadyControllerDegraded: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap \"oauth-openshift\" not found (check authentication operator, it is supposed to create this)"
time="2021-04-28T15:58:34Z" level=info msg="Cluster operator authentication Available is False with OAuthServerDeployment_PreconditionNotFulfilled::OAuthServerServiceEndpointAccessibleController_EndpointUnavailable::OAuthServerServiceEndpointsEndpointAccessibleController_ResourceNotFound::ReadyIngressNodes_NoReadyIngressNodes::WellKnown_NotReady: OAuthServerServiceEndpointAccessibleControllerAvailable: Get \"https://172.30.178.123:443/healthz\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)\nOAuthServerServiceEndpointsEndpointAccessibleControllerAvailable: endpoints \"oauth-openshift\" not found\nReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes, 3 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods).\nWellKnownAvailable: The well-known endpoint is not yet available: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap \"oauth-openshift\" not found (check authentication operator, it is supposed to create this)"
time="2021-04-28T15:58:34Z" level=info msg="Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform"
time="2021-04-28T15:58:34Z" level=info msg="Cluster operator console Progressing is True with DefaultRouteSync_FailedAdmitDefaultRoute::OAuthClientSync_FailedHost: DefaultRouteSyncProgressing: route \"console\" is not available at canonical host []\nOAuthClientSyncProgressing: route \"console\" is not available at canonical host []"
time="2021-04-28T15:58:34Z" level=info msg="Cluster operator console Available is Unknown with NoData: "
time="2021-04-28T15:58:34Z" level=info msg="Cluster operator image-registry Available is False with NoReplicasAvailable: Available: The deployment does not have available replicas\nNodeCADaemonAvailable: The daemon set node-ca has available replicas\nImagePrunerAvailable: Pruner CronJob has been created"
time="2021-04-28T15:58:34Z" level=info msg="Cluster operator image-registry Progressing is True with DeploymentNotCompleted: Progressing: The deployment has not completed"
time="2021-04-28T15:58:34Z" level=error msg="Cluster operator image-registry Degraded is True with Unavailable: Degraded: The deployment does not have available replicas"
time="2021-04-28T15:58:34Z" level=info msg="Cluster operator ingress Available is False with IngressUnavailable: Not all ingress controllers are available."
time="2021-04-28T15:58:34Z" level=info msg="Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available."
time="2021-04-28T15:58:34Z" level=error msg="Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: ingresscontroller \"default\" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod \"router-default-84fb65d5cd-xzgrg\" cannot be scheduled: 0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Pod \"router-default-84fb65d5cd-c9vgw\" cannot be scheduled: 0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Make sure you have sufficient worker nodes.), DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1)"
time="2021-04-28T15:58:34Z" level=info msg="Cluster operator insights Disabled is False with AsExpected: "
time="2021-04-28T15:58:34Z" level=error msg="Cluster operator monitoring Degraded is True with UpdatingAlertmanagerFailed: Failed to rollout the stack. Error: running task Updating Alertmanager failed: waiting for Alertmanager Route to become ready failed: waiting for route openshift-monitoring/alertmanager-main: no status available"
time="2021-04-28T15:58:34Z" level=info msg="Cluster operator monitoring Available is False with UpdatingconfigurationsharingFailed: Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error."
time="2021-04-28T15:58:34Z" level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack."
time="2021-04-28T15:58:34Z" level=info msg="Cluster operator network ManagementStateDegraded is False with : "
time="2021-04-28T15:58:34Z" level=info msg="Cluster operator network Progressing is True with Deploying: Deployment \"openshift-network-diagnostics/network-check-source\" is not available (awaiting 1 nodes)"
time="2021-04-28T15:58:34Z" level=error msg="Cluster initialization failed because one or more operators are not functioning properly.\nThe cluster should be accessible for troubleshooting as detailed in the documentation linked below,\nhttps://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html\nThe 'wait-for install-complete' subcommand can then be used to continue the installation"
time="2021-04-28T15:58:34Z" level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console, image-registry, ingress, monitoring"




Must-gather too large to attach

ClusterID: 32d672b8-cd10-4ce3-9b2e-308bfad98f86
ClusterVersion: Installing "4.8.0-0.nightly-2021-04-26-151924" for 4 days: Unable to apply 4.8.0-0.nightly-2021-04-26-151924: some cluster operators have not yet rolled out
ClusterOperators:
	clusteroperator/authentication is not available (OAuthServerServiceEndpointAccessibleControllerAvailable: Get "https://172.30.178.123:443/healthz": dial tcp 172.30.178.123:443: i/o timeout (Client.Timeout exceeded while awaiting headers)
OAuthServerServiceEndpointsEndpointAccessibleControllerAvailable: endpoints "oauth-openshift" not found
ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes, 3 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods).
WellKnownAvailable: The well-known endpoint is not yet available: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap "oauth-openshift" not found (check authentication operator, it is supposed to create this)) because IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server
OAuthClientsControllerDegraded: no ingress for host oauth-openshift.apps.ostest.shiftstack.com in route oauth-openshift in namespace openshift-authentication
OAuthServerDeploymentDegraded: waiting for the oauth-openshift route to contain an admitted ingress: no admitted ingress for route oauth-openshift in namespace openshift-authentication
OAuthServerDeploymentDegraded:
OAuthServerRouteEndpointAccessibleControllerDegraded: route "openshift-authentication/oauth-openshift": status does not have a host address
OAuthServerServiceEndpointAccessibleControllerDegraded: Get "https://172.30.178.123:443/healthz": dial tcp 172.30.178.123:443: i/o timeout (Client.Timeout exceeded while awaiting headers)
OAuthServerServiceEndpointsEndpointAccessibleControllerDegraded: oauth service endpoints are not ready
RouteDegraded: no ingress for host oauth-openshift.apps.ostest.shiftstack.com in route oauth-openshift in namespace openshift-authentication
WellKnownReadyControllerDegraded: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap "oauth-openshift" not found (check authentication operator, it is supposed to create this)
	clusteroperator/console is not available () because All is well
	clusteroperator/image-registry is not available (Available: The deployment does not have available replicas
NodeCADaemonAvailable: The daemon set node-ca has available replicas
ImagePrunerAvailable: Pruner CronJob has been created) because Degraded: The deployment does not have available replicas
	clusteroperator/ingress is not available (Not all ingress controllers are available.) because Some ingresscontrollers are degraded: ingresscontroller "default" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-84fb65d5cd-xzgrg" cannot be scheduled: 0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Pod "router-default-84fb65d5cd-c9vgw" cannot be scheduled: 0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Make sure you have sufficient worker nodes.), DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1), DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 0/2 of replicas are available)
	clusteroperator/monitoring is not available (Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.) because Failed to rollout the stack. Error: running task Updating Alertmanager failed: waiting for Alertmanager Route to become ready failed: waiting for route openshift-monitoring/alertmanager-main: no status available
	clusteroperator/network is progressing: Deployment "openshift-network-diagnostics/network-check-source" is not available (awaiting 1 nodes)




What did you expect to happen?
Installation should pass

How to reproduce it (as minimally and precisely as possible)?
100%

Anything else we need to know?

install-config:
# This file is autogenerated by infrared openshift plugin
apiVersion: v1
baseDomain: "shiftstack.com"
clusterID:  "2539313a-8f78-5205-919e-ed0d8ae76763"
compute:
- name: worker
  platform:
    openstack:
      zones: []
      additionalNetworkIDs: ['1ae8b499-30de-4122-b355-d6562dbfd0a9']
  replicas: 3
controlPlane:
  name: master
  platform:
    openstack:
      zones: []
  replicas: 3
metadata:
  name: "ostest"
networking:
  clusterNetworks:
  - cidr:             10.128.0.0/14
    hostSubnetLength: 9
  serviceCIDR: 172.30.0.0/16
  machineCIDR: 10.196.0.0/16
  type: "Kuryr"
platform:
  openstack:
    cloud:            "shiftstack"
    externalNetwork:  "nova"
    region:           "regionOne"
    computeFlavor:    "m4.xlarge"
    lbFloatingIP:     "10.46.22.183"
    ingressFloatingIP:     "10.46.22.178"
    externalDNS:      ["10.46.0.31"]
pullSecret: |

Comment 1 egarcia 2021-05-03 19:19:23 UTC
I see that your worker nodes all failed to come up. When you view those nodes in openstack, what reason does it give for them failing? The installer is failing because it cant deploy the OpenShift services without workers, unless you make the masters schedulable.

Comment 2 Udi Shkalim 2021-05-04 08:57:40 UTC
So I think this is related to https://bugzilla.redhat.com/show_bug.cgi?id=1955969

Comment 3 egarcia 2021-05-04 13:39:16 UTC
ah yes, there seems to be a regression in that right now. To unblock yourself, try to work around the issue by using the ports directive instead. https://github.com/openshift/cluster-api-provider-openstack/blob/f17f967e01975ac0142f4d8a8011c9caca8b61d6/pkg/apis/openstackproviderconfig/v1alpha1/types.go#L201-L234

Comment 4 egarcia 2021-05-05 15:37:51 UTC
Closing as duplicate. https://bugzilla.redhat.com/show_bug.cgi?id=1955969 is also marked as a test blocker.

*** This bug has been marked as a duplicate of bug 1955969 ***