Created attachment 1864514 [details] openshift-install log [kni@provisionhost-0-0 ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 22h Unable to apply 4.9.0-0.nightly-2022-03-05-200607: some cluster operators have not yet rolled out Platform: baremetal IPI When deploying cluster on dual stack ctr-plane with infinite lease defined on DHCP server, installer fails, Ingress operator degraded, workers are provisioned, but not joined the cluster must-gather and install log attached
Notes: - Issue was reproduced at least 3 times in a row on 4.9.23 as well as on nightly - Dual stack ctr-plane with regular dhcp lease was deployed successfully - Summary from must-gather here: ClusterID: ecd6af98-7afb-44aa-a5ab-c5eb246889d8 ClusterVersion: Installing "4.9.0-0.nightly-2022-03-05-200607" for 10 hours: Unable to apply 4.9.0-0.nightly-2022-03-05-200607: some cluster operators have not yet rolled out ClusterOperators: clusteroperator/authentication is not available (OAuthServerRouteEndpointAccessibleControllerAvailable: route.route.openshift.io "oauth-openshift" not found OAuthServerServiceEndpointAccessibleControllerAvailable: Get "https://172.30.209.118:443/healthz": dial tcp 172.30.209.118:443: connect: connection refused OAuthServerServiceEndpointsEndpointAccessibleControllerAvailable: endpoints "oauth-openshift" not found ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes, 3 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods). WellKnownAvailable: The well-known endpoint is not yet available: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap "oauth-openshift" not found (check authentication operator, it is supposed to create this)) because IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server OAuthClientsControllerDegraded: no ingress for host oauth-openshift.apps.ocp-edge-cluster-0.qe.lab.redhat.com in route oauth-openshift in namespace openshift-authentication OAuthServerDeploymentDegraded: waiting for the oauth-openshift route to contain an admitted ingress: no admitted ingress for route oauth-openshift in namespace openshift-authentication OAuthServerDeploymentDegraded: OAuthServerRouteEndpointAccessibleControllerDegraded: route "openshift-authentication/oauth-openshift": status does not have a host address OAuthServerServiceEndpointAccessibleControllerDegraded: Get "https://172.30.209.118:443/healthz": dial tcp 172.30.209.118:443: connect: connection refused OAuthServerServiceEndpointsEndpointAccessibleControllerDegraded: oauth service endpoints are not ready WellKnownReadyControllerDegraded: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap "oauth-openshift" not found (check authentication operator, it is supposed to create this) clusteroperator/console is not available (RouteHealthAvailable: console route is not admitted) because DefaultRouteSyncDegraded: no ingress for host console-openshift-console.apps.ocp-edge-cluster-0.qe.lab.redhat.com in route console in namespace openshift-console RouteHealthDegraded: console route is not admitted SyncLoopRefreshDegraded: no ingress for host console-openshift-console.apps.ocp-edge-cluster-0.qe.lab.redhat.com in route console in namespace openshift-console clusteroperator/ingress is not available (The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.)) because The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-5c66c775-786zq" cannot be scheduled: 0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Pod "router-default-5c66c775-v2t6q" cannot be scheduled: 0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Make sure you have sufficient worker nodes.), DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1), DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 0/2 of replicas are available) clusteroperator/monitoring is not available (Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.) because Failed to rollout the stack. Error: updating alertmanager: waiting for Alertmanager Route to become ready failed: waiting for route openshift-monitoring/alertmanager-main: no status available updating thanos querier: waiting for Thanos Querier Route to become ready failed: waiting for route openshift-monitoring/thanos-querier: no status available updating prometheus-k8s: waiting for Prometheus Route to become ready failed: waiting for route openshift-monitoring/prometheus-k8s: no status available updating grafana: waiting for Grafana Route to become ready failed: waiting for route openshift-monitoring/grafana: no status available updating kube-state-metrics: reconciling kube-state-metrics Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/kube-state-metrics: got 1 unavailable replicas updating openshift-state-metrics: reconciling openshift-state-metrics Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/openshift-state-metrics: got 1 unavailable replicas updating prometheus-adapter: reconciling PrometheusAdapter Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-adapter: got 2 unavailable replicas clusteroperator/network is progressing: Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
I would not define this bug as blocker since such configuration is not major customer scenario, both dual stack on ctr-plane and infinite lease for DHCP on that network. Still - looks like mechanism of converting infinite lease into static IP cause damage to workers ability to join the cluster and Ingress setup as result
[kni@provisionhost-0-0 ~]$ oc get nodes NAME STATUS ROLES AGE VERSION master-0-0 Ready master 26h v1.22.5+5c84e52 master-0-1 Ready master 26h v1.22.5+5c84e52 master-0-2 Ready master 26h v1.22.5+5c84e52 [kni@provisionhost-0-0 ~]$ oc get baremetalhost -n openshift-machine-api NAME STATE CONSUMER ONLINE ERROR openshift-master-0-0 externally provisioned ocp-edge-cluster-0-k8vck-master-0 true openshift-master-0-1 externally provisioned ocp-edge-cluster-0-k8vck-master-1 true openshift-master-0-2 externally provisioned ocp-edge-cluster-0-k8vck-master-2 true openshift-worker-0-0 provisioned ocp-edge-cluster-0-k8vck-worker-0-vhqmc true openshift-worker-0-1 provisioned ocp-edge-cluster-0-k8vck-worker-0-jrjfr true [kni@provisionhost-0-0 ~]$ oc get machines -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE ocp-edge-cluster-0-k8vck-master-0 Running 26h ocp-edge-cluster-0-k8vck-master-1 Running 26h ocp-edge-cluster-0-k8vck-master-2 Running 26h ocp-edge-cluster-0-k8vck-worker-0-jrjfr Provisioned 26h ocp-edge-cluster-0-k8vck-worker-0-vhqmc Provisioned 26h
Reassigning to see if networking team can take a look at this.
It looks like this might be a hostname issue. The kubelet logs are full of: "Error getting node" err="node \"worker-0\" not found" But when I look at the BMH for worker-0, I see: hostname: worker-0.ostest.test.metalkube.org So it looks like the BMH picked the FQDN and kubelet picked the short name for some reason. I'm not really clear why infinite leases would have anything to do with the hostname selection process though. I also see the same mismatch for masters, so I might just be wrong too. However, I know the master deployment process is a little different from workers so maybe that's why? I need to try a deployment without infinite leases and see if the same thing is happening.
Yeah, in a dual stack deployment without infinite leases, we get fqdns for the node names: $ oc get nodes NAME STATUS ROLES AGE VERSION master-0.ostest.test.metalkube.org Ready master 18h v1.22.5+5c84e52 With infinite leases we get short names: $ oc get nodes NAME STATUS ROLES AGE VERSION master-0 Ready master 18h v1.22.5+5c84e52 I'm not sure what would have changed from 4.9 to 4.10 to trigger this. I think we have the same DHCP6 logic, which means we should be getting FQDNs anytime there is an ipv6 DHCP address involved.
*** This bug has been marked as a duplicate of bug 2038249 ***