2061641 – Workers failed to join cluster during deployment when dual stack with infinite lease for ctr-plane was configured

Bug 2061641 - Workers failed to join cluster during deployment when dual stack with infinite lease for ctr-plane was configured

Summary: Workers failed to join cluster during deployment when dual stack with infinit...

Keywords:
Status:	CLOSED DUPLICATE of bug 2038249
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Ben Nemec
QA Contact:	Victor Voronkov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-03-08 06:15 UTC by Victor Voronkov
Modified:	2022-03-15 17:26 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-15 17:26:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
openshift-install log (21.02 KB, text/plain) 2022-03-08 06:15 UTC, Victor Voronkov	no flags	Details
View All

Description Victor Voronkov 2022-03-08 06:15:49 UTC

Created attachment 1864514 [details]
openshift-install log

[kni@provisionhost-0-0 ~]$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          22h     Unable to apply 4.9.0-0.nightly-2022-03-05-200607: some cluster operators have not yet rolled out


Platform: baremetal IPI

When deploying cluster on dual stack ctr-plane with infinite lease defined on DHCP server, installer fails, Ingress operator degraded, workers are provisioned, but not joined the cluster

must-gather and install log attached

Comment 1 Victor Voronkov 2022-03-08 06:18:39 UTC

Notes:
- Issue was reproduced at least 3 times in a row on 4.9.23 as well as on nightly
- Dual stack ctr-plane with regular dhcp lease was deployed successfully
- Summary from must-gather here:

ClusterID: ecd6af98-7afb-44aa-a5ab-c5eb246889d8
ClusterVersion: Installing "4.9.0-0.nightly-2022-03-05-200607" for 10 hours: Unable to apply 4.9.0-0.nightly-2022-03-05-200607: some cluster operators have not yet rolled out
ClusterOperators:
	clusteroperator/authentication is not available (OAuthServerRouteEndpointAccessibleControllerAvailable: route.route.openshift.io "oauth-openshift" not found
OAuthServerServiceEndpointAccessibleControllerAvailable: Get "https://172.30.209.118:443/healthz": dial tcp 172.30.209.118:443: connect: connection refused
OAuthServerServiceEndpointsEndpointAccessibleControllerAvailable: endpoints "oauth-openshift" not found
ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes, 3 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods).
WellKnownAvailable: The well-known endpoint is not yet available: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap "oauth-openshift" not found (check authentication operator, it is supposed to create this)) because IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server
OAuthClientsControllerDegraded: no ingress for host oauth-openshift.apps.ocp-edge-cluster-0.qe.lab.redhat.com in route oauth-openshift in namespace openshift-authentication
OAuthServerDeploymentDegraded: waiting for the oauth-openshift route to contain an admitted ingress: no admitted ingress for route oauth-openshift in namespace openshift-authentication
OAuthServerDeploymentDegraded: 
OAuthServerRouteEndpointAccessibleControllerDegraded: route "openshift-authentication/oauth-openshift": status does not have a host address
OAuthServerServiceEndpointAccessibleControllerDegraded: Get "https://172.30.209.118:443/healthz": dial tcp 172.30.209.118:443: connect: connection refused
OAuthServerServiceEndpointsEndpointAccessibleControllerDegraded: oauth service endpoints are not ready
WellKnownReadyControllerDegraded: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap "oauth-openshift" not found (check authentication operator, it is supposed to create this)
	clusteroperator/console is not available (RouteHealthAvailable: console route is not admitted) because DefaultRouteSyncDegraded: no ingress for host console-openshift-console.apps.ocp-edge-cluster-0.qe.lab.redhat.com in route console in namespace openshift-console
RouteHealthDegraded: console route is not admitted
SyncLoopRefreshDegraded: no ingress for host console-openshift-console.apps.ocp-edge-cluster-0.qe.lab.redhat.com in route console in namespace openshift-console
	clusteroperator/ingress is not available (The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.)) because The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-5c66c775-786zq" cannot be scheduled: 0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Pod "router-default-5c66c775-v2t6q" cannot be scheduled: 0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Make sure you have sufficient worker nodes.), DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1), DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 0/2 of replicas are available)
	clusteroperator/monitoring is not available (Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.) because Failed to rollout the stack. Error: updating alertmanager: waiting for Alertmanager Route to become ready failed: waiting for route openshift-monitoring/alertmanager-main: no status available
updating thanos querier: waiting for Thanos Querier Route to become ready failed: waiting for route openshift-monitoring/thanos-querier: no status available
updating prometheus-k8s: waiting for Prometheus Route to become ready failed: waiting for route openshift-monitoring/prometheus-k8s: no status available
updating grafana: waiting for Grafana Route to become ready failed: waiting for route openshift-monitoring/grafana: no status available
updating kube-state-metrics: reconciling kube-state-metrics Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/kube-state-metrics: got 1 unavailable replicas
updating openshift-state-metrics: reconciling openshift-state-metrics Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/openshift-state-metrics: got 1 unavailable replicas
updating prometheus-adapter: reconciling PrometheusAdapter Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-adapter: got 2 unavailable replicas
	clusteroperator/network is progressing: Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready

Comment 3 Victor Voronkov 2022-03-08 10:27:05 UTC

I would not define this bug as blocker since such configuration is not major customer scenario, both dual stack on ctr-plane and infinite lease for DHCP on that network. Still - looks like mechanism of converting infinite lease into static IP cause damage to workers ability to join the cluster and Ingress setup as result

Comment 4 Victor Voronkov 2022-03-08 10:28:16 UTC

[kni@provisionhost-0-0 ~]$ oc get nodes
NAME         STATUS   ROLES    AGE   VERSION
master-0-0   Ready    master   26h   v1.22.5+5c84e52
master-0-1   Ready    master   26h   v1.22.5+5c84e52
master-0-2   Ready    master   26h   v1.22.5+5c84e52
[kni@provisionhost-0-0 ~]$ oc get baremetalhost -n openshift-machine-api
NAME                   STATE                    CONSUMER                                  ONLINE   ERROR
openshift-master-0-0   externally provisioned   ocp-edge-cluster-0-k8vck-master-0         true     
openshift-master-0-1   externally provisioned   ocp-edge-cluster-0-k8vck-master-1         true     
openshift-master-0-2   externally provisioned   ocp-edge-cluster-0-k8vck-master-2         true     
openshift-worker-0-0   provisioned              ocp-edge-cluster-0-k8vck-worker-0-vhqmc   true     
openshift-worker-0-1   provisioned              ocp-edge-cluster-0-k8vck-worker-0-jrjfr   true     
[kni@provisionhost-0-0 ~]$ oc get machines -n openshift-machine-api
NAME                                      PHASE         TYPE   REGION   ZONE   AGE
ocp-edge-cluster-0-k8vck-master-0         Running                              26h
ocp-edge-cluster-0-k8vck-master-1         Running                              26h
ocp-edge-cluster-0-k8vck-master-2         Running                              26h
ocp-edge-cluster-0-k8vck-worker-0-jrjfr   Provisioned                          26h
ocp-edge-cluster-0-k8vck-worker-0-vhqmc   Provisioned                          26h

Comment 5 Bob Fournier 2022-03-08 17:30:59 UTC

Reassigning to see if networking team can take a look at this.

Comment 6 Ben Nemec 2022-03-09 18:31:39 UTC

It looks like this might be a hostname issue. The kubelet logs are full of:

"Error getting node" err="node \"worker-0\" not found"

But when I look at the BMH for worker-0, I see:

hostname: worker-0.ostest.test.metalkube.org

So it looks like the BMH picked the FQDN and kubelet picked the short name for some reason.

I'm not really clear why infinite leases would have anything to do with the hostname selection process though. I also see the same mismatch for masters, so I might just be wrong too. However, I know the master deployment process is a little different from workers so maybe that's why? I need to try a deployment without infinite leases and see if the same thing is happening.

Comment 7 Ben Nemec 2022-03-10 17:55:27 UTC

Yeah, in a dual stack deployment without infinite leases, we get fqdns for the node names:

$ oc get nodes
NAME                                 STATUS   ROLES    AGE   VERSION
master-0.ostest.test.metalkube.org   Ready    master   18h   v1.22.5+5c84e52

With infinite leases we get short names:

$ oc get nodes
NAME       STATUS   ROLES    AGE   VERSION
master-0   Ready    master   18h   v1.22.5+5c84e52

I'm not sure what would have changed from 4.9 to 4.10 to trigger this. I think we have the same DHCP6 logic, which means we should be getting FQDNs anytime there is an ipv6 DHCP address involved.

Comment 8 Victor Voronkov 2022-03-15 17:26:05 UTC


*** This bug has been marked as a duplicate of bug 2038249 ***

Note You need to log in before you can comment on or make changes to this bug.