2102158 – Unable to deploy 4.11 Dual Stack in hybrid cluster with two bare metal workers

Bug 2102158 - Unable to deploy 4.11 Dual Stack in hybrid cluster with two bare metal workers

Summary: Unable to deploy 4.11 Dual Stack in hybrid cluster with two bare metal workers

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Bare Metal Hardware Provisioning
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Derek Higgins
QA Contact:	Amit Ugol
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2102157 (view as bug list)
Depends On:
Blocks:	2100035 2110029
TreeView+	depends on / blocked

Reported:	2022-06-29 12:12 UTC by Greg Kopels
Modified:	2022-11-21 10:49 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2110029 (view as bug list)
Environment:
Last Closed:	2022-11-21 10:49:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Greg Kopels 2022-06-29 12:12:53 UTC

Description of problem:
Unable to deploy 4.11 dualstack in hybrid cluster with two bare metal workers.

Version-Release number of selected component (if applicable):
4.11.0-0.nightly-2022-05-25-193227

How reproducible:
Easily reproducible in our CI pipeline. Deploy OCP with dual stack.

Steps to Reproduce:
1. Deploy OCP 4.11 version configured as dual stack cluster

Actual results:
Masters are in status: NotReady and workers don'y come up.

Expected results:
Deploy dual stack cluster.

Additional info:
In the same environment I am able to deploy 4.10 dual stack clusters.

Comment 1 Greg Kopels 2022-06-29 12:14:00 UTC

Deployment logs can be found here:
https://auto-jenkins-csb-kniqe.apps.ocp-c1.prod.psi.redhat.com/view/CNF-core/job/CNF/job/test-ocp-general-cnf-core-playground/32/

Kubernetes API at https://api.hlxcl7.lab.eng.tlv2.redhat.com:6443.
"https://api.hlxcl7.lab.eng.tlv2.redhat.com:6443/api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dbootstrap&resourceVersion=9541&timeoutSeconds=301&watch=true": dial tcp [2620:52:0:2e38::700]:6443: connect: no route to host
22:33:55  W0628 22:33:54.934837 4150634 reflector.go:324] k8s.io/client-go/tools/watch/informerwatcher.go:146: failed to list *v1.ConfigMap: Get

Comment 2 Derek Higgins 2022-06-30 11:12:44 UTC

*** Bug 2102157 has been marked as a duplicate of this bug. ***

Comment 3 Derek Higgins 2022-06-30 11:19:58 UTC

On one of the failing masters the ovnkube-node container is failing

[root@hlxcl7-master-1 core]# crictl logs 3a6f95d2a0507 |& tail
I0629 15:47:48.616774   28984 ovs.go:206] Exec(5): stderr: ""
I0629 15:47:48.616796   28984 ovs.go:202] Exec(6): /usr/bin/ovs-vsctl --timeout=15 set interface ovn-k8s-mp0 mac=c6\:05\:2d\:06\:c8\:28
I0629 15:47:48.621688   28984 ovs.go:205] Exec(6): stdout: ""
I0629 15:47:48.621718   28984 ovs.go:206] Exec(6): stderr: ""
I0629 15:47:48.686769   28984 gateway_init.go:261] Initializing Gateway Functionality
I0629 15:47:48.686923   28984 gateway_localnet.go:163] Node local addresses initialized to: map[10.130.0.2:{10.130.0.0 fffffe00} 10.46.56.76:{10.46.56.0 ffffff00} 127.0.0.1:{127.0.0.0 ff000000} 2620:52:0:2e38::706:{2620:52:0:2e38::706 ffffffffffffffffffffffffffffffff} ::1:{::1 ffffffffffffffffffffffffffffffff} fd01:0:0:3::2:{fd01:0:0:3:: ffffffffffffffff0000000000000000} fe80::5054:ff:fe57:18a8:{fe80:: ffffffffffffffff0000000000000000} fe80::68d6:50ff:fea4:19b3:{fe80:: ffffffffffffffff0000000000000000} fe80::c405:2dff:fe06:c828:{fe80:: ffffffffffffffff0000000000000000}]
I0629 15:47:48.687017   28984 helper_linux.go:71] Provided gateway interface "br-ex", found as index: 5
I0629 15:47:48.687083   28984 helper_linux.go:97] Found default gateway interface br-ex 10.46.56.254
I0629 15:47:48.687120   28984 helper_linux.go:71] Provided gateway interface "br-ex", found as index: 5
F0629 15:47:48.687179   28984 ovnkube.go:133] failed to get default gateway interface

We've also noticed that the ip= param in the kernel params is ip=dhcp, we'll try a test build with ip=dhcp,dhcp6 to represent dual stack setup and see if this fixes the issue

Comment 4 Derek Higgins 2022-07-05 14:47:12 UTC

Note that the 3 dualstack nightlies on 4.11 (e2e-metal-ipi-serial-ovn-dualstack, metal-ipi-ovn-dualstack and e2e-metal-ipi-ovn-dualstack-local-gateway ) are all successfully deploying clusters
so this doesn't appear to be a problem in all dualstack environments

I've Also asked the reporter to test a PR that forces ip=dhcp,dhcp6 , I'll report back once we know if it made a difference

Comment 7 Greg Kopels 2022-07-14 23:05:18 UTC

Finally have cluster to work with. I tried deploying 4.10.20 dualstack and have the same issue as with 4.11.

[gkopels@ ~]$ oc get nodes
NAME                                             STATUS     ROLES    AGE   VERSION
hlxcl7-master-0.hlxcl7.lab.eng.tlv2.redhat.com   NotReady   master   71m   v1.23.5+3afdacb
hlxcl7-master-1.hlxcl7.lab.eng.tlv2.redhat.com   NotReady   master   71m   v1.23.5+3afdacb
hlxcl7-master-2.hlxcl7.lab.eng.tlv2.redhat.com   NotReady   master   71m   v1.23.5+3afdacb


I am able to install 4.9 dualstack with no problem.

Who can have a look with me?

Comment 9 Greg Kopels 2022-08-10 06:35:48 UTC

A draft PR has been created to test the fix. https://github.com/openshift/installer/pull/6063
I am able to create an build with the cluster-bot but am unable to deploy in our CI.  Our CI does not have access to the repo where this build is stored. With some help I am trying to manually run our pipeline to pull this build.  However so far I have been unsuccessful. I have a meeting again today to attempt the deployment.

Comment 10 Greg Kopels 2022-08-10 06:54:51 UTC

Hi @pparasur @dhiggins we are having difficulties trying to deploy the cluster-bot build in our CI. Issues with the repo and infrastructure to reach it. Is there anyway you can test the PR 6063 then merge it?  At which point I can test it in the nightly image. Thanks

Comment 12 Greg Kopels 2022-08-10 13:35:17 UTC

We were able to run the build from cluster-bot with the fix  https://github.com/openshift/installer/pull/6063. 
The result was the same as before. Workers don't come up.

[root@helix08 tmp7]# oc get nodes
NAME                                             STATUS     ROLES    AGE     VERSION
hlxcl7-master-0.hlxcl7.lab.eng.tlv2.redhat.com   NotReady   master   3h11m   v1.24.0+9546431
hlxcl7-master-1.hlxcl7.lab.eng.tlv2.redhat.com   NotReady   master   3h11m   v1.24.0+9546431
hlxcl7-master-2.hlxcl7.lab.eng.tlv2.redhat.com   NotReady   master   3h11m   v1.24.0+9546431
[root@helix08 tmp7]#

Comment 13 Greg Kopels 2022-11-11 13:35:17 UTC

Hi,
Together with Derek Higgins we were able to validate a Dual Stack deployment with no issues using 4.11.0-0.nightly-2022-11-10-202051.

Comment 14 Greg Kopels 2022-11-11 13:35:32 UTC

Hi,
Together with Derek Higgins we were able to validate a Dual Stack deployment with no issues using 4.11.0-0.nightly-2022-11-10-202051.

Comment 15 Greg Kopels 2022-11-11 13:35:43 UTC

Hi,
Together with Derek Higgins we were able to validate a Dual Stack deployment with no issues using 4.11.0-0.nightly-2022-11-10-202051.

Comment 16 Derek Higgins 2022-11-21 10:49:18 UTC

Closing based on the above comments.

Note You need to log in before you can comment on or make changes to this bug.