Description of problem: Unable to deploy 4.11 dualstack in hybrid cluster with two bare metal workers. Version-Release number of selected component (if applicable): 4.11.0-0.nightly-2022-05-25-193227 How reproducible: Easily reproducible in our CI pipeline. Deploy OCP with dual stack. Steps to Reproduce: 1. Deploy OCP 4.11 version configured as dual stack cluster Actual results: Masters are in status: NotReady and workers don'y come up. Expected results: Deploy dual stack cluster. Additional info: In the same environment I am able to deploy 4.10 dual stack clusters.
Deployment logs can be found here: https://auto-jenkins-csb-kniqe.apps.ocp-c1.prod.psi.redhat.com/view/CNF-core/job/CNF/job/test-ocp-general-cnf-core-playground/32/ Kubernetes API at https://api.hlxcl7.lab.eng.tlv2.redhat.com:6443. "https://api.hlxcl7.lab.eng.tlv2.redhat.com:6443/api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dbootstrap&resourceVersion=9541&timeoutSeconds=301&watch=true": dial tcp [2620:52:0:2e38::700]:6443: connect: no route to host 22:33:55 W0628 22:33:54.934837 4150634 reflector.go:324] k8s.io/client-go/tools/watch/informerwatcher.go:146: failed to list *v1.ConfigMap: Get
*** Bug 2102157 has been marked as a duplicate of this bug. ***
On one of the failing masters the ovnkube-node container is failing [root@hlxcl7-master-1 core]# crictl logs 3a6f95d2a0507 |& tail I0629 15:47:48.616774 28984 ovs.go:206] Exec(5): stderr: "" I0629 15:47:48.616796 28984 ovs.go:202] Exec(6): /usr/bin/ovs-vsctl --timeout=15 set interface ovn-k8s-mp0 mac=c6\:05\:2d\:06\:c8\:28 I0629 15:47:48.621688 28984 ovs.go:205] Exec(6): stdout: "" I0629 15:47:48.621718 28984 ovs.go:206] Exec(6): stderr: "" I0629 15:47:48.686769 28984 gateway_init.go:261] Initializing Gateway Functionality I0629 15:47:48.686923 28984 gateway_localnet.go:163] Node local addresses initialized to: map[10.130.0.2:{10.130.0.0 fffffe00} 10.46.56.76:{10.46.56.0 ffffff00} 127.0.0.1:{127.0.0.0 ff000000} 2620:52:0:2e38::706:{2620:52:0:2e38::706 ffffffffffffffffffffffffffffffff} ::1:{::1 ffffffffffffffffffffffffffffffff} fd01:0:0:3::2:{fd01:0:0:3:: ffffffffffffffff0000000000000000} fe80::5054:ff:fe57:18a8:{fe80:: ffffffffffffffff0000000000000000} fe80::68d6:50ff:fea4:19b3:{fe80:: ffffffffffffffff0000000000000000} fe80::c405:2dff:fe06:c828:{fe80:: ffffffffffffffff0000000000000000}] I0629 15:47:48.687017 28984 helper_linux.go:71] Provided gateway interface "br-ex", found as index: 5 I0629 15:47:48.687083 28984 helper_linux.go:97] Found default gateway interface br-ex 10.46.56.254 I0629 15:47:48.687120 28984 helper_linux.go:71] Provided gateway interface "br-ex", found as index: 5 F0629 15:47:48.687179 28984 ovnkube.go:133] failed to get default gateway interface We've also noticed that the ip= param in the kernel params is ip=dhcp, we'll try a test build with ip=dhcp,dhcp6 to represent dual stack setup and see if this fixes the issue
Note that the 3 dualstack nightlies on 4.11 (e2e-metal-ipi-serial-ovn-dualstack, metal-ipi-ovn-dualstack and e2e-metal-ipi-ovn-dualstack-local-gateway ) are all successfully deploying clusters so this doesn't appear to be a problem in all dualstack environments I've Also asked the reporter to test a PR that forces ip=dhcp,dhcp6 , I'll report back once we know if it made a difference
Finally have cluster to work with. I tried deploying 4.10.20 dualstack and have the same issue as with 4.11. [gkopels@ ~]$ oc get nodes NAME STATUS ROLES AGE VERSION hlxcl7-master-0.hlxcl7.lab.eng.tlv2.redhat.com NotReady master 71m v1.23.5+3afdacb hlxcl7-master-1.hlxcl7.lab.eng.tlv2.redhat.com NotReady master 71m v1.23.5+3afdacb hlxcl7-master-2.hlxcl7.lab.eng.tlv2.redhat.com NotReady master 71m v1.23.5+3afdacb I am able to install 4.9 dualstack with no problem. Who can have a look with me?
A draft PR has been created to test the fix. https://github.com/openshift/installer/pull/6063 I am able to create an build with the cluster-bot but am unable to deploy in our CI. Our CI does not have access to the repo where this build is stored. With some help I am trying to manually run our pipeline to pull this build. However so far I have been unsuccessful. I have a meeting again today to attempt the deployment.
Hi @pparasur @dhiggins we are having difficulties trying to deploy the cluster-bot build in our CI. Issues with the repo and infrastructure to reach it. Is there anyway you can test the PR 6063 then merge it? At which point I can test it in the nightly image. Thanks
We were able to run the build from cluster-bot with the fix https://github.com/openshift/installer/pull/6063. The result was the same as before. Workers don't come up. [root@helix08 tmp7]# oc get nodes NAME STATUS ROLES AGE VERSION hlxcl7-master-0.hlxcl7.lab.eng.tlv2.redhat.com NotReady master 3h11m v1.24.0+9546431 hlxcl7-master-1.hlxcl7.lab.eng.tlv2.redhat.com NotReady master 3h11m v1.24.0+9546431 hlxcl7-master-2.hlxcl7.lab.eng.tlv2.redhat.com NotReady master 3h11m v1.24.0+9546431 [root@helix08 tmp7]#
Hi, Together with Derek Higgins we were able to validate a Dual Stack deployment with no issues using 4.11.0-0.nightly-2022-11-10-202051.
Closing based on the above comments.