Bug 2036112

Summary: Installation with OVNKubernetes fails
Product: OpenShift Container Platform Reporter: Itzik Brown <itbrown>
Component: InstallerAssignee: Stephen Finucane <stephenfin>
Installer sub component: OpenShift on OpenStack QA Contact: Jon Uriarte <juriarte>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: unspecified CC: aos-bugs, emacchi, mifiedle, pprinett, rlobillo
Version: 4.10Keywords: TestBlocker
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-01-20 15:25:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Itzik Brown 2021-12-29 18:33:18 UTC
openshift-install 4.10.0-0.nightly-2021-12-23-153012

openshift-install 4.10.0-0.nightly-2021-12-23-153012


What happened?

Installation fails. Worker nodes are provisioned but don't become ready

$ oc  --namespace=openshift-machine-api get deployments
NAME                          READY   UP-TO-DATE   AVAILABLE   AGE
cluster-autoscaler-operator   1/1     1            1           52m
cluster-baremetal-operator    1/1     1            1           52m
machine-api-controllers       1/1     1            1           44m
machine-api-operator          1/1     1            1           52m

$ oc get nodes
NAME                    STATUS   ROLES    AGE   VERSION
ostest-ptlx5-master-0   Ready    master   51m   v1.22.1+6859754
ostest-ptlx5-master-1   Ready    master   51m   v1.22.1+6859754
ostest-ptlx5-master-2   Ready    master   51m   v1.22.1+6859754

From machine api controller log:
I1229 18:28:52.359841       1 controller.go:175] ostest-ptlx5-worker-0-cvd7j: reconciling Machine
I1229 18:28:53.855465       1 controller.go:298] ostest-ptlx5-worker-0-cvd7j: reconciling machine triggers idempotent update
E1229 18:28:55.558392       1 controller.go:300] ostest-ptlx5-worker-0-cvd7j: error updating machine: Operation cannot be fulfilled on machines.machine.openshift.io "ostest-ptlx5-worker-0-cvd7j": the object has been modified; please apply your changes to the latest version and try again, retrying in 30s seconds

Comment 2 Emilien Macchi 2022-01-05 17:24:58 UTC
Deployment is working in CI: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-e2e-openstack-ovn/1478671317682098176

But we have events that repeat pathologically:
event happened 181 times, something is wrong: ns/openshift-ovn-kubernetes pod/ovnkube-master-5phk2 node/7v8ww7v1-a4db4-cbqbx-master-2 - reason/BackOff Back-off restarting failed container

This looks different from what you have but it needs more investigation.

Comment 4 Stephen Finucane 2022-01-14 16:06:52 UTC
Can we get the install-config.yaml and OpenStack logs for this deployment. The latter should provide a little more information on what was configured while the former should allow us to reproduce this issue.

Comment 8 Stephen Finucane 2022-01-20 15:25:31 UTC
We investigated this. The worker node had two default routes with an identical metric.

  default via dev br-ex proto dhcp metric 100
  default via dev ens4 proto dhcp metric 100

We deleted the latter of these and things magically worked. This means this is duplicate of bug 2035326, and the fix is [1]. Closing as a duplicate.

[1] https://github.com/openshift/machine-config-operator/pull/2898

*** This bug has been marked as a duplicate of bug 2035326 ***