Bug 2036112 - Installation with OVNKubernetes fails
Summary: Installation with OVNKubernetes fails
Status: CLOSED DUPLICATE of bug 2035326
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.10
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.10.0
Assignee: Stephen Finucane
QA Contact: Jon Uriarte
Depends On:
TreeView+ depends on / blocked
Reported: 2021-12-29 18:33 UTC by Itzik Brown
Modified: 2022-01-20 15:25 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2022-01-20 15:25:31 UTC
Target Upstream Version:

Attachments (Terms of Use)

Description Itzik Brown 2021-12-29 18:33:18 UTC
openshift-install 4.10.0-0.nightly-2021-12-23-153012

openshift-install 4.10.0-0.nightly-2021-12-23-153012


What happened?

Installation fails. Worker nodes are provisioned but don't become ready

$ oc  --namespace=openshift-machine-api get deployments
NAME                          READY   UP-TO-DATE   AVAILABLE   AGE
cluster-autoscaler-operator   1/1     1            1           52m
cluster-baremetal-operator    1/1     1            1           52m
machine-api-controllers       1/1     1            1           44m
machine-api-operator          1/1     1            1           52m

$ oc get nodes
NAME                    STATUS   ROLES    AGE   VERSION
ostest-ptlx5-master-0   Ready    master   51m   v1.22.1+6859754
ostest-ptlx5-master-1   Ready    master   51m   v1.22.1+6859754
ostest-ptlx5-master-2   Ready    master   51m   v1.22.1+6859754

From machine api controller log:
I1229 18:28:52.359841       1 controller.go:175] ostest-ptlx5-worker-0-cvd7j: reconciling Machine
I1229 18:28:53.855465       1 controller.go:298] ostest-ptlx5-worker-0-cvd7j: reconciling machine triggers idempotent update
E1229 18:28:55.558392       1 controller.go:300] ostest-ptlx5-worker-0-cvd7j: error updating machine: Operation cannot be fulfilled on machines.machine.openshift.io "ostest-ptlx5-worker-0-cvd7j": the object has been modified; please apply your changes to the latest version and try again, retrying in 30s seconds

Comment 2 Emilien Macchi 2022-01-05 17:24:58 UTC
Deployment is working in CI: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-e2e-openstack-ovn/1478671317682098176

But we have events that repeat pathologically:
event happened 181 times, something is wrong: ns/openshift-ovn-kubernetes pod/ovnkube-master-5phk2 node/7v8ww7v1-a4db4-cbqbx-master-2 - reason/BackOff Back-off restarting failed container

This looks different from what you have but it needs more investigation.

Comment 4 Stephen Finucane 2022-01-14 16:06:52 UTC
Can we get the install-config.yaml and OpenStack logs for this deployment. The latter should provide a little more information on what was configured while the former should allow us to reproduce this issue.

Comment 8 Stephen Finucane 2022-01-20 15:25:31 UTC
We investigated this. The worker node had two default routes with an identical metric.

  default via dev br-ex proto dhcp metric 100
  default via dev ens4 proto dhcp metric 100

We deleted the latter of these and things magically worked. This means this is duplicate of bug 2035326, and the fix is [1]. Closing as a duplicate.

[1] https://github.com/openshift/machine-config-operator/pull/2898

*** This bug has been marked as a duplicate of bug 2035326 ***

Note You need to log in before you can comment on or make changes to this bug.