1973704 – node not joining as member in etcd, etcd-operator cannot communicate with etcd endpoints

Bug 1973704 - node not joining as member in etcd, etcd-operator cannot communicate with etcd endpoints

Summary: node not joining as member in etcd, etcd-operator cannot communicate with etc...

Keywords:
Status:	CLOSED DUPLICATE of bug 1980135
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Dan Winship
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-18 14:04 UTC by Yolanda Robla
Modified:	2022-06-10 13:45 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-30 13:58:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
etcd operator log (1.64 MB, text/plain) 2021-06-18 14:10 UTC, Yolanda Robla	no flags	Details
View All

Description Yolanda Robla 2021-06-18 14:04:00 UTC

Description of problem:

When deploying an ipv6 cluster, i have an issue of a node not joining etcd as member . etcd-operator complains with:


{"level":"warn","ts":"2021-07-06T12:17:54.918Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-5676d5d7-987a-43a2-905b-36472a6aefb5/[2620:52:0:1310::11]:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp [2620:52:0:1310::12]:2379: connect: connection refused\""}
{"level":"info","ts":"2021-07-06T12:17:54.918Z","caller":"health/log.go:29","msg":"health check","addresses":["https://[2620:52:0:1310::12]:2379"],"check":"QuorumReadSingleTarget","health":false,"start":"2021-07-06T09:10:47.292Z","error":"client timeout exceeded: 1s"}

I debugged by entering into etcd-operator pod, and executed the following commands:

sh-4.4# curl https://[2620:52:0:1310::13]:2379
curl: (7) Failed to connect to 2620:52:0:1310::13 port 2379: Connection timed out
sh-4.4# curl https://[2620:52:0:1310::11]:2379
curl: (7) Failed to connect to 2620:52:0:1310::13 port 2379: Connection timed out
sh-4.4# curl https://[2620:52:0:1310::12]:2379
curl: (7) Failed to connect to 2620:52:0:1310::12 port 2379: Connection refused

etcd-operator can only communicate with the master-1 where it's deployed (the :12 one).

However, if i perform the same curl commands from the master-1 itself, i can curl properly:

[root@master-1 core]#  curl -k https://[2620:52:0:1310::11]:2379
curl: (35) error:14094412:SSL routines:ssl3_read_bytes:sslv3 alert bad certificate
[root@master-1 core]#  curl -k https://[2620:52:0:1310::12]:2379
curl: (7) Failed to connect to 2620:52:0:1310::12 port 2379: Connection refused
[root@master-1 core]#  curl -k https://[2620:52:0:1310::13]:2379
curl: (35) error:14094412:SSL routines:ssl3_read_bytes:sslv3 alert bad certificate


So it seems as a problem of communication between pod and host network

Comment 1 Yolanda Robla 2021-06-18 14:10:38 UTC

Created attachment 1792063 [details]
etcd operator log

Comment 2 Yolanda Robla 2021-06-18 14:13:23 UTC

That's etcdctl member list:

[root@test-operator-master-1 /]# etcdctl member list -w table
+------------------+---------+------------------------+------------------------------------+------------------------------------------------------------------+------------+
|        ID        | STATUS  |          NAME          |             PEER ADDRS             |                           CLIENT ADDRS                           | IS LEARNER |
+------------------+---------+------------------------+------------------------------------+------------------------------------------------------------------+------------+
| 3f966dacb5368073 | started | test-operator-master-2 |  https://[2620:52:0:1310::13]:2380 | https://[2620:52:0:1310::13]:2379,unixs://[2620:52:0:1310::13]:0 |      false |
| 4b25da5e15cb3df4 | started |         etcd-bootstrap | https://[2620:52:0:1310::12e]:2380 |                               https://[2620:52:0:1310::12e]:2379 |      false |
| 8c680d48e5a19d58 | started | test-operator-master-1 |  https://[2620:52:0:1310::12]:2380 | https://[2620:52:0:1310::12]:2379,unixs://[2620:52:0:1310::12]:0 |      false |
+------------------+---------+------------------------+------------------------------------+------------------------------------------------------------------+------------+

Comment 3 Sam Batschelet 2021-06-18 15:51:32 UTC

After review of etcd-operator logs `test-operator-master-1` was failing health checks. But test-operator-master-1 was healthy as reported by the etcd logs and quorum-guard health checks.


```
I0618 14:02:03.946852       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"199cf30a-2f6b-49c7-9dfa-cfb9bbd1b081", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'UnhealthyEtcdMember' unhealthy members: test-operator-master-1
```|

```
NAME                                       READY   STATUS             RESTARTS   AGE
etcd-quorum-guard-58cb69847b-fs26p         1/1     Running            0          3h1m
etcd-quorum-guard-58cb69847b-njbnc         0/1     Running            0          3h1m
etcd-quorum-guard-58cb69847b-qpk6r         1/1     Running            0          3h1m
etcd-test-operator-master-0                3/4     CrashLoopBackOff   30         6m27s
etcd-test-operator-master-1                4/4     Running            8          179m
etcd-test-operator-master-2                4/4     Running            4          3h
installer-2-test-operator-master-0         0/1     Completed          0          161m
installer-2-test-operator-master-1         0/1     Completed          0          179m
installer-2-test-operator-master-2         0/1     Completed          0          3h1m
installer-3-test-operator-master-0         0/1     Completed          0          119m
revision-pruner-2-test-operator-master-1   0/1     Completed          0          161m
revision-pruner-2-test-operator-master-2   0/1     Completed          0          179m
```

This was caused by the inability of etcd-operator pod to communicate with master-1 hostNetwork.

curl test against master-1 (no route)

```
sh-4.4# curl -vvv https://[2620:52:0:1310::12]:2379/metrics -k
*   Trying 2620:52:0:1310::12...
* TCP_NODELAY set
```

vs master-2 (failure expected)

```
sh-4.4# curl -vvv https://[2620:52:0:1310::13]:2379/metrics -k
*   Trying 2620:52:0:1310::13...
* TCP_NODELAY set
* Connected to 2620:52:0:1310::13 (2620:52:0:1310::13) port 2379 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Request CERT (13):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Certificate (11):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS alert, bad certificate (554):
* error:14094412:SSL routines:ssl3_read_bytes:sslv3 alert bad certificate
* Closing connection 0
curl: (35) error:14094412:SSL routines:ssl3_read_bytes:sslv3 alert bad certificate
```

Moving to OVN team to triage.

Comment 4 Yolanda Robla 2021-06-18 17:38:47 UTC

So a curl from the same host/pod master-1 was working. If i do the curl from a different host/pod it just times out. And pod in master-2 is reachable, is an specific problem of the pods in master-1

Comment 5 Dan Winship 2021-06-21 18:44:03 UTC

(In reply to Yolanda Robla from comment #0)
> Initially, the master nodes get a dns server that is not reachable, as a
> consequence nodes cannot get inverse resolution.

Why is this? That seems... bad?

> Even when dns is fixed, the node seems to be registered incorrectly, and
> keeps erroring, instead of regenerating the member list.

I'm not sure how much etcd is expected to be able to automatically recover from misconfigurations, as opposed to needing manual intervention to fix things in that case.


Is this cluster still up? Can I get access to it? It's hard to debug the network state without any further information.

Comment 6 Yolanda Robla 2021-06-22 14:19:05 UTC

I still have the cluster, but I managed to fix the problems.
What i did to achieve it, is to fix the dns manually, then do an oc delete node + reboot on each of the nodes. When they came, they didn't get the ovn problems and were able to join the cluster.

Comment 7 Yolanda Robla 2021-07-06 12:34:31 UTC

I updated the bug with latest description. Not a dns problem anymore, and issue of communication continues. I am deploying with ipv6, rc.1 , on virtualized masters

Comment 8 Yolanda Robla 2021-07-06 12:45:26 UTC

I cannot attach must-gather logs because it just fails. But this is the output i get:

ClusterID: e6cb21bf-701e-4ef9-afc2-eb688d549ece
ClusterVersion: Installing "4.8.0-rc.1" for 6 hours: Unable to apply 4.8.0-rc.1: some cluster operators have not yet rolled out
ClusterOperators:
clusteroperator/authentication is not available (WellKnownAvailable: The well-known endpoint is not yet available: failed to GET kube-apiserver oauth endpoint https://[2620:52:0:1310::12]:6443/.well-known/oauth-authorization-server: dial tcp [2620:52:0:1310::12]:6443: i/o timeout) because APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver (crashlooping container is waiting in apiserver-78ff5fc9d8-s62xp pod)
WellKnownReadyControllerDegraded: failed to GET kube-apiserver oauth endpoint https://[2620:52:0:1310::12]:6443/.well-known/oauth-authorization-server: dial tcp [2620:52:0:1310::12]:6443: i/o timeout
clusteroperator/etcd is degraded because BootstrapTeardownDegraded: failed to get etcd client: failed to make etcd client for endpoints [https://[2620:52:0:1310::13]:2379 https://[2620:52:0:1310::11]:2379 https://[2620:52:0:1310::12]:2379 https://[2620:52:0:1310::1de]:2379]: context deadline exceeded
ClusterMemberControllerDegraded: failed to get etcd client: failed to make etcd client for endpoints [https://[2620:52:0:1310::13]:2379 https://[2620:52:0:1310::11]:2379 https://[2620:52:0:1310::12]:2379 https://[2620:52:0:1310::1de]:2379]: context deadline exceeded
EtcdEndpointsDegraded: could not create etcd client: failed to get etcd client: failed to make etcd client for endpoints [https://[2620:52:0:1310::13]:2379 https://[2620:52:0:1310::11]:2379 https://[2620:52:0:1310::12]:2379 https://[2620:52:0:1310::1de]:2379]: context deadline exceeded
EtcdMembersControllerDegraded: failed to get etcd client: failed to make etcd client for endpoints [https://[2620:52:0:1310::13]:2379 https://[2620:52:0:1310::11]:2379 https://[2620:52:0:1310::12]:2379 https://[2620:52:0:1310::1de]:2379]: context deadline exceeded
EtcdMembersDegraded: 2 of 3 members are available, master-0.clus2.t5g.lab.eng.bos.redhat.com is unhealthy
StaticPodsDegraded: pod/etcd-master-1.clus2.t5g.lab.eng.bos.redhat.com container "etcd" is waiting: CrashLoopBackOff: back-off 5m0s restarting failed container=etcd pod=etcd-master-1.clus2.t5g.lab.eng.bos.redhat.com_openshift-etcd(6174abdc-9461-47c2-bfb8-608de49e12fe)
clusteroperator/machine-config is not available (Cluster not available for 4.8.0-rc.1) because Failed to resync 4.8.0-rc.1 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 3)
clusteroperator/openshift-apiserver is degraded because APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver (crashlooping container is waiting in apiserver-558ccc44f7-sfrbl pod)

[must-gather ] OUT namespace/openshift-must-gather-kdqkc created
[must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-4q44q created
[must-gather ] OUT pod for plug-in image mg.tar created
[must-gather-97x9x] OUT gather did not start: unable to pull image: ImagePullBackOff: Back-off pulling image "mg.tar"
[must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-4q44q deleted
[must-gather ] OUT namespace/openshift-must-gather-kdqkc deleted

When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information.
ClusterID: e6cb21bf-701e-4ef9-afc2-eb688d549ece
ClusterVersion: Installing "4.8.0-rc.1" for 7 hours: Unable to apply 4.8.0-rc.1: some cluster operators have not yet rolled out
ClusterOperators:
clusteroperator/authentication is not available (WellKnownAvailable: The well-known endpoint is not yet available: failed to GET kube-apiserver oauth endpoint https://[2620:52:0:1310::12]:6443/.well-known/oauth-authorization-server: dial tcp [2620:52:0:1310::12]:6443: i/o timeout) because APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver (crashlooping container is waiting in apiserver-78ff5fc9d8-s62xp pod)
WellKnownReadyControllerDegraded: failed to GET kube-apiserver oauth endpoint https://[2620:52:0:1310::12]:6443/.well-known/oauth-authorization-server: dial tcp [2620:52:0:1310::12]:6443: i/o timeout
clusteroperator/etcd is degraded because BootstrapTeardownDegraded: failed to get etcd client: failed to make etcd client for endpoints [https://[2620:52:0:1310::13]:2379 https://[2620:52:0:1310::11]:2379 https://[2620:52:0:1310::12]:2379 https://[2620:52:0:1310::1de]:2379]: context deadline exceeded
ClusterMemberControllerDegraded: failed to get etcd client: failed to make etcd client for endpoints [https://[2620:52:0:1310::13]:2379 https://[2620:52:0:1310::11]:2379 https://[2620:52:0:1310::12]:2379 https://[2620:52:0:1310::1de]:2379]: context deadline exceeded
EtcdEndpointsDegraded: could not create etcd client: failed to get etcd client: failed to make etcd client for endpoints [https://[2620:52:0:1310::13]:2379 https://[2620:52:0:1310::11]:2379 https://[2620:52:0:1310::12]:2379 https://[2620:52:0:1310::1de]:2379]: context deadline exceeded
EtcdMembersControllerDegraded: failed to get etcd client: failed to make etcd client for endpoints [https://[2620:52:0:1310::13]:2379 https://[2620:52:0:1310::11]:2379 https://[2620:52:0:1310::12]:2379 https://[2620:52:0:1310::1de]:2379]: context deadline exceeded
EtcdMembersDegraded: 2 of 3 members are available, master-0.clus2.t5g.lab.eng.bos.redhat.com is unhealthy
StaticPodsDegraded: pod/etcd-master-1.clus2.t5g.lab.eng.bos.redhat.com container "etcd" is waiting: CrashLoopBackOff: back-off 5m0s restarting failed container=etcd pod=etcd-master-1.clus2.t5g.lab.eng.bos.redhat.com_openshift-etcd(6174abdc-9461-47c2-bfb8-608de49e12fe)
clusteroperator/machine-config is not available (Cluster not available for 4.8.0-rc.1) because Failed to resync 4.8.0-rc.1 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 3)
clusteroperator/openshift-apiserver is degraded because APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver (crashlooping container is waiting in apiserver-558ccc44f7-sfrbl pod)

Gathering data for ns/openshift-config...
Gathering data for ns/openshift-config-managed...
Gathering data for ns/openshift-authentication...
Gathering data for ns/openshift-authentication-operator...
Gathering data for ns/openshift-ingress...
Gathering data for ns/openshift-oauth-apiserver...
Gathering data for ns/openshift-machine-api...
Gathering data for ns/openshift-cloud-credential-operator...
Gathering data for ns/openshift-config-operator...
Gathering data for ns/openshift-console-operator...
Gathering data for ns/openshift-console...
Gathering data for ns/openshift-cluster-storage-operator...
Gathering data for ns/openshift-dns-operator...
Gathering data for ns/openshift-dns...
Gathering data for ns/openshift-etcd-operator...
Gathering data for ns/openshift-etcd...
Gathering data for ns/openshift-image-registry...
Gathering data for ns/openshift-ingress-operator...
Gathering data for ns/openshift-ingress-canary...
Gathering data for ns/openshift-insights...
Gathering data for ns/openshift-kube-apiserver-operator...
Gathering data for ns/openshift-kube-apiserver...
Gathering data for ns/openshift-kube-controller-manager...
Gathering data for ns/openshift-kube-controller-manager-operator...
Gathering data for ns/openshift-kube-scheduler...
Gathering data for ns/openshift-kube-scheduler-operator...
Gathering data for ns/openshift-kube-storage-version-migrator...
Gathering data for ns/openshift-kube-storage-version-migrator-operator...
Gathering data for ns/openshift-cluster-machine-approver...
Gathering data for ns/openshift-machine-config-operator...
Gathering data for ns/openshift-kni-infra...
Gathering data for ns/openshift-openstack-infra...
Gathering data for ns/openshift-ovirt-infra...
Gathering data for ns/openshift-vsphere-infra...
Gathering data for ns/openshift-marketplace...
Gathering data for ns/openshift-monitoring...
Gathering data for ns/openshift-user-workload-monitoring...
Gathering data for ns/openshift-multus...
Gathering data for ns/openshift-ovn-kubernetes...
Gathering data for ns/openshift-host-network...
Gathering data for ns/openshift-network-diagnostics...
Gathering data for ns/openshift-network-operator...
Gathering data for ns/openshift-cluster-node-tuning-operator...
Gathering data for ns/openshift-apiserver-operator...
Gathering data for ns/openshift-apiserver...
Gathering data for ns/openshift-controller-manager-operator...
Gathering data for ns/openshift-controller-manager...
Gathering data for ns/openshift-cluster-samples-operator...
Gathering data for ns/openshift...
Gathering data for ns/openshift-operator-lifecycle-manager...
Gathering data for ns/openshift-service-ca-operator...
Gathering data for ns/openshift-service-ca...
Gathering data for ns/openshift-cluster-csi-drivers...
Wrote inspect data to must-gather.local.5651019809961815697.
error running backup collection: errors ocurred while gathering data:
[skipping gathering namespaces/openshift-machine-api due to error: one or more errors ocurred while gathering pod-specific data for namespace: openshift-machine-api

one or more errors ocurred while gathering container data for pod metal3-image-cache-w9tnw:

[previous terminated container "metal3-httpd" in pod "metal3-image-cache-w9tnw" not found, container "metal3-httpd" in pod "metal3-image-cache-w9tnw" is waiting to start: PodInitializing, previous terminated container "metal3-ipa-downloader" in pod "metal3-image-cache-w9tnw" not found, container "metal3-ipa-downloader" in pod "metal3-image-cache-w9tnw" is waiting to start: PodInitializing], skipping gathering clusterroles.rbac.authorization.k8s.io/system:registry due to error: clusterroles.rbac.authorization.k8s.io "system:registry" not found, skipping gathering clusterrolebindings.rbac.authorization.k8s.io/registry-registry-role due to error: clusterrolebindings.rbac.authorization.k8s.io "registry-registry-role" not found, skipping gathering secrets/support due to error: secrets "support" not found, skipping gathering podnetworkconnectivitychecks.controlplane.operator.openshift.io due to error: the server doesn't have a resource type "podnetworkconnectivitychecks", skipping gathering endpoints/host-etcd-2 due to error: endpoints "host-etcd-2" not found]e

Comment 9 Yolanda Robla 2021-07-07 15:12:50 UTC

More issues found...
Once the primary interface joins the ovn bridge, it looses access to the router:

[root@master-0 core]# ping -6 2620:52:0:1310::fe
PING 2620:52:0:1310::fe(2620:52:0:1310::fe) 56 data bytes

... 
timeout


If we do an nmcli con down br-ex, interface is recovering access to the router again:
PING 2620:52:0:1310::fe(2620:52:0:1310::fe) 56 data bytes
64 bytes from 2620:52:0:1310::fe: icmp_seq=1 ttl=64 time=1.10 ms

The ipv6 routes are something like:

::1 dev lo proto kernel metric 256 pref medium
2620:52:0:1310::11 dev br-ex proto kernel metric 100 pref medium
2620:52:0:1310::/64 dev br-ex proto ra metric 100 pref medium
fd01:0:0:2::/64 dev ovn-k8s-mp0 proto kernel metric 256 pref medium
fd01::/60 via fd01:0:0:2::1 dev ovn-k8s-mp0 metric 1024 pref medium
fd02::/112 via fe80::200:5eff:fe00:201 dev br-ex metric 1024 mtu 1400 pref medium
fe80::/64 dev br-ex proto kernel metric 100 pref medium
fe80::/64 dev genev_sys_6081 proto kernel metric 256 pref medium
fe80::/64 dev bd2b5377a738a7b proto kernel metric 256 pref medium
fe80::/64 dev 502838c675d9637 proto kernel metric 256 pref medium
fe80::/64 dev bc33650510ca18d proto kernel metric 256 pref medium
fe80::/64 dev 2e38e838223b585 proto kernel metric 256 pref medium
fe80::/64 dev 5f2d10bb59ab5da proto kernel metric 256 pref medium
fe80::/64 dev ce7bf82e33841d0 proto kernel metric 256 pref medium
fe80::/64 dev ea793bb6e3654ff proto kernel metric 256 pref medium
fe80::/64 dev 56f24b5464afba5 proto kernel metric 256 pref medium
fe80::/64 dev b5ff40060ae392a proto kernel metric 256 pref medium
fe80::/64 dev 1ac04219bbbf021 proto kernel metric 256 pref medium
fe80::/64 dev df49e0706a41f59 proto kernel metric 256 pref medium
fe80::/64 dev 1173023fb1b0474 proto kernel metric 256 pref medium
fe80::/64 dev f02f11a25ce984a proto kernel metric 256 pref medium
fe80::/64 dev 133c439ad6bac3f proto kernel metric 256 pref medium
fe80::/64 dev b8cb90330f96a5c proto kernel metric 256 pref medium
fe80::/64 dev 210e1a86995f110 proto kernel metric 256 pref medium
fe80::/64 dev d3efdb43eafc82e proto kernel metric 256 pref medium
fe80::/64 dev a0460d9300ca35b proto kernel metric 256 pref medium
fe80::/64 dev 846d8e6af4bd2ae proto kernel metric 256 pref medium
fe80::/64 dev 2a851660c3c86de proto kernel metric 256 pref medium
fe80::/64 dev 76655a4eb2407c4 proto kernel metric 256 pref medium
fe80::/64 dev d96febf56aa01ad proto kernel metric 256 pref medium
fe80::/64 dev 178859532efd07e proto kernel metric 256 pref medium
fe80::/64 dev 303263169c31e31 proto kernel metric 256 pref medium
fe80::/64 dev 2b894cc845da96d proto kernel metric 256 pref medium
fe80::/64 dev b51475bea53a33a proto kernel metric 256 pref medium
fe80::/64 dev 241588223e4d2a4 proto kernel metric 256 pref medium
fe80::/64 dev c20cabee7698475 proto kernel metric 256 pref medium
fe80::/64 dev 7e9b931d5637a28 proto kernel metric 256 pref medium
fe80::/64 dev b3536b19169ab9e proto kernel metric 256 pref medium
fe80::/64 dev d0a7dcb09da783f proto kernel metric 256 pref medium
fe80::/64 dev fcf6ee938514a24 proto kernel metric 256 pref medium
fe80::/64 dev 337e22260e92415 proto kernel metric 256 pref medium
fe80::/64 dev 63fa92468180a9d proto kernel metric 256 pref medium
fe80::/64 dev bb5d705f245faad proto kernel metric 256 pref medium
fe80::/64 dev f807d2abd3dde9b proto kernel metric 256 pref medium
fe80::/64 dev 22927c1211c49b1 proto kernel metric 256 pref medium
fe80::/64 dev 08ddd05afca7a85 proto kernel metric 256 pref medium
fe80::/64 dev 7488acde9cf98a8 proto kernel metric 256 pref medium
fe80::/64 dev 1fb96d7f137d638 proto kernel metric 256 pref medium
fe80::/64 dev 53837813fd591e7 proto kernel metric 256 pref medium
fe80::/64 dev 112a68b863f8ffd proto kernel metric 256 pref medium
fe80::/64 dev 050c65130d3c3d2 proto kernel metric 256 pref medium
fe80::/64 dev 0f1ba19ca64be5d proto kernel metric 256 pref medium
fe80::/64 dev ad3b755a6113ccb proto kernel metric 256 pref medium
default proto ra metric 100 pref medium
	nexthop via fe80::200:5eff:fe00:201 dev br-ex weight 1 
	nexthop via fe80::9e8a:cb00:6704:ab00 dev br-ex weight 1 
	nexthop via fe80::9e8a:cb00:6704:9200 dev br-ex weight 1

Comment 10 Yolanda Robla 2021-07-09 13:36:30 UTC

With the help of Marius Cornea and Jaime Caamaño i've been able to see the problem.
The issue is that our dhcp addresses come with /128 mask from the IT router, and our machineNetwork is /64. That's the route i get on a sample vm:

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
    inet6 2620:52:0:1310::14/128 scope global dynamic noprefixroute 

My install-config.yaml has this machineNetwork:

networking:
  networkType: OVNKubernetes
  machineNetwork:
  - cidr: 2620:52:0:1310::/64

When OVN is configured, it seems to create the routes based on /128 mask, not /64:

[root@test-operator-installer ~]# oc -n openshift-ovn-kubernetes exec -it ovnkube-master-7j6wr -c ovnkube-master -- ovn-nbctl find  Logical_Router_Port | grep -A1 rtoe-GR
name                : rtoe-GR_master-2.clus2.t5g.lab.eng.bos.redhat.com
networks            : ["2620:52:0:1310::13/128"]
--
  name                : rtoe-GR_master-1.clus2.t5g.lab.eng.bos.redhat.com
networks            : ["2620:52:0:1310::12/128"]
--
  name                : rtoe-GR_master-0.clus2.t5g.lab.eng.bos.redhat.com
networks            : ["2620:52:0:1310::11/128"]

When i manually modified the route with the commands given by Jaime, everything started worked:

oc -n  openshift-ovn-kubernetes exec -ti ovnkube-master-7j6wr -c ovnkube-master -- ovn-nbctl set Logical_Router_Port  rtoe-GR_master-0.clus2.t5g.lab.eng.bos.redhat.com networks='["2620:52:0:1310::11/64"]'

Comment 12 Dan Winship 2021-07-30 13:58:58 UTC

Marking this a dup of 1980135. Jaime has already submitted a patch to ovn-kubernetes upstream to deal with this problem.

*** This bug has been marked as a duplicate of bug 1980135 ***

Note You need to log in before you can comment on or make changes to this bug.