Bug 1973215 - [OVN] EgressIP no longer worked after a cluster upgrade
Summary: [OVN] EgressIP no longer worked after a cluster upgrade
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 4.9.0
Assignee: Alexander Constantinescu
QA Contact: huirwang
URL:
Whiteboard:
: 1987258 (view as bug list)
Depends On:
Blocks: 1997049 2001542
TreeView+ depends on / blocked
 
Reported: 2021-06-17 12:55 UTC by philipp.dallig
Modified: 2021-10-18 17:35 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1997049 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:35:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
kubeovn-master, who took over (200.20 KB, text/plain)
2021-06-21 14:31 UTC, philipp.dallig
no flags Details
kubeonv-master before deletion (317.28 KB, text/plain)
2021-06-21 14:32 UTC, philipp.dallig
no flags Details
must-gather before deletion (16.82 MB, application/gzip)
2021-06-21 14:34 UTC, philipp.dallig
no flags Details
must-gather after deletion (16.82 MB, application/gzip)
2021-06-21 14:35 UTC, philipp.dallig
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 679 0 None None None 2021-08-23 15:51:27 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:35:44 UTC

Description philipp.dallig 2021-06-17 12:55:21 UTC
Description of problem:
I am using ovn-kubernetes in OKD 4.7. I created an egressIP and noticed that my pods lose connection to the external system during upgrades.
After some investigation, I can reproduce this problem when I delete the active ovnkube-master pod.

Version-Release number of selected component (if applicable):

oc adm release info quay.io/openshift/okd:4.7.0-0.okd-2021-06-04-191031 --commit-urls
-> ovn-kubernetes - https://github.com/openshift/ovn-kubernetes/commit/e71b4ad737e589c10eb9ab4b396e6c1bae052a35


How reproducible:
100%

Steps to Reproduce:
1. Create an EgressIP
2. Create a Pod that is affected by the EgressIP
3. Delete the active ovnkube-master pod
4. Try to ping the local gateway with the created pod.

Actual results:
Gateway is not reachable

Expected results:
Gateway is reachable

Additional info:
After deleting the active ovnkube-master pod, create a second pod and the gateway is accessible from this pod again.

Comment 1 Alexander Constantinescu 2021-06-18 13:40:04 UTC
Hi 

Could you please provide a must-gather or logs from ovnkube-master? There is very little to go on here, so I am closing it as INSUFFICIENT_DATA until that is done. 

/Alexander

Comment 2 philipp.dallig 2021-06-21 14:31:52 UTC
Created attachment 1792670 [details]
kubeovn-master, who took over

Comment 3 philipp.dallig 2021-06-21 14:32:43 UTC
Created attachment 1792671 [details]
kubeonv-master before deletion

Comment 4 philipp.dallig 2021-06-21 14:34:21 UTC
Created attachment 1792672 [details]
must-gather before deletion

Comment 5 philipp.dallig 2021-06-21 14:35:37 UTC
Created attachment 1792674 [details]
must-gather after deletion

Comment 6 philipp.dallig 2021-06-21 14:36:43 UTC
Hello,
thank you very much for looking at this ticket. I thought this problem was quite easy to reproduce locally, so I did not attach the Must-Gather data.

I did must-gather before deleting the active onv-kubemaster. Please note that I had to restart my pods and recreate the egressip several times to get it to work.
I think the fault lies somewhere in the ovn-kubemaster, so I have also attached the log of the active kubovn-master pod log before deleting it.


After all the pods were able to connect to the external service via the egressip, I deleted the active ovn-kubemaster.I am also attaching the log of the active kubeovn-master pod that took over after deletion.


Unfortunately I am not a networker, but if you are interested I can test a new version of the network in my test cluster to solve this problem.

With kind regards
Philipp

Comment 7 philipp.dallig 2021-06-23 11:55:45 UTC
Hi Alexander

It seems that even nodes that are not assigned the egressip will respond to an ARP request.

{code}
13:47 $ oc get egressips.k8s.ovn.org 
NAME                     EGRESSIPS      ASSIGNED NODE                         ASSIGNED EGRESSIPS
vsphere-cloud-provider   10.20.16.129   worker1-cl1-dc99.s-ocp.cloud.avm.de   10.20.16.129
{code}

worker1-cl1-dc99.s-ocp.cloud.avm.de IP: 10.20.16.41  MAC: 00:50:56:00:00:41
worker2-cl1-dc99.s-ocp.cloud.avm.de IP: 10.20.16.42  MAC: 00:50:56:00:00:42
worker3-cl1-dc99.s-ocp.cloud.avm.de IP: 10.20.16.43  MAC: 00:50:56:00:00:43


{code}
[root@master3 ~]# arping 10.20.16.129
ARPING 10.20.16.129 from 10.20.16.13 br-ex
Unicast reply from 10.20.16.129 [00:50:56:00:00:41]  1.600ms
Unicast reply from 10.20.16.129 [00:50:56:00:00:43]  1.626ms
Unicast reply from 10.20.16.129 [00:50:56:00:00:42]  1.913ms
{code}

A tracepath also looks strange.
{code}
[root@master3 ~]# tracepath 10.20.16.129
 1?: [LOCALHOST]                      pmtu 1500
 1:  worker1-cl1-dc99.s-ocp.cloud.avm.de                   0.238ms 
 1:  worker1-cl1-dc99.s-ocp.cloud.avm.de                   0.157ms 
 2:  worker2-cl1-dc99.s-ocp.cloud.avm.de                   0.216ms asymm  1 
 3:  no reply
 4:  worker2-cl1-dc99.s-ocp.cloud.avm.de                   0.368ms asymm  1 
 5:  no reply
 6:  worker2-cl1-dc99.s-ocp.cloud.avm.de                   0.494ms asymm  1 
 7:  no reply
 8:  worker2-cl1-dc99.s-ocp.cloud.avm.de                   0.639ms asymm  1 
 9:  no reply
10:  worker2-cl1-dc99.s-ocp.cloud.avm.de                   0.533ms asymm  1 
11:  no reply
12:  worker2-cl1-dc99.s-ocp.cloud.avm.de                   0.571ms asymm  1 
13:  no reply
14:  worker2-cl1-dc99.s-ocp.cloud.avm.de                   0.630ms asymm  1 
15:  no reply
16:  worker2-cl1-dc99.s-ocp.cloud.avm.de                   0.792ms asymm  1 
17:  no reply
18:  worker2-cl1-dc99.s-ocp.cloud.avm.de                   0.771ms asymm  1 
19:  no reply
20:  worker2-cl1-dc99.s-ocp.cloud.avm.de                   0.789ms asymm  1 
21:  no reply
22:  worker2-cl1-dc99.s-ocp.cloud.avm.de                   1.135ms asymm  1 
23:  no reply
24:  worker2-cl1-dc99.s-ocp.cloud.avm.de                   0.889ms asymm  1 
25:  no reply
26:  worker2-cl1-dc99.s-ocp.cloud.avm.de                   1.867ms asymm  1 
27:  no reply
28:  worker2-cl1-dc99.s-ocp.cloud.avm.de                   1.107ms asymm  1 
29:  no reply
30:  worker2-cl1-dc99.s-ocp.cloud.avm.de                   1.069ms asymm  1 
     Too many hops: pmtu 1500
     Resume: pmtu 1500
{code}

I hope this helps to find the right solution for this problem.

With kind regards
Philipp

Comment 8 philipp.dallig 2021-07-08 07:16:44 UTC
Finally I found out why the routing does not work. Every time the ovnkube-master container is restarted, the internal node IP changes and the old routes in the ovn_cluster_router router are not deleted.

I see two possible solutions to solve this problem.
1) Delete the old routes before adding a new one.
2) Do not change the internal node IP

I would prefer the second solution as this should not affect any existing connection in the cluster.

These are the routes of my ovn_cluster_router after several restarts of the ovnkube-master container.


ovn-nbctl --pidfile=/var/run/ovn/ovn-nbctl.pid -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt --db ssl:10.20.16.11:9641,ssl:10.20.16.12:9641,ssl:10.20.16.13:9641 lr-policy-list 177fc56f-efc6-44e7-b8df-c29d58cf89f2
Routing Policies
      1005 ip4.src == 10.128.0.2 && ip4.dst == 10.20.16.12 /* master2.s-ocp.cloud.avm.de */         reroute
      1005 ip4.src == 10.128.4.2 && ip4.dst == 10.20.16.42 /* worker2-cl1-dc99.s-ocp.cloud.avm.de */         reroute
      1005 ip4.src == 10.129.0.2 && ip4.dst == 10.20.16.13 /* master3.s-ocp.cloud.avm.de */         reroute
      1005 ip4.src == 10.129.4.2 && ip4.dst == 10.20.16.43 /* worker3-cl1-dc99.s-ocp.cloud.avm.de */         reroute
      1005 ip4.src == 10.130.0.2 && ip4.dst == 10.20.16.11 /* master1.s-ocp.cloud.avm.de */         reroute
      1005 ip4.src == 10.130.4.2 && ip4.dst == 10.20.16.44 /* worker4-cl1-dc99.s-ocp.cloud.avm.de */         reroute
      1005 ip4.src == 10.131.2.2 && ip4.dst == 10.20.16.41 /* worker1-cl1-dc99.s-ocp.cloud.avm.de */         reroute
      1004 inport == "rtos-master1.s-ocp.cloud.avm.de" && ip4.dst == 10.20.16.11 /* master1.s-ocp.cloud.avm.de */         reroute
      1004 inport == "rtos-master2.s-ocp.cloud.avm.de" && ip4.dst == 10.20.16.12 /* master2.s-ocp.cloud.avm.de */         reroute
      1004 inport == "rtos-master3.s-ocp.cloud.avm.de" && ip4.dst == 10.20.16.13 /* master3.s-ocp.cloud.avm.de */         reroute
      1004 inport == "rtos-worker1-cl1-dc99.s-ocp.cloud.avm.de" && ip4.dst == 10.20.16.41 /* worker1-cl1-dc99.s-ocp.cloud.avm.de */         reroute
      1004 inport == "rtos-worker2-cl1-dc99.s-ocp.cloud.avm.de" && ip4.dst == 10.20.16.42 /* worker2-cl1-dc99.s-ocp.cloud.avm.de */         reroute
      1004 inport == "rtos-worker3-cl1-dc99.s-ocp.cloud.avm.de" && ip4.dst == 10.20.16.43 /* worker3-cl1-dc99.s-ocp.cloud.avm.de */         reroute
      1004 inport == "rtos-worker4-cl1-dc99.s-ocp.cloud.avm.de" && ip4.dst == 10.20.16.44 /* worker4-cl1-dc99.s-ocp.cloud.avm.de */         reroute
      1003 ip4.src == 10.128.0.2  && ip4.dst != 10.128.0.0/14 /* inter-master2.s-ocp.cloud.avm.de */         reroute
      1003 ip4.src == 10.128.4.2  && ip4.dst != 10.128.0.0/14 /* inter-worker2-cl1-dc99.s-ocp.cloud.avm.de */         reroute
      1003 ip4.src == 10.129.0.2  && ip4.dst != 10.128.0.0/14 /* inter-master3.s-ocp.cloud.avm.de */         reroute
      1003 ip4.src == 10.129.4.2  && ip4.dst != 10.128.0.0/14 /* inter-worker3-cl1-dc99.s-ocp.cloud.avm.de */         reroute
      1003 ip4.src == 10.130.0.2  && ip4.dst != 10.128.0.0/14 /* inter-master1.s-ocp.cloud.avm.de */         reroute
      1003 ip4.src == 10.130.4.2  && ip4.dst != 10.128.0.0/14 /* inter-worker4-cl1-dc99.s-ocp.cloud.avm.de */         reroute
      1003 ip4.src == 10.131.2.2  && ip4.dst != 10.128.0.0/14 /* inter-worker1-cl1-dc99.s-ocp.cloud.avm.de */         reroute
       101 ip4.src == 10.128.0.0/14 && ip4.dst == 10.128.0.0/14           allow
       101 ip4.src == 10.128.0.0/14 && ip4.dst == 10.20.16.11/32           allow
       101 ip4.src == 10.128.0.0/14 && ip4.dst == 10.20.16.12/32           allow
       101 ip4.src == 10.128.0.0/14 && ip4.dst == 10.20.16.13/32           allow
       101 ip4.src == 10.128.0.0/14 && ip4.dst == 10.20.16.41/32           allow
       101 ip4.src == 10.128.0.0/14 && ip4.dst == 10.20.16.42/32           allow
       101 ip4.src == 10.128.0.0/14 && ip4.dst == 10.20.16.43/32           allow
       101 ip4.src == 10.128.0.0/14 && ip4.dst == 10.20.16.44/32           allow
       100                              ip4.src == 10.128.0.3         reroute                100.64.0.8
       100                              ip4.src == 10.128.0.3         reroute                100.64.0.2
       100                              ip4.src == 10.128.0.3         reroute                100.64.0.7
       100                              ip4.src == 10.128.0.6         reroute                100.64.0.2
       100                              ip4.src == 10.128.0.6         reroute                100.64.0.7
       100                              ip4.src == 10.128.0.6         reroute                100.64.0.8
       100                             ip4.src == 10.128.4.15         reroute                100.64.0.7
       100                             ip4.src == 10.128.4.15         reroute                100.64.0.8
       100                             ip4.src == 10.128.4.15         reroute                100.64.0.2
       100                              ip4.src == 10.129.0.6         reroute                100.64.0.7
       100                              ip4.src == 10.129.0.6         reroute                100.64.0.8
       100                              ip4.src == 10.129.0.6         reroute                100.64.0.2
       100                              ip4.src == 10.129.0.8         reroute                100.64.0.2
       100                              ip4.src == 10.129.0.8         reroute                100.64.0.7
       100                              ip4.src == 10.129.0.8         reroute                100.64.0.8
       100                             ip4.src == 10.129.4.13         reroute                100.64.0.8
       100                             ip4.src == 10.129.4.13         reroute                100.64.0.7
       100                             ip4.src == 10.129.4.13         reroute                100.64.0.2
       100                              ip4.src == 10.130.0.4         reroute                100.64.0.8
       100                              ip4.src == 10.130.0.4         reroute                100.64.0.7
       100                              ip4.src == 10.130.0.4         reroute                100.64.0.2
       100                              ip4.src == 10.130.0.5         reroute                100.64.0.7
       100                              ip4.src == 10.130.0.5         reroute                100.64.0.8
       100                              ip4.src == 10.130.0.5         reroute                100.64.0.2
       100                              ip4.src == 10.130.0.6         reroute                100.64.0.2
       100                              ip4.src == 10.130.0.6         reroute                100.64.0.7
       100                              ip4.src == 10.130.0.6         reroute                100.64.0.8
       100                             ip4.src == 10.130.4.18         reroute                100.64.0.8
       100                             ip4.src == 10.130.4.18         reroute                100.64.0.7
       100                             ip4.src == 10.130.4.18         reroute                100.64.0.2
       100                             ip4.src == 10.131.2.14         reroute                100.64.0.7
       100                             ip4.src == 10.131.2.14         reroute                100.64.0.2
       100                             ip4.src == 10.131.2.14         reroute                100.64.0.8

Comment 9 philipp.dallig 2021-07-12 12:17:28 UTC
@jtanenba
I have created a PR (https://github.com/ovn-org/ovn-kubernetes/pull/2331) to fix this problem. I hope you will help to move the problem solving forward.

Comment 12 Alexander Constantinescu 2021-09-01 09:07:26 UTC
*** Bug 1987258 has been marked as a duplicate of this bug. ***

Comment 15 errata-xmlrpc 2021-10-18 17:35:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.