Bug 2027244 - [OVN] EgressIP Route creation around three second to late
Summary: [OVN] EgressIP Route creation around three second to late
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.8
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
: 4.8.z
Assignee: ffernand
QA Contact: huirwang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-11-29 08:00 UTC by philipp.dallig
Modified: 2023-09-15 01:17 UTC (History)
7 users (show)

Fixed In Version: OCP 4.8.29
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-02-23 08:43:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 912 0 None Merged [release-4.8] Bug 2041767: Fix pod handlers and pod IP parsing for egress IP 2022-02-09 15:30:15 UTC
Red Hat Product Errata RHBA-2022:0559 0 None None None 2022-02-23 08:43:36 UTC

Description philipp.dallig 2021-11-29 08:00:22 UTC
Description of problem:
Pods with a matching egressIP are given a reroute policy so that external traffic is routed through the egressIP. Unfortunately, the creation of this reroute policy is quite late. The started pod process uses the node IP for early connections.

Version-Release number of selected component (if applicable):
OKD 4.8.0-0.okd-2021-11-14-052418 with ovn-kubernetes (https://github.com/openshift/ovn-kubernetes/tree/f1d74e354da9337853677d4f20b03fc14107beb3)

Steps to Reproduce:
1. Create a EgressIP resource
2. Create a Pod. This Pod must be affected by the EgressIP selectors.
3. Check the creation of the redirection policy in the active ovn-kube-master log

Actual results:

As you can see in the log below, the creation of the reroute policy occurs about three seconds later than the annotation for the IP address.

W1126 13:39:32.962839       1 pods.go:349] Failed to get options for port: bes-jenkins_mvn-jdk11-mongodb-speech-3t6vs-ldjmg
I1126 13:39:32.963029       1 kube.go:61] Setting annotations map[k8s.ovn.org/pod-networks:{"default":{"ip_addresses":["10.128.10.5/23"],"mac_address":"0a:58:0a:80:0a:05","gateway_ips":["10.128.10.1"],"ip_address":"10.128.10.5/23","gateway_ip":"10.128.10.1"}}] on pod bes-jenkins/mvn-jdk11-mongodb-speech-3t6vs-ldjmg
2021-11-26T13:39:33.037Z|01885|nbctl|INFO|Running command run -- add address_set de5c11b1-0490-4e4e-bcf5-338d5bf071c4 addresses "\"10.128.10.5\""
I1126 13:39:33.051125       1 pods.go:304] [bes-jenkins/mvn-jdk11-mongodb-speech-3t6vs-ldjmg] addLogicalPort took 88.930583ms
2021-11-26T13:39:36.242Z|01886|nbctl|INFO|Running command run --id=@lr-policy -- create logical_router_policy action=reroute "match=\"ip4.src == 10.128.10.5\"" priority=100 "nexthop=\"100.64.0.23\"" external_ids:name=bes-jenkins -- add logical_router ovn_cluster_router policies @lr-policy
2021-11-26T13:39:36.261Z|01887|nbctl|INFO|Running command run --id=@nat -- create nat type=snat logical_port=k8s-worker2-cl4-dc1.ocp.cloud.mycompany.com "external_ip=\"10.20.14.4\"" "logical_ip=\"10.128.10.5\"" external_ids:name=bes-jenkins -- add logical_router GR_worker2-cl4-dc1.ocp.cloud.mycompany.com nat @nat

Expected results:
The reroute policy creation should be must faster. The best solution would be if the Pod gets its IP.

Additional info:

Comment 1 Alexander Constantinescu 2022-01-13 14:48:57 UTC
Hi

A couple of things on this:

1. There has been a "performance improvement" for egress IP. It's currently being back-ported to 4.9 https://github.com/openshift/ovn-kubernetes/pull/884/commits/8e6ef439426bd471151548c43832d9c9284e2d44 so 4.8 is coming after. Previously the egress IP setup needed to wait for pod.Status.podIPs to be set before we could proceed with the setup, that takes a lot longer than just reacting to the annotation ovn sets when it has allocated an IP address.

2. 1. doesn't matter because egress IP can only work once the pod has received an IP address and is running. I.e: just reading logs and making a determination of whether this has performance impacts is moot point for egress IP. The egress IP performance is dependent on the CRI, Kubelet, ovnkube-master's IP allocation for the pod, ovnkube-node's networking setup for the pod, etc. Even if the logs would have indicated a quicker time, it still wouldn't have matter for what concerns the question: "__can__ the pod use egress IP?" because it can only start sending traffic once __all__ of that is done, not once those logs have been written.

3. Given 2. we also need to know what conditions you're cluster has. How many pods are you running? How many nodes? How many namespaces? Performance is impacted by all of this, and sometimes things are just slow because there's a lot of things going on. 

I am thinking about closing this as a NOTABUG. But I'll keep it open for posterity until answers and more information come in w.r.t the cluster setup. 

/Alex.

Comment 4 Alexander Constantinescu 2022-01-13 16:49:00 UTC
Please follow https://bugzilla.redhat.com/show_bug.cgi?id=2034668 for an answer as to when https://github.com/openshift/ovn-kubernetes/pull/884/commits/8e6ef439426bd471151548c43832d9c9284e2d44 is back-ported to 4.8. But like I said in https://bugzilla.redhat.com/show_bug.cgi?id=2027244#c1 this is not really a bug in that the egress IP code doesn't completely decide how fast a pod starts using an egress IP. If there's a big delta between the time a pod matching an egress IP is started and running and the time its egress traffic gets the egress IP assigned, then that's a bug. But those log lines are not an indication of something conclusive and necessarily wrong.

Comment 5 philipp.dallig 2022-01-14 10:57:27 UTC
Thank you very much for your reply. I have taken a look at the "performance improvement" for the egress IP. I think this change fixes this problem. Unfortunately, I'm stuck on okd 4.8.0-0.okd-2021-11-14-052418 right now and can't update for other reasons.
After updating to okd 4.9 with your patch, I will let you know if the problem has been fixed.

Comment 10 philipp.dallig 2022-02-10 15:04:23 UTC
I have updated my pre prod environment to OKD 4.9.0-0.okd-2022-01-29-035536. A review of the log shows a significant speed improvement. Everything is done in milliseconds. In my eyes, the bug can be closed.

I0210 10:25:32.262224       1 egressip.go:416] EgressIP: vault has matched on pod: kubernetes-external-secrets-567c97cf48-5xfgm in namespace: external-secrets
I0210 10:25:32.275289       1 egressip.go:427] EgressIP: vault update for pod: kubernetes-external-secrets-567c97cf48-5xfgm in namespace: external-secrets
I0210 10:25:32.277769       1 pods.go:293] Creating logical port for external-secrets_kubernetes-external-secrets-567c97cf48-5xfgm on switch worker3-cl1-dc99.s-ocp.cloud.avm.de [66eb4bba-709b-4a17-9a5a-2df29978d65c]
I0210 10:25:32.279553       1 pods.go:491] Annotation values: ip=[10.129.4.7/23] ; mac=0a:58:0a:81:04:07 ; gw=[10.129.4.1]
Annotation=map[k8s.ovn.org/pod-networks:{"default":{"ip_addresses":["10.129.4.7/23"],"mac_address":"0a:58:0a:81:04:07","gateway_ips":["10.129.4.1"],"ip_address":"10.129.4.7/23","gateway_ip":"10.129.4.1"}}]
I0210 10:25:32.281066       1 kube.go:67] Setting annotations map[k8s.ovn.org/pod-networks:{"default":{"ip_addresses":["10.129.4.7/23"],"mac_address":"0a:58:0a:81:04:07","gateway_ips":["10.129.4.1"],"ip_address":"10.129.4.7/23","gateway_ip":"10.129.4.1"}}] on pod external-secrets/kubernetes-external-secrets-567c97cf48-5xfgm
I0210 10:25:32.367885       1 port_cache.go:55] port-cache(external-secrets_kubernetes-external-secrets-567c97cf48-5xfgm): added port &{name:external-secrets_kubernetes-external-secrets-567c97cf48-5xfgm uuid:8e5fd222-f9d8-4d28-a889-53a98cab581c logicalSwitch:worker3-cl1-dc99.s-ocp.cloud.avm.de ips:[0xc001f9a750] mac:[10 88 10 129 4 7] expires:{wall:0 ext:0 loc:<nil>}}
I0210 10:25:32.369126       1 pods.go:282] [external-secrets/kubernetes-external-secrets-567c97cf48-5xfgm] addLogicalPort took 91.370815ms, OVN Execute time 14.5123ms, pod Annotation time: 72.271737ms
I0210 10:25:32.369222       1 egressip.go:427] EgressIP: vault update for pod: kubernetes-external-secrets-567c97cf48-5xfgm in namespace: external-secrets
I0210 10:25:32.369288       1 egressip.go:427] EgressIP: vault update for pod: kubernetes-external-secrets-567c97cf48-5xfgm in namespace: external-secrets
I0210 10:25:32.369413       1 ovs.go:209] exec(1071): /usr/bin/ovn-nbctl --timeout=15 --format=csv --data=bare --no-heading --columns=_uuid find logical_router_policy match="ip4.src == 10.129.4.7" priority=100 external_ids:name=vault nexthops=["100.64.0.8"]
I0210 10:25:32.395977       1 ovs.go:209] exec(1072): /usr/bin/ovn-nbctl --timeout=15 --id=@lr-policy create logical_router_policy action=reroute match="ip4.src == 10.129.4.7" priority=100 nexthops=["100.64.0.8"] external_ids:name=vault -- add logical_router ovn_cluster_router policies @lr-policy
2022-02-10T10:25:32.401Z|00737|ovn_dbctl|INFO|Running command run --id=@lr-policy -- create logical_router_policy action=reroute "match=\"ip4.src == 10.129.4.7\"" priority=100 "nexthops=[\"100.64.0.8\"]" external_ids:name=vault -- add logical_router ovn_cluster_router policies @lr-policy

Comment 16 errata-xmlrpc 2022-02-23 08:43:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.8.32 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0559

Comment 17 Red Hat Bugzilla 2023-09-15 01:17:36 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.