Bug 2101992 - [Azure] IP address release: After deleting and recreating egressIP object, egress traffic was intermittently broke for about 1 minute
Summary: [Azure] IP address release: After deleting and recreating egressIP object, eg...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.11
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.12.0
Assignee: Andreas Karis
QA Contact: huirwang
URL:
Whiteboard:
: 2090997 (view as bug list)
Depends On:
Blocks: 2105801
TreeView+ depends on / blocked
 
Reported: 2022-06-29 03:37 UTC by huirwang
Modified: 2023-01-17 19:51 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2105801 (view as bug list)
Environment:
Last Closed: 2023-01-17 19:50:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 1214 0 None open Bug 2111534: Downstream Merge: 27-07-2022 2022-08-02 11:00:32 UTC
Github ovn-org ovn-kubernetes pull 3065 0 None open Unwire OVNKubernetes before scheduling CloudPrivateIPConfig deletion 2022-07-10 16:55:51 UTC
Red Hat Product Errata RHSA-2022:7399 0 None None None 2023-01-17 19:51:16 UTC

Description huirwang 2022-06-29 03:37:09 UTC
Description of problem:


Version-Release number of selected component (if applicable):
4.11.0-0.nightly-2022-06-28-160049

How reproducible:
Frequently

Steps to Reproduce:
$ oc get nodes -o wide
NAME                                           STATUS   ROLES    AGE   VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                        KERNEL-VERSION              CONTAINER-RUNTIME
huirwang-0629a-mbxn4-master-0                  Ready    master   88m   v1.24.0+9ddc8b1   10.0.0.7      <none>        Red Hat Enterprise Linux CoreOS 411.86.202206272059-0 (Ootpa)   4.18.0-372.9.1.el8.x86_64   cri-o://1.24.1-7.rhaos4.11.gita69f315.el8
huirwang-0629a-mbxn4-master-1                  Ready    master   81m   v1.24.0+9ddc8b1   10.0.0.8      <none>        Red Hat Enterprise Linux CoreOS 411.86.202206272059-0 (Ootpa)   4.18.0-372.9.1.el8.x86_64   cri-o://1.24.1-7.rhaos4.11.gita69f315.el8
huirwang-0629a-mbxn4-master-2                  Ready    master   88m   v1.24.0+9ddc8b1   10.0.0.6      <none>        Red Hat Enterprise Linux CoreOS 411.86.202206272059-0 (Ootpa)   4.18.0-372.9.1.el8.x86_64   cri-o://1.24.1-7.rhaos4.11.gita69f315.el8
huirwang-0629a-mbxn4-worker-southcentralus-1   Ready    worker   62m   v1.24.0+9ddc8b1   10.0.1.5      <none>        Red Hat Enterprise Linux CoreOS 411.86.202206272059-0 (Ootpa)   4.18.0-372.9.1.el8.x86_64   cri-o://1.24.1-7.rhaos4.11.gita69f315.el8
huirwang-0629a-mbxn4-worker-southcentralus-2   Ready    worker   62m   v1.24.0+9ddc8b1   10.0.1.4      <none>        Red Hat Enterprise Linux CoreOS 411.86.202206272059-0 (Ootpa)   4.18.0-372.9.1.el8.x86_64   cri-o://1.24.1-7.rhaos4.11.gita69f315.el8

1. Label one worker node as egress node
$ oc label node  huirwang-0629a-mbxn4-worker-southcentralus-1 k8s.ovn.org/egress-assignable=
node/huirwang-0629a-mbxn4-worker-southcentralus-1 labeled
2. Create one egressip object
$ oc get egressip
NAME         EGRESSIPS   ASSIGNED NODE                                  ASSIGNED EGRESSIPS
egressip-1   10.0.1.27   huirwang-0629a-mbxn4-worker-southcentralus-1   10.0.1.27
3. Create one test namespace and test pods, add label matched egressip label.
3. Delete the existing egressip object and recreate it.
4. Check the outbound traffic from test pod

$ oc delete egressip --all
egressip.k8s.ovn.org "egressip-1" deleted
$ oc create -f ~/script/tmp/egressip/config1.yaml 
egressip.k8s.ovn.org/egressip-1 created
$ oc rsh -n test test-rc-d7dvk 
~ $ while true; do curl 10.0.99.4:9095 --connect-timeout 5; echo;date; sleep 2;done
10.0.1.27
Wed Jun 29 03:27:09 UTC 2022
10.0.1.27
Wed Jun 29 03:27:11 UTC 2022
10.0.1.27
Wed Jun 29 03:27:13 UTC 2022
10.0.1.27
Wed Jun 29 03:27:15 UTC 2022
10.0.1.27
Wed Jun 29 03:27:17 UTC 2022
10.0.1.27
Wed Jun 29 03:27:19 UTC 2022
10.0.1.27
Wed Jun 29 03:27:21 UTC 2022
10.0.1.27
Wed Jun 29 03:27:23 UTC 2022
10.0.1.27
Wed Jun 29 03:27:25 UTC 2022
curl: (28) Connection timeout after 5001 ms

Wed Jun 29 03:27:32 UTC 2022
curl: (28) Connection timeout after 5001 ms

Wed Jun 29 03:27:39 UTC 2022
curl: (28) Connection timeout after 5001 ms

Wed Jun 29 03:27:46 UTC 2022
curl: (28) Connection timeout after 5001 ms

Wed Jun 29 03:27:53 UTC 2022
curl: (28) Connection timeout after 5001 ms

Wed Jun 29 03:28:00 UTC 2022
curl: (28) Connection timeout after 5000 ms

Wed Jun 29 03:28:07 UTC 2022
10.0.1.4
Wed Jun 29 03:28:12 UTC 2022
10.0.1.4
Wed Jun 29 03:28:14 UTC 2022
10.0.1.4
Wed Jun 29 03:28:16 UTC 2022
10.0.1.4
Wed Jun 29 03:28:18 UTC 2022
10.0.1.4
Wed Jun 29 03:28:20 UTC 2022
curl: (28) Connection timeout after 5000 ms

Wed Jun 29 03:28:27 UTC 2022
curl: (28) Connection timeout after 5000 ms

Wed Jun 29 03:28:34 UTC 2022
10.0.1.27
Wed Jun 29 03:28:36 UTC 2022
10.0.1.27
Wed Jun 29 03:28:38 UTC 2022
10.0.1.27
Wed Jun 29 03:28:40 UTC 2022
10.0.1.27
Wed Jun 29 03:28:42 UTC 2022
10.0.1.27
Wed Jun 29 03:28:44 UTC 2022
10.0.1.27
Wed Jun 29 03:28:46 UTC 2022
10.0.1.27
Wed Jun 29 03:28:48 UTC 2022
^C

Actual results:
egress traffic was intermittently broke for about 1 minute

Expected results:
No intermittently broke  for egress traffic.

Additional info:

Comment 4 Andreas Karis 2022-07-10 16:54:46 UTC
Ironically, we run into 2 unrelated issues.

One upon unassignment of an EgressIP (affects all clouds but Azure in particular because it's so slow detaching IP addresses) and one upon attachment of IPs on Azure. For the latter: https://bugzilla.redhat.com/show_bug.cgi?id=2105801

Comment 5 Andreas Karis 2022-07-10 17:01:19 UTC
Can you check with https://github.com/openshift/ovn-kubernetes/pull/1180   ; you will still see drops when the IP address is being assigned. But when you delete the egressip, you should not see drops any more.

So, comment #0 (just the deletion operation) should be fixed.

For comment #2 and #3, this will require a chance to the CNCC.

Comment 6 Andreas Karis 2022-07-10 18:08:47 UTC
Just some further info for testing and what this bug here in particular addresses - again, only deletion, as IP address assignment is another bug.

Delete an egressip object:
~~~
[akaris@linux 2101992]$ oc delete egressip egressip
egressip.k8s.ovn.org "egressip" deleted
~~~

During this operation, run a curl - in this extreme case, you can see a nearly 2 minutes outage for external traffic from the pod:
~~~
sh-5.1# while true; do curl ifconfig.me --connect-timeout 1; echo -n " --- "; date; sleep 1; done
20.1.196.203 --- Sun Jul 10 18:00:41 UTC 2022
20.1.196.203 --- Sun Jul 10 18:00:42 UTC 2022
20.1.196.203 --- Sun Jul 10 18:00:43 UTC 2022
20.1.196.203 --- Sun Jul 10 18:00:44 UTC 2022
20.1.196.203 --- Sun Jul 10 18:00:45 UTC 2022
20.1.196.203 --- Sun Jul 10 18:00:46 UTC 2022
20.1.196.203 --- Sun Jul 10 18:00:47 UTC 2022
20.1.196.203 --- Sun Jul 10 18:00:49 UTC 2022
curl: (28) Connection timeout after 1000 ms
 --- Sun Jul 10 18:00:51 UTC 2022





curl: (28) Connection timeout after 1000 ms
 --- Sun Jul 10 18:00:53 UTC 2022
curl: (28) Connection timeout after 1001 ms
 --- Sun Jul 10 18:00:55 UTC 2022
curl: (28) Connection timeout after 1001 ms
 --- Sun Jul 10 18:00:57 UTC 2022
curl: (28) Connection timeout after 1000 ms
 --- Sun Jul 10 18:00:59 UTC 2022
curl: (28) Connection timeout after 1001 ms
 --- Sun Jul 10 18:01:01 UTC 2022
curl: (28) Connection timeout after 1001 ms
 --- Sun Jul 10 18:01:03 UTC 2022
curl: (28) Connection timeout after 1000 ms
 --- Sun Jul 10 18:01:05 UTC 2022
curl: (28) Connection timeout after 1001 ms
 --- Sun Jul 10 18:01:07 UTC 2022
curl: (28) Connection timeout after 1001 ms
 --- Sun Jul 10 18:01:09 UTC 2022
curl: (28) Connection timeout after 1000 ms
 --- Sun Jul 10 18:01:11 UTC 2022
curl: (28) Connection timeout after 1001 ms
 --- Sun Jul 10 18:01:13 UTC 2022
curl: (28) Connection timeout after 1001 ms
 --- Sun Jul 10 18:01:15 UTC 2022
curl: (28) Connection timeout after 1000 ms
 --- Sun Jul 10 18:01:17 UTC 2022
curl: (28) Connection timeout after 1001 ms
 --- Sun Jul 10 18:01:19 UTC 2022
curl: (28) Connection timeout after 1001 ms
 --- Sun Jul 10 18:01:21 UTC 2022
curl: (28) Connection timeout after 1001 ms
 --- Sun Jul 10 18:01:23 UTC 2022
curl: (28) Connection timeout after 1001 ms
 --- Sun Jul 10 18:01:25 UTC 2022
curl: (28) Connection timeout after 1000 ms
 --- Sun Jul 10 18:01:27 UTC 2022
curl: (28) Connection timeout after 1001 ms
 --- Sun Jul 10 18:01:29 UTC 2022
curl: (28) Connection timeout after 1001 ms
 --- Sun Jul 10 18:01:31 UTC 2022
curl: (28) Connection timeout after 1000 ms
 --- Sun Jul 10 18:01:33 UTC 2022
curl: (28) Connection timeout after 1001 ms
 --- Sun Jul 10 18:01:35 UTC 2022
curl: (28) Connection timeout after 1000 ms
 --- Sun Jul 10 18:01:37 UTC 2022
curl: (28) Connection timeout after 1001 ms
 --- Sun Jul 10 18:01:39 UTC 2022
curl: (28) Connection timeout after 1000 ms
 --- Sun Jul 10 18:01:41 UTC 2022
curl: (28) Connection timeout after 1001 ms
 --- Sun Jul 10 18:01:43 UTC 2022
curl: (28) Connection timeout after 1001 ms
 --- Sun Jul 10 18:01:45 UTC 2022
curl: (28) Connection timeout after 1001 ms
 --- Sun Jul 10 18:01:47 UTC 2022
curl: (28) Connection timeout after 1001 ms
 --- Sun Jul 10 18:01:49 UTC 2022
curl: (28) Connection timeout after 1001 ms
 --- Sun Jul 10 18:01:51 UTC 2022
curl: (28) Connection timeout after 1001 ms
 --- Sun Jul 10 18:01:53 UTC 2022
curl: (28) Connection timeout after 1001 ms
 --- Sun Jul 10 18:01:55 UTC 2022
curl: (28) Connection timeout after 1000 ms
 --- Sun Jul 10 18:01:57 UTC 2022
curl: (28) Connection timeout after 1000 ms
 --- Sun Jul 10 18:01:59 UTC 2022
curl: (28) Connection timeout after 1001 ms
 --- Sun Jul 10 18:02:01 UTC 2022
curl: (28) Connection timeout after 1001 ms
 --- Sun Jul 10 18:02:03 UTC 2022
curl: (28) Connection timeout after 1001 ms
 --- Sun Jul 10 18:02:05 UTC 2022
curl: (28) Connection timeout after 1001 ms
 --- Sun Jul 10 18:02:07 UTC 2022
curl: (28) Connection timeout after 1001 ms
 --- Sun Jul 10 18:02:09 UTC 2022
curl: (28) Connection timeout after 1001 ms
 --- Sun Jul 10 18:02:11 UTC 2022



curl: (28) Connection timeout after 1001 ms
 --- Sun Jul 10 18:02:13 UTC 2022
^C
sh-5.1# while true; do curl ifconfig.me --connect-timeout 1; echo -n " --- "; date; sleep 1; done
curl: (28) Connection timeout after 1000 ms
 --- Sun Jul 10 18:02:15 UTC 2022
curl: (28) Connection timeout after 1000 ms
 --- Sun Jul 10 18:02:17 UTC 2022
20.1.196.203 --- Sun Jul 10 18:02:18 UTC 2022
20.1.196.203 --- Sun Jul 10 18:02:19 UTC 2022
20.1.196.203 --- Sun Jul 10 18:02:21 UTC 2022
20.1.196.203 --- Sun Jul 10 18:02:22 UTC 2022
20.1.196.203 --- Sun Jul 10 18:02:23 UTC 2022
20.1.196.203 --- Sun Jul 10 18:02:24 UTC 2022
20.1.196.203 --- Sun Jul 10 18:02:25 UTC 2022
20.1.196.203 --- Sun Jul 10 18:02:26 UTC 2022
20.1.196.203 --- Sun Jul 10 18:02:27 UTC 2022
20.1.196.203 --- Sun Jul 10 18:02:28 UTC 2022
20.1.196.203 --- Sun Jul 10 18:02:29 UTC 2022
20.1.196.203 --- Sun Jul 10 18:02:30 UTC 2022
^C
sh-5.1# 
~~~

This outage can be explained by looking at the CNCC logs:
~~~
oc logs -n oc logs -nopenshift-cloud-network-config-controller          cloud-network-config-controller-696787bd76-vdsxn
(...)
I0710 17:46:51.199647       1 controller.go:160] Dropping key '10.0.129.8' from the cloud-private-ip-config workqueue
I0710 17:46:51.205296       1 controller.go:160] Dropping key '10.0.129.8' from the cloud-private-ip-config workqueue
I0710 18:00:48.860733       1 controller.go:182] Assigning key: 10.0.129.8 to cloud-private-ip-config workqueue
I0710 18:00:48.866791       1 cloudprivateipconfig_controller.go:187] CloudPrivateIPConfig: "10.0.129.8" will be deleted from node: "ci-ln-4p9hg2b-1d09d-c4pt7-worker-eastus21-g2f59"
I0710 18:00:48.873548       1 controller.go:182] Assigning key: 10.0.129.8 to cloud-private-ip-config workqueue
I0710 18:01:52.778965       1 controller.go:182] Assigning key: 10.0.129.8 to cloud-private-ip-config workqueue
I0710 18:02:18.570761       1 cloudprivateipconfig_controller.go:242] CloudPrivateIPConfig: 10.0.129.8 object has been marked for complete deletion
I0710 18:02:18.570800       1 cloudprivateipconfig_controller.go:249] Cleaning up IP address and finalizer for CloudPrivateIPConfig: "10.0.129.8", deleting it completely
I0710 18:02:18.582754       1 controller.go:182] Assigning key: 10.0.129.8 to cloud-private-ip-config workqueue
I0710 18:02:18.583456       1 controller.go:160] Dropping key '10.0.129.8' from the cloud-private-ip-config workqueue
I0710 18:02:18.586842       1 cloudprivateipconfig_controller.go:421] CloudPrivateIPConfig: "10.0.129.8" in work queue no longer exists
I0710 18:02:18.586869       1 controller.go:160] Dropping key '10.0.129.8' from the cloud-private-ip-config workqueue
[akaris@linux 2101992]$ 
~~~

Outage from 18:00:48 to 18:02:18 which is exactly the time that Azure needs to report a successful IP address release to the CNCC. 

With the fix for this bug, this outage will no longer occur upon deletion.

Comment 9 Andreas Karis 2022-07-11 11:51:31 UTC
*** Bug 2090997 has been marked as a duplicate of this bug. ***

Comment 10 Andreas Karis 2022-08-09 15:44:29 UTC
The downstream merge landed with https://github.com/openshift/ovn-kubernetes/pull/1214

Comment 14 errata-xmlrpc 2023-01-17 19:50:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399


Note You need to log in before you can comment on or make changes to this bug.