Description of problem: Version-Release number of selected component (if applicable): 4.11.0-0.nightly-2022-06-28-160049 How reproducible: Frequently Steps to Reproduce: $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME huirwang-0629a-mbxn4-master-0 Ready master 88m v1.24.0+9ddc8b1 10.0.0.7 <none> Red Hat Enterprise Linux CoreOS 411.86.202206272059-0 (Ootpa) 4.18.0-372.9.1.el8.x86_64 cri-o://1.24.1-7.rhaos4.11.gita69f315.el8 huirwang-0629a-mbxn4-master-1 Ready master 81m v1.24.0+9ddc8b1 10.0.0.8 <none> Red Hat Enterprise Linux CoreOS 411.86.202206272059-0 (Ootpa) 4.18.0-372.9.1.el8.x86_64 cri-o://1.24.1-7.rhaos4.11.gita69f315.el8 huirwang-0629a-mbxn4-master-2 Ready master 88m v1.24.0+9ddc8b1 10.0.0.6 <none> Red Hat Enterprise Linux CoreOS 411.86.202206272059-0 (Ootpa) 4.18.0-372.9.1.el8.x86_64 cri-o://1.24.1-7.rhaos4.11.gita69f315.el8 huirwang-0629a-mbxn4-worker-southcentralus-1 Ready worker 62m v1.24.0+9ddc8b1 10.0.1.5 <none> Red Hat Enterprise Linux CoreOS 411.86.202206272059-0 (Ootpa) 4.18.0-372.9.1.el8.x86_64 cri-o://1.24.1-7.rhaos4.11.gita69f315.el8 huirwang-0629a-mbxn4-worker-southcentralus-2 Ready worker 62m v1.24.0+9ddc8b1 10.0.1.4 <none> Red Hat Enterprise Linux CoreOS 411.86.202206272059-0 (Ootpa) 4.18.0-372.9.1.el8.x86_64 cri-o://1.24.1-7.rhaos4.11.gita69f315.el8 1. Label one worker node as egress node $ oc label node huirwang-0629a-mbxn4-worker-southcentralus-1 k8s.ovn.org/egress-assignable= node/huirwang-0629a-mbxn4-worker-southcentralus-1 labeled 2. Create one egressip object $ oc get egressip NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS egressip-1 10.0.1.27 huirwang-0629a-mbxn4-worker-southcentralus-1 10.0.1.27 3. Create one test namespace and test pods, add label matched egressip label. 3. Delete the existing egressip object and recreate it. 4. Check the outbound traffic from test pod $ oc delete egressip --all egressip.k8s.ovn.org "egressip-1" deleted $ oc create -f ~/script/tmp/egressip/config1.yaml egressip.k8s.ovn.org/egressip-1 created $ oc rsh -n test test-rc-d7dvk ~ $ while true; do curl 10.0.99.4:9095 --connect-timeout 5; echo;date; sleep 2;done 10.0.1.27 Wed Jun 29 03:27:09 UTC 2022 10.0.1.27 Wed Jun 29 03:27:11 UTC 2022 10.0.1.27 Wed Jun 29 03:27:13 UTC 2022 10.0.1.27 Wed Jun 29 03:27:15 UTC 2022 10.0.1.27 Wed Jun 29 03:27:17 UTC 2022 10.0.1.27 Wed Jun 29 03:27:19 UTC 2022 10.0.1.27 Wed Jun 29 03:27:21 UTC 2022 10.0.1.27 Wed Jun 29 03:27:23 UTC 2022 10.0.1.27 Wed Jun 29 03:27:25 UTC 2022 curl: (28) Connection timeout after 5001 ms Wed Jun 29 03:27:32 UTC 2022 curl: (28) Connection timeout after 5001 ms Wed Jun 29 03:27:39 UTC 2022 curl: (28) Connection timeout after 5001 ms Wed Jun 29 03:27:46 UTC 2022 curl: (28) Connection timeout after 5001 ms Wed Jun 29 03:27:53 UTC 2022 curl: (28) Connection timeout after 5001 ms Wed Jun 29 03:28:00 UTC 2022 curl: (28) Connection timeout after 5000 ms Wed Jun 29 03:28:07 UTC 2022 10.0.1.4 Wed Jun 29 03:28:12 UTC 2022 10.0.1.4 Wed Jun 29 03:28:14 UTC 2022 10.0.1.4 Wed Jun 29 03:28:16 UTC 2022 10.0.1.4 Wed Jun 29 03:28:18 UTC 2022 10.0.1.4 Wed Jun 29 03:28:20 UTC 2022 curl: (28) Connection timeout after 5000 ms Wed Jun 29 03:28:27 UTC 2022 curl: (28) Connection timeout after 5000 ms Wed Jun 29 03:28:34 UTC 2022 10.0.1.27 Wed Jun 29 03:28:36 UTC 2022 10.0.1.27 Wed Jun 29 03:28:38 UTC 2022 10.0.1.27 Wed Jun 29 03:28:40 UTC 2022 10.0.1.27 Wed Jun 29 03:28:42 UTC 2022 10.0.1.27 Wed Jun 29 03:28:44 UTC 2022 10.0.1.27 Wed Jun 29 03:28:46 UTC 2022 10.0.1.27 Wed Jun 29 03:28:48 UTC 2022 ^C Actual results: egress traffic was intermittently broke for about 1 minute Expected results: No intermittently broke for egress traffic. Additional info:
Ironically, we run into 2 unrelated issues. One upon unassignment of an EgressIP (affects all clouds but Azure in particular because it's so slow detaching IP addresses) and one upon attachment of IPs on Azure. For the latter: https://bugzilla.redhat.com/show_bug.cgi?id=2105801
Can you check with https://github.com/openshift/ovn-kubernetes/pull/1180 ; you will still see drops when the IP address is being assigned. But when you delete the egressip, you should not see drops any more. So, comment #0 (just the deletion operation) should be fixed. For comment #2 and #3, this will require a chance to the CNCC.
Just some further info for testing and what this bug here in particular addresses - again, only deletion, as IP address assignment is another bug. Delete an egressip object: ~~~ [akaris@linux 2101992]$ oc delete egressip egressip egressip.k8s.ovn.org "egressip" deleted ~~~ During this operation, run a curl - in this extreme case, you can see a nearly 2 minutes outage for external traffic from the pod: ~~~ sh-5.1# while true; do curl ifconfig.me --connect-timeout 1; echo -n " --- "; date; sleep 1; done 20.1.196.203 --- Sun Jul 10 18:00:41 UTC 2022 20.1.196.203 --- Sun Jul 10 18:00:42 UTC 2022 20.1.196.203 --- Sun Jul 10 18:00:43 UTC 2022 20.1.196.203 --- Sun Jul 10 18:00:44 UTC 2022 20.1.196.203 --- Sun Jul 10 18:00:45 UTC 2022 20.1.196.203 --- Sun Jul 10 18:00:46 UTC 2022 20.1.196.203 --- Sun Jul 10 18:00:47 UTC 2022 20.1.196.203 --- Sun Jul 10 18:00:49 UTC 2022 curl: (28) Connection timeout after 1000 ms --- Sun Jul 10 18:00:51 UTC 2022 curl: (28) Connection timeout after 1000 ms --- Sun Jul 10 18:00:53 UTC 2022 curl: (28) Connection timeout after 1001 ms --- Sun Jul 10 18:00:55 UTC 2022 curl: (28) Connection timeout after 1001 ms --- Sun Jul 10 18:00:57 UTC 2022 curl: (28) Connection timeout after 1000 ms --- Sun Jul 10 18:00:59 UTC 2022 curl: (28) Connection timeout after 1001 ms --- Sun Jul 10 18:01:01 UTC 2022 curl: (28) Connection timeout after 1001 ms --- Sun Jul 10 18:01:03 UTC 2022 curl: (28) Connection timeout after 1000 ms --- Sun Jul 10 18:01:05 UTC 2022 curl: (28) Connection timeout after 1001 ms --- Sun Jul 10 18:01:07 UTC 2022 curl: (28) Connection timeout after 1001 ms --- Sun Jul 10 18:01:09 UTC 2022 curl: (28) Connection timeout after 1000 ms --- Sun Jul 10 18:01:11 UTC 2022 curl: (28) Connection timeout after 1001 ms --- Sun Jul 10 18:01:13 UTC 2022 curl: (28) Connection timeout after 1001 ms --- Sun Jul 10 18:01:15 UTC 2022 curl: (28) Connection timeout after 1000 ms --- Sun Jul 10 18:01:17 UTC 2022 curl: (28) Connection timeout after 1001 ms --- Sun Jul 10 18:01:19 UTC 2022 curl: (28) Connection timeout after 1001 ms --- Sun Jul 10 18:01:21 UTC 2022 curl: (28) Connection timeout after 1001 ms --- Sun Jul 10 18:01:23 UTC 2022 curl: (28) Connection timeout after 1001 ms --- Sun Jul 10 18:01:25 UTC 2022 curl: (28) Connection timeout after 1000 ms --- Sun Jul 10 18:01:27 UTC 2022 curl: (28) Connection timeout after 1001 ms --- Sun Jul 10 18:01:29 UTC 2022 curl: (28) Connection timeout after 1001 ms --- Sun Jul 10 18:01:31 UTC 2022 curl: (28) Connection timeout after 1000 ms --- Sun Jul 10 18:01:33 UTC 2022 curl: (28) Connection timeout after 1001 ms --- Sun Jul 10 18:01:35 UTC 2022 curl: (28) Connection timeout after 1000 ms --- Sun Jul 10 18:01:37 UTC 2022 curl: (28) Connection timeout after 1001 ms --- Sun Jul 10 18:01:39 UTC 2022 curl: (28) Connection timeout after 1000 ms --- Sun Jul 10 18:01:41 UTC 2022 curl: (28) Connection timeout after 1001 ms --- Sun Jul 10 18:01:43 UTC 2022 curl: (28) Connection timeout after 1001 ms --- Sun Jul 10 18:01:45 UTC 2022 curl: (28) Connection timeout after 1001 ms --- Sun Jul 10 18:01:47 UTC 2022 curl: (28) Connection timeout after 1001 ms --- Sun Jul 10 18:01:49 UTC 2022 curl: (28) Connection timeout after 1001 ms --- Sun Jul 10 18:01:51 UTC 2022 curl: (28) Connection timeout after 1001 ms --- Sun Jul 10 18:01:53 UTC 2022 curl: (28) Connection timeout after 1001 ms --- Sun Jul 10 18:01:55 UTC 2022 curl: (28) Connection timeout after 1000 ms --- Sun Jul 10 18:01:57 UTC 2022 curl: (28) Connection timeout after 1000 ms --- Sun Jul 10 18:01:59 UTC 2022 curl: (28) Connection timeout after 1001 ms --- Sun Jul 10 18:02:01 UTC 2022 curl: (28) Connection timeout after 1001 ms --- Sun Jul 10 18:02:03 UTC 2022 curl: (28) Connection timeout after 1001 ms --- Sun Jul 10 18:02:05 UTC 2022 curl: (28) Connection timeout after 1001 ms --- Sun Jul 10 18:02:07 UTC 2022 curl: (28) Connection timeout after 1001 ms --- Sun Jul 10 18:02:09 UTC 2022 curl: (28) Connection timeout after 1001 ms --- Sun Jul 10 18:02:11 UTC 2022 curl: (28) Connection timeout after 1001 ms --- Sun Jul 10 18:02:13 UTC 2022 ^C sh-5.1# while true; do curl ifconfig.me --connect-timeout 1; echo -n " --- "; date; sleep 1; done curl: (28) Connection timeout after 1000 ms --- Sun Jul 10 18:02:15 UTC 2022 curl: (28) Connection timeout after 1000 ms --- Sun Jul 10 18:02:17 UTC 2022 20.1.196.203 --- Sun Jul 10 18:02:18 UTC 2022 20.1.196.203 --- Sun Jul 10 18:02:19 UTC 2022 20.1.196.203 --- Sun Jul 10 18:02:21 UTC 2022 20.1.196.203 --- Sun Jul 10 18:02:22 UTC 2022 20.1.196.203 --- Sun Jul 10 18:02:23 UTC 2022 20.1.196.203 --- Sun Jul 10 18:02:24 UTC 2022 20.1.196.203 --- Sun Jul 10 18:02:25 UTC 2022 20.1.196.203 --- Sun Jul 10 18:02:26 UTC 2022 20.1.196.203 --- Sun Jul 10 18:02:27 UTC 2022 20.1.196.203 --- Sun Jul 10 18:02:28 UTC 2022 20.1.196.203 --- Sun Jul 10 18:02:29 UTC 2022 20.1.196.203 --- Sun Jul 10 18:02:30 UTC 2022 ^C sh-5.1# ~~~ This outage can be explained by looking at the CNCC logs: ~~~ oc logs -n oc logs -nopenshift-cloud-network-config-controller cloud-network-config-controller-696787bd76-vdsxn (...) I0710 17:46:51.199647 1 controller.go:160] Dropping key '10.0.129.8' from the cloud-private-ip-config workqueue I0710 17:46:51.205296 1 controller.go:160] Dropping key '10.0.129.8' from the cloud-private-ip-config workqueue I0710 18:00:48.860733 1 controller.go:182] Assigning key: 10.0.129.8 to cloud-private-ip-config workqueue I0710 18:00:48.866791 1 cloudprivateipconfig_controller.go:187] CloudPrivateIPConfig: "10.0.129.8" will be deleted from node: "ci-ln-4p9hg2b-1d09d-c4pt7-worker-eastus21-g2f59" I0710 18:00:48.873548 1 controller.go:182] Assigning key: 10.0.129.8 to cloud-private-ip-config workqueue I0710 18:01:52.778965 1 controller.go:182] Assigning key: 10.0.129.8 to cloud-private-ip-config workqueue I0710 18:02:18.570761 1 cloudprivateipconfig_controller.go:242] CloudPrivateIPConfig: 10.0.129.8 object has been marked for complete deletion I0710 18:02:18.570800 1 cloudprivateipconfig_controller.go:249] Cleaning up IP address and finalizer for CloudPrivateIPConfig: "10.0.129.8", deleting it completely I0710 18:02:18.582754 1 controller.go:182] Assigning key: 10.0.129.8 to cloud-private-ip-config workqueue I0710 18:02:18.583456 1 controller.go:160] Dropping key '10.0.129.8' from the cloud-private-ip-config workqueue I0710 18:02:18.586842 1 cloudprivateipconfig_controller.go:421] CloudPrivateIPConfig: "10.0.129.8" in work queue no longer exists I0710 18:02:18.586869 1 controller.go:160] Dropping key '10.0.129.8' from the cloud-private-ip-config workqueue [akaris@linux 2101992]$ ~~~ Outage from 18:00:48 to 18:02:18 which is exactly the time that Azure needs to report a successful IP address release to the CNCC. With the fix for this bug, this outage will no longer occur upon deletion.
*** Bug 2090997 has been marked as a duplicate of this bug. ***
The downstream merge landed with https://github.com/openshift/ovn-kubernetes/pull/1214
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399