Bug 1910533

Summary: [OVN] It takes about 5 minutes for EgressIP failover to work
Product: OpenShift Container Platform Reporter: huirwang
Component: NetworkingAssignee: Alexander Constantinescu <aconstan>
Networking sub component: ovn-kubernetes QA Contact: huirwang
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aconstan, anbhat
Version: 4.7   
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1920482 (view as bug list) Environment:
Last Closed: 2021-02-24 15:48:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1920482    

Description huirwang 2020-12-24 09:25:30 UTC
Description of problem:
It takes about 5 minutes for EgressIP failover to work.

Version-Release number of selected component (if applicable):
4.7.0-0.nightly-2020-12-21-131655 

How reproducible:
Sometimes

Steps to Reproduce:
1. Label two nodes as EgressIP nodes.
2. Create EgressIP object
oc get egressip -o yaml
apiVersion: v1
items:
- apiVersion: k8s.ovn.org/v1
  kind: EgressIP
  metadata:
    creationTimestamp: "2020-12-24T08:58:02Z"
    generation: 2
    managedFields:
    - apiVersion: k8s.ovn.org/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:spec:
          .: {}
          f:egressIPs: {}
          f:namespaceSelector:
            .: {}
            f:matchLabels:
              .: {}
              f:team: {}
          f:podSelector:
            .: {}
            f:matchLabels:
              .: {}
              f:team: {}
      manager: oc
      operation: Update
      time: "2020-12-24T08:58:02Z"
    - apiVersion: k8s.ovn.org/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          .: {}
          f:items: {}
      manager: ovnkube
      operation: Update
      time: "2020-12-24T08:58:02Z"
    name: egressip2
    resourceVersion: "551266"
    uid: 2561ea84-af86-4f08-a085-3e0eabac235b
  spec:
    egressIPs:
    - 172.31.249.203
    - 172.31.249.202
    namespaceSelector:
      matchLabels:
        team: red
    podSelector:
      matchLabels:
        team: blue
  status:
    items:
    - egressIP: 172.31.249.203
      node: huirwang-470-rgw66-master-1
    - egressIP: 172.31.249.202
      node: huirwang-470-rgw66-master-2
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

3. Create ns hrw and pods in it, label the pods and namespace according to above matchLabels.

4. From one pod to access the outside, meanwhile, stop kubelet service on the node: huirwang-470-rgw66-master-1 which will make the node NotReady.

oc rsh -n hrw test-rc-gcw4v
~ $ while true; do date; curl 172.31.249.80:9095 --connect-timeout 2;sleep 2;done

172.31.249.203Thu Dec 24 09:03:51 UTC 2020
172.31.249.203Thu Dec 24 09:03:53 UTC 2020
172.31.249.203Thu Dec 24 09:03:55 UTC 2020
172.31.249.203Thu Dec 24 09:03:57 UTC 2020
172.31.249.203Thu Dec 24 09:03:59 UTC 2020
172.31.249.203Thu Dec 24 09:04:01 UTC 2020
172.31.249.203Thu Dec 24 09:04:03 UTC 2020
172.31.249.203Thu Dec 24 09:04:05 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:04:09 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:04:13 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:04:18 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:04:22 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:04:26 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:04:30 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:04:34 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:04:38 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:04:42 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:04:46 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:04:50 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:04:54 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:04:58 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:05:02 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:05:06 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:05:10 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:05:14 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:05:18 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:05:22 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:05:26 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:05:30 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:05:34 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:05:38 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:05:42 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:05:46 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:05:50 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:05:54 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:05:58 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:06:02 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:06:06 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:06:10 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:06:14 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:06:18 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:06:22 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:06:26 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:06:30 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:06:34 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:06:38 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:06:42 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:06:46 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:06:50 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:06:54 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:06:58 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:07:02 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:07:06 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:07:10 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:07:14 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:07:18 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:07:22 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:07:26 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:07:30 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:07:34 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:07:38 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:07:42 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:07:46 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:07:50 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:07:54 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:07:58 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:08:02 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:08:06 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:08:10 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:08:14 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:08:18 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:08:22 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:08:26 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:08:30 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:08:34 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:08:38 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:08:42 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:08:46 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:08:50 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:08:54 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:08:58 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:09:02 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:09:06 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:09:10 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:09:14 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:09:18 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:09:22 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:09:26 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:09:30 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:09:34 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:09:38 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:09:42 UTC 2020
curl: (28) Connection timed out after 2001 milliseconds
Thu Dec 24 09:09:46 UTC 2020
curl: (28) Connection timed out after 2000 milliseconds
Thu Dec 24 09:09:50 UTC 2020
172.31.249.203Thu Dec 24 09:09:52 UTC 2020
172.31.249.203Thu Dec 24 09:09:54 UTC 2020
172.31.249.203Thu Dec 24 09:09:56 UTC 2020
172.31.249.203Thu Dec 24 09:09:58 UTC 2020
172.31.249.203Thu Dec 24 09:10:00 UTC 2020
172.31.249.203Thu Dec 24 09:10:02 UTC 2020




Actual results:

The change in egressip object is very soon, but the connection getting work is slow.  It takes more than 5 minutes to get it work.

oc get egressip -o yaml
apiVersion: v1
items:
- apiVersion: k8s.ovn.org/v1
  kind: EgressIP
  metadata:
    creationTimestamp: "2020-12-24T08:58:02Z"
    generation: 5
    managedFields:
    - apiVersion: k8s.ovn.org/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:spec:
          .: {}
          f:egressIPs: {}
          f:namespaceSelector:
            .: {}
            f:matchLabels:
              .: {}
              f:team: {}
          f:podSelector:
            .: {}
            f:matchLabels:
              .: {}
              f:team: {}
      manager: oc
      operation: Update
      time: "2020-12-24T08:58:02Z"
    - apiVersion: k8s.ovn.org/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          .: {}
          f:items: {}
      manager: ovnkube
      operation: Update
      time: "2020-12-24T08:58:02Z"
    name: egressip2
    resourceVersion: "553045"
    uid: 2561ea84-af86-4f08-a085-3e0eabac235b
  spec:
    egressIPs:
    - 172.31.249.203
    - 172.31.249.202
    namespaceSelector:
      matchLabels:
        team: red
    podSelector:
      matchLabels:
        team: blue
  status:
    items:
    - egressIP: 172.31.249.203
      node: huirwang-470-rgw66-master-2
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""


Expected results:

The failover should work very soon.

Additional info:
I cannot reproduce it each time, but found this happen about 2~3 times.

Comment 2 Alexander Constantinescu 2021-01-06 15:59:35 UTC
FYI: Upstream PR: https://github.com/ovn-org/ovn-kubernetes/pull/1939

Comment 7 errata-xmlrpc 2021-02-24 15:48:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633