Bug 1877273

Summary: [OVN] EgressIP cannot fail over to available nodes after one egressIP node shutdown
Product: OpenShift Container Platform Reporter: huirwang
Component: NetworkingAssignee: Alexander Constantinescu <aconstan>
Networking sub component: ovn-kubernetes QA Contact: huirwang
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aconstan, acossett, amulmule, bbennett, ChetRHosey, danw, jboxman, jnordell, skanakal, vpickard
Version: 4.6   
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: When a node experienced networking issues (or the kubelet failed to function properly and went into a non-ready state) the egress IPs assigned to that node were never re-assigned elsewhere Consequence: The egress IP functionality was broken as packets were still routed to this faulty egress node, which could not serve traffic. Fix: We now verify the state of all egress nodes periodically by pinging each egress node and verifying the node object's state. Result: In case a node goes down, the egress IPs are now re-assigned and the functionality keeps working by re-directing egress traffic to another node.
Story Points: ---
Clone Of:
: 1898160 (view as bug list) Environment:
Last Closed: 2021-02-24 15:17:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1898160    

Description huirwang 2020-09-09 09:31:43 UTC
Description of problem:
There are two EgressIP nodes configured with EgressIP, currently pods using one egressIP node for outgoing traffic, but once that node shutdown, the pods outgoing traffic was broken, not fail over to another node.

Version-Release number of selected component (if applicable):
4.6.0-0.ci-2020-09-08-214738

How reproducible:
Always

Steps to Reproduce:
1. Label two nodes with egressIP
oc label node  compute-1 "k8s.ovn.org/egress-assignable"=""
oc label node  compute-0 "k8s.ovn.org/egress-assignable"=""
2.Create namespace test and pods in it.
Add labels to namespaces and pods

oc label ns test name=test 
oc label pod hello-pod team=blue -n test

3. Create egressIP object

apiVersion: k8s.ovn.org/v1
kind: EgressIP
metadata:
  name: egressip7
spec:
  egressIPs:
  - 139.178.76.20 
  - 139.178.76.21
  podSelector:
    matchLabels:
      team: blue
  namespaceSelector:
    matchLabels:
      name: test
4. Check the egressIP object, the egressIP applied two nodes.
oc get egressip egressip7 -o yaml
apiVersion: k8s.ovn.org/v1
kind: EgressIP
metadata:
  creationTimestamp: "2020-09-09T05:43:42Z"
  generation: 2
  managedFields:
  - apiVersion: k8s.ovn.org/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        .: {}
        f:egressIPs: {}
        f:namespaceSelector:
          .: {}
          f:matchLabels:
            .: {}
            f:name: {}
        f:podSelector:
          .: {}
          f:matchLabels:
            .: {}
            f:team: {}
    manager: oc
    operation: Update
    time: "2020-09-09T05:43:42Z"
  - apiVersion: k8s.ovn.org/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        .: {}
        f:items: {}
    manager: ovnkube
    operation: Update
    time: "2020-09-09T05:43:43Z"
  name: egressip7
  resourceVersion: "185731"
  selfLink: /apis/k8s.ovn.org/v1/egressips/egressip7
  uid: 640e20aa-2acd-45fa-be08-e83b73905ea4
spec:
  egressIPs:
  - 139.178.76.20
  - 139.178.76.21
  namespaceSelector:
    matchLabels:
      name: test
  podSelector:
    matchLabels:
      team: blue
status:
  items:
  - egressIP: 139.178.76.20
    node: compute-0
  - egressIP: 139.178.76.21
    node: compute-1


5. From test pod to access outside websites, it used the first egressIp configured as source IP.
oc rsh -n test hello-pod
/ # curl ifconfig.me
139.178.76.20

6. Shutdown node compute-0
oc get nodes
NAME              STATUS     ROLES    AGE     VERSION
compute-0         NotReady   worker   6h52m   v1.19.0-rc.2+068702d
compute-1         Ready      worker   6h52m   v1.19.0-rc.2+068702d
control-plane-0   Ready      master   7h1m    v1.19.0-rc.2+068702d
control-plane-1   Ready      master   7h1m    v1.19.0-rc.2+068702d
control-plane-2   Ready      master   7h1m    v1.19.0-rc.2+068702d

6. From test pod to access outside websites
oc rsh -n test hello-pod
/ # curl --connect-timeout 5 ifconfig.me
curl: (7) Failed to connect to ifconfig.me port 80: Operation timed out

Actual results:
It cannot connect outside after the node compute-0 shutdown


Expected results:
The pod should also connect outside the cluster and fail over to another available egressIP.

Additional info:

Comment 5 Alexander Constantinescu 2020-09-23 07:40:29 UTC
Hi Huiran

I am going to push this out to 4.7, the reason is:

1) In OVN/OVS we cannot have multiple reroutes matching the same traffic to multiple egress nodes. For this we would need the OVN RFE: https://bugzilla.redhat.com/show_bug.cgi?id=1881826 to be implemented.
2) Even if multiple reroutes to multiple egress nodes exists, we cannot ensure that if a node silently dies (i.e the OpenShift/Kubernetes API server is not aware) that traffic then flows though the node which is still functioning, for that we will need OVN RFE: https://bugzilla.redhat.com/show_bug.cgi?id=1847570 

This is not a big use case I believe and should be fine waiting for until the 4.7 release.

Comment 6 acossett 2020-10-26 19:20:38 UTC
This is really important to fix as soon as possible, since the customer can not go to production without a failover working scenario, the application will be down until this faulty node is destroy.  Also it does not make sens to configure 2 IP, if the failover is not working.
This is a bid use case for Telco and Financial customer.

Comment 7 Jason Boxman 2020-10-26 20:14:43 UTC
It looks like we sort of covered this with:

"If a node is deleted by a cluster administrator, any egress IP addresses assigned to it are automatically reassigned, subject to the previously described conditions."

But it isn't working?

Is this a known issue for OCP 4.6 GA?

Thanks!

Comment 8 Dan Winship 2020-10-27 13:33:23 UTC
(In reply to Alexander Constantinescu from comment #5)
> This is not a big use case I believe and should be fine waiting for until
> the 4.7 release.

Doh. So I guess there was confusion in all the scurrying to finish 4.6 features, but this is absolutely a mandatory part of the feature. OVN-Kubernetes needs to actively detect when nodes become unreachable, and move their egress IPs away when they do. It can't just assume Nodes will get deleted if they are unavailable. See poll()/check() in https://github.com/openshift/sdn/blob/master/pkg/network/master/egressip.go for the openshift-sdn version.

OVN-Kubernetes also needs code to rebalance egress IPs when they get too unbalanced. (eg, once the above problem is fixed, then after an upgrade, the last egress node to reboot would end up with 0 egress IPs assigned afterward). What OpenShift SDN does (ReallocateEgressIPs() in https://github.com/openshift/sdn/blob/master/pkg/network/common/egressip.go) is that every time a node or an egress IP is added or removed, it computes both an "incremental" allocation (like what ovn-kubernetes does now) and a "from scratch" allocation (ie, how it would have chosen to allocate the IPs if none of them were already assigned). And then if any node has more than twice as many egress IPs in the "from scratch" allocation as it would have had in the "incremental" allocation, it knows things have gotten unbalanced and it needs to proactively move some IPs over to the underallocated node(s).

Comment 9 Alexander Constantinescu 2020-11-16 14:58:53 UTC
This is ready for testing. It's been integrated on master (i.e 4.7 with PR: https://github.com/openshift/ovn-kubernetes/pull/317) so I am setting it to MODIFIED. I am working on the back-port to 4.6

Comment 15 Jason Boxman 2021-01-25 16:01:11 UTC
So does this need a docs update once it is merged?

Thanks!

Comment 16 Jason Boxman 2021-02-03 01:34:04 UTC
Docs update for this BZ:

https://github.com/openshift/openshift-docs/pull/28956

Is this okay?

Thanks!

Comment 17 acossett 2021-02-23 19:50:17 UTC
With the lastest code change, why does BOTH IP are reassigned to new nodes, when only on fails ?

Initial working flow
Active Traffic flow from the egress pod attached to this egressIP
POD to External is natted to 10.0.32.112(all good working) (view starting config)

-----
Step for failover is to shutdown the node ovn-qgwkn-worker-canadacentral3-k4qk5 = 10.0.32.112 and expect the traffic to than be exited with 10.0.32.111 (as per initial starting config)

Results : 
IP 111 is re-assigned to another node automatically and traffic flow is interrupted, and you can see now that both IP are not matching the right node anymore...

Result Config after the node shutdown
~/Documents/ocp4/ovn_egressip » oc get egressIP                                                                                                       
NAME            EGRESSIPS     ASSIGNED NODE                           ASSIGNED EGRESSIPS
egressip-test   10.0.32.111   ovn-qgwkn-worker-canadacentral2-rhswn   10.0.32.111
status:
  items:
  - egressIP: 10.0.32.111
    node: ovn-qgwkn-worker-canadacentral2-rhswn
  - egressIP: 10.0.32.112
    node: ovn-qgwkn-worker-canadacentral1-bskbf

--------------------------------------------------------------------
Starting Config
~/Documents/ocp4/ovn_egressip » oc get egressIP                                                                                                       
NAME            EGRESSIPS     ASSIGNED NODE                           ASSIGNED EGRESSIPS
egressip-test   10.0.32.111   ovn-qgwkn-worker-canadacentral1-bskbf   10.0.32.111

Node #1: ovn-qgwkn-worker-canadacentral1-bskbf = 10.0.32.111
Node #2: ovn-qgwkn-worker-canadacentral3-k4qk5 = 10.0.32.112

egressIP yaml :
apiVersion: k8s.ovn.org/v1
kind: EgressIP
metadata:
  name: egressip-test
spec:
  egressIPs:
  - 10.0.32.111
  - 10.0.32.112
  namespaceSelector:
    matchLabels:
      name: example-egressip1
status:
  items:
  - egressIP: 10.0.32.111
    node: ovn-qgwkn-worker-canadacentral1-bskbf
  - egressIP: 10.0.32.112
    node: ovn-qgwkn-worker-canadacentral3-k4qk5


Expectation (manual mode) 
#1 10.0.32.111 should not be reassigned (untouched) and Pod traffic should start exiting with this IP.
#2 10.0.32.112 should be inactive until the node comes back or reassigned after +- 5 minutes of inactivities ?

Comment 19 errata-xmlrpc 2021-02-24 15:17:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633