Bug 1995887

Summary: [OVN]After reboot egress node, lr-policy-list was not correct, some duplicate records or missed internal IPs
Product: OpenShift Container Platform Reporter: huirwang
Component: NetworkingAssignee: Ben Bennett <bbennett>
Networking sub component: ovn-kubernetes QA Contact: huirwang
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: anbhat, andcosta, cldavey, openshift-bugs-escalate, tidawson
Version: 4.7   
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2034668 (view as bug list) Environment:
Last Closed: 2022-03-10 16:05:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2034668    

Description huirwang 2021-08-20 04:04:52 UTC
Description of problem:
After reboot egress node,  lr-policy-list was not correct, some duplicate records or missed internal IPs

Version-Release number of selected component (if applicable):
4.9.0-0.nightly-2021-08-19-184748

How reproducible:
Frequently 

Steps to Reproduce:
1. Lable 3 egress nodes and create one egressip object
...
spec:
    egressIPs:
    - 172.31.248.103
    - 172.31.248.104
    - 172.31.248.105
    namespaceSelector:
      matchLabels:
        name: test
    podSelector: {}
  status:
    items:
    - egressIP: 172.31.248.103
      node: compute-1
    - egressIP: 172.31.248.104
      node: compute-2
    - egressIP: 172.31.248.105
      node: compute-0
...
2.Create two namespace, and label ns name=test, then create 10 pods for each namespace.

3. Check the lr-policy-list 

sh-4.4#  ovn-nbctl lr-policy-list ovn_cluster_router  | grep "100 "
       100                             ip4.src == 10.128.2.45         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.128.2.46         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.128.2.47         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.128.2.48         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.128.2.49         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.128.2.50         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.129.2.31         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.129.2.32         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.129.2.33         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.129.2.34         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.129.2.35         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.131.0.51         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.131.0.52         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.131.0.53         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.131.0.54         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.131.0.55         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.131.0.56         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.131.0.57         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.131.0.58         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.131.0.59         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
sh-4.4#  ovn-nbctl lr-policy-list ovn_cluster_router  | grep "100 " | wc -l
20

4. Then reboot egress node compute-0

5. After the node back to ready, check lr-policy-list  again

Actual results:
There some duplicate records. Like  ,one is with 3 internal IPs, one is with 2 internal IPs
      100                             ip4.src == 10.131.0.56         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
      100                             ip4.src == 10.131.0.56         reroute                100.64.0.6, 100.64.0.7

Some missed internal IP. Only two internal IPs, if curl from the related pods, it will use 2 egress nodes, even we have 3 egress nodes available.
100                             ip4.src == 10.131.0.57         reroute                100.64.0.6, 100.64.0.7

$ oc rsh -n test2 test-rc-sbp6f
~ $  while true; do curl 172.31.249.80:9095;sleep 2; echo ""; done;
172.31.248.104
172.31.248.104
172.31.248.103
172.31.248.104
172.31.248.103
172.31.248.103
172.31.248.104
172.31.248.104
172.31.248.103
172.31.248.104
172.31.248.104
172.31.248.104
172.31.248.104
172.31.248.104
172.31.248.104
.....
172.31.248.104
172.31.248.104
172.31.248.103

sh-4.4# ovn-nbctl lr-policy-list ovn_cluster_router  | grep "100 "
       100                             ip4.src == 10.128.2.45         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.128.2.46         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.128.2.47         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.128.2.48         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.128.2.49         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.128.2.50         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.129.2.31         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.129.2.32         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.129.2.33         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.129.2.34         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.129.2.35         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.131.0.51         reroute                100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.131.0.51         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.131.0.52         reroute                100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.131.0.52         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.131.0.53         reroute                100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.131.0.53         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.131.0.54         reroute                100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.131.0.54         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.131.0.55         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.131.0.55         reroute                100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.131.0.56         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.131.0.56         reroute                100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.131.0.57         reroute                100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.131.0.58         reroute                100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.131.0.58         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.131.0.59         reroute                100.64.0.6, 100.64.0.7
       100                             ip4.src == 10.131.0.59         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7

sh-4.4# ovn-nbctl lr-policy-list ovn_cluster_router  | grep "100 " | wc -l
28

Expected results:
After egress nodes reboot, lr-policy-list should have records as before reboot nodes.

Additional info:
Moreover, after delete two namespaces, test and test2, that means all test pods gone, there are still some lr-policy-list left.

workaround is restart ovn-kubemaster pods.

Comment 1 Alexander Constantinescu 2021-10-12 10:41:11 UTC
Hi Huiran

I can't remember exactly but I get the feeling this problem was linked to https://bugzilla.redhat.com/show_bug.cgi?id=1973215 

1973215 was fixed on 4.9 before code freeze, so could you try to reproduce this problem with the latest version of 4.9 to verify if they indeed are duplicates? 

If the problem has not been resolved on 4.9: could you provide a kubeconfig / must-gather? 

Thanks in advance!

Comment 7 Alexander Constantinescu 2021-12-22 13:10:49 UTC
*** Bug 2034790 has been marked as a duplicate of this bug. ***

Comment 16 errata-xmlrpc 2022-03-10 16:05:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Comment 17 Red Hat Bugzilla 2023-09-18 04:25:20 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days