Description of problem: After reboot egress node, lr-policy-list was not correct, some duplicate records or missed internal IPs Version-Release number of selected component (if applicable): 4.9.0-0.nightly-2021-08-19-184748 How reproducible: Frequently Steps to Reproduce: 1. Lable 3 egress nodes and create one egressip object ... spec: egressIPs: - 172.31.248.103 - 172.31.248.104 - 172.31.248.105 namespaceSelector: matchLabels: name: test podSelector: {} status: items: - egressIP: 172.31.248.103 node: compute-1 - egressIP: 172.31.248.104 node: compute-2 - egressIP: 172.31.248.105 node: compute-0 ... 2.Create two namespace, and label ns name=test, then create 10 pods for each namespace. 3. Check the lr-policy-list sh-4.4# ovn-nbctl lr-policy-list ovn_cluster_router | grep "100 " 100 ip4.src == 10.128.2.45 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.128.2.46 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.128.2.47 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.128.2.48 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.128.2.49 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.128.2.50 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.129.2.31 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.129.2.32 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.129.2.33 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.129.2.34 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.129.2.35 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.131.0.51 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.131.0.52 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.131.0.53 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.131.0.54 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.131.0.55 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.131.0.56 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.131.0.57 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.131.0.58 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.131.0.59 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 sh-4.4# ovn-nbctl lr-policy-list ovn_cluster_router | grep "100 " | wc -l 20 4. Then reboot egress node compute-0 5. After the node back to ready, check lr-policy-list again Actual results: There some duplicate records. Like ,one is with 3 internal IPs, one is with 2 internal IPs 100 ip4.src == 10.131.0.56 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.131.0.56 reroute 100.64.0.6, 100.64.0.7 Some missed internal IP. Only two internal IPs, if curl from the related pods, it will use 2 egress nodes, even we have 3 egress nodes available. 100 ip4.src == 10.131.0.57 reroute 100.64.0.6, 100.64.0.7 $ oc rsh -n test2 test-rc-sbp6f ~ $ while true; do curl 172.31.249.80:9095;sleep 2; echo ""; done; 172.31.248.104 172.31.248.104 172.31.248.103 172.31.248.104 172.31.248.103 172.31.248.103 172.31.248.104 172.31.248.104 172.31.248.103 172.31.248.104 172.31.248.104 172.31.248.104 172.31.248.104 172.31.248.104 172.31.248.104 ..... 172.31.248.104 172.31.248.104 172.31.248.103 sh-4.4# ovn-nbctl lr-policy-list ovn_cluster_router | grep "100 " 100 ip4.src == 10.128.2.45 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.128.2.46 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.128.2.47 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.128.2.48 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.128.2.49 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.128.2.50 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.129.2.31 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.129.2.32 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.129.2.33 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.129.2.34 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.129.2.35 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.131.0.51 reroute 100.64.0.6, 100.64.0.7 100 ip4.src == 10.131.0.51 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.131.0.52 reroute 100.64.0.6, 100.64.0.7 100 ip4.src == 10.131.0.52 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.131.0.53 reroute 100.64.0.6, 100.64.0.7 100 ip4.src == 10.131.0.53 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.131.0.54 reroute 100.64.0.6, 100.64.0.7 100 ip4.src == 10.131.0.54 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.131.0.55 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.131.0.55 reroute 100.64.0.6, 100.64.0.7 100 ip4.src == 10.131.0.56 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.131.0.56 reroute 100.64.0.6, 100.64.0.7 100 ip4.src == 10.131.0.57 reroute 100.64.0.6, 100.64.0.7 100 ip4.src == 10.131.0.58 reroute 100.64.0.6, 100.64.0.7 100 ip4.src == 10.131.0.58 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 100 ip4.src == 10.131.0.59 reroute 100.64.0.6, 100.64.0.7 100 ip4.src == 10.131.0.59 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 sh-4.4# ovn-nbctl lr-policy-list ovn_cluster_router | grep "100 " | wc -l 28 Expected results: After egress nodes reboot, lr-policy-list should have records as before reboot nodes. Additional info: Moreover, after delete two namespaces, test and test2, that means all test pods gone, there are still some lr-policy-list left. workaround is restart ovn-kubemaster pods.
Hi Huiran I can't remember exactly but I get the feeling this problem was linked to https://bugzilla.redhat.com/show_bug.cgi?id=1973215 1973215 was fixed on 4.9 before code freeze, so could you try to reproduce this problem with the latest version of 4.9 to verify if they indeed are duplicates? If the problem has not been resolved on 4.9: could you provide a kubeconfig / must-gather? Thanks in advance!
*** Bug 2034790 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days