Bug 2029742 - [ovn] Stale lr-policy-list and snat rules left for egressip
Summary: [ovn] Stale lr-policy-list and snat rules left for egressip
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.10
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.10.0
Assignee: ffernand
QA Contact: huirwang
URL:
Whiteboard:
Depends On:
Blocks: 2034477 2048841
TreeView+ depends on / blocked
 
Reported: 2021-12-07 08:28 UTC by huirwang
Modified: 2022-04-05 10:20 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-10 16:32:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 910 0 None Merged Bug 2029742: egressip: fix usage of clientModel doAfter 2022-01-19 18:21:30 UTC
Github ovn-org ovn-kubernetes pull 2735 0 None Merged egressip: fix usage of clientModel doAfter 2022-01-12 14:55:35 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:32:41 UTC

Description huirwang 2021-12-07 08:28:57 UTC
Description of problem:
Nat rules for egressip were not cleared  even restart ovnkube-master pods

Version-Release number of selected component (if applicable):
4.10.0-0.nightly-2021-12-06-201335

How reproducible:


Steps to Reproduce:
1. Label one node as egress node
2. Create egressip object
 oc get egressip -o yaml
....
  spec:
    egressIPs:
    - 172.31.249.117
    namespaceSelector:
      matchLabels:
        org: pm
    podSelector: {}
  status:
    items:
    - egressIP: 172.31.249.117
      node: compute-0
....
3. Create ns ds36l and 10 pods in it, label org=pm  to the namespace
4. scale the CNO to 0 
 oc scale deployment network-operator -n openshift-network-operator --replicas 0
5.Delete ovnkube-master ds
Scale test pods replicas to 1
6. scale the CNO to 1
oc scale deployment network-operator -n openshift-network-operator --replicas 1
deployment.apps/network-operator scaled

7. Check  lr-policy-list and snat
 ovn-nbctl lr-policy-list ovn_cluster_router | grep "100 "
       100                             ip4.src == 10.128.2.28         reroute                100.64.0.6
       100                             ip4.src == 10.128.2.29         reroute                100.64.0.6
       100                             ip4.src == 10.128.2.30         reroute                100.64.0.6
       100                             ip4.src == 10.128.2.31         reroute                100.64.0.6
       100                             ip4.src == 10.128.2.32         reroute                100.64.0.6
       100                             ip4.src == 10.128.2.33         reroute                100.64.0.6
       100                             ip4.src == 10.131.0.22         reroute                100.64.0.6
       100                             ip4.src == 10.131.0.23         reroute                100.64.0.6
       100                             ip4.src == 10.131.0.24         reroute                100.64.0.6
       100                             ip4.src == 10.131.0.25         reroute                100.64.0.6

8. Nat rules are not correct,not only for the above 10 pod's IP.  As I have done some egressip regression testing on this cluster and also this case for a couple of times. 
sh-4.4# ovn-nbctl --format=csv --no-heading find nat external_ids:name=egressip
e371ed02-2d3d-4f47-83f8-e47327323a16,[],[],{name=egressip},"""172.31.249.117""",[],"""""","""10.128.2.30""",k8s-compute-0,"{stateless=""false""}",snat
bd532783-71ed-4e7f-81e3-05c3c13e189a,[],[],{name=egressip},"""172.31.248.78""",[],"""""","""10.131.0.21""",k8s-compute-0,"{stateless=""false""}",snat
32135123-a84b-4918-88ef-cb20003fbd04,[],[],{name=egressip},"""172.31.248.78""",[],"""""","""10.128.2.14""",k8s-compute-0,"{stateless=""false""}",snat
4da801e4-c9c5-4220-929d-c2e8e504dd24,[],[],{name=egressip},"""172.31.249.117""",[],"""""","""10.131.0.24""",k8s-compute-0,"{stateless=""false""}",snat
78eaece4-766e-4358-bfbb-ac6c1e210d05,[],[],{name=egressip},"""172.31.248.53""",[],"""""","""10.128.2.64""",k8s-compute-0,"{stateless=""false""}",snat
ca7582a5-a7f1-4bed-97e0-9a518e91d558,[],[],{name=egressip},"""172.31.248.78""",[],"""""","""10.128.2.13""",k8s-compute-0,"{stateless=""false""}",snat
7b1fdd26-1f81-408c-8cad-6d9fcd03731f,[],[],{name=egressip},"""172.31.249.117""",[],"""""","""10.128.2.29""",k8s-compute-0,"{stateless=""false""}",snat
e008c686-44f8-49ef-9690-845567b800db,[],[],{name=egressip},"""172.31.248.53""",[],"""""","""10.128.2.65""",k8s-compute-0,"{stateless=""false""}",snat
727c0dc6-fc19-4c6a-b890-e57b97dacd64,[],[],{name=egressip},"""172.31.248.53""",[],"""""","""10.128.2.61""",k8s-compute-0,"{stateless=""false""}",snat
46231587-9d93-40b0-a9b6-db90383ca133,[],[],{name=egressip},"""172.31.248.212""",[],"""""","""10.131.0.10""",k8s-compute-0,"{stateless=""false""}",snat
9ee85407-8a48-4963-9bac-d734c674bdff,[],[],{name=egressip},"""172.31.248.212""",[],"""""","""10.128.2.10""",k8s-compute-0,"{stateless=""false""}",snat
ed59f208-8503-4763-a12c-23d44306d13a,[],[],{name=egressip},"""172.31.248.212""",[],"""""","""10.131.0.9""",k8s-compute-0,"{stateless=""false""}",snat
5cc6fac9-a3b3-4355-a704-5c796746951d,[],[],{name=egressip},"""172.31.248.78""",[],"""""","""10.128.2.18""",k8s-compute-0,"{stateless=""false""}",snat
42cdda4b-3d9b-4d31-80f4-30cfc6863283,[],[],{name=egressip},"""172.31.249.117""",[],"""""","""10.128.2.32""",k8s-compute-0,"{stateless=""false""}",snat
8462e02c-c2ba-4e9a-83ae-2dffbd0019f2,[],[],{name=egressip},"""172.31.248.78""",[],"""""","""10.128.2.16""",k8s-compute-0,"{stateless=""false""}",snat
4bcd434c-a9cf-4b44-968c-e33e84015f3b,[],[],{name=egressip},"""172.31.248.78""",[],"""""","""10.128.2.17""",k8s-compute-0,"{stateless=""false""}",snat
535535d0-db85-435c-8acb-2362688a20a9,[],[],{name=egressip},"""172.31.249.117""",[],"""""","""10.131.0.23""",k8s-compute-0,"{stateless=""false""}",snat
19c316b2-5a5e-44c2-bc3e-dfe95cdf7596,[],[],{name=egressip},"""172.31.248.212""",[],"""""","""10.128.2.14""",k8s-compute-0,"{stateless=""false""}",snat
d25eb913-3287-4650-9f3a-05110508e25d,[],[],{name=egressip},"""172.31.248.212""",[],"""""","""10.128.2.9""",k8s-compute-0,"{stateless=""false""}",snat
55b8da53-fc85-4092-b985-3c18411f6a88,[],[],{name=egressip},"""172.31.248.212""",[],"""""","""10.131.0.15""",k8s-compute-0,"{stateless=""false""}",snat
e3256ad4-63d1-4f02-a114-e3539129139a,[],[],{name=egressip},"""172.31.248.53""",[],"""""","""10.128.2.63""",k8s-compute-0,"{stateless=""false""}",snat
4339efb5-b75a-429a-8f0b-b11f93f97686,[],[],{name=egressip},"""172.31.249.117""",[],"""""","""10.128.2.31""",k8s-compute-0,"{stateless=""false""}",snat
ac1fdc15-f300-4fa5-a271-74982450601a,[],[],{name=egressip},"""172.31.248.212""",[],"""""","""10.128.2.13""",k8s-compute-0,"{stateless=""false""}",snat
d93ec1b5-9774-4422-b0ee-8334354d2d32,[],[],{name=egressip},"""172.31.249.117""",[],"""""","""10.131.0.25""",k8s-compute-0,"{stateless=""false""}",snat
5190eef6-335f-4a98-ac68-e7e033aafde4,[],[],{name=egressip},"""172.31.248.212""",[],"""""","""10.128.2.12""",k8s-compute-0,"{stateless=""false""}",snat
c9b4eb58-d2d0-4b55-a2bc-c246f536eb1f,[],[],{name=egressip},"""172.31.249.117""",[],"""""","""10.131.0.22""",k8s-compute-0,"{stateless=""false""}",snat
a0628f39-ee24-4e73-9bc7-f93e3f83410e,[],[],{name=egressip},"""172.31.248.78""",[],"""""","""10.131.0.23""",k8s-compute-0,"{stateless=""false""}",snat
9217a2f9-06cb-4e56-94b0-d5be7dc808fc,[],[],{name=egressip},"""172.31.248.212""",[],"""""","""10.128.2.11""",k8s-compute-0,"{stateless=""false""}",snat
4b0d787b-facb-4119-92c6-b6b1d66a4788,[],[],{name=egressip},"""172.31.249.117""",[],"""""","""10.128.2.28""",k8s-compute-0,"{stateless=""false""}",snat
4ac19f7d-9c7f-4982-b0f0-ad305170fbaa,[],[],{name=egressip},"""172.31.248.53""",[],"""""","""10.128.2.60""",k8s-compute-0,"{stateless=""false""}",snat
ac8a781e-0d17-4928-89a4-0f809720aa04,[],[],{name=egressip},"""172.31.248.78""",[],"""""","""10.131.0.22""",k8s-compute-0,"{stateless=""false""}",snat



Actual results:
Stale lr-policy-list  and snat rules left


Expected results:
No stale lr-policy-list  and snat rules left

Additional info:

Comment 3 ffernand 2021-12-21 23:13:34 UTC
Stale rows are not getting deleted because of the following error in the transaction:

I1221 22:49:13.197021      37 model_client.go:313] Delete operations generated as: [{Op:delete Table:Logical_Router_Policy Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:0 Where:[where column _uuid == {a97cd20d-c22b-4a99-882f-e19ebf9d\
6af7}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]


E1221 22:49:13.197075      37 egressip.go:877] XXX syncStaleEgressReroutePolicy will delete egressip: {a97cd20d-c22b-4a99-882f-e19ebf9d6af7 reroute map[name:egressip] ip4.src == 10.244.1.27 <nil> [100.64.0.4] map[] 100}
I1221 22:49:13.197388      37 model_client.go:304] Mutate operations generated as: [{Op:mutate Table:Logical_Router Row:map[] Rows:[] Columns:[] Mutations:[{Column:policies Mutator:delete Value:{GoSet:[{GoUUID:a97cd20d-c22b-4a99-882f-e19\
ebf9d6af7}]}}] Timeout:0 Where:[where column _uuid == {4a681460-5950-463c-9fb9-745721734569}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]
I1221 22:49:13.197447      37 transact.go:41] Configuring OVN: [{Op:delete Table:Logical_Router_Policy Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:0 Where:[where column _uuid == {891376ae-6907-4ba9-b72f-d476ebaeb5c6}] Until: Durabl\
e:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:Logical_Router_Policy Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:0 Where:[where column _uuid == {a97cd20d-c22b-4a99-882f-e19ebf9d6af7}] Until: Durable:<nil> Comment:<nil\
> Lock:<nil> UUIDName:} {Op:mutate Table:Logical_Router Row:map[] Rows:[] Columns:[] Mutations:[{Column:policies Mutator:delete Value:{GoSet:[{GoUUID:a97cd20d-c22b-4a99-882f-e19ebf9d6af7}]}}] Timeout:0 Where:[where column _uuid == {4a681\
460-5950-463c-9fb9-745721734569}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]
I1221 22:49:13.197651      37 client.go:726]  "msg"="transacting operations"  "database"="OVN_Northbound" "operations"="[{Op:delete Table:Logical_Router_Policy Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:0 Where:[where column _uuid\
 == {891376ae-6907-4ba9-b72f-d476ebaeb5c6}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:Logical_Router_Policy Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:0 Where:[where column _uuid == {a97cd20d-c22b-4\
a99-882f-e19ebf9d6af7}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Logical_Router Row:map[] Rows:[] Columns:[] Mutations:[{Column:policies Mutator:delete Value:{GoSet:[{GoUUID:a97cd20d-c22b-4a99-882f-e19ebf\
9d6af7}]}}] Timeout:0 Where:[where column _uuid == {4a681460-5950-463c-9fb9-745721734569}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]"



E1221 22:49:13.198459      37 egressip.go:895] Unable to remove stale logical router policies, err: error in transact with ops [{Op:delete Table:Logical_Router_Policy Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:0 Where:[where colum\
n _uuid == {891376ae-6907-4ba9-b72f-d476ebaeb5c6}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:Logical_Router_Policy Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:0 Where:[where column _uuid == {a97cd20d\
-c22b-4a99-882f-e19ebf9d6af7}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Logical_Router Row:map[] Rows:[] Columns:[] Mutations:[{Column:policies Mutator:delete Value:{GoSet:[{GoUUID:a97cd20d-c22b-4a99-882f\
-e19ebf9d6af7}]}}] Timeout:0 Where:[where column _uuid == {4a681460-5950-463c-9fb9-745721734569}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] results [{Count:1 Error: Details: UUID:{GoUUID:} Rows:[]} {Count:1 Error: Details\
: UUID:{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID:} Rows:[]} {Count:0 Error:referential integrity violation Details:cannot delete Logical_Router_Policy row 891376ae-6907-4ba9-b72f-d476ebaeb5c6 because of 1 remaining referen\
ce(s) UUID:{GoUUID:} Rows:[]}] and errors []: referential integrity violation: cannot delete Logical_Router_Policy row 891376ae-6907-4ba9-b72f-d476ebaeb5c6 because of 1 remaining reference(s)


There is something wrong in the mutation delete, because the wrong uuid is being deleted. To be further investigated.

Comment 11 errata-xmlrpc 2022-03-10 16:32:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.