Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2023220

Summary:	ACL for a deleted egressfirewall still present on node join switch
Product:	OpenShift Container Platform	Reporter:	Riccardo Ravaioli <rravaiol>
Component:	Networking	Assignee:	Riccardo Ravaioli <rravaiol>
Networking sub component:	ovn-kubernetes	QA Contact:	huirwang
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	unspecified	CC:	anusaxen, arghosh, atn, rravaiol, zzhao
Version:	4.8
Target Milestone:	---
Target Release:	4.9.z
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	2023216	Environment:
Last Closed:	2021-12-13 12:06:24 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2023216
Bug Blocks:	2011666

Comment 2 Riccardo Ravaioli 2021-12-01 17:10:46 UTC

Moving to MODIFIED, since the code in 4.9 cannot provoke the issue found in 4.8 and 4.7.

What was said for the 4.10 BZ applies also here:

  "with respect to the original bug found in 4.7 and 4.8 (#2011666), in 4.9 the implementation of the egress firewall feature changed and the two issues found in the code in 4.7 and 4.8 are already addressed: 

(1) when updating an existing egress firewall, we are no longer adding and then removing a temporary ACL with external ID egressFirewall=$NS-blockAll, blocking outgoing traffic from all pods in the namespace. 

(2) the syncEgressFirewall method already makes sure that all egress firewall ACLs in OVN correspond to egress firewalls in the API server."


In addition to that, syncEgressFirewall in 4.9 also takes care of cleanup after switching gateway mode from local to shared (shared to local is not currently supported) : it will delete any trailing ACLs that might be carried over after an upgrade from a 4.8.z in local gw mode showing this issue to a 4.9.z in shared gw mode.

In light of this, verification can happen in the following way, similarly to what was suggested for the 4.10 bug but with a one more subtlety as I explain below.


In order to verify that this issue is solved in 4.9, we should make sure that all EgressFirewall ACLs at startup correspond to actual EgressFirewalls in the API server. So let's cover two cases:

*** case 1: no EgressFirewall, shared gw mode
- ssh into ovn-k master and add spurious ACLs to node logical switches (as in local gateway mode) and to the join switch (as in shared gateway mode):

ovn-nbctl --id=@acl create acl action=drop direction=to-lport priority=10000  match="1.2.3.0/24" external-ids:egressFirewall=default-blockAll -- add logical_switch ovn-worker acls @acl
ovn-nbctl --id=@acl create acl action=drop direction=to-lport priority=10000 match="1.2.3.0/24" external-ids:egressFirewall=default-blockAll -- add logical_switch ovn-worker2 acls @acl
ovn-nbctl --id=@acl create acl action=drop direction=to-lport priority=10000 match="1.2.3.0/24" external-ids:egressFirewall=default-blockAll -- add logical_switch ovn-control-plane acls @acl
ovn-nbctl --id=@acl create acl action=drop direction=to-lport priority=10000 match="1.2.3.0/24" external-ids:egressFirewall=default-blockAll -- add logical_switch join acls @acl

In the example above, I used "ovn-worker", "ovn-worker2", "ovn-control-plane" as node switches; also, I added ACLs for node logical switches too, simulating ACLs carried over from an upgrade + gw mode switch from local to shared.

- delete ovn-k master pod
- wait for ovn-k master pod to be up again, then ssh into ovn-k master and verify that the ACLs from above have been deleted: 
  ovn-nbctl list acls
  ovn-nbctl acl-list $node
  ovn-nbctl acl-list join

*** case 2: with an EgressFirewall, shared gw mode
- add a simple egressfirewall, like:

$ cat ef.yaml
apiVersion: k8s.ovn.org/v1
kind: EgressFirewall
metadata:
  name: default
spec:
  egress: 
  - type: Allow
    to:
      cidrSelector: 8.8.8.0/24
  - type: Allow
    to:
      dnsName: github.com
  - type: Deny
    to:
      cidrSelector: 0.0.0.0/0

$ kubectl apply -f ef.yaml


Lastly, we can repeat the steps above for local gateway mode, keeping in mind that:
- moving from shared to local is not currently allowed
- consequently, spurious ACLs on the "join" switch cannot be carried over from shared gw mode if we are currently in local gw mode.

In this scenario, we can simply test that spurious ACLs on node logical switches (thus excluding the join switch) get deleted after a restart:

ovn-nbctl --id=@acl create acl action=drop direction=to-lport priority=10000  match="1.2.3.0/24" external-ids:egressFirewall=default-blockAll -- add logical_switch ovn-worker acls @acl
ovn-nbctl --id=@acl create acl action=drop direction=to-lport priority=10000 match="1.2.3.0/24" external-ids:egressFirewall=default-blockAll -- add logical_switch ovn-worker2 acls @acl
ovn-nbctl --id=@acl create acl action=drop direction=to-lport priority=10000 match="1.2.3.0/24" external-ids:egressFirewall=default-blockAll -- add logical_switch ovn-control-plane acls @acl

Comment 7 errata-xmlrpc 2021-12-13 12:06:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.9.11 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:5003