Bug 2023220 - ACL for a deleted egressfirewall still present on node join switch
Summary: ACL for a deleted egressfirewall still present on node join switch
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.8
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: 4.9.z
Assignee: Riccardo Ravaioli
QA Contact: huirwang
URL:
Whiteboard:
Depends On: 2023216
Blocks: 2011666
TreeView+ depends on / blocked
 
Reported: 2021-11-15 09:03 UTC by Riccardo Ravaioli
Modified: 2021-12-13 12:06 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2023216
Environment:
Last Closed: 2021-12-13 12:06:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2021:5003 0 None Closed Intermittent DNS lookup errors for cluster.local addresses 2022-06-16 13:24:39 UTC

Comment 2 Riccardo Ravaioli 2021-12-01 17:10:46 UTC
Moving to MODIFIED, since the code in 4.9 cannot provoke the issue found in 4.8 and 4.7.

What was said for the 4.10 BZ applies also here:

  "with respect to the original bug found in 4.7 and 4.8 (#2011666), in 4.9 the implementation of the egress firewall feature changed and the two issues found in the code in 4.7 and 4.8 are already addressed: 

(1) when updating an existing egress firewall, we are no longer adding and then removing a temporary ACL with external ID egressFirewall=$NS-blockAll, blocking outgoing traffic from all pods in the namespace. 

(2) the syncEgressFirewall method already makes sure that all egress firewall ACLs in OVN correspond to egress firewalls in the API server."


In addition to that, syncEgressFirewall in 4.9 also takes care of cleanup after switching gateway mode from local to shared (shared to local is not currently supported) : it will delete any trailing ACLs that might be carried over after an upgrade from a 4.8.z in local gw mode showing this issue to a 4.9.z in shared gw mode.

In light of this, verification can happen in the following way, similarly to what was suggested for the 4.10 bug but with a one more subtlety as I explain below.


In order to verify that this issue is solved in 4.9, we should make sure that all EgressFirewall ACLs at startup correspond to actual EgressFirewalls in the API server. So let's cover two cases:

*** case 1: no EgressFirewall, shared gw mode
- ssh into ovn-k master and add spurious ACLs to node logical switches (as in local gateway mode) and to the join switch (as in shared gateway mode):

ovn-nbctl --id=@acl create acl action=drop direction=to-lport priority=10000  match="1.2.3.0/24" external-ids:egressFirewall=default-blockAll -- add logical_switch ovn-worker acls @acl
ovn-nbctl --id=@acl create acl action=drop direction=to-lport priority=10000 match="1.2.3.0/24" external-ids:egressFirewall=default-blockAll -- add logical_switch ovn-worker2 acls @acl
ovn-nbctl --id=@acl create acl action=drop direction=to-lport priority=10000 match="1.2.3.0/24" external-ids:egressFirewall=default-blockAll -- add logical_switch ovn-control-plane acls @acl
ovn-nbctl --id=@acl create acl action=drop direction=to-lport priority=10000 match="1.2.3.0/24" external-ids:egressFirewall=default-blockAll -- add logical_switch join acls @acl

In the example above, I used "ovn-worker", "ovn-worker2", "ovn-control-plane" as node switches; also, I added ACLs for node logical switches too, simulating ACLs carried over from an upgrade + gw mode switch from local to shared.

- delete ovn-k master pod
- wait for ovn-k master pod to be up again, then ssh into ovn-k master and verify that the ACLs from above have been deleted: 
  ovn-nbctl list acls
  ovn-nbctl acl-list $node
  ovn-nbctl acl-list join

*** case 2: with an EgressFirewall, shared gw mode
- add a simple egressfirewall, like:

$ cat ef.yaml
apiVersion: k8s.ovn.org/v1
kind: EgressFirewall
metadata:
  name: default
spec:
  egress: 
  - type: Allow
    to:
      cidrSelector: 8.8.8.0/24
  - type: Allow
    to:
      dnsName: github.com
  - type: Deny
    to:
      cidrSelector: 0.0.0.0/0

$ kubectl apply -f ef.yaml


Lastly, we can repeat the steps above for local gateway mode, keeping in mind that:
- moving from shared to local is not currently allowed
- consequently, spurious ACLs on the "join" switch cannot be carried over from shared gw mode if we are currently in local gw mode.

In this scenario, we can simply test that spurious ACLs on node logical switches (thus excluding the join switch) get deleted after a restart:

ovn-nbctl --id=@acl create acl action=drop direction=to-lport priority=10000  match="1.2.3.0/24" external-ids:egressFirewall=default-blockAll -- add logical_switch ovn-worker acls @acl
ovn-nbctl --id=@acl create acl action=drop direction=to-lport priority=10000 match="1.2.3.0/24" external-ids:egressFirewall=default-blockAll -- add logical_switch ovn-worker2 acls @acl
ovn-nbctl --id=@acl create acl action=drop direction=to-lport priority=10000 match="1.2.3.0/24" external-ids:egressFirewall=default-blockAll -- add logical_switch ovn-control-plane acls @acl

Comment 7 errata-xmlrpc 2021-12-13 12:06:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.9.11 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:5003


Note You need to log in before you can comment on or make changes to this bug.