Bug 2017650 - [OVN]EgressFirewall cannot be applied correctly if cluster has windows nodes
Summary: [OVN]EgressFirewall cannot be applied correctly if cluster has windows nodes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.10.0
Assignee: Surya Seetharaman
QA Contact: huirwang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-10-27 06:13 UTC by huirwang
Modified: 2022-03-10 16:22 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-10 16:22:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 908 0 None open Bug 2017650: EF: Pull up switch names from cache 2022-01-14 20:51:44 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:22:25 UTC

Description huirwang 2021-10-27 06:13:08 UTC
Description of problem:
[OVN]EgressFirewall cannot be applied correctly if cluster has windows nodes

Version-Release number of selected component (if applicable):
4.7.0-0.nightly-2021-10-21-232712

How reproducible:


Steps to Reproduce:
1. Setup a vsphere cluster with flexy job with profile 73_IPI on vSphere 7.0 & OVN & WindowsContainer

2. Then create one test project and an EgressFirewall
kind: EgressFirewall
apiVersion: k8s.ovn.org/v1
metadata:
  name: default
spec:
  egress:
  - type: Allow
    to:
      dnsName: www.badiu.com 
  - type: Allow
    to:
      dnsName: yahoo.com
    ports:
      - protocol: TCP
        port: 80
  - type: Deny
    to:
      cidrSelector: 0.0.0.0/0
3.

Actual results:
$ oc get egressfirewall -n test
NAME      EGRESSFIREWALL STATUS
default   EgressFirewall Rules not correctly added

I1026 10:38:38.047876       1 egressfirewall.go:210] Adding egressFirewall default in namespace test
2021-10-26T10:38:38.061Z|02429|nbctl|INFO|Running command run -- create address_set name=a5773229689678011375 external-ids:name=www.badiu.com_v4
2021-10-26T10:38:38.072Z|02430|nbctl|INFO|Running command run --id=@acl -- create acl priority=9999 direction=to-lport "match=\"(ip4.dst == $a5773229689678011375) && ip4.src == $a5811396932658691220 && ip4.dst != 10.128.0.0/14\"" action=allow external-ids:egressFirewall=test -- add logical_switch huirwang1026a-46x5j-master-0 acls @acl
2021-10-26T10:38:38.083Z|02431|nbctl|INFO|Running command run --id=@acl -- create acl priority=9999 direction=to-lport "match=\"(ip4.dst == $a5773229689678011375) && ip4.src == $a5811396932658691220 && ip4.dst != 10.128.0.0/14\"" action=allow external-ids:egressFirewall=test -- add logical_switch huirwang1026a-46x5j-worker-rlj72 acls @acl
2021-10-26T10:38:38.095Z|02432|nbctl|INFO|Running command run --id=@acl -- create acl priority=9999 direction=to-lport "match=\"(ip4.dst == $a5773229689678011375) && ip4.src == $a5811396932658691220 && ip4.dst != 10.128.0.0/14\"" action=allow external-ids:egressFirewall=test -- add logical_switch huirwang1026a-46x5j-worker-8pd4l acls @acl
2021-10-26T10:38:38.106Z|02433|nbctl|INFO|Running command run --id=@acl -- create acl priority=9999 direction=to-lport "match=\"(ip4.dst == $a5773229689678011375) && ip4.src == $a5811396932658691220 && ip4.dst != 10.128.0.0/14\"" action=allow external-ids:egressFirewall=test -- add logical_switch winworker-mff7b acls @acl
E1026 10:38:38.106660       1 ovn.go:893] error executing create ACL command, stderr: "ovn-nbctl: no row \"winworker-mff7b\" in table Logical_Switch\n", OVN command '/usr/bin/ovn-nbctl --timeout=15 --id=@acl create acl priority=9999 direction=to-lport match="(ip4.dst == $a5773229689678011375) && ip4.src == $a5811396932658691220 && ip4.dst != 10.128.0.0/14" action=allow external-ids:egressFirewall=test -- add logical_switch winworker-mff7b acls @acl' failed: exit status 1
I1026 10:38:38.106699       1 kube.go:131] Updating status on EgressFirewall default in namespace test


hello-pod  is located on Linux nodes
# oc get pod -n test -o wide
NAME        READY   STATUS    RESTARTS   AGE   IP             NODE                               NOMINATED NODE   READINESS GATES
hello-pod   1/1     Running   0          20h   10.128.2.151   huirwang1026a-46x5j-worker-8pd4l   <none>           <none>

$ oc rsh -n test hello-pod
/ # curl -I www.google.com
HTTP/1.1 200 OK
Content-Type: text/html; charset=ISO-8859-1
P3P: CP="This is not a P3P policy! See g.co/p3phelp for more info."
Date: Wed, 27 Oct 2021 06:11:51 GMT
Server: gws
X-XSS-Protection: 0
X-Frame-Options: SAMEORIGIN
Transfer-Encoding: chunked
Expires: Wed, 27 Oct 2021 06:11:51 GMT
Cache-Control: private
Set-Cookie: 1P_JAR=2021-10-27-06; expires=Fri, 26-Nov-2021 06:11:51 GMT; path=/; domain=.google.com; Secure
Set-Cookie: NID=511=LCkmCuPBsCzQ4rBD-NJw4t9TW1YslnqffNuY4mFS5xTg5hTBtVT53rlKOeKlTE1anRSM6Pa3-jUt6ML52lBpl_dtql3O8S2kb06U8NKCOKgtOUXKgKDMyL4T--WK7p8aqtz2-JLrJU7kazn6_THsMT2lJM4tceHdZFAuXlaTUK4; expires=Thu, 28-Apr-2022 06:11:51 GMT; path=/; domain=.google.com; HttpOnly

This cluster has mixed linux nodes and windows nodes.
$ oc get nodes -o wide
NAME                               STATUS   ROLES    AGE   VERSION                       INTERNAL-IP      EXTERNAL-IP      OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
huirwang1026a-46x5j-master-0       Ready    master   22h   v1.20.0+bbbc079               172.31.249.90    172.31.249.90    Red Hat Enterprise Linux CoreOS 47.84.202110212231-0 (Ootpa)   4.18.0-305.19.1.el8_4.x86_64   cri-o://1.20.5-7.rhaos4.7.gite80c8db.el8
huirwang1026a-46x5j-master-1       Ready    master   22h   v1.20.0+bbbc079               172.31.249.59    172.31.249.59    Red Hat Enterprise Linux CoreOS 47.84.202110212231-0 (Ootpa)   4.18.0-305.19.1.el8_4.x86_64   cri-o://1.20.5-7.rhaos4.7.gite80c8db.el8
huirwang1026a-46x5j-master-2       Ready    master   22h   v1.20.0+bbbc079               172.31.249.92    172.31.249.92    Red Hat Enterprise Linux CoreOS 47.84.202110212231-0 (Ootpa)   4.18.0-305.19.1.el8_4.x86_64   cri-o://1.20.5-7.rhaos4.7.gite80c8db.el8
huirwang1026a-46x5j-worker-8pd4l   Ready    worker   22h   v1.20.0+bbbc079               172.31.249.16    172.31.249.16    Red Hat Enterprise Linux CoreOS 47.84.202110212231-0 (Ootpa)   4.18.0-305.19.1.el8_4.x86_64   cri-o://1.20.5-7.rhaos4.7.gite80c8db.el8
huirwang1026a-46x5j-worker-rlj72   Ready    worker   22h   v1.20.0+bbbc079               172.31.249.22    172.31.249.22    Red Hat Enterprise Linux CoreOS 47.84.202110212231-0 (Ootpa)   4.18.0-305.19.1.el8_4.x86_64   cri-o://1.20.5-7.rhaos4.7.gite80c8db.el8
winworker-mff7b                    Ready    worker   22h   v1.20.0-1081+d0b1ad449a08b3   172.31.249.219   172.31.249.219   Windows Server Standard                                        10.0.19041.508                 docker://20.10.7
winworker-wz8f5                    Ready    worker   22h   v1.20.0-1081+d0b1ad449a08b3   172.31.249.140   172.31.249.140   Windows Server Standard                                        10.0.19041.508                 docker://20.10.7

Expected results:
EgressFirewall can be added successfully in this kind of cluster and works for pods which are located on linux nodes.

Additional info:
BTW, I didn't reproduce this issue in 4.9 build 4.9.0-0.nightly-2021-10-26-041726 with same flexy profile cluster.

Comment 2 Surya Seetharaman 2022-01-11 22:02:03 UTC
Apologies on the delay in getting to this bug, is this still an issue on the 4.7 cluster?

Comment 3 Surya Seetharaman 2022-01-11 22:03:19 UTC
If it is still an issue, can I please have access to the windows cluster where this is being observed? I am unable to reproduce this rn, so would be good to get my hands on a reproducer.

Cheers,
Surya.

Comment 5 Surya Seetharaman 2022-01-13 14:25:07 UTC
Took a look at the cluster, this is a bug and happens only on hybrid-overlay mode. We try to create acls with the following command to attach them to the node-switch in LGW (<4.8 OCP):

for _, logicalSwitch := range logicalSwitches {
		if uuids == "" {
			_, stderr, err := util.RunOVNNbctl("--id=@acl", "create", "acl",
				fmt.Sprintf("priority=%d", priority),
				fmt.Sprintf("direction=%s", toLport), match, "action="+action,
				fmt.Sprintf("external-ids:egressFirewall=%s", externalID),
				"--", "add", "logical_switch", logicalSwitch,
				"acls", "@acl")
			if err != nil {
				return fmt.Errorf("error executing create ACL command, stderr: %q, %+v", stderr, err)
			}
		} else {
			for _, uuid := range strings.Fields(uuids) {
				_, stderr, err := util.RunOVNNbctl("add", "logical_switch", logicalSwitch, "acls", uuid)
				if err != nil {
					return fmt.Errorf("error adding ACL to joinsSwitch %s failed, stderr: %q, %+v",
						logicalSwitch, stderr, err)
				}
			}
		}
	}

and logicalSwitches are constructed from:

if config.Gateway.Mode == config.GatewayModeLocal {
		nodes, err := oc.watchFactory.GetNodes()
		if err != nil {
			return fmt.Errorf("unable to setup egress firewall ACLs on cluster nodes, err: %v", err)
		}
		for _, node := range nodes {
			logicalSwitches = append(logicalSwitches, node.Name)
		}
	} else {
		logicalSwitches = append(logicalSwitches, types.OVNJoinSwitch)
	}

the whole list of nodes in the cluster, we need to avoid hybrid-overlay nodes because hybrid overlay nodes won't have ovn-k topology configured.
sh-4.4# ovn-nbctl ls-list    
1e1e489b-eea2-49f2-97c0-6ca8522c73a1 (ext_huirwang-011347-7vxst-master-0)
9d332752-3bd7-485d-996b-2eedd19b02ee (ext_huirwang-011347-7vxst-master-1)
78dce192-e897-4701-9108-83fe47afbd07 (ext_huirwang-011347-7vxst-master-2)
41db2b20-4a2c-4a3b-9e92-17b12469932e (ext_huirwang-011347-7vxst-worker-g6dzw)
b58453d0-299f-48cd-98ee-5fa7f754d5c6 (ext_huirwang-011347-7vxst-worker-gdmzq)
12cc994f-a881-4d85-b876-e57993bac112 (huirwang-011347-7vxst-master-0)
d7a4e30d-5e35-4ec9-8c26-e1cedab6f13e (huirwang-011347-7vxst-master-1)
a4051baa-19e8-4de7-ab87-32b07156c099 (huirwang-011347-7vxst-master-2)
524a94bf-cc95-4202-9789-f2761e422eee (huirwang-011347-7vxst-worker-g6dzw)
fca70df5-a7b9-4b47-a8d8-ba1632ca70d5 (huirwang-011347-7vxst-worker-gdmzq)
ce7416ad-5200-4c2d-9cef-d43f5f845ad1 (join)
290a2ea3-8df0-44af-87bd-984a219a2eab (node_local_switch)


Setting severity and priority to medium.

Comment 6 Surya Seetharaman 2022-01-13 17:59:12 UTC
https://github.com/ovn-org/ovn-kubernetes/pull/2749 posted upstream fix, Once it lands need to backport it downstream and do the nbctl equivalent of this in <4.10 releases.

Comment 12 errata-xmlrpc 2022-03-10 16:22:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.