Description of problem: [OVN]EgressFirewall cannot be applied correctly if cluster has windows nodes Version-Release number of selected component (if applicable): 4.7.0-0.nightly-2021-10-21-232712 How reproducible: Steps to Reproduce: 1. Setup a vsphere cluster with flexy job with profile 73_IPI on vSphere 7.0 & OVN & WindowsContainer 2. Then create one test project and an EgressFirewall kind: EgressFirewall apiVersion: k8s.ovn.org/v1 metadata: name: default spec: egress: - type: Allow to: dnsName: www.badiu.com - type: Allow to: dnsName: yahoo.com ports: - protocol: TCP port: 80 - type: Deny to: cidrSelector: 0.0.0.0/0 3. Actual results: $ oc get egressfirewall -n test NAME EGRESSFIREWALL STATUS default EgressFirewall Rules not correctly added I1026 10:38:38.047876 1 egressfirewall.go:210] Adding egressFirewall default in namespace test 2021-10-26T10:38:38.061Z|02429|nbctl|INFO|Running command run -- create address_set name=a5773229689678011375 external-ids:name=www.badiu.com_v4 2021-10-26T10:38:38.072Z|02430|nbctl|INFO|Running command run --id=@acl -- create acl priority=9999 direction=to-lport "match=\"(ip4.dst == $a5773229689678011375) && ip4.src == $a5811396932658691220 && ip4.dst != 10.128.0.0/14\"" action=allow external-ids:egressFirewall=test -- add logical_switch huirwang1026a-46x5j-master-0 acls @acl 2021-10-26T10:38:38.083Z|02431|nbctl|INFO|Running command run --id=@acl -- create acl priority=9999 direction=to-lport "match=\"(ip4.dst == $a5773229689678011375) && ip4.src == $a5811396932658691220 && ip4.dst != 10.128.0.0/14\"" action=allow external-ids:egressFirewall=test -- add logical_switch huirwang1026a-46x5j-worker-rlj72 acls @acl 2021-10-26T10:38:38.095Z|02432|nbctl|INFO|Running command run --id=@acl -- create acl priority=9999 direction=to-lport "match=\"(ip4.dst == $a5773229689678011375) && ip4.src == $a5811396932658691220 && ip4.dst != 10.128.0.0/14\"" action=allow external-ids:egressFirewall=test -- add logical_switch huirwang1026a-46x5j-worker-8pd4l acls @acl 2021-10-26T10:38:38.106Z|02433|nbctl|INFO|Running command run --id=@acl -- create acl priority=9999 direction=to-lport "match=\"(ip4.dst == $a5773229689678011375) && ip4.src == $a5811396932658691220 && ip4.dst != 10.128.0.0/14\"" action=allow external-ids:egressFirewall=test -- add logical_switch winworker-mff7b acls @acl E1026 10:38:38.106660 1 ovn.go:893] error executing create ACL command, stderr: "ovn-nbctl: no row \"winworker-mff7b\" in table Logical_Switch\n", OVN command '/usr/bin/ovn-nbctl --timeout=15 --id=@acl create acl priority=9999 direction=to-lport match="(ip4.dst == $a5773229689678011375) && ip4.src == $a5811396932658691220 && ip4.dst != 10.128.0.0/14" action=allow external-ids:egressFirewall=test -- add logical_switch winworker-mff7b acls @acl' failed: exit status 1 I1026 10:38:38.106699 1 kube.go:131] Updating status on EgressFirewall default in namespace test hello-pod is located on Linux nodes # oc get pod -n test -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES hello-pod 1/1 Running 0 20h 10.128.2.151 huirwang1026a-46x5j-worker-8pd4l <none> <none> $ oc rsh -n test hello-pod / # curl -I www.google.com HTTP/1.1 200 OK Content-Type: text/html; charset=ISO-8859-1 P3P: CP="This is not a P3P policy! See g.co/p3phelp for more info." Date: Wed, 27 Oct 2021 06:11:51 GMT Server: gws X-XSS-Protection: 0 X-Frame-Options: SAMEORIGIN Transfer-Encoding: chunked Expires: Wed, 27 Oct 2021 06:11:51 GMT Cache-Control: private Set-Cookie: 1P_JAR=2021-10-27-06; expires=Fri, 26-Nov-2021 06:11:51 GMT; path=/; domain=.google.com; Secure Set-Cookie: NID=511=LCkmCuPBsCzQ4rBD-NJw4t9TW1YslnqffNuY4mFS5xTg5hTBtVT53rlKOeKlTE1anRSM6Pa3-jUt6ML52lBpl_dtql3O8S2kb06U8NKCOKgtOUXKgKDMyL4T--WK7p8aqtz2-JLrJU7kazn6_THsMT2lJM4tceHdZFAuXlaTUK4; expires=Thu, 28-Apr-2022 06:11:51 GMT; path=/; domain=.google.com; HttpOnly This cluster has mixed linux nodes and windows nodes. $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME huirwang1026a-46x5j-master-0 Ready master 22h v1.20.0+bbbc079 172.31.249.90 172.31.249.90 Red Hat Enterprise Linux CoreOS 47.84.202110212231-0 (Ootpa) 4.18.0-305.19.1.el8_4.x86_64 cri-o://1.20.5-7.rhaos4.7.gite80c8db.el8 huirwang1026a-46x5j-master-1 Ready master 22h v1.20.0+bbbc079 172.31.249.59 172.31.249.59 Red Hat Enterprise Linux CoreOS 47.84.202110212231-0 (Ootpa) 4.18.0-305.19.1.el8_4.x86_64 cri-o://1.20.5-7.rhaos4.7.gite80c8db.el8 huirwang1026a-46x5j-master-2 Ready master 22h v1.20.0+bbbc079 172.31.249.92 172.31.249.92 Red Hat Enterprise Linux CoreOS 47.84.202110212231-0 (Ootpa) 4.18.0-305.19.1.el8_4.x86_64 cri-o://1.20.5-7.rhaos4.7.gite80c8db.el8 huirwang1026a-46x5j-worker-8pd4l Ready worker 22h v1.20.0+bbbc079 172.31.249.16 172.31.249.16 Red Hat Enterprise Linux CoreOS 47.84.202110212231-0 (Ootpa) 4.18.0-305.19.1.el8_4.x86_64 cri-o://1.20.5-7.rhaos4.7.gite80c8db.el8 huirwang1026a-46x5j-worker-rlj72 Ready worker 22h v1.20.0+bbbc079 172.31.249.22 172.31.249.22 Red Hat Enterprise Linux CoreOS 47.84.202110212231-0 (Ootpa) 4.18.0-305.19.1.el8_4.x86_64 cri-o://1.20.5-7.rhaos4.7.gite80c8db.el8 winworker-mff7b Ready worker 22h v1.20.0-1081+d0b1ad449a08b3 172.31.249.219 172.31.249.219 Windows Server Standard 10.0.19041.508 docker://20.10.7 winworker-wz8f5 Ready worker 22h v1.20.0-1081+d0b1ad449a08b3 172.31.249.140 172.31.249.140 Windows Server Standard 10.0.19041.508 docker://20.10.7 Expected results: EgressFirewall can be added successfully in this kind of cluster and works for pods which are located on linux nodes. Additional info: BTW, I didn't reproduce this issue in 4.9 build 4.9.0-0.nightly-2021-10-26-041726 with same flexy profile cluster.
Apologies on the delay in getting to this bug, is this still an issue on the 4.7 cluster?
If it is still an issue, can I please have access to the windows cluster where this is being observed? I am unable to reproduce this rn, so would be good to get my hands on a reproducer. Cheers, Surya.
Took a look at the cluster, this is a bug and happens only on hybrid-overlay mode. We try to create acls with the following command to attach them to the node-switch in LGW (<4.8 OCP): for _, logicalSwitch := range logicalSwitches { if uuids == "" { _, stderr, err := util.RunOVNNbctl("--id=@acl", "create", "acl", fmt.Sprintf("priority=%d", priority), fmt.Sprintf("direction=%s", toLport), match, "action="+action, fmt.Sprintf("external-ids:egressFirewall=%s", externalID), "--", "add", "logical_switch", logicalSwitch, "acls", "@acl") if err != nil { return fmt.Errorf("error executing create ACL command, stderr: %q, %+v", stderr, err) } } else { for _, uuid := range strings.Fields(uuids) { _, stderr, err := util.RunOVNNbctl("add", "logical_switch", logicalSwitch, "acls", uuid) if err != nil { return fmt.Errorf("error adding ACL to joinsSwitch %s failed, stderr: %q, %+v", logicalSwitch, stderr, err) } } } } and logicalSwitches are constructed from: if config.Gateway.Mode == config.GatewayModeLocal { nodes, err := oc.watchFactory.GetNodes() if err != nil { return fmt.Errorf("unable to setup egress firewall ACLs on cluster nodes, err: %v", err) } for _, node := range nodes { logicalSwitches = append(logicalSwitches, node.Name) } } else { logicalSwitches = append(logicalSwitches, types.OVNJoinSwitch) } the whole list of nodes in the cluster, we need to avoid hybrid-overlay nodes because hybrid overlay nodes won't have ovn-k topology configured. sh-4.4# ovn-nbctl ls-list 1e1e489b-eea2-49f2-97c0-6ca8522c73a1 (ext_huirwang-011347-7vxst-master-0) 9d332752-3bd7-485d-996b-2eedd19b02ee (ext_huirwang-011347-7vxst-master-1) 78dce192-e897-4701-9108-83fe47afbd07 (ext_huirwang-011347-7vxst-master-2) 41db2b20-4a2c-4a3b-9e92-17b12469932e (ext_huirwang-011347-7vxst-worker-g6dzw) b58453d0-299f-48cd-98ee-5fa7f754d5c6 (ext_huirwang-011347-7vxst-worker-gdmzq) 12cc994f-a881-4d85-b876-e57993bac112 (huirwang-011347-7vxst-master-0) d7a4e30d-5e35-4ec9-8c26-e1cedab6f13e (huirwang-011347-7vxst-master-1) a4051baa-19e8-4de7-ab87-32b07156c099 (huirwang-011347-7vxst-master-2) 524a94bf-cc95-4202-9789-f2761e422eee (huirwang-011347-7vxst-worker-g6dzw) fca70df5-a7b9-4b47-a8d8-ba1632ca70d5 (huirwang-011347-7vxst-worker-gdmzq) ce7416ad-5200-4c2d-9cef-d43f5f845ad1 (join) 290a2ea3-8df0-44af-87bd-984a219a2eab (node_local_switch) Setting severity and priority to medium.
https://github.com/ovn-org/ovn-kubernetes/pull/2749 posted upstream fix, Once it lands need to backport it downstream and do the nbctl equivalent of this in <4.10 releases.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056