Bug 2109442 - Network policies are not implemented or updated by OVN-Kubernetes
Summary: Network policies are not implemented or updated by OVN-Kubernetes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.10
Hardware: Unspecified
OS: Linux
unspecified
high
Target Milestone: ---
: 4.10.z
Assignee: ffernand
QA Contact: Anurag saxena
URL:
Whiteboard:
Depends On: 2057723
Blocks: 2115926
TreeView+ depends on / blocked
 
Reported: 2022-07-21 08:27 UTC by philipp.dallig
Modified: 2023-09-15 01:56 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2048538
Environment:
Last Closed: 2022-08-23 18:29:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 1195 0 None open [release-4.10] Bug 2109442: Fix GetPodsBySelector/GetNamespacesBySelector so MatchExpressions is respected 2022-08-04 15:16:29 UTC
Red Hat Product Errata RHBA-2022:6095 0 None None None 2022-08-23 18:29:25 UTC

Description philipp.dallig 2022-07-21 08:27:00 UTC
An important commit was missed during the downstream merge
Commit: https://github.com/openshift/ovn-kubernetes/pull/956/commits/96b2a2555a654d72a8546366032063a98a016f29
Initial downstream merge to master branch: https://github.com/openshift/ovn-kubernetes/pull/956
Downstream merge into the Release 4.10 branch: https://github.com/openshift/ovn-kubernetes/pull/971
Pull Request, um den fehlenden Commit in Release 4.10 aufzunehmen: https://github.com/openshift/ovn-kubernetes/pull/1195

+++ This bug was initially created as a clone of Bug #2048538 +++

Description of problem:

In one of our customer's clusters we see that new network policies are not created or updated by OVN-Kubernetes.
For one application this means it cannot reach the DNS service because the network policy that allows that is not being implemented.

In our own test on this cluster, pods in a namespace CAN reach each other despite this network policy:
~~~
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  creationTimestamp: "2022-01-27T14:41:05Z"
  generation: 2
  name: default-deny
  namespace: customer-debug
  resourceVersion: "311846645"
  uid: 87646222-c86d-4000-8997-7f0557ac34cf
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
~~~

In one of our dev clusters this network policy is enforced.


Version-Release number of selected component (if applicable):

OCP 4.8.25

How reproducible:

This happens randomly and very difficult to predict.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

The case has the must-gathers in from the cluster.

--- Additional comment from Tim Rozet on 2022-02-03 16:04:01 UTC ---

Upon finishing my analysis of the logs there are several bugs/errors happening here. All of which compound to either make network policies fail to be enforced properly or may cause them to stay enforcing when they shouldn't be:

1.policy.go:818] Failed to set ports in PortGroup for network policy ie-st-montun-filebeat/default-deny: Reconnecting...Transaction Failed due to an error: syntax error details: expected ["set", <array>] in {update Port_Group map[name:a11253394058733577533 ports:0xc001f1a1b0] [] [] [] 0 [[name == a11253394058733577533]]  }

This is due to a bug in the go-ovn library that was fixed in 4.9. I'm going to backport the same fix to 4.8z.

2. policy.go:1166] no pod IPs found on pod redhat-marketplace-brhvf: could not find OVN pod annotation in map[openshift.io/scc:anyuid operatorframework.io/managed-by:marketplace-operator]

This error is spammed throughout the log, but is benign. On pod add we could fail to get the OVN annotation due to racing with pod handler. However, once the pod handler annotates the pod an update event will happen and this code will be executed again. I'm going to ignore printing this error on pod add.

3. policy.go:733] logical port cd-argocd-cdteam_testssl2 not found in cache

This is the same as https://bugzilla.redhat.com/show_bug.cgi?id=2037884. The bug references stateful sets, but this was really true about any pod being added. When the network policy is created or pods are added that belong to the network policy's namespace, we attempt to get the pod's information from an internal cache. This races with the pod being added to the cache by the pod handler. The fix makes the network policy handler wait until the pod is added to the cache. Otherwise the network policy is created and potentially skips being applied to some pods in the namespace. This is already fixed in 4.8.29

4. policy.go:1166] failed to add IPs  ... set contains duplicate value

The duplicate value here being added is a VIP for a load balancer. In 4.9 and later there is a lower probability of this happening (because we no longer store an internal cache, so there shouldn't be duplicates), however I'm still going to add checks to ensure we filter out any duplicate values before adding to them to the cache or sending the RPC to OVN. I'm going to ensure a proper fix going in master and then backport to 4.8z.

5. E0125 18:40:32.759129       1 policy.go:955] Failed to create port_group for network policy allow-prometheus in namespace ie-st-montun-filebeat

This is the most egregious bug. First of all the log is is not printing the actual error. Second, this failure causes the network policy to fail creation, and then it is not retried again (unless the policy is updated). We need a retry mechanism to attempt to recreate the policy just like we do with pods. This will require a heavier fix in master and then backport down to 4.8z.

--- Additional comment from Tim Rozet on 2022-02-03 21:59:27 UTC ---

Fix for number 2: https://github.com/ovn-org/ovn-kubernetes/pull/2792

--- Additional comment from Tim Rozet on 2022-02-03 22:41:21 UTC ---

Fix for number 4: https://github.com/ovn-org/ovn-kubernetes/pull/2794

--- Additional comment from Tim Rozet on 2022-02-04 23:23:59 UTC ---

Partial fix for number 5: https://github.com/ovn-org/ovn-kubernetes/pull/2797

Will need a follow up part 2 after this is reviewed + accepted.

--- Additional comment from Tim Rozet on 2022-02-09 01:45:15 UTC ---

Posted https://github.com/ovn-org/ovn-kubernetes/pull/2809 which will supersede PR 2797. That should be the complete fix for issue number 5.

--- Additional comment from Andy Bartlett on 2022-02-09 10:33:15 UTC ---

@trozet Do you have a link for the BZ / PR for:

1.policy.go:818] Failed to set ports in PortGroup for network policy ie-st-montun-filebeat/default-deny: Reconnecting...Transaction Failed due to an error: syntax error details: expected ["set", <array>] in {update Port_Group map[name:a11253394058733577533 ports:0xc001f1a1b0] [] [] [] 0 [[name == a11253394058733577533]]  }

This is due to a bug in the go-ovn library that was fixed in 4.9. I'm going to backport the same fix to 4.8z.

Many thanks,

Andy

--- Additional comment from Tim Rozet on 2022-02-14 16:55:30 UTC ---

Yeah the fix for number 1 is a one liner in the ebay/libovsdb library:

https://github.com/openshift/ovn-kubernetes/commit/35677418d2bbfddb6229e1d776bba2064dde646b#diff-88e093886eb91e9ca5f9234d74a5f756c0251d685c141c902a7833d95bec5345R27

@@ -24,7 +24,7 @@ func NewOvsSet(goSlice interface{}) (*OvsSet, error) {
		return nil, errors.New("OvsSet supports only Go Slice types")
	}

-	var ovsSet []interface{}
+	ovsSet := make([]interface{}, 0, v.Len())
	for i := 0; i < v.Len(); i++ {
		ovsSet = append(ovsSet, v.Index(i).Interface())
	}

--- Additional comment from Tim Rozet on 2022-02-15 14:51:40 UTC ---

Moving back to assigned, a small issue was found with the previous patch: https://github.com/ovn-org/ovn-kubernetes/pull/2823

--- Additional comment from Tim Rozet on 2022-02-16 17:21:06 UTC ---

Found another issue where a delete/recreate of a policy with the same name may not clean up the stale version. Pushed a fix here: https://github.com/ovn-org/ovn-kubernetes/pull/2826

Comment 9 errata-xmlrpc 2022-08-23 18:29:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.28 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:6095

Comment 10 Red Hat Bugzilla 2023-09-15 01:56:54 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days


Note You need to log in before you can comment on or make changes to this bug.