Bug 2048538 - Network policies are not implemented or updated by OVN-Kubernetes
Summary: Network policies are not implemented or updated by OVN-Kubernetes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.8
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.11.0
Assignee: Tim Rozet
QA Contact: Anurag saxena
URL:
Whiteboard:
Depends On:
Blocks: 2057723
TreeView+ depends on / blocked
 
Reported: 2022-01-31 13:20 UTC by Andy Bartlett
Modified: 2023-09-15 01:51 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2057723 2109442 (view as bug list)
Environment:
Last Closed: 2022-08-10 10:45:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 956 0 None Merged Bug 2048538: [DownstreamMerge] 2-14-22 2022-03-02 14:32:21 UTC
Github openshift ovn-kubernetes pull 966 0 None Merged Bug 2048538: [DownstreamMerge] 2-22-22 2022-03-02 14:32:22 UTC
Github ovn-org ovn-kubernetes pull 2792 0 None Merged Fixes handling errors for getting IPs for pods 2022-03-02 14:32:22 UTC
Github ovn-org ovn-kubernetes pull 2794 0 None Merged Duplicates in addrsets 2022-03-02 14:32:23 UTC
Github ovn-org ovn-kubernetes pull 2809 0 None Merged Adds retry mechanism for Network Policy 2022-03-02 14:32:24 UTC
Github ovn-org ovn-kubernetes pull 2823 0 None Merged Fixes race for namespace logging level update 2022-03-02 14:32:24 UTC
Github ovn-org ovn-kubernetes pull 2826 0 None Merged Fixes delete retry on network policy recreation 2022-03-02 14:32:25 UTC
Github ovn-org ovn-kubernetes pull 2847 0 None open NP Retry: return error for ensureAddrSet 2022-03-04 16:36:14 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:46:22 UTC

Description Andy Bartlett 2022-01-31 13:20:40 UTC
Description of problem:

In one of our customer's clusters we see that new network policies are not created or updated by OVN-Kubernetes.
For one application this means it cannot reach the DNS service because the network policy that allows that is not being implemented.

In our own test on this cluster, pods in a namespace CAN reach each other despite this network policy:
~~~
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  creationTimestamp: "2022-01-27T14:41:05Z"
  generation: 2
  name: default-deny
  namespace: customer-debug
  resourceVersion: "311846645"
  uid: 87646222-c86d-4000-8997-7f0557ac34cf
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
~~~

In one of our dev clusters this network policy is enforced.


Version-Release number of selected component (if applicable):

OCP 4.8.25

How reproducible:

This happens randomly and very difficult to predict.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

The case has the must-gathers in from the cluster.

Comment 7 Tim Rozet 2022-02-03 16:04:01 UTC
Upon finishing my analysis of the logs there are several bugs/errors happening here. All of which compound to either make network policies fail to be enforced properly or may cause them to stay enforcing when they shouldn't be:

1.policy.go:818] Failed to set ports in PortGroup for network policy ie-st-montun-filebeat/default-deny: Reconnecting...Transaction Failed due to an error: syntax error details: expected ["set", <array>] in {update Port_Group map[name:a11253394058733577533 ports:0xc001f1a1b0] [] [] [] 0 [[name == a11253394058733577533]]  }

This is due to a bug in the go-ovn library that was fixed in 4.9. I'm going to backport the same fix to 4.8z.

2. policy.go:1166] no pod IPs found on pod redhat-marketplace-brhvf: could not find OVN pod annotation in map[openshift.io/scc:anyuid operatorframework.io/managed-by:marketplace-operator]

This error is spammed throughout the log, but is benign. On pod add we could fail to get the OVN annotation due to racing with pod handler. However, once the pod handler annotates the pod an update event will happen and this code will be executed again. I'm going to ignore printing this error on pod add.

3. policy.go:733] logical port cd-argocd-cdteam_testssl2 not found in cache

This is the same as https://bugzilla.redhat.com/show_bug.cgi?id=2037884. The bug references stateful sets, but this was really true about any pod being added. When the network policy is created or pods are added that belong to the network policy's namespace, we attempt to get the pod's information from an internal cache. This races with the pod being added to the cache by the pod handler. The fix makes the network policy handler wait until the pod is added to the cache. Otherwise the network policy is created and potentially skips being applied to some pods in the namespace. This is already fixed in 4.8.29

4. policy.go:1166] failed to add IPs  ... set contains duplicate value

The duplicate value here being added is a VIP for a load balancer. In 4.9 and later there is a lower probability of this happening (because we no longer store an internal cache, so there shouldn't be duplicates), however I'm still going to add checks to ensure we filter out any duplicate values before adding to them to the cache or sending the RPC to OVN. I'm going to ensure a proper fix going in master and then backport to 4.8z.

5. E0125 18:40:32.759129       1 policy.go:955] Failed to create port_group for network policy allow-prometheus in namespace ie-st-montun-filebeat

This is the most egregious bug. First of all the log is is not printing the actual error. Second, this failure causes the network policy to fail creation, and then it is not retried again (unless the policy is updated). We need a retry mechanism to attempt to recreate the policy just like we do with pods. This will require a heavier fix in master and then backport down to 4.8z.

Comment 8 Tim Rozet 2022-02-03 21:59:27 UTC
Fix for number 2: https://github.com/ovn-org/ovn-kubernetes/pull/2792

Comment 9 Tim Rozet 2022-02-03 22:41:21 UTC
Fix for number 4: https://github.com/ovn-org/ovn-kubernetes/pull/2794

Comment 10 Tim Rozet 2022-02-04 23:23:59 UTC
Partial fix for number 5: https://github.com/ovn-org/ovn-kubernetes/pull/2797

Will need a follow up part 2 after this is reviewed + accepted.

Comment 11 Tim Rozet 2022-02-09 01:45:15 UTC
Posted https://github.com/ovn-org/ovn-kubernetes/pull/2809 which will supersede PR 2797. That should be the complete fix for issue number 5.

Comment 12 Andy Bartlett 2022-02-09 10:33:15 UTC
@trozet Do you have a link for the BZ / PR for:

1.policy.go:818] Failed to set ports in PortGroup for network policy ie-st-montun-filebeat/default-deny: Reconnecting...Transaction Failed due to an error: syntax error details: expected ["set", <array>] in {update Port_Group map[name:a11253394058733577533 ports:0xc001f1a1b0] [] [] [] 0 [[name == a11253394058733577533]]  }

This is due to a bug in the go-ovn library that was fixed in 4.9. I'm going to backport the same fix to 4.8z.

Many thanks,

Andy

Comment 13 Tim Rozet 2022-02-14 16:55:30 UTC
Yeah the fix for number 1 is a one liner in the ebay/libovsdb library:

https://github.com/openshift/ovn-kubernetes/commit/35677418d2bbfddb6229e1d776bba2064dde646b#diff-88e093886eb91e9ca5f9234d74a5f756c0251d685c141c902a7833d95bec5345R27

@@ -24,7 +24,7 @@ func NewOvsSet(goSlice interface{}) (*OvsSet, error) {
		return nil, errors.New("OvsSet supports only Go Slice types")
	}

-	var ovsSet []interface{}
+	ovsSet := make([]interface{}, 0, v.Len())
	for i := 0; i < v.Len(); i++ {
		ovsSet = append(ovsSet, v.Index(i).Interface())
	}

Comment 15 Tim Rozet 2022-02-15 14:51:40 UTC
Moving back to assigned, a small issue was found with the previous patch: https://github.com/ovn-org/ovn-kubernetes/pull/2823

Comment 17 Tim Rozet 2022-02-16 17:21:06 UTC
Found another issue where a delete/recreate of a policy with the same name may not clean up the stale version. Pushed a fix here: https://github.com/ovn-org/ovn-kubernetes/pull/2826

Comment 23 errata-xmlrpc 2022-08-10 10:45:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Comment 24 Red Hat Bugzilla 2023-09-15 01:51:29 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days


Note You need to log in before you can comment on or make changes to this bug.