Hide Forgot
Description of problem: In one of our customer's clusters we see that new network policies are not created or updated by OVN-Kubernetes. For one application this means it cannot reach the DNS service because the network policy that allows that is not being implemented. In our own test on this cluster, pods in a namespace CAN reach each other despite this network policy: ~~~ apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: creationTimestamp: "2022-01-27T14:41:05Z" generation: 2 name: default-deny namespace: customer-debug resourceVersion: "311846645" uid: 87646222-c86d-4000-8997-7f0557ac34cf spec: podSelector: {} policyTypes: - Ingress - Egress ~~~ In one of our dev clusters this network policy is enforced. Version-Release number of selected component (if applicable): OCP 4.8.25 How reproducible: This happens randomly and very difficult to predict. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: The case has the must-gathers in from the cluster.
Upon finishing my analysis of the logs there are several bugs/errors happening here. All of which compound to either make network policies fail to be enforced properly or may cause them to stay enforcing when they shouldn't be: 1.policy.go:818] Failed to set ports in PortGroup for network policy ie-st-montun-filebeat/default-deny: Reconnecting...Transaction Failed due to an error: syntax error details: expected ["set", <array>] in {update Port_Group map[name:a11253394058733577533 ports:0xc001f1a1b0] [] [] [] 0 [[name == a11253394058733577533]] } This is due to a bug in the go-ovn library that was fixed in 4.9. I'm going to backport the same fix to 4.8z. 2. policy.go:1166] no pod IPs found on pod redhat-marketplace-brhvf: could not find OVN pod annotation in map[openshift.io/scc:anyuid operatorframework.io/managed-by:marketplace-operator] This error is spammed throughout the log, but is benign. On pod add we could fail to get the OVN annotation due to racing with pod handler. However, once the pod handler annotates the pod an update event will happen and this code will be executed again. I'm going to ignore printing this error on pod add. 3. policy.go:733] logical port cd-argocd-cdteam_testssl2 not found in cache This is the same as https://bugzilla.redhat.com/show_bug.cgi?id=2037884. The bug references stateful sets, but this was really true about any pod being added. When the network policy is created or pods are added that belong to the network policy's namespace, we attempt to get the pod's information from an internal cache. This races with the pod being added to the cache by the pod handler. The fix makes the network policy handler wait until the pod is added to the cache. Otherwise the network policy is created and potentially skips being applied to some pods in the namespace. This is already fixed in 4.8.29 4. policy.go:1166] failed to add IPs ... set contains duplicate value The duplicate value here being added is a VIP for a load balancer. In 4.9 and later there is a lower probability of this happening (because we no longer store an internal cache, so there shouldn't be duplicates), however I'm still going to add checks to ensure we filter out any duplicate values before adding to them to the cache or sending the RPC to OVN. I'm going to ensure a proper fix going in master and then backport to 4.8z. 5. E0125 18:40:32.759129 1 policy.go:955] Failed to create port_group for network policy allow-prometheus in namespace ie-st-montun-filebeat This is the most egregious bug. First of all the log is is not printing the actual error. Second, this failure causes the network policy to fail creation, and then it is not retried again (unless the policy is updated). We need a retry mechanism to attempt to recreate the policy just like we do with pods. This will require a heavier fix in master and then backport down to 4.8z.
Fix for number 2: https://github.com/ovn-org/ovn-kubernetes/pull/2792
Fix for number 4: https://github.com/ovn-org/ovn-kubernetes/pull/2794
Partial fix for number 5: https://github.com/ovn-org/ovn-kubernetes/pull/2797 Will need a follow up part 2 after this is reviewed + accepted.
Posted https://github.com/ovn-org/ovn-kubernetes/pull/2809 which will supersede PR 2797. That should be the complete fix for issue number 5.
@trozet Do you have a link for the BZ / PR for: 1.policy.go:818] Failed to set ports in PortGroup for network policy ie-st-montun-filebeat/default-deny: Reconnecting...Transaction Failed due to an error: syntax error details: expected ["set", <array>] in {update Port_Group map[name:a11253394058733577533 ports:0xc001f1a1b0] [] [] [] 0 [[name == a11253394058733577533]] } This is due to a bug in the go-ovn library that was fixed in 4.9. I'm going to backport the same fix to 4.8z. Many thanks, Andy
Yeah the fix for number 1 is a one liner in the ebay/libovsdb library: https://github.com/openshift/ovn-kubernetes/commit/35677418d2bbfddb6229e1d776bba2064dde646b#diff-88e093886eb91e9ca5f9234d74a5f756c0251d685c141c902a7833d95bec5345R27 @@ -24,7 +24,7 @@ func NewOvsSet(goSlice interface{}) (*OvsSet, error) { return nil, errors.New("OvsSet supports only Go Slice types") } - var ovsSet []interface{} + ovsSet := make([]interface{}, 0, v.Len()) for i := 0; i < v.Len(); i++ { ovsSet = append(ovsSet, v.Index(i).Interface()) }
Moving back to assigned, a small issue was found with the previous patch: https://github.com/ovn-org/ovn-kubernetes/pull/2823
Found another issue where a delete/recreate of a policy with the same name may not clean up the stale version. Pushed a fix here: https://github.com/ovn-org/ovn-kubernetes/pull/2826
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069