2024880 – Egress IP breaks when network policies are applied

Bug 2024880 - Egress IP breaks when network policies are applied

Summary: Egress IP breaks when network policies are applied

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Ben Bennett
QA Contact:	huirwang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2026302
TreeView+	depends on / blocked

Reported:	2021-11-19 11:24 UTC by Mridul Markandey
Modified:	2022-03-16 01:37 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-10 16:29:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift sdn pull 373	None	open	Bug 2024880: [EgressIP] move `ct(commit)` action from OVS group to flow	2021-11-23 14:49:03 UTC
Red Hat Issue Tracker	INSIGHTOCP-537	None	None	None	2021-11-23 14:39:27 UTC
Red Hat Knowledge Base (Solution)	6534811	None	None	None	2021-11-23 13:36:31 UTC
Red Hat Product Errata	RHSA-2022:0056	None	None	None	2022-03-10 16:30:04 UTC

Description Mridul Markandey 2021-11-19 11:24:42 UTC

Description of problem:
After upgrading to RHOCP v4.8.17, EgressIP breaks due to default network policy. 

Version-Release number of selected component (if applicable):
v4.8.20


How reproducible:
Always

Steps to Reproduce:
1. The issue can be reproduced using the default network policy YAML attached to the Bugzilla.


Actual results:
EgressIP does not work when network policies are there. If we remove the policy, the application starts to work using EgressIP.

Expected results:
EgressIP should work in the presence of default network policy


Additional info:
As asked by the engineering team, I am raising this Bugzilla. The previous discussion has been done on this BZ[https://bugzilla.redhat.com/show_bug.cgi?id=2008987].

Comment 18 huirwang 2021-11-24 09:17:33 UTC

Verified in 4.10.0-0.nightly-2021-11-24-030137, EgressIP worked with the networkpolicy configured

$ oc get hostsubnet
NAME                                  HOST                                  HOST IP          SUBNET          EGRESS CIDRS          EGRESS IPS
qe-huirwang1124b-x2z4m-master-0       qe-huirwang1124b-x2z4m-master-0       172.31.249.55    10.128.0.0/23                         
qe-huirwang1124b-x2z4m-master-1       qe-huirwang1124b-x2z4m-master-1       172.31.249.160   10.129.0.0/23                         
qe-huirwang1124b-x2z4m-master-2       qe-huirwang1124b-x2z4m-master-2       172.31.249.121   10.130.0.0/23                         
qe-huirwang1124b-x2z4m-worker-5rh7p   qe-huirwang1124b-x2z4m-worker-5rh7p   172.31.249.32    10.128.2.0/23   ["172.31.249.0/24"]   ["172.31.249.201"]
qe-huirwang1124b-x2z4m-worker-8x5c5   qe-huirwang1124b-x2z4m-worker-8x5c5   172.31.249.3     10.131.0.0/23
$ oc get netnamespace test
NAME   NETID     EGRESS IPS
test   3821487   ["172.31.249.201"]

$ oc get networkpolicy -n test -oyaml
apiVersion: v1
items:
- apiVersion: networking.k8s.io/v1
  kind: NetworkPolicy
  metadata:
    creationTimestamp: "2021-11-24T08:47:06Z"
    generation: 1
    managedFields:
    - apiVersion: networking.k8s.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:spec:
          f:ingress: {}
          f:policyTypes: {}
      manager: kubectl-create
      operation: Update
      time: "2021-11-24T08:47:06Z"
    name: test-podselector-and-ipblock
    namespace: test
    resourceVersion: "33366"
    uid: b920fada-395d-4039-94cd-0d777bdc87dd
  spec:
    ingress:
    - from:
      - ipBlock:
          cidr: 10.129.2.32/32
      - ipBlock:
          cidr: 10.131.0.0/24
      - ipBlock:
          cidr: 10.128.2.38/32
    podSelector: {}
    policyTypes:
    - Ingress
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

$ oc rsh -n test test-rc-897bw
~ $ curl 172.31.249.80:9095
172.31.249.201~ 
$ curl www.google.com -I
HTTP/1.1 200 OK
Content-Type: text/html; charset=ISO-8859-1
P3P: CP="This is not a P3P policy! See g.co/p3phelp for more info."
Date: Wed, 24 Nov 2021 08:50:54 GMT
Server: gws
X-XSS-Protection: 0
X-Frame-Options: SAMEORIGIN
Transfer-Encoding: chunked
Expires: Wed, 24 Nov 2021 08:50:54 GMT
Cache-Control: private
Set-Cookie: 1P_JAR=2021-11-24-08; expires=Fri, 24-Dec-2021 08:50:54 GMT; path=/; domain=.google.com; Secure
Set-Cookie: NID=511=ZchGIK5lR5eNtv-2BdT8K277sBGv9JhR9wDeAxdHpp77mT78NUzUJ6KGkt0kcwBIGZ5DX4TBqCFpPYOx0-DTX8O5_4zkDhYvzuMuhvinKeh7VV0SYsnj7oiB2bAaKrHIDsUEKAxNlJm0gUxxC8NlXsH__YoK2MUdktyR6Ob2ec8; expires=Thu, 26-May-2022 08:50:54 GMT; path=/; domain=.google.com; HttpOnly

Comment 21 Vadim Rutkovsky 2021-11-29 14:51:08 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Up to 2 minute disruption in edge routing
  example: Up to 90seconds of API downtime
  example: etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Issue resolves itself after five minutes
  example: Admin uses oc to fix things
  example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  example: No, it’s always been like this we just never noticed
  example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 24 Alexander Constantinescu 2021-11-30 11:31:36 UTC

> Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?

Customers using Egress IPs in namespaces with the network policies applied which  do not explicitly allow access from the endpoints the pods in the namespaces are trying to connect to 

> What is the impact?  Is it serious enough to warrant blocking edges?

The egress IP matching pods will not have external connectivity unless the network policy is a removed or modified to explicitly allow connectivity from those endpoints

> How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?

The only remediation is removing the network policy or modifying it

> Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?

All 4.8.z  and 4.9.z  versions are impacted  and customers upgrading from 4.7 to 4.8 will most likely hit this issue if the have this network  policy configuration

Comment 26 Lalatendu Mohanty 2021-11-30 14:56:24 UTC

> All 4.8.z  and 4.9.z  versions are impacted  and customers upgrading from 4.7 to 4.8 will most likely hit this issue if the have this network  policy configuration

Because all 4.8.z releases have this issue and this took this long to surface it seems a small proportion of customers might face this issue. Also removing edges to all of 4.8.z will impact customers very negatively as we have edges present for so long. So we are not planning to block upgrade edges for this bug. However if the bug starts impacting more clusters we will reconsider blocking the edge.

Comment 33 Nick Su 2021-12-28 02:13:11 UTC

Hi

I have tested in my OCP4.8.10, egressIP not work within below network policy
```yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-same-namespace
  namespace: NAMESPACE
spec:
  ingress:
  - from:
    - podSelector: {}
  podSelector: {}
  policyTypes:
  - Ingress
```

Only after modify the networkPolicy to allow from the default namespace, the egressIP start to work, please refer to this https://bugzilla.redhat.com/show_bug.cgi?id=1700431 , looks like it not yet fix

Comment 37 errata-xmlrpc 2022-03-10 16:29:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.