Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1748034

Summary:

[3.11] SDN - migrating from multitenant to networkpolicy does not work

Product:

OpenShift Container Platform

Reporter:

Samuel <smoro>

Component:

Networking

Assignee:

Jason Boxman <jboxman>

Networking sub component:

openshift-sdn

QA Contact:

zhaozhanqi <zzhao>

Status:

CLOSED NOTABUG

Docs Contact:

Severity:

urgent

Priority:

unspecified

CC:

aos-bugs, arghosh, bbeaudoi, bbennett, dcaldwel, jboxman, jolee, mharri, ph.hutter, weliang

Version:

3.11.0

Flags:

jboxman: needinfo-
jboxman: needinfo-

Target Milestone:

---

Target Release:

3.11.z

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-05-08 18:37:28 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Testing passed	none
Testing failed	none

Description Samuel 2019-09-02 14:03:39 UTC

Description of problem:


Working with some OpenShift 3.11.98 cluster originally deployed with multitenant network plugin. Now trying to reconfigure it, for networkpolicy



How reproducible:

Always
seen in OCP 3.11.98 with customer
reproduced on my own OKD 3.11


Steps to Reproduce:

1. Deploy OCP 3.11, with multitenant network plugin

2. Follow instructions migrating to networkpolicy [1]

3. label routers namespace [2]

4. Install the "allow-from-default-namespace" NetworkPolicy sample, copy-pasted from the docs [2]



[1]  https://docs.openshift.com/container-platform/3.11/install_config/configuring_sdn.html#migrating-between-sdn-plugins

[2] https://docs.openshift.com/container-platform/3.11/admin_guide/managing_networking.html#admin-guide-networking-networkpolicy



Actual results:

From a router Pod, I can not connect my application Pod


Expected results:

router Pod should be able to connect the application Pod


Additional info:

On the node hosting my application Pod, tcpdump confirms TCP handshake is broken, SYN arrives with no response back. Not forwarded to Pod.

Dumping OVS flows from table 80, I would have something like this:

cookie=0x0 table=80, n_packets=15691,  n_bytes=1345863, priority=300,ip,new_src=10.254.4.1 action=output:NXM_NX_REG2[]
cookie=0x0 table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0xce730,reg1=0xa40d05 action=output:NXM_NX_REG2[]
cookie=0x0 table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0xa40d05,reg1=0xa40d05 action=output:NXM_NX_REG2[]
cookie=0x0 table=80, n_packets=0, n_bytes=0, priority=50,reg1=0x737782 action=output:NXM_NX_REG2[]
cookie=0x0 table=80, n_packets=108076, n_bytes=9242321, priority=50,reg1=0xff02b action=output:NXM_NX_REG2[]
cookie=0x0 table=80, n_packets=1998, n_bytes=348855, priority=60, reg1=0x82cb64 actions=output:NXM_NX_REG2[]
cookie=0x0 table=80, n_packets=1084, n_bytes=80096, priority=0 action=drop


The second and third rule listed here seems to allow traffic from 0xce730 (netid for my routers project/non-default) to 0xa40d05 (netid for my app project), and from 0xa40d05 to 0xa40d05.
Note that reproducing on OKD, my routers are in the default namespace / 0x0. Customer setup is a little bit more complicated, although I doubt it is relevant right here.

Anyway, generating traffic from my routers, I can see the drop rule counters being incremented, while the second rule sticks to 0.
Generating traffic within my namespace, the third rule properly matches packets. Our issue only involves cross-namespace communications.


I'm pretty sure that setup should work - or at least it would have, had I deployed my cluster with networkpolicy to begin with.


Any help would be much appreciated.


Thanks,

Regards.

Comment 2 Samuel 2019-09-03 14:15:02 UTC

I see a needinfo flag: is there anything I can provide you with? What do you need?

thanks

Comment 3 Casey Callendrello 2019-09-03 14:31:14 UTC

I can't discuss customer cases in public bugzilla comments - please ensure you have the correct account permissions.

Comment 6 Samuel 2019-09-03 14:47:06 UTC

As a partner, I don't see private messages here.

I would see them on https://access.redhat.com/support/cases/#/case/02458754 though.

Thanks

Comment 7 David Caldwell 2019-09-04 10:17:51 UTC

The cluster has been upgraded to 3.11.135. Same issue is present.

Comment 8 Weibin Liang 2019-09-05 13:05:56 UTC

I reproduce the same issue in 3.11.98 when deploy networkpolicy cluster directly.
I will work with jtanenba to debug this issue.

Comment 10 Weibin Liang 2019-09-05 14:26:03 UTC

Tested in latest v3.11.142 cluster with networkpolicy installation directly, see the same failure issue.

So it is not issue about the migrating from multitenant to networkpolicy.

Check other simple networkpolicy testing, the testing passed.

Look the namespace selector in networkpolicy does not function correctly.

Comment 12 Weibin Liang 2019-09-05 14:41:09 UTC

(In reply to Weibin Liang from comment #10)
> Tested in latest v3.11.142 cluster with networkpolicy installation directly,
> see the same failure issue.
> 
> So it is not issue about the migrating from multitenant to networkpolicy.
> 
> Check other simple networkpolicy testing, the testing passed.
> 
> Look the namespace selector in networkpolicy does not function correctly.

Correction: Look like namespace + pod selector in networkpolicy does not function correctly.

Comment 14 Weibin Liang 2019-09-05 15:26:30 UTC

(In reply to Weibin Liang from comment #12)
> (In reply to Weibin Liang from comment #10)
> > Tested in latest v3.11.142 cluster with networkpolicy installation directly,
> > see the same failure issue.
> > 
> > So it is not issue about the migrating from multitenant to networkpolicy.
> > 
> > Check other simple networkpolicy testing, the testing passed.
> > 
> > Look the namespace selector in networkpolicy does not function correctly.
> 
> Correction: Look like namespace + pod selector in networkpolicy does not
> function correctly.

namespace + pod selector was added after 3.11, but namespace selector should work in 3.11

Comment 16 Weibin Liang 2019-09-05 20:14:36 UTC

Recall my comment 10 and 12, sorry for confusion.

The latest testing results show, the problem only happened when migrating from multitenant to networkpolicy, tested in v3.11.141

Comment 17 zhaozhanqi 2019-09-06 02:02:26 UTC

namespace + pod selector should be added from 4.0, so it should be worked in 3.11 I remember.

Comment 18 zhaozhanqi 2019-09-06 02:03:21 UTC

(In reply to zhaozhanqi from comment #17)
> namespace + pod selector should be added from 4.0, so it should be worked in
> 3.11 I remember.

sorry, typo 'it should NOT be worked in 3.11'

Comment 20 Samuel 2019-09-06 10:22:45 UTC

(In reply to zhaozhanqi from comment #17)
> namespace + pod selector should be added from 4.0, so it should be worked in
> 3.11 I remember.



Indeed, the following snippet, taken from k8s docs, won't work ocp 3.x:

kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: allow-from-default-namespace
spec:
  podSelector:
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: default
      podSelector:
        matchLabels:
          name: pod-in-default-namespace


And I think that's what our doc means, when it is said podSelector can not be used in combination with namespaceSelector.

Now, I'm pretty sure the following should work, as it is mentioned in 3.11 docs - and I already used it with other customers running 3.10+:

kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: allow-from-default-namespace
spec:
  podSelector:
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: default


Regards.

Comment 21 David Caldwell 2019-09-06 10:35:01 UTC

@Samuel
zhaozhanqi corrected that to 'it should NOT be worked in 3.11'

Comment 22 Samuel 2019-09-06 11:17:45 UTC

understood, and I'd rather make sure it is clear that the sample we're having issues with should work.

Despite using podSelector alongside namespaceSelector, this is a case that should work in 3.11.

Comment 24 Brian J. Beaudoin 2019-09-07 00:28:11 UTC

This is working for me as expected in OpenShift 3.11.141. The API schema in Comment #20 is incorrect. The `spec` object should look like this:

spec:
  podSelector: {}
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: default
    - podSelector:
        matchLabels:
          name: pod-in-this-namespace

From what I understand the "from" rules are not evaluated as "AND" but as "OR". The podSelector is not related to the namespaceSelector so "pod-in-default-namespace" is really "pod-in-this-namespace". Either the namespaceSelector or the podSelector may match, the first rule that matches in the from block will allow the traffic.

The documentation states "ingress: Each NetworkPolicy may include a list of whitelist ingress rules."
https://kubernetes.io/docs/concepts/services-networking/network-policies/#the-networkpolicy-resource


Blocks evaluated together (AND logic):

- ingress
- ports

Blocks evaluated separately (OR logic):

- ingress.ipBlock
- ingress.namespaceSelector
- ingress.podSelector

I tested the following in the openshift-console project to show the podSelector, when specified properly, does not block traffic and was treated with "OR" when the namespace was defined. (Note, the podSelector is treated as being within the current namespace, not within the aforementioned namespace).

apiVersion: extensions/v1beta1
kind: NetworkPolicy
metadata:
  creationTimestamp: 2019-09-06T22:15:23Z
  generation: 15
  name: allow-from-default-namespace
  namespace: openshift-console
  resourceVersion: "365202"
  selfLink: /apis/extensions/v1beta1/namespaces/openshift-console/networkpolicies/allow-from-default-namespace
  uid: d28fff2c-d0f3-11e9-94eb-000c2924178d
spec:
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: default
    - podSelector:
        matchLabels:
          router: foo
    ports:
    - port: 8443
      protocol: TCP
  podSelector:
    matchLabels:
      app: openshift-console
  policyTypes:
  - Ingress
[cloud-user@master1 ~]$ oc get -o yaml networkpolicy
apiVersion: v1
items:
- apiVersion: extensions/v1beta1
  kind: NetworkPolicy
  metadata:
    creationTimestamp: 2019-09-06T22:15:23Z
    generation: 15
    name: allow-from-default-namespace
    namespace: openshift-console
    resourceVersion: "365202"
    selfLink: /apis/extensions/v1beta1/namespaces/openshift-console/networkpolicies/allow-from-default-namespace
    uid: d28fff2c-d0f3-11e9-94eb-000c2924178d
  spec:
    ingress:
    - from:
      - namespaceSelector:
          matchLabels:
            name: default
      - podSelector:
          matchLabels:
            router: foo
      ports:
      - port: 8443
        protocol: TCP
    podSelector:
      matchLabels:
        app: openshift-console
    policyTypes:
    - Ingress
- apiVersion: extensions/v1beta1
  kind: NetworkPolicy
  metadata:
    creationTimestamp: 2019-09-06T22:15:58Z
    generation: 1
    name: allow-same-namespace
    namespace: openshift-console
    resourceVersion: "359411"
    selfLink: /apis/extensions/v1beta1/namespaces/openshift-console/networkpolicies/allow-same-namespace
    uid: e6f77d15-d0f3-11e9-94eb-000c2924178d
  spec:
    ingress:
    - from:
      - podSelector: {}
    podSelector: {}
    policyTypes:
    - Ingress
- apiVersion: extensions/v1beta1
  kind: NetworkPolicy
  metadata:
    creationTimestamp: 2019-09-06T22:17:54Z
    generation: 1
    name: deny-by-default
    namespace: openshift-console
    resourceVersion: "359617"
    selfLink: /apis/extensions/v1beta1/namespaces/openshift-console/networkpolicies/deny-by-default
    uid: 2c510d23-d0f4-11e9-94eb-000c2924178d
  spec:
    podSelector: {}
    policyTypes:
    - Ingress
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""


Simple testing:

[cloud-user@master1 ~]$ oc get pods -n default -l router=router
NAME             READY     STATUS    RESTARTS   AGE
router-1-d2rkq   1/1       Running   0          2d

[cloud-user@master1 ~]$ oc get pods -n default -l router=foo
No resources found.

[cloud-user@master1 ~]$ oc -n default rsh dc/router curl --insecure --head https://console.openshift-console.endpoints:8443
HTTP/1.1 200 OK
[...]

[cloud-user@master1 ~]$ oc -n default rsh dc/registry-console curl --insecure --head https://console.openshift-console.endpoints:8443
HTTP/1.1 200 OK
[...]

[cloud-user@master1 ~]$ oc get -n default -o jsonpath='{.metadata.labels}' pod router-1-d2rkq
map[deployment:router-1 deploymentconfig:router router:router]

[cloud-user@master1 ~]$ oc get -n default -o jsonpath='{.metadata.labels}' pod registry-console-1-v9275
map[deployment:registry-console-1 deploymentconfig:registry-console name:registry-console]


Negative testing with confirmation, also showing :

[cloud-user@master1 ~]$ oc -n openshift-web-console rsh deployment/webconsole curl --insecure --head https://console.openshift-console.endpoints:8443 --verbose
* About to connect() to console.openshift-console.endpoints port 8443 (#0)
*   Trying 10.128.0.164...
* Connection timed out
* Failed connect to console.openshift-console.endpoints:8443; Connection timed out
* Closing connection 0
curl: (7) Failed connect to console.openshift-console.endpoints:8443; Connection timed out
command terminated with exit code 7

[cloud-user@master1 ~]$ oc label namespace openshift-web-console name=default
namespace/openshift-web-console labeled

[cloud-user@master1 ~]$ oc -n openshift-web-console rsh deployment/webconsole curl --insecure --head https://console.openshift-console.endpoints:8443
HTTP/1.1 200 OK
[...]

[cloud-user@master1 ~]$ oc get -n openshift-web-console -o jsonpath='{.metadata.labels}' pod webconsole-7f7f679596-zngg5
map[app:openshift-web-console pod-template-hash:3939235152 webconsole:true]

Comment 25 Samuel 2019-09-09 10:22:03 UTC

So, thanks to @bbeaudoi suggestion on rocketchat, ...

Now that I start my routers without hostnetwork, my initial networkpolicies work just fine, as expected, without requiring any labels on the default namespace.


Then, basically, when a Pod in hostnetwork reaches something in the SDN, OVS considers its traffic came from an unknown netid, which translates to netid 0.


Now, say you:
- deploy a router (or whichever hostnetwork-based Pod) in a Project whose netid is non-0
- you want to setup cross-namespaces networkpolicies allowing that hostnetwork-based Pod communications with some application

Then there is no use in labeling the router Pod namespace.
Instead, we should label the default namespace, or any namespace whose netid is 0. Which isn't really intuitive, ...


Now, say we deploy several routers in different Projects, trying to segregate traffic using networkpolicies and host projects in "groups" thata should not, in any way, see each others.
Then we should get rid of hostNetwork.
An alternative could be based on NodePorts Services, and EgressNetworkPolicies preventing Pods in our SDN, bypassing customers' corporate firewalls.


Not sure there is an actual bug, then. Although that's something our docs could clarify, hopefully.



Thanks again to Brian

Comment 26 Jacob Tanenbaum 2019-09-09 15:03:28 UTC

agreed this isn't a bug we should update our docs

Comment 27 Jason Boxman 2019-09-09 15:15:56 UTC

I've clone this bug and created a docs bug for this issue:

https://bugzilla.redhat.com/show_bug.cgi?id=1750429

Thanks.

Comment 28 Samuel 2019-09-09 16:08:46 UTC

I might have been too soft. And sure, we can fix the doc, at the very least.


But then again, we found here that Pods using hostnetwork enters the SDN with a netid0, regardless of their Project.
How isn't this a bug?!


Consider the following:

os_sdn_network_plugin_name: redhat/openshift-ovs-networkpolicy
openshift_additional_projects:
  routers-mgmt:
    default_node_selector: environment=mgmt
  routers-dev:
    default_node_selector: environment=dev
  routers-prod:
    default_node_selector: environment=prod
  routers-stage:
    default_node_selector: environment=stage
openshift_hosted_routers:
- name: routers-mgmt
  certificate:
    certfile: xxx
    keyfile: xxx
    cafile: xxx
  replicas: "{{ groups['ingress-mgmt'] | length }}"
  serviceaccount: router
  namespace: routers-mgmt
  edits:
  - action: append
    key: spec.template.spec.containers[0].env
    value:
      name: ROUTE_LABELS
      value: environment=mgmt
  selector: environment=mgmt,node-role.kubernetes.io/ingress=true
[...]
- name: routers-dev
[...]
- name: routers-stage
[...]
- name: routers-prod
[...]



Customer was expecting to have its production routers having exclusive access to prod apps, stage to stage, ... and so on.
And I'm just here deploying the cluster, there's been enough architects already coming up with this, ...


Now, regardless of how many routers project we setup, everything goes through netid0.
networkpolicies matching labels on routers project won't work
networkpolicies matching labels on default namespace would open access for any hostnetwork Pod.


There's no doc fixing that could excuse this. There is a bug to be fixed.
OVS needs to know which namespace is sending traffic, regardless of how Pods are configured.
Otherwise, networkpolicies are just useless.

Comment 29 Weibin Liang 2019-09-09 20:15:00 UTC

Created attachment 1613330 [details]
Testing passed

Comment 30 Weibin Liang 2019-09-09 20:15:55 UTC

Created attachment 1613331 [details]
Testing failed

Comment 31 Weibin Liang 2019-09-09 20:16:33 UTC

Hi Samuel,

I run same networkpolicies testing steps in a cluster with openshift-ovs-networkpolicy installed at beginning, and a cluster with migrating from multitenant to openshift-ovs-networkpolicy, curl from a router to a application pod will failed in the cluster with migrating from multitenant to openshift-ovs-networkpolicy.

Could you check my two attached logs to see if my testing steps are similar as our customer used? Thanks!

Comment 33 Samuel 2019-09-10 05:46:17 UTC

Indeed, that's interesting.

Last week, I would have said yes, that's pretty much what we did with customer.
And that's exactly how I reproduced that issue on my own OKD.


Now, as of last week-end, I think there's something more.
Actually, customer is not using the default namespace.
Our routers are in dedicated Projects (routers-dev, routers-stage, ...) which have their own netid.
And it turns out that when a Pod uses hostNetwork, then OVS assumes that traffic belongs to netid 0, regardless of its namespace.
So installing a policy in projX allowing traffic from routerX won't work. While installing a policy in projX allowing traffic from the default namespace would allow in traffic from any hostnetwork Pod (all my routers, but also etcd, ...)


Regarding the issue you've reproduced, I'm still unsure how to solve it.
Best I could say, is try and reboot everything. Somehow, we went through this here. I couldn't tell how for sure.

But having your routers in the default namespace, then they already belong to netid 0.
Policies matching a label on your default namespace should then appear to work, until you try setting up routers in non-default namespaces.
As of yesterday, our networkpolicies work "as expected", as we realized that re-deploying our routers without hostnetwork, and exposing them with a NodePort Service, then all our policies started working as planned.


Thanks for looking it up,

Regards.

Comment 38 Ben Bennett 2020-05-08 18:37:28 UTC

Closing because there is a docs bug tracking it https://bugzilla.redhat.com/show_bug.cgi?id=1750429