Bug 1897073

Summary:	[OCP 4.5] wrong netid assigned to Openshift projects/namespaces
Product:	OpenShift Container Platform	Reporter:	Angelo Gabrieli <agabriel>
Component:	Networking	Assignee:	Juan Luis de Sousa-Valadas <jdesousa>
Networking sub component:	openshift-sdn	QA Contact:	Arti Sood <asood>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	low
Priority:	low	CC:	anusaxen, bbennett, javier.ordax, jdesousa, jechen
Version:	4.5
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: N/A Consequence: OpenShift SDN used to log incorrectly unable to allocate "netid 1: provided netid is not in the valid range" for namespaces with netid 1 Fix: Don't log anything for netid < 10 Result: Doesn't log that line anymore.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-02-24 15:32:41 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Angelo Gabrieli 2020-11-12 08:54:56 UTC

Description of problem:

Openshift 4.5.7 cluster shows several projects/namespaces with `netid` 0 or 1:


$ oc get netnamespace
NAME                                               NETID      EGRESS IPS
default                                            0
int-centrodesarrollo-test1                    8524677
int-ci-gen                                    4117769
int-cs-gen                                    16362159
int-ga-gen                                    2119115
int-in-gen                                    3605134
int-iv-gen                                    4778414
int-ps-gen                                    2483765
int-tg-gen                                    11557281
kafkaop                                            16478583
knative-eventing                                   9187314
kube-node-lease                                    15505980
kube-public                                        15191955
kube-system                                        1
n14                                                12880083
nfsprovider                                        6780911
ns1                                                4058331
ns6                                                9196549
openshift                                          15365180
openshift-ansible-service-broker                   1
openshift-apiserver                                1
openshift-apiserver-operator                       141009
openshift-authentication                           1
openshift-authentication-operator                  1
openshift-cloud-credential-operator                4980348
openshift-cluster-machine-approver                 3125753
openshift-cluster-node-tuning-operator             6450632
openshift-cluster-samples-operator                 9557112
openshift-cluster-storage-operator                 12893962
openshift-cluster-version                          8311817
openshift-config                                   8959015
openshift-config-managed                           2853628
openshift-config-operator                          3491444
openshift-console                                  14759098
openshift-console-operator                         8578226
openshift-controller-manager                       1945164
openshift-controller-manager-operator              8017439
openshift-dns                                      0
openshift-dns-operator                             2204318
openshift-etcd                                     1
openshift-etcd-operator                            9199636
openshift-image-registry                           0
openshift-infra                                    1638378
openshift-ingress                                  0
openshift-ingress-operator                         5734481
openshift-insights                                 9425718
openshift-kni-infra                                14792119
openshift-kube-apiserver                           0
openshift-kube-apiserver-operator                  5610822
openshift-kube-controller-manager                  2745533
openshift-kube-controller-manager-operator         13642800
openshift-kube-scheduler                           13717797
openshift-kube-scheduler-operator                  14716170
openshift-kube-storage-version-migrator            4753197
openshift-kube-storage-version-migrator-operator   6148817
openshift-logging                                  4788062
openshift-machine-api                              3871079
openshift-machine-config-operator                  651516
openshift-marketplace                              653779
openshift-monitoring                               0
openshift-multus                                   15210559
openshift-network-operator                         1074590
openshift-node                                     13725627
openshift-openstack-infra                          1587789
openshift-operator-lifecycle-manager               0
openshift-operators                                3839779
openshift-operators-redhat                         15653470
openshift-ovirt-infra                              666156
openshift-pipelines                                5088107
openshift-sdn                                      6504235
openshift-service-ca                               2977171
openshift-service-ca-operator                      11833620
openshift-service-catalog-apiserver                1
openshift-service-catalog-controller-manager       1
openshift-service-catalog-removed                  16192306
openshift-template-service-broker                  1
openshift-user-workload-monitoring                 0
openshift-vsphere-infra                            558532
test123                                            3502734
test3                                              3626688
testelastic                                        16607574
testlog                                            184178
$

This is breaking the network communication between project/namespaces:

$ oc project openshift-logging
Now using project "openshift-logging" on server "https://api.xxx.xxx:6443".
$ oc get pods
NAME                                       READY   STATUS    RESTARTS   AGE
cluster-logging-operator-dbf7dd956-8kgbl   1/1     Running   0          7h14m
test-5f464fd59-bx5wv                       1/1     Running   0          7h14m
$ oc get pods -o wide
NAME                                       READY   STATUS    RESTARTS   AGE     IP             NODE         NOMINATED NODE   READINESS GATES
cluster-logging-operator-dbf7dd956-8kgbl   1/1     Running   0          7h14m   150.129.2.8    gxxx751   <none>           <none>
test-5f464fd59-bx5wv                       1/1     Running   0          7h14m   150.129.2.12   gxxx751   <none>           <none>
$ oc project testelastic
Now using project "testelastic" on server "https://api.xxx.xxx:6443".
$ oc get pods
NAME                   READY   STATUS    RESTARTS   AGE
test-5f464fd59-4gz2x   1/1     Running   0          7h14m
$ oc rsh test-5f464fd59-4gz2x
$ bash
groups: cannot find name for group ID 1000660000
1000660000@test-5f464fd59-4gz2x:/app$ curl 150.129.2.12:8081
^C
1000660000@test-5f464fd59-4gz2x:/app$ exit
exit
$ exit
command terminated with exit code 130
$

Logs from SDN controller pods:

~~~
2020-09-08T17:46:07.106335261Z E0908 17:46:07.106267       1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range
2020-09-08T17:46:07.106335261Z E0908 17:46:07.106310       1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range
2020-09-08T17:46:07.110851209Z E0908 17:46:07.110790       1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range
2020-09-08T17:46:07.111932021Z E0908 17:46:07.111814       1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range
2020-09-08T17:46:07.113482715Z E0908 17:46:07.113449       1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range
2020-09-08T17:46:07.114645345Z E0908 17:46:07.114606       1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range
2020-09-08T17:46:07.115780102Z E0908 17:46:07.115749       1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range
2020-09-08T17:46:07.11688395Z E0908 17:46:07.116837       1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range
2020-09-08T17:46:07.119312517Z E0908 17:46:07.118858       1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range

...

2020-09-08T17:12:05.780595325Z E0908 17:12:05.780526       1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range
2020-09-08T17:12:05.780595325Z E0908 17:12:05.780567       1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range
2020-09-08T17:12:05.782082454Z E0908 17:12:05.781977       1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range
2020-09-08T17:12:05.783292856Z E0908 17:12:05.783260       1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range
2020-09-08T17:12:05.784389495Z E0908 17:12:05.784340       1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range
2020-09-08T17:12:05.785480932Z E0908 17:12:05.785434       1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range
2020-09-08T17:12:05.786629329Z E0908 17:12:05.786589       1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range
2020-09-08T17:12:05.787882832Z E0908 17:12:05.787646       1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range
2020-09-08T17:12:05.788770108Z E0908 17:12:05.788733       1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range

...

2020-09-10T10:37:48.024219635Z E0910 10:37:48.024139       1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range
2020-09-10T10:37:48.024219635Z E0910 10:37:48.024182       1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range
2020-09-10T10:37:48.025852587Z E0910 10:37:48.025693       1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range
2020-09-10T10:37:48.027191166Z E0910 10:37:48.027158       1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range
2020-09-10T10:37:48.028888314Z E0910 10:37:48.028852       1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range
2020-09-10T10:37:48.030073973Z E0910 10:37:48.030043       1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range
2020-09-10T10:37:48.032319094Z E0910 10:37:48.032272       1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range
2020-09-10T10:37:48.034038221Z E0910 10:37:48.033983       1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range
2020-09-10T10:37:48.036703562Z E0910 10:37:48.036582       1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range
~~~

The cluster looks installed with NetworkPolicy default plugin and with no NetworkPolicy rules:

$ oc get networkpolicy -A
 No resources found
$

Version-Release number of selected component (if applicable):
Openshift 4.5.7
SDN pod image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:14af12f0b04ffdce223dd61d4f9ea68380c15fc6ccf6b0970a1e6647ad29f07a


How reproducible:
Customer experienced this issue upgrading from 4.3 to 4.4 and finally to 4.5 (cluster installed as 4.2)

Steps to Reproduce:
1.
2.
3.


Actual results:

Network communication is broken between projects/namespaces


Expected results:

Network communication allowed between projects/namespaces


Additional info:

Comment 1 Dan Winship 2020-11-12 14:43:30 UTC

> The cluster looks installed with NetworkPolicy default plugin and with no NetworkPolicy rules:

No, it really looks like it's installed with openshift-sdn in the Multitenant mode, in which case this behavior is expected.

> 2020-09-10T10:37:48.024219635Z E0910 10:37:48.024139       1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range

(We should fix this error message though. It shouldn't be logging that.)

Comment 2 Juan Luis de Sousa-Valadas 2020-11-12 15:10:20 UTC

Hi Dan, 
Exactly my thoughts, I spoke with Angelo about this and it's effectively multitenant:

apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  creationTimestamp: "2020-02-20T15:21:10Z"
  generation: 1
  name: cluster
  resourceVersion: "430"
  selfLink: /apis/operator.openshift.io/v1/networks/cluster
  uid: 9fd5b0f2-53f4-11ea-9964-005056ba4c5e
spec:
  clusterNetwork:
  - cidr: 150.128.0.0/14
    hostPrefix: 23
  defaultNetwork:
    openshiftSDNConfig:
      mode: Multitenant
      mtu: 1450
      vxlanPort: 4789
    type: OpenShiftSDN
  serviceNetwork:
  - 140.30.0.0/16

I have a PR to fix the error message.

Comment 3 Juan Luis de Sousa-Valadas 2020-11-12 15:11:38 UTC

Angelo,
Regarding how to solve the communication issues:
https://docs.openshift.com/container-platform/4.5/networking/openshift_sdn/multitenant-isolation.html#nw-multitenant-joining_multitenant-isolation

Except for the error message everything is expected behavior.

Comment 4 javier.ordax 2020-11-12 16:07:39 UTC

Hi,

@jdesousa from where do you get that network configuration.

During the live session we had with the customer, support requested to the customer to gather the network configuration and mode multitenant is not there. The terminal output was exported and is attached to the RedHat case.

oc get network cluster -o yaml

apiVersion: config.openshift.io/v1
kind: Network
metadata:
  creationTimestamp: "2020-02-20T15:21:10Z"
  generation: 2
  name: cluster
  resourceVersion: "1827"
  selfLink: /apis/config.openshift.io/v1/networks/cluster
  uid: 9fb72218-53f4-11ea-9964-005056ba4c5e
spec:
  clusterNetwork:
  - cidr: 150.128.0.0/14
    hostPrefix: 23
  externalIP:
    policy: \{\}
  networkType: OpenShiftSDN
  serviceNetwork:
  - 140.30.0.0/16
status:
  clusterNetwork:
  - cidr: 150.128.0.0/14
    hostPrefix: 23
  clusterNetworkMTU: 1450
  networkType: OpenShiftSDN
  serviceNetwork:
  - 140.30.0.0/16

Comment 5 javier.ordax 2020-11-12 16:25:41 UTC

Sorry, I have just realized we are not checking the same object. I will review again.

Comment 6 Juan Luis de Sousa-Valadas 2020-11-12 16:45:28 UTC

That yaml output was provided by Angelo Gabrieli in a private conversation in Slack about 2 hours ago.

Comment 8 Arti Sood 2020-12-09 18:02:13 UTC

Steps to reproduce:-

1. Create 4.5.22 cluster with multi tenant mode with flexy template https://gitlab.cee.redhat.com/aosqe/flexy-templates/-/blob/master/functionality-testing/aos-4_5/ipi-on-aws/versioned-installer-multitenant-ci

oc version
Client Version: 4.6.6
Server Version: 4.5.22
Kubernetes Version: v1.18.3+616db59

oc get clusternetwork
NAME      CLUSTER NETWORK   SERVICE NETWORK   PLUGIN NAME
default   10.128.0.0/14     172.30.0.0/16     redhat/openshift-ovs-multitenant

oc get network.operator -o yaml | grep mode
              f:mode: {}
        mode: Multitenant


2. Create new project and pod in the new project 
   oc create -f https://raw.githubusercontent.com/openshift/verification-tests/master/testdata/networking/pod-for-ping.json

3. check logs of all the pods sdn-controller-* in openshift-sdn project to see the error messages. 
   oc logs sdn-controller-7tbwk


oc logs sdn-controller-tmq6x | grep 'unable to allocate'
E1209 13:52:24.752288       1 vnids.go:298] unable to allocate netid 1: provided netid is not in the valid range
E1209 13:52:24.752315       1 vnids.go:298] unable to allocate netid 1: provided netid is not in the valid range
E1209 13:52:24.753518       1 vnids.go:298] unable to allocate netid 1: provided netid is not in the valid range
E1209 13:52:24.754645       1 vnids.go:298] unable to allocate netid 1: provided netid is not in the valid range
E1209 13:52:24.755819       1 vnids.go:298] unable to allocate netid 1: provided netid is not in the valid range
E1209 13:52:24.756887       1 vnids.go:298] unable to allocate netid 1: provided netid is not in the valid range
E1209 13:52:24.758044       1 vnids.go:298] unable to allocate netid 1: provided netid is not in the valid range
E1209 13:52:24.759192       1 vnids.go:298] unable to allocate netid 1: provided netid is not in the valid range
E1209 13:52:24.760360       1 vnids.go:298] unable to allocate netid 1: provided netid is not in the valid range

Comment 9 Dan Brahaney 2020-12-09 18:11:04 UTC

On 4.7 (registry.svc.ci.openshift.org/ocp/release:4.7.0-0.nightly-2020-12-09-112139) I am ubable to reproduce the errors from sdn-controller pods.

Using the following steps, asood was able to reproduce the bug on 4.5, and then I was able to use the same steps to verify the bug is no longer present in 4.7.

❯ oc project openshift-sdn
Now using project "openshift-sdn" on server "https://api.dbrahane-4-7-1209-multitenant.qe.devcluster.openshift.com:6443".
❯ oc get pods
NAME                   READY   STATUS    RESTARTS   AGE
sdn-controller-7r85b   1/1     Running   0          164m
sdn-controller-b6n95   1/1     Running   0          164m
sdn-controller-rcwql   1/1     Running   0          164m
❯ oc create -f https://raw.githubusercontent.com/openshift/verification-tests/master/testdata/networking/pod-for-ping.json
❯ oc logs sdn-controller-b6n95 | grep "unable to allocate"

❯ oc logs sdn-controller-rcwql | grep "unable to allocate"

❯ oc logs sdn-controller-7r85b | grep "unable to allocate"

Comment 10 Juan Luis de Sousa-Valadas 2020-12-10 08:24:50 UTC

Hi, the target release for this bug is 4.7, therefore I think this should be verified. The fix was merged in master last week, so it's expected that the problem is still reproducible in 4.5 and 4.6.

Comment 11 Arti Sood 2020-12-10 13:54:17 UTC

Comment#8 is reproduction in 4.5.22 and Comment#9 is verification in 4.7.
Still waiting for fix in 4.5 and 4.6.

Marking it verified as per comment #9.

Comment 12 Juan Luis de Sousa-Valadas 2020-12-10 14:43:51 UTC

Hi Arti,
We normally don't backport anything unless it's having an impact. Is there any real impact? Has any customer requested it?
Don't get me wrong, if someone needs the backport I'm happy to do it, but I don't think it has any actual impact because the only customer that I know has complained about it is already aware that they can ignore it.

Comment 13 Arti Sood 2020-12-10 16:36:42 UTC

Hi Juan,

Discussed with Anurag and as per your comment as for now there is no actual impact known but we are not sure about performance impact of logging in large scale cluster at this point.

We can mark it verified for now and open a new bug for prior releases if necessary.

Comment 14 Juan Luis de Sousa-Valadas 2020-12-10 16:49:05 UTC

Hi Arti,
It's just printing 7 lines (once per netnamespace with netid 1) every time a project is createdand when the cluster is started. I don't think it's worth investigating it. Anyway if you still think it may be a problem I'll do an automatic cherry-pick, it's not complex enough to waste anybody's time checking if it can actually decrease performance.

Comment 15 Arti Sood 2020-12-11 00:58:38 UTC

Hi Juan,

Let us do it. There may be some customers as I found out continue to use 4.5 and 4.6.

Thank You!

Comment 16 Anurag saxena 2020-12-11 01:10:16 UTC

Thanks for clarification Juan and thanks for verifying it Arti and Dan. Agree with Juan, this is not recurring and its just 7 lines in the start when namespaces with netid 1 gets created during cluster bring up (in multitenant mode). Juan, I am not sure if customers like Verizon is using multitenant mode or not in 4.6, if not we may not need back porting.

Comment 19 errata-xmlrpc 2021-02-24 15:32:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633