Description of problem: Openshift 4.5.7 cluster shows several projects/namespaces with `netid` 0 or 1: $ oc get netnamespace NAME NETID EGRESS IPS default 0 int-centrodesarrollo-test1 8524677 int-ci-gen 4117769 int-cs-gen 16362159 int-ga-gen 2119115 int-in-gen 3605134 int-iv-gen 4778414 int-ps-gen 2483765 int-tg-gen 11557281 kafkaop 16478583 knative-eventing 9187314 kube-node-lease 15505980 kube-public 15191955 kube-system 1 n14 12880083 nfsprovider 6780911 ns1 4058331 ns6 9196549 openshift 15365180 openshift-ansible-service-broker 1 openshift-apiserver 1 openshift-apiserver-operator 141009 openshift-authentication 1 openshift-authentication-operator 1 openshift-cloud-credential-operator 4980348 openshift-cluster-machine-approver 3125753 openshift-cluster-node-tuning-operator 6450632 openshift-cluster-samples-operator 9557112 openshift-cluster-storage-operator 12893962 openshift-cluster-version 8311817 openshift-config 8959015 openshift-config-managed 2853628 openshift-config-operator 3491444 openshift-console 14759098 openshift-console-operator 8578226 openshift-controller-manager 1945164 openshift-controller-manager-operator 8017439 openshift-dns 0 openshift-dns-operator 2204318 openshift-etcd 1 openshift-etcd-operator 9199636 openshift-image-registry 0 openshift-infra 1638378 openshift-ingress 0 openshift-ingress-operator 5734481 openshift-insights 9425718 openshift-kni-infra 14792119 openshift-kube-apiserver 0 openshift-kube-apiserver-operator 5610822 openshift-kube-controller-manager 2745533 openshift-kube-controller-manager-operator 13642800 openshift-kube-scheduler 13717797 openshift-kube-scheduler-operator 14716170 openshift-kube-storage-version-migrator 4753197 openshift-kube-storage-version-migrator-operator 6148817 openshift-logging 4788062 openshift-machine-api 3871079 openshift-machine-config-operator 651516 openshift-marketplace 653779 openshift-monitoring 0 openshift-multus 15210559 openshift-network-operator 1074590 openshift-node 13725627 openshift-openstack-infra 1587789 openshift-operator-lifecycle-manager 0 openshift-operators 3839779 openshift-operators-redhat 15653470 openshift-ovirt-infra 666156 openshift-pipelines 5088107 openshift-sdn 6504235 openshift-service-ca 2977171 openshift-service-ca-operator 11833620 openshift-service-catalog-apiserver 1 openshift-service-catalog-controller-manager 1 openshift-service-catalog-removed 16192306 openshift-template-service-broker 1 openshift-user-workload-monitoring 0 openshift-vsphere-infra 558532 test123 3502734 test3 3626688 testelastic 16607574 testlog 184178 $ This is breaking the network communication between project/namespaces: $ oc project openshift-logging Now using project "openshift-logging" on server "https://api.xxx.xxx:6443". $ oc get pods NAME READY STATUS RESTARTS AGE cluster-logging-operator-dbf7dd956-8kgbl 1/1 Running 0 7h14m test-5f464fd59-bx5wv 1/1 Running 0 7h14m $ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cluster-logging-operator-dbf7dd956-8kgbl 1/1 Running 0 7h14m 150.129.2.8 gxxx751 <none> <none> test-5f464fd59-bx5wv 1/1 Running 0 7h14m 150.129.2.12 gxxx751 <none> <none> $ oc project testelastic Now using project "testelastic" on server "https://api.xxx.xxx:6443". $ oc get pods NAME READY STATUS RESTARTS AGE test-5f464fd59-4gz2x 1/1 Running 0 7h14m $ oc rsh test-5f464fd59-4gz2x $ bash groups: cannot find name for group ID 1000660000 1000660000@test-5f464fd59-4gz2x:/app$ curl 150.129.2.12:8081 ^C 1000660000@test-5f464fd59-4gz2x:/app$ exit exit $ exit command terminated with exit code 130 $ Logs from SDN controller pods: ~~~ 2020-09-08T17:46:07.106335261Z E0908 17:46:07.106267 1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range 2020-09-08T17:46:07.106335261Z E0908 17:46:07.106310 1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range 2020-09-08T17:46:07.110851209Z E0908 17:46:07.110790 1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range 2020-09-08T17:46:07.111932021Z E0908 17:46:07.111814 1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range 2020-09-08T17:46:07.113482715Z E0908 17:46:07.113449 1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range 2020-09-08T17:46:07.114645345Z E0908 17:46:07.114606 1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range 2020-09-08T17:46:07.115780102Z E0908 17:46:07.115749 1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range 2020-09-08T17:46:07.11688395Z E0908 17:46:07.116837 1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range 2020-09-08T17:46:07.119312517Z E0908 17:46:07.118858 1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range ... 2020-09-08T17:12:05.780595325Z E0908 17:12:05.780526 1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range 2020-09-08T17:12:05.780595325Z E0908 17:12:05.780567 1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range 2020-09-08T17:12:05.782082454Z E0908 17:12:05.781977 1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range 2020-09-08T17:12:05.783292856Z E0908 17:12:05.783260 1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range 2020-09-08T17:12:05.784389495Z E0908 17:12:05.784340 1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range 2020-09-08T17:12:05.785480932Z E0908 17:12:05.785434 1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range 2020-09-08T17:12:05.786629329Z E0908 17:12:05.786589 1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range 2020-09-08T17:12:05.787882832Z E0908 17:12:05.787646 1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range 2020-09-08T17:12:05.788770108Z E0908 17:12:05.788733 1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range ... 2020-09-10T10:37:48.024219635Z E0910 10:37:48.024139 1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range 2020-09-10T10:37:48.024219635Z E0910 10:37:48.024182 1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range 2020-09-10T10:37:48.025852587Z E0910 10:37:48.025693 1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range 2020-09-10T10:37:48.027191166Z E0910 10:37:48.027158 1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range 2020-09-10T10:37:48.028888314Z E0910 10:37:48.028852 1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range 2020-09-10T10:37:48.030073973Z E0910 10:37:48.030043 1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range 2020-09-10T10:37:48.032319094Z E0910 10:37:48.032272 1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range 2020-09-10T10:37:48.034038221Z E0910 10:37:48.033983 1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range 2020-09-10T10:37:48.036703562Z E0910 10:37:48.036582 1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range ~~~ The cluster looks installed with NetworkPolicy default plugin and with no NetworkPolicy rules: $ oc get networkpolicy -A No resources found $ Version-Release number of selected component (if applicable): Openshift 4.5.7 SDN pod image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:14af12f0b04ffdce223dd61d4f9ea68380c15fc6ccf6b0970a1e6647ad29f07a How reproducible: Customer experienced this issue upgrading from 4.3 to 4.4 and finally to 4.5 (cluster installed as 4.2) Steps to Reproduce: 1. 2. 3. Actual results: Network communication is broken between projects/namespaces Expected results: Network communication allowed between projects/namespaces Additional info:
> The cluster looks installed with NetworkPolicy default plugin and with no NetworkPolicy rules: No, it really looks like it's installed with openshift-sdn in the Multitenant mode, in which case this behavior is expected. > 2020-09-10T10:37:48.024219635Z E0910 10:37:48.024139 1 vnids.go:292] unable to allocate netid 1: provided netid is not in the valid range (We should fix this error message though. It shouldn't be logging that.)
Hi Dan, Exactly my thoughts, I spoke with Angelo about this and it's effectively multitenant: apiVersion: operator.openshift.io/v1 kind: Network metadata: creationTimestamp: "2020-02-20T15:21:10Z" generation: 1 name: cluster resourceVersion: "430" selfLink: /apis/operator.openshift.io/v1/networks/cluster uid: 9fd5b0f2-53f4-11ea-9964-005056ba4c5e spec: clusterNetwork: - cidr: 150.128.0.0/14 hostPrefix: 23 defaultNetwork: openshiftSDNConfig: mode: Multitenant mtu: 1450 vxlanPort: 4789 type: OpenShiftSDN serviceNetwork: - 140.30.0.0/16 I have a PR to fix the error message.
Angelo, Regarding how to solve the communication issues: https://docs.openshift.com/container-platform/4.5/networking/openshift_sdn/multitenant-isolation.html#nw-multitenant-joining_multitenant-isolation Except for the error message everything is expected behavior.
Hi, @jdesousa from where do you get that network configuration. During the live session we had with the customer, support requested to the customer to gather the network configuration and mode multitenant is not there. The terminal output was exported and is attached to the RedHat case. oc get network cluster -o yaml apiVersion: config.openshift.io/v1 kind: Network metadata: creationTimestamp: "2020-02-20T15:21:10Z" generation: 2 name: cluster resourceVersion: "1827" selfLink: /apis/config.openshift.io/v1/networks/cluster uid: 9fb72218-53f4-11ea-9964-005056ba4c5e spec: clusterNetwork: - cidr: 150.128.0.0/14 hostPrefix: 23 externalIP: policy: \{\} networkType: OpenShiftSDN serviceNetwork: - 140.30.0.0/16 status: clusterNetwork: - cidr: 150.128.0.0/14 hostPrefix: 23 clusterNetworkMTU: 1450 networkType: OpenShiftSDN serviceNetwork: - 140.30.0.0/16
Sorry, I have just realized we are not checking the same object. I will review again.
That yaml output was provided by Angelo Gabrieli in a private conversation in Slack about 2 hours ago.
Steps to reproduce:- 1. Create 4.5.22 cluster with multi tenant mode with flexy template https://gitlab.cee.redhat.com/aosqe/flexy-templates/-/blob/master/functionality-testing/aos-4_5/ipi-on-aws/versioned-installer-multitenant-ci oc version Client Version: 4.6.6 Server Version: 4.5.22 Kubernetes Version: v1.18.3+616db59 oc get clusternetwork NAME CLUSTER NETWORK SERVICE NETWORK PLUGIN NAME default 10.128.0.0/14 172.30.0.0/16 redhat/openshift-ovs-multitenant oc get network.operator -o yaml | grep mode f:mode: {} mode: Multitenant 2. Create new project and pod in the new project oc create -f https://raw.githubusercontent.com/openshift/verification-tests/master/testdata/networking/pod-for-ping.json 3. check logs of all the pods sdn-controller-* in openshift-sdn project to see the error messages. oc logs sdn-controller-7tbwk oc logs sdn-controller-tmq6x | grep 'unable to allocate' E1209 13:52:24.752288 1 vnids.go:298] unable to allocate netid 1: provided netid is not in the valid range E1209 13:52:24.752315 1 vnids.go:298] unable to allocate netid 1: provided netid is not in the valid range E1209 13:52:24.753518 1 vnids.go:298] unable to allocate netid 1: provided netid is not in the valid range E1209 13:52:24.754645 1 vnids.go:298] unable to allocate netid 1: provided netid is not in the valid range E1209 13:52:24.755819 1 vnids.go:298] unable to allocate netid 1: provided netid is not in the valid range E1209 13:52:24.756887 1 vnids.go:298] unable to allocate netid 1: provided netid is not in the valid range E1209 13:52:24.758044 1 vnids.go:298] unable to allocate netid 1: provided netid is not in the valid range E1209 13:52:24.759192 1 vnids.go:298] unable to allocate netid 1: provided netid is not in the valid range E1209 13:52:24.760360 1 vnids.go:298] unable to allocate netid 1: provided netid is not in the valid range
On 4.7 (registry.svc.ci.openshift.org/ocp/release:4.7.0-0.nightly-2020-12-09-112139) I am ubable to reproduce the errors from sdn-controller pods. Using the following steps, asood was able to reproduce the bug on 4.5, and then I was able to use the same steps to verify the bug is no longer present in 4.7. ❯ oc project openshift-sdn Now using project "openshift-sdn" on server "https://api.dbrahane-4-7-1209-multitenant.qe.devcluster.openshift.com:6443". ❯ oc get pods NAME READY STATUS RESTARTS AGE sdn-controller-7r85b 1/1 Running 0 164m sdn-controller-b6n95 1/1 Running 0 164m sdn-controller-rcwql 1/1 Running 0 164m ❯ oc create -f https://raw.githubusercontent.com/openshift/verification-tests/master/testdata/networking/pod-for-ping.json ❯ oc logs sdn-controller-b6n95 | grep "unable to allocate" ❯ oc logs sdn-controller-rcwql | grep "unable to allocate" ❯ oc logs sdn-controller-7r85b | grep "unable to allocate"
Hi, the target release for this bug is 4.7, therefore I think this should be verified. The fix was merged in master last week, so it's expected that the problem is still reproducible in 4.5 and 4.6.
Comment#8 is reproduction in 4.5.22 and Comment#9 is verification in 4.7. Still waiting for fix in 4.5 and 4.6. Marking it verified as per comment #9.
Hi Arti, We normally don't backport anything unless it's having an impact. Is there any real impact? Has any customer requested it? Don't get me wrong, if someone needs the backport I'm happy to do it, but I don't think it has any actual impact because the only customer that I know has complained about it is already aware that they can ignore it.
Hi Juan, Discussed with Anurag and as per your comment as for now there is no actual impact known but we are not sure about performance impact of logging in large scale cluster at this point. We can mark it verified for now and open a new bug for prior releases if necessary.
Hi Arti, It's just printing 7 lines (once per netnamespace with netid 1) every time a project is createdand when the cluster is started. I don't think it's worth investigating it. Anyway if you still think it may be a problem I'll do an automatic cherry-pick, it's not complex enough to waste anybody's time checking if it can actually decrease performance.
Hi Juan, Let us do it. There may be some customers as I found out continue to use 4.5 and 4.6. Thank You!
Thanks for clarification Juan and thanks for verifying it Arti and Dan. Agree with Juan, this is not recurring and its just 7 lines in the start when namespaces with netid 1 gets created during cluster bring up (in multitenant mode). Juan, I am not sure if customers like Verizon is using multitenant mode or not in 4.6, if not we may not need back porting.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633