1915885 – Kuryr doesn't support workers running on multiple subnets

Bug 1915885 - Kuryr doesn't support workers running on multiple subnets

Summary: Kuryr doesn't support workers running on multiple subnets

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.7
Hardware:	All
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Michał Dulko
QA Contact:	GenadiC
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-13 16:09 UTC by Michał Dulko
Modified:	2021-02-24 15:53 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:52:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
kuryr test results after deploying worker on separate network (1.37 MB, application/gzip) 2021-02-01 15:58 UTC, rlobillo	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 949	None	closed	Bug 1915885: Kuryr: Support multiple nodes subnets	2021-02-09 12:26:30 UTC
Github	openshift kuryr-kubernetes pull 438	None	closed	Bug 1915885: Support multiple nodes subnets	2021-02-09 12:26:31 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:53:00 UTC

Description Michał Dulko 2021-01-13 16:09:48 UTC

Description of problem:
In 4.7 OCP on OSP got ability to run worker nodes in separate subnets [1]. This is not supported by Kuryr as it assumes single subnet for all the nodes of the OpenShift cluster.

[1] https://issues.redhat.com/browse/OSASINFRA-2087

Version-Release number of selected component (if applicable):


How reproducible: Always


Steps to Reproduce:
1. Add some nodes on a separate subnet as described here: https://issues.redhat.com/browse/OSASINFRA-2094?focusedCommentId=15355542&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15355542
2. Run some pods on those nodes.

Actual results:
The pods won't have the connectivity (most likely will be stuck on ContainerCreating with kuryr-cni being unable to connect them).

Expected results:
Nodes work normally.

Additional info:

Comment 2 rlobillo 2021-01-29 13:22:59 UTC

FailedQA on OCP4.7.0-0.nightly-2021-01-28-060516 over OSP13 (2021-01-20.1) with Amphora provider.

0. Install OCP cluster with IPI normally.

1. Create new subnet on machine network:
$ export CLUSTER_NAME=`jq -r .infraID ostest/metadata.json`
$ openstack subnet create --gateway 192.168.123.1 --subnet-range 192.168.123.0/24 --allocation-pool start=192.168.123.15,end=192.168.123.254 --network ${CLUSTER_NAME}-openshift --tag openshiftClusterID=${CLUSTER_NAME} ${CLUSTER_NAME}-additional
$ openstack router add subnet ${CLUSTER_NAME}-external-router ${CLUSTER_NAME}-additional

2. Edit infraID on security-groups-additional-network.yaml (https://gist.github.com/rlobillo/9a4b549d8ecfb8f7a0feae3fb2081e82) and run:
$ ansible-playbook security-groups-additional-network.yaml

# Note: From Martin's security-groups-additional-network.yaml, I needed to remove the --tag in the command "openstack security group set", due to "error: unrecognized arguments"

3. Edit <CLUSTER_NAME> on additional_subnet.yaml (https://gist.github.com/rlobillo/1540f8dc6790aa2c0e060e8bc49d29e9) and run:
$ oc apply -f additional_machineset.yaml

4. Wait until worker is created:

$ openstack server list
+--------------------------------------+----------------------------------------+--------+---------------------------------------+--------------------+-----------+
| ID                                   | Name                                   | Status | Networks                              | Image              | Flavor    |
+--------------------------------------+----------------------------------------+--------+---------------------------------------+--------------------+-----------+
| 201a4839-04c4-4cdd-b2b7-b3059693f035 | ostest-p7sj6-worker-0-additional-d4vt2 | ACTIVE | ostest-p7sj6-openshift=192.168.123.36 | ostest-p7sj6-rhcos | m4.xlarge |
| 4784e036-a860-4ced-9a8b-0d95549032d7 | ostest-p7sj6-worker-0-vwgtx            | ACTIVE | ostest-p7sj6-openshift=10.196.1.67    | ostest-p7sj6-rhcos | m4.xlarge |
| 736249c1-d2ad-47d0-9a45-6ddd2440587c | ostest-p7sj6-master-2                  | ACTIVE | ostest-p7sj6-openshift=10.196.2.54    | ostest-p7sj6-rhcos | m4.xlarge |
| aaf30364-5bda-4413-9a19-0143158ab70c | ostest-p7sj6-master-1                  | ACTIVE | ostest-p7sj6-openshift=10.196.2.49    | ostest-p7sj6-rhcos | m4.xlarge |
| 4955bbec-c23f-4de5-9e7d-8280ab59d813 | ostest-p7sj6-master-0                  | ACTIVE | ostest-p7sj6-openshift=10.196.3.171   | ostest-p7sj6-rhcos | m4.xlarge |
+--------------------------------------+----------------------------------------+--------+---------------------------------------+--------------------+-----------+

$ oc get nodes -o wide
NAME                                     STATUS   ROLES    AGE     VERSION           INTERNAL-IP      EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
ostest-p7sj6-master-0                    Ready    master   5h20m   v1.20.0+4b40bb4   10.196.3.171     <none>        Red Hat Enterprise Linux CoreOS 47.83.202101272343-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gita1ab08a.el8.43
ostest-p7sj6-master-1                    Ready    master   5h20m   v1.20.0+4b40bb4   10.196.2.49      <none>        Red Hat Enterprise Linux CoreOS 47.83.202101272343-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gita1ab08a.el8.43
ostest-p7sj6-master-2                    Ready    master   5h19m   v1.20.0+4b40bb4   10.196.2.54      <none>        Red Hat Enterprise Linux CoreOS 47.83.202101272343-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gita1ab08a.el8.43
ostest-p7sj6-worker-0-additional-d4vt2   Ready    worker   73m     v1.20.0+4b40bb4   192.168.123.36   <none>        Red Hat Enterprise Linux CoreOS 47.83.202101272343-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gita1ab08a.el8.43
ostest-p7sj6-worker-0-vwgtx              Ready    worker   5h5m    v1.20.0+4b40bb4   10.196.1.67      <none>        Red Hat Enterprise Linux CoreOS 47.83.202101272343-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gita1ab08a.el8.43

^ We have 2 workers, one running on subnet ostest-p7sj6-nodes and other one running on subnet ostest-p7sj6-additional.

5. Pods are successfully created on different workers:

$ oc get pods -o wide
NAME    READY   STATUS    RESTARTS   AGE     IP              NODE                                     NOMINATED NODE   READINESS GATES                            
demo    1/1     Running   0          9m43s   10.128.60.159   ostest-p7sj6-worker-0-additional-d4vt2   <none>           <none>                                     
demo2   1/1     Running   0          4m31s   10.128.60.96    ostest-p7sj6-worker-0-vwgtx              <none>           <none>                                     

But it is not possible to log into the pod running on the additional worker:

$ oc rsh pod/demo
Error from server: error dialing backend: dial tcp 192.168.123.36:10250: i/o timeout


However, it is possible to connect to the pod in the regular worker, even ping the pod on the additional worker:

$ oc rsh pod/demo2
~ $ ping 10.128.60.159
PING 10.128.60.159 (10.128.60.159) 56(84) bytes of data.
64 bytes from 10.128.60.159: icmp_seq=1 ttl=64 time=1.35 ms
64 bytes from 10.128.60.159: icmp_seq=2 ttl=64 time=0.314 ms
^C
--- 10.128.60.159 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 0.314/0.831/1.349/0.517 ms
~ $ 

As a consequence, kuryr_tempest_plugin.tests.scenario.test_cross_ping.TestCrossPingScenario.test_pod_pod_ping is failing when pods are created in different workers.

Comment 3 Michał Dulko 2021-02-01 15:43:28 UTC

Back to ON_QA, the problem was just a missing SG.

Comment 4 rlobillo 2021-02-01 15:57:45 UTC

Verified on  OCP4.7.0-0.nightly-2021-01-28-060516 over OSP13 (2021-01-20.1) with Amphora provider.

There was a missing SG rule:

openstack security group rule create --dst-port 10250 --ingress --protocol tcp --remote-ip 10.196.0.0/16 ostest-p7sj6-worker-additional

After creating it, all NP, conformance and tempest tests passed (logs attached).

Comment 5 rlobillo 2021-02-01 15:58:44 UTC

Created attachment 1752885 [details]
kuryr test results after deploying worker on separate network

Comment 8 errata-xmlrpc 2021-02-24 15:52:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.