Description of problem: In 4.7 OCP on OSP got ability to run worker nodes in separate subnets [1]. This is not supported by Kuryr as it assumes single subnet for all the nodes of the OpenShift cluster. [1] https://issues.redhat.com/browse/OSASINFRA-2087 Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Add some nodes on a separate subnet as described here: https://issues.redhat.com/browse/OSASINFRA-2094?focusedCommentId=15355542&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15355542 2. Run some pods on those nodes. Actual results: The pods won't have the connectivity (most likely will be stuck on ContainerCreating with kuryr-cni being unable to connect them). Expected results: Nodes work normally. Additional info:
FailedQA on OCP4.7.0-0.nightly-2021-01-28-060516 over OSP13 (2021-01-20.1) with Amphora provider. 0. Install OCP cluster with IPI normally. 1. Create new subnet on machine network: $ export CLUSTER_NAME=`jq -r .infraID ostest/metadata.json` $ openstack subnet create --gateway 192.168.123.1 --subnet-range 192.168.123.0/24 --allocation-pool start=192.168.123.15,end=192.168.123.254 --network ${CLUSTER_NAME}-openshift --tag openshiftClusterID=${CLUSTER_NAME} ${CLUSTER_NAME}-additional $ openstack router add subnet ${CLUSTER_NAME}-external-router ${CLUSTER_NAME}-additional 2. Edit infraID on security-groups-additional-network.yaml (https://gist.github.com/rlobillo/9a4b549d8ecfb8f7a0feae3fb2081e82) and run: $ ansible-playbook security-groups-additional-network.yaml # Note: From Martin's security-groups-additional-network.yaml, I needed to remove the --tag in the command "openstack security group set", due to "error: unrecognized arguments" 3. Edit <CLUSTER_NAME> on additional_subnet.yaml (https://gist.github.com/rlobillo/1540f8dc6790aa2c0e060e8bc49d29e9) and run: $ oc apply -f additional_machineset.yaml 4. Wait until worker is created: $ openstack server list +--------------------------------------+----------------------------------------+--------+---------------------------------------+--------------------+-----------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+----------------------------------------+--------+---------------------------------------+--------------------+-----------+ | 201a4839-04c4-4cdd-b2b7-b3059693f035 | ostest-p7sj6-worker-0-additional-d4vt2 | ACTIVE | ostest-p7sj6-openshift=192.168.123.36 | ostest-p7sj6-rhcos | m4.xlarge | | 4784e036-a860-4ced-9a8b-0d95549032d7 | ostest-p7sj6-worker-0-vwgtx | ACTIVE | ostest-p7sj6-openshift=10.196.1.67 | ostest-p7sj6-rhcos | m4.xlarge | | 736249c1-d2ad-47d0-9a45-6ddd2440587c | ostest-p7sj6-master-2 | ACTIVE | ostest-p7sj6-openshift=10.196.2.54 | ostest-p7sj6-rhcos | m4.xlarge | | aaf30364-5bda-4413-9a19-0143158ab70c | ostest-p7sj6-master-1 | ACTIVE | ostest-p7sj6-openshift=10.196.2.49 | ostest-p7sj6-rhcos | m4.xlarge | | 4955bbec-c23f-4de5-9e7d-8280ab59d813 | ostest-p7sj6-master-0 | ACTIVE | ostest-p7sj6-openshift=10.196.3.171 | ostest-p7sj6-rhcos | m4.xlarge | +--------------------------------------+----------------------------------------+--------+---------------------------------------+--------------------+-----------+ $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ostest-p7sj6-master-0 Ready master 5h20m v1.20.0+4b40bb4 10.196.3.171 <none> Red Hat Enterprise Linux CoreOS 47.83.202101272343-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gita1ab08a.el8.43 ostest-p7sj6-master-1 Ready master 5h20m v1.20.0+4b40bb4 10.196.2.49 <none> Red Hat Enterprise Linux CoreOS 47.83.202101272343-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gita1ab08a.el8.43 ostest-p7sj6-master-2 Ready master 5h19m v1.20.0+4b40bb4 10.196.2.54 <none> Red Hat Enterprise Linux CoreOS 47.83.202101272343-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gita1ab08a.el8.43 ostest-p7sj6-worker-0-additional-d4vt2 Ready worker 73m v1.20.0+4b40bb4 192.168.123.36 <none> Red Hat Enterprise Linux CoreOS 47.83.202101272343-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gita1ab08a.el8.43 ostest-p7sj6-worker-0-vwgtx Ready worker 5h5m v1.20.0+4b40bb4 10.196.1.67 <none> Red Hat Enterprise Linux CoreOS 47.83.202101272343-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gita1ab08a.el8.43 ^ We have 2 workers, one running on subnet ostest-p7sj6-nodes and other one running on subnet ostest-p7sj6-additional. 5. Pods are successfully created on different workers: $ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES demo 1/1 Running 0 9m43s 10.128.60.159 ostest-p7sj6-worker-0-additional-d4vt2 <none> <none> demo2 1/1 Running 0 4m31s 10.128.60.96 ostest-p7sj6-worker-0-vwgtx <none> <none> But it is not possible to log into the pod running on the additional worker: $ oc rsh pod/demo Error from server: error dialing backend: dial tcp 192.168.123.36:10250: i/o timeout However, it is possible to connect to the pod in the regular worker, even ping the pod on the additional worker: $ oc rsh pod/demo2 ~ $ ping 10.128.60.159 PING 10.128.60.159 (10.128.60.159) 56(84) bytes of data. 64 bytes from 10.128.60.159: icmp_seq=1 ttl=64 time=1.35 ms 64 bytes from 10.128.60.159: icmp_seq=2 ttl=64 time=0.314 ms ^C --- 10.128.60.159 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1002ms rtt min/avg/max/mdev = 0.314/0.831/1.349/0.517 ms ~ $ As a consequence, kuryr_tempest_plugin.tests.scenario.test_cross_ping.TestCrossPingScenario.test_pod_pod_ping is failing when pods are created in different workers.
Back to ON_QA, the problem was just a missing SG.
Verified on OCP4.7.0-0.nightly-2021-01-28-060516 over OSP13 (2021-01-20.1) with Amphora provider. There was a missing SG rule: openstack security group rule create --dst-port 10250 --ingress --protocol tcp --remote-ip 10.196.0.0/16 ostest-p7sj6-worker-additional After creating it, all NP, conformance and tempest tests passed (logs attached).
Created attachment 1752885 [details] kuryr test results after deploying worker on separate network
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633