Bug 1763936
Summary: | [sig-network] Networking Granular Checks: Services should function for endpoint-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s] [Skipped:Network/OVNKubernetes] | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Wei Duan <wduan> | |
Component: | Installer | Assignee: | Jeremiah Stuever <jstuever> | |
Installer sub component: | openshift-installer | QA Contact: | zhaozhanqi <zzhao> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | gpei, hekumar, hongkliu, slowrie, wking, zzhao | |
Version: | 4.3.0 | Keywords: | Reopened | |
Target Milestone: | --- | |||
Target Release: | 4.4.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: AWS IPI was missing security group rules allowing traffic from control plane hosts to workers on TCP and UDP ports 30000 - 32767.
Similarly, AWS UPI and GCP UPI jobs did not have the proper network ACLs applied in all situations, this was limited in scope to the CI jobs, so no need to include this in release notes, I'm just explaining why there's three PRs attached to this bug.
Consequence: Newly introduced OVN Networking components would not work properly in clusters lacking these security group rules.
Fix: For existing clusters, add security group rules allowing control plane to workers on TCP and UDP ports 30000 - 32767.
Result: OVN Networking components will work properly.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1779469 1784594 1798176 (view as bug list) | Environment: | ||
Last Closed: | 2020-05-04 11:14:34 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1779469, 1784594, 1798176 |
Description
Wei Duan
2019-10-22 02:17:38 UTC
Yup, this is a known bug. We've already filed a fix skipping this test, and we have a known issue. It may not be fixed for 4.3, but we'll try. *** Bug 1763941 has been marked as a duplicate of this bug. *** I added a link to the skipping PR, but it missed things like we're seeing in 4.3's aws-upi job today [1,2]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.3/373/build-log.txt | grep '^failed: ' failed: (5m38s) 2019-11-21T23:24:14 "[sig-network] Networking Granular Checks: Services should function for node-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s]" failed: (4m43s) 2019-11-21T23:24:31 "[sig-network] Networking Granular Checks: Services should function for node-Service: http [Suite:openshift/conformance/parallel] [Suite:k8s]" failed: (9m15s) 2019-11-21T23:38:14 "[sig-network] Networking Granular Checks: Services should function for pod-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s]" failed: (9m21s) 2019-11-21T23:42:57 "[sig-network] Networking Granular Checks: Services should function for endpoint-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s] [Skipped:Network/OVNKubernetes]" Skips have: \[sig-network\] Networking Granular Checks: Services should function for endpoint-Service`, but are lacking 'node-Service: udp', 'node-Service: http', and 'pod-service: udp' variants. Failure counts by job for each of those over the past 24h: $ curl -s 'https://search.svc.ci.openshift.org/search?maxAge=24h&name=.&search=failed:+.*Networking+Granular+Checks:+Services+should+function+for+node-Service:+udp' | jq -r '. | keys[]' | sed 's|.*/\([^/]*\)/[0-9]*|\1|' | sort | uniq -c | sort -n 1 pull-ci-openshift-origin-master-e2e-aws-fips 1 rehearse-5308-pull-ci-openshift-installer-master-e2e-aws-proxy 1 release-openshift-ocp-installer-e2e-gcp-4.3 1 release-openshift-ocp-installer-e2e-openstack-4.3 1 release-openshift-ocp-installer-e2e-openstack-4.4 3 release-openshift-ocp-installer-e2e-aws-proxy-4.3 4 release-openshift-ocp-installer-e2e-aws-upi-4.3 4 release-openshift-ocp-installer-e2e-aws-upi-4.4 $ curl -s 'https://search.svc.ci.openshift.org/search?maxAge=24h&name=.&search=failed:+.*Networking+Granular+Checks:+Services+should+function+for+node-Service:+http' | jq -r '. | keys[]' | sed 's|.*/\([^/]*\)/[0-9]*|\1|' | sort | uniq -c | sort -n 1 endurance-e2e-aws-4.3 1 pull-ci-openshift-installer-master-e2e-aws-fips 1 pull-ci-openshift-machine-config-operator-master-e2e-aws 1 release-openshift-ocp-installer-e2e-aws-upi-4.3 1 release-openshift-ocp-installer-e2e-openstack-4.4 1 release-openshift-origin-installer-e2e-aws-sdn-multitenant-4.3 1 release-openshift-origin-installer-e2e-gcp-compact-4.3 $ curl -s 'https://search.svc.ci.openshift.org/search?maxAge=24h&name=.&search=failed:+.*Networking+Granular+Checks:+Services+should+function+for+pod-Service:+udp' | jq -r '. | keys[]' | sed 's|.*/\([^/]*\)/[0-9]*|\1|' | sort | uniq -c | sort -n 1 rehearse-5308-pull-ci-openshift-installer-master-e2e-aws-proxy 3 release-openshift-ocp-installer-e2e-aws-proxy-4.3 4 release-openshift-ocp-installer-e2e-aws-upi-4.4 5 release-openshift-ocp-installer-e2e-aws-upi-4.3 [1]: https://search.svc.ci.openshift.org/chart?name=release-openshift-ocp-installer-e2e-aws-upi-4.3&search=failed:%20.*Networking%20Granular%20Checks:%20Services%20should%20function%20for%20.*-Service [2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.3/373 *** Bug 1763940 has been marked as a duplicate of this bug. *** *** Bug 1767946 has been marked as a duplicate of this bug. *** Ricardo was looking into this over in [1]. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1767946#c2 This is currently causing ~10% of all ^release-.*4.3$ errors over the past 12h [1,2,3,4,5,6,7,8]. Bumping the severity/priority. [1]: https://search.svc.ci.openshift.org/chart?maxAge=12h&name=^release-.*4.3$&search=failed:.*Networking%20Granular%20Checks:%20Services%20should%20function%20for [2]: https://search.svc.ci.openshift.org/?name=^release.*4.3$&search=failed%3A.*Networking+Granular+Checks%3A+Services+should+function+for&maxAge=12h&context=0&type=all [3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-compact-4.3/32 [4]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.3/381 [5]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-compact-4.3/40 [6]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.3/382 [7]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-proxy-4.3/226 [8]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.3/383 I'm afraid I'm unable to repro this. I ran 5-6 times in a row the endpoint-Service udp test, they all passed: <snip> STEP: Performing setup for networking test in namespace e2e-nettest-6329 STEP: creating a selector STEP: Creating the service pods in kubernetes Dec 10 18:49:56.700: INFO: Waiting up to 10m0s for all (but 100) nodes to be schedulable STEP: Creating test pods STEP: Getting node addresses Dec 10 18:50:36.925: INFO: Waiting up to 10m0s for all (but 100) nodes to be schedulable STEP: Creating the service on top of the pods in kubernetes Dec 10 18:50:38.272: INFO: Service node-port-service in namespace e2e-nettest-6329 found. Dec 10 18:50:38.762: INFO: Service session-affinity-service in namespace e2e-nettest-6329 found. STEP: dialing(udp) netserver-0 (endpoint) --> 172.30.228.56:90 (config.clusterIP) Dec 10 18:50:39.062: INFO: ExecWithOptions {Command:[/bin/sh -c curl -g -q -s 'http://10.128.2.207:8080/dial?request=hostName&protocol=udp&host=172.30.228.56&port=90&tries=1'] Namespace:e2e-nettest-6329 PodName:host-test-container-pod ContainerName:agnhost Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:false} Dec 10 18:50:40.522: INFO: Waiting for endpoints: map[netserver-0:{} netserver-2:{}] Dec 10 18:50:42.673: INFO: ExecWithOptions {Command:[/bin/sh -c curl -g -q -s 'http://10.128.2.207:8080/dial?request=hostName&protocol=udp&host=172.30.228.56&port=90&tries=1'] Namespace:e2e-nettest-6329 PodName:host-test-container-pod ContainerName:agnhost Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:false} Dec 10 18:50:44.938: INFO: Waiting for endpoints: map[netserver-0:{} netserver-2:{}] Dec 10 18:50:47.159: INFO: ExecWithOptions {Command:[/bin/sh -c curl -g -q -s 'http://10.128.2.207:8080/dial?request=hostName&protocol=udp&host=172.30.228.56&port=90&tries=1'] Namespace:e2e-nettest-6329 PodName:host-test-container-pod ContainerName:agnhost Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:false} Dec 10 18:50:48.345: INFO: Waiting for endpoints: map[netserver-0:{}] Dec 10 18:50:50.538: INFO: ExecWithOptions {Command:[/bin/sh -c curl -g -q -s 'http://10.128.2.207:8080/dial?request=hostName&protocol=udp&host=172.30.228.56&port=90&tries=1'] Namespace:e2e-nettest-6329 PodName:host-test-container-pod ContainerName:agnhost Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:false} Dec 10 18:50:51.973: INFO: Waiting for endpoints: map[netserver-0:{}] Dec 10 18:50:54.300: INFO: ExecWithOptions {Command:[/bin/sh -c curl -g -q -s 'http://10.128.2.207:8080/dial?request=hostName&protocol=udp&host=172.30.228.56&port=90&tries=1'] Namespace:e2e-nettest-6329 PodName:host-test-container-pod ContainerName:agnhost Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:false} Dec 10 18:50:55.697: INFO: Waiting for endpoints: map[netserver-0:{}] Dec 10 18:50:57.848: INFO: ExecWithOptions {Command:[/bin/sh -c curl -g -q -s 'http://10.128.2.207:8080/dial?request=hostName&protocol=udp&host=172.30.228.56&port=90&tries=1'] Namespace:e2e-nettest-6329 PodName:host-test-container-pod ContainerName:agnhost Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:false} Dec 10 18:50:59.122: INFO: Waiting for endpoints: map[] STEP: dialing(udp) netserver-0 (endpoint) --> 10.0.130.166:32012 (nodeIP) Dec 10 18:50:59.273: INFO: ExecWithOptions {Command:[/bin/sh -c curl -g -q -s 'http://10.128.2.207:8080/dial?request=hostName&protocol=udp&host=10.0.130.166&port=32012&tries=1'] Namespace:e2e-nettest-6329 PodName:host-test-container-pod ContainerName:agnhost Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:false} Dec 10 18:51:00.567: INFO: Waiting for endpoints: map[netserver-1:{} netserver-2:{}] Dec 10 18:51:02.826: INFO: ExecWithOptions {Command:[/bin/sh -c curl -g -q -s 'http://10.128.2.207:8080/dial?request=hostName&protocol=udp&host=10.0.130.166&port=32012&tries=1'] Namespace:e2e-nettest-6329 PodName:host-test-container-pod ContainerName:agnhost Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:false} Dec 10 18:51:04.130: INFO: Waiting for endpoints: map[netserver-2:{}] Dec 10 18:51:06.281: INFO: ExecWithOptions {Command:[/bin/sh -c curl -g -q -s 'http://10.128.2.207:8080/dial?request=hostName&protocol=udp&host=10.0.130.166&port=32012&tries=1'] Namespace:e2e-nettest-6329 PodName:host-test-container-pod ContainerName:agnhost Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:false} Dec 10 18:51:07.532: INFO: Waiting for endpoints: map[] [AfterEach] [sig-network] Networking /home/ricky/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:152 Dec 10 18:51:07.532: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready STEP: Destroying namespace "e2e-nettest-6329" for this suite. Dec 10 18:51:08.146: INFO: Running AfterSuite actions on all nodes Dec 10 18:51:08.146: INFO: Running AfterSuite actions on node 1 passed: (1m22s) 2019-12-10T17:51:08 "[sig-network] Networking Granular Checks: Services should function for endpoint-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s] [Skipped:Network/OVNKubernetes]" 1 pass, 0 skip (1m22s) </snip> I'm unable to reproduce, this must be an environmental issue on CI. I ran on a 4.4 cluster the test 100 times: [ricky@ricky-laptop origin]$ for i in {1..100}; do _output/local/bin/linux/amd64/openshift-tests run-test "[sig-network] Networking Granular Checks: Services should function for endpoint-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s] [Skipped:Network/OVNKubernetes]" ; echo $? >> /tmp/test.out ; done [ricky@ricky-laptop origin]$ grep -c 0 /tmp/test.out 100 It succeeds 100%. I checked same build with 4.3.0-0.nightly-2019-12-23-235118 aws [1] works well. but on openstack [2] still failed [1]https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.3/554 [2]https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.3/658 and I also checked gcp and vsphere are working well. So it should be openstack issue I guess. Verified this bug according to comment 17 and 18 The fix for this bug was originally to disable tests, however those tests were identifying valid problems. Moving this over to the Installer component and we'll track fixes to the AWS IPI, AWS UPI, and GCP UPI CI jobs that should resolve these. Additional changes necessary. met same issue in https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.4/297 Lets split a separate bug out for that, we really probably should've done one bug per provider from the start but ovirt comes from another team. I'll clone and assign, also since oVirt is new to 4.4 there's no need to backport beyond that. move this bug to verified. since another bug 1798176 is tracing issue comment 28 Hit the same issue in the gcp-compact CI. https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-compact-4.3/115 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581 |