Bug 1763936

Summary: [sig-network] Networking Granular Checks: Services should function for endpoint-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s] [Skipped:Network/OVNKubernetes]
Product: OpenShift Container Platform Reporter: Wei Duan <wduan>
Component: InstallerAssignee: Jeremiah Stuever <jstuever>
Installer sub component: openshift-installer QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: gpei, hekumar, hongkliu, slowrie, wking, zzhao
Version: 4.3.0Keywords: Reopened
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: AWS IPI was missing security group rules allowing traffic from control plane hosts to workers on TCP and UDP ports 30000 - 32767. Similarly, AWS UPI and GCP UPI jobs did not have the proper network ACLs applied in all situations, this was limited in scope to the CI jobs, so no need to include this in release notes, I'm just explaining why there's three PRs attached to this bug. Consequence: Newly introduced OVN Networking components would not work properly in clusters lacking these security group rules. Fix: For existing clusters, add security group rules allowing control plane to workers on TCP and UDP ports 30000 - 32767. Result: OVN Networking components will work properly.
Story Points: ---
Clone Of:
: 1779469 1784594 1798176 (view as bug list) Environment:
Last Closed: 2020-05-04 11:14:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1779469, 1784594, 1798176    

Description Wei Duan 2019-10-22 02:17:38 UTC
Description of problem:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.3/139


[sig-network] Networking Granular Checks: Services should function for endpoint-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s] [Skipped:Network/OVNKubernetes] 

fail [k8s.io/kubernetes/test/e2e/framework/networking_utils.go:250]: Oct 21 21:49:55.346: Failed to find expected endpoints:
Tries 66
Command curl -g -q -s 'http://10.129.0.253:8080/dial?request=hostName&protocol=udp&host=10.0.59.21&port=32196&tries=1'
retrieved map[]
expected map[netserver-0:{} netserver-1:{} netserver-2:{} netserver-3:{} netserver-4:{} netserver-5:{}]

Comment 2 Casey Callendrello 2019-10-22 12:17:27 UTC
Yup, this is a known bug. We've already filed a fix skipping this test, and we have a known issue. It may not be fixed for 4.3, but we'll try.

Comment 3 Casey Callendrello 2019-10-22 12:22:34 UTC
*** Bug 1763941 has been marked as a duplicate of this bug. ***

Comment 6 W. Trevor King 2019-11-22 00:30:42 UTC
I added a link to the skipping PR, but it missed things like we're seeing in 4.3's aws-upi job today [1,2]:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.3/373/build-log.txt | grep '^failed: '
failed: (5m38s) 2019-11-21T23:24:14 "[sig-network] Networking Granular Checks: Services should function for node-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s]"
failed: (4m43s) 2019-11-21T23:24:31 "[sig-network] Networking Granular Checks: Services should function for node-Service: http [Suite:openshift/conformance/parallel] [Suite:k8s]"
failed: (9m15s) 2019-11-21T23:38:14 "[sig-network] Networking Granular Checks: Services should function for pod-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s]"
failed: (9m21s) 2019-11-21T23:42:57 "[sig-network] Networking Granular Checks: Services should function for endpoint-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s] [Skipped:Network/OVNKubernetes]"

Skips have:

  \[sig-network\] Networking Granular Checks: Services should function for endpoint-Service`,

but are lacking 'node-Service: udp', 'node-Service: http', and 'pod-service: udp' variants.  Failure counts by job for each of those over the past 24h:

$ curl -s 'https://search.svc.ci.openshift.org/search?maxAge=24h&name=.&search=failed:+.*Networking+Granular+Checks:+Services+should+function+for+node-Service:+udp' | jq -r '. | keys[]' | sed 's|.*/\([^/]*\)/[0-9]*|\1|' | sort | uniq -c | sort -n
      1 pull-ci-openshift-origin-master-e2e-aws-fips
      1 rehearse-5308-pull-ci-openshift-installer-master-e2e-aws-proxy
      1 release-openshift-ocp-installer-e2e-gcp-4.3
      1 release-openshift-ocp-installer-e2e-openstack-4.3
      1 release-openshift-ocp-installer-e2e-openstack-4.4
      3 release-openshift-ocp-installer-e2e-aws-proxy-4.3
      4 release-openshift-ocp-installer-e2e-aws-upi-4.3
      4 release-openshift-ocp-installer-e2e-aws-upi-4.4

$ curl -s 'https://search.svc.ci.openshift.org/search?maxAge=24h&name=.&search=failed:+.*Networking+Granular+Checks:+Services+should+function+for+node-Service:+http' | jq -r '. | keys[]' | sed 's|.*/\([^/]*\)/[0-9]*|\1|' | sort | uniq -c | sort -n
      1 endurance-e2e-aws-4.3
      1 pull-ci-openshift-installer-master-e2e-aws-fips
      1 pull-ci-openshift-machine-config-operator-master-e2e-aws
      1 release-openshift-ocp-installer-e2e-aws-upi-4.3
      1 release-openshift-ocp-installer-e2e-openstack-4.4
      1 release-openshift-origin-installer-e2e-aws-sdn-multitenant-4.3
      1 release-openshift-origin-installer-e2e-gcp-compact-4.3

$ curl -s 'https://search.svc.ci.openshift.org/search?maxAge=24h&name=.&search=failed:+.*Networking+Granular+Checks:+Services+should+function+for+pod-Service:+udp' | jq -r '. | keys[]' | sed 's|.*/\([^/]*\)/[0-9]*|\1|' | sort | uniq -c | sort -n
      1 rehearse-5308-pull-ci-openshift-installer-master-e2e-aws-proxy
      3 release-openshift-ocp-installer-e2e-aws-proxy-4.3
      4 release-openshift-ocp-installer-e2e-aws-upi-4.4
      5 release-openshift-ocp-installer-e2e-aws-upi-4.3

[1]: https://search.svc.ci.openshift.org/chart?name=release-openshift-ocp-installer-e2e-aws-upi-4.3&search=failed:%20.*Networking%20Granular%20Checks:%20Services%20should%20function%20for%20.*-Service
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.3/373

Comment 7 W. Trevor King 2019-11-23 05:44:17 UTC
*** Bug 1763940 has been marked as a duplicate of this bug. ***

Comment 8 W. Trevor King 2019-11-23 05:45:46 UTC
*** Bug 1767946 has been marked as a duplicate of this bug. ***

Comment 9 W. Trevor King 2019-11-23 05:46:59 UTC
Ricardo was looking into this over in [1].

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1767946#c2

Comment 11 Ricardo Carrillo Cruz 2019-12-10 17:53:48 UTC
I'm afraid I'm unable to repro this.

I ran 5-6 times in a row the endpoint-Service udp test, they all passed:

<snip>
STEP: Performing setup for networking test in namespace e2e-nettest-6329
STEP: creating a selector
STEP: Creating the service pods in kubernetes
Dec 10 18:49:56.700: INFO: Waiting up to 10m0s for all (but 100) nodes to be schedulable
STEP: Creating test pods
STEP: Getting node addresses
Dec 10 18:50:36.925: INFO: Waiting up to 10m0s for all (but 100) nodes to be schedulable
STEP: Creating the service on top of the pods in kubernetes
Dec 10 18:50:38.272: INFO: Service node-port-service in namespace e2e-nettest-6329 found.
Dec 10 18:50:38.762: INFO: Service session-affinity-service in namespace e2e-nettest-6329 found.
STEP: dialing(udp) netserver-0 (endpoint) --> 172.30.228.56:90 (config.clusterIP)
Dec 10 18:50:39.062: INFO: ExecWithOptions {Command:[/bin/sh -c curl -g -q -s 'http://10.128.2.207:8080/dial?request=hostName&protocol=udp&host=172.30.228.56&port=90&tries=1'] Namespace:e2e-nettest-6329 PodName:host-test-container-pod ContainerName:agnhost Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:false}
Dec 10 18:50:40.522: INFO: Waiting for endpoints: map[netserver-0:{} netserver-2:{}]
Dec 10 18:50:42.673: INFO: ExecWithOptions {Command:[/bin/sh -c curl -g -q -s 'http://10.128.2.207:8080/dial?request=hostName&protocol=udp&host=172.30.228.56&port=90&tries=1'] Namespace:e2e-nettest-6329 PodName:host-test-container-pod ContainerName:agnhost Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:false}
Dec 10 18:50:44.938: INFO: Waiting for endpoints: map[netserver-0:{} netserver-2:{}]
Dec 10 18:50:47.159: INFO: ExecWithOptions {Command:[/bin/sh -c curl -g -q -s 'http://10.128.2.207:8080/dial?request=hostName&protocol=udp&host=172.30.228.56&port=90&tries=1'] Namespace:e2e-nettest-6329 PodName:host-test-container-pod ContainerName:agnhost Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:false}
Dec 10 18:50:48.345: INFO: Waiting for endpoints: map[netserver-0:{}]
Dec 10 18:50:50.538: INFO: ExecWithOptions {Command:[/bin/sh -c curl -g -q -s 'http://10.128.2.207:8080/dial?request=hostName&protocol=udp&host=172.30.228.56&port=90&tries=1'] Namespace:e2e-nettest-6329 PodName:host-test-container-pod ContainerName:agnhost Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:false}
Dec 10 18:50:51.973: INFO: Waiting for endpoints: map[netserver-0:{}]
Dec 10 18:50:54.300: INFO: ExecWithOptions {Command:[/bin/sh -c curl -g -q -s 'http://10.128.2.207:8080/dial?request=hostName&protocol=udp&host=172.30.228.56&port=90&tries=1'] Namespace:e2e-nettest-6329 PodName:host-test-container-pod ContainerName:agnhost Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:false}
Dec 10 18:50:55.697: INFO: Waiting for endpoints: map[netserver-0:{}]
Dec 10 18:50:57.848: INFO: ExecWithOptions {Command:[/bin/sh -c curl -g -q -s 'http://10.128.2.207:8080/dial?request=hostName&protocol=udp&host=172.30.228.56&port=90&tries=1'] Namespace:e2e-nettest-6329 PodName:host-test-container-pod ContainerName:agnhost Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:false}
Dec 10 18:50:59.122: INFO: Waiting for endpoints: map[]
STEP: dialing(udp) netserver-0 (endpoint) --> 10.0.130.166:32012 (nodeIP)
Dec 10 18:50:59.273: INFO: ExecWithOptions {Command:[/bin/sh -c curl -g -q -s 'http://10.128.2.207:8080/dial?request=hostName&protocol=udp&host=10.0.130.166&port=32012&tries=1'] Namespace:e2e-nettest-6329 PodName:host-test-container-pod ContainerName:agnhost Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:false}
Dec 10 18:51:00.567: INFO: Waiting for endpoints: map[netserver-1:{} netserver-2:{}]
Dec 10 18:51:02.826: INFO: ExecWithOptions {Command:[/bin/sh -c curl -g -q -s 'http://10.128.2.207:8080/dial?request=hostName&protocol=udp&host=10.0.130.166&port=32012&tries=1'] Namespace:e2e-nettest-6329 PodName:host-test-container-pod ContainerName:agnhost Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:false}
Dec 10 18:51:04.130: INFO: Waiting for endpoints: map[netserver-2:{}]
Dec 10 18:51:06.281: INFO: ExecWithOptions {Command:[/bin/sh -c curl -g -q -s 'http://10.128.2.207:8080/dial?request=hostName&protocol=udp&host=10.0.130.166&port=32012&tries=1'] Namespace:e2e-nettest-6329 PodName:host-test-container-pod ContainerName:agnhost Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:false}
Dec 10 18:51:07.532: INFO: Waiting for endpoints: map[]
[AfterEach] [sig-network] Networking
  /home/ricky/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:152
Dec 10 18:51:07.532: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-nettest-6329" for this suite.
Dec 10 18:51:08.146: INFO: Running AfterSuite actions on all nodes
Dec 10 18:51:08.146: INFO: Running AfterSuite actions on node 1

passed: (1m22s) 2019-12-10T17:51:08 "[sig-network] Networking Granular Checks: Services should function for endpoint-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s] [Skipped:Network/OVNKubernetes]"

1 pass, 0 skip (1m22s)
</snip>

Comment 16 Ricardo Carrillo Cruz 2019-12-20 09:38:56 UTC
I'm unable to reproduce, this must be an environmental issue on CI.
I ran on a 4.4 cluster the test 100 times:

[ricky@ricky-laptop origin]$ for i in {1..100}; do _output/local/bin/linux/amd64/openshift-tests run-test "[sig-network] Networking Granular Checks: Services should function for endpoint-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s] [Skipped:Network/OVNKubernetes]" ; echo $? >> /tmp/test.out ; done

[ricky@ricky-laptop origin]$ grep -c 0 /tmp/test.out
100

It succeeds 100%.

Comment 17 zhaozhanqi 2019-12-24 03:18:55 UTC
I checked same build with 4.3.0-0.nightly-2019-12-23-235118

aws [1] works well.  but on openstack [2] still failed

[1]https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.3/554
[2]https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.3/658

Comment 18 zhaozhanqi 2019-12-24 03:25:39 UTC
and I also checked gcp and vsphere are working well. So it should be openstack issue I guess.

Comment 19 zhaozhanqi 2020-01-06 09:21:49 UTC
Verified this bug according to comment 17 and 18

Comment 20 Scott Dodson 2020-01-27 20:26:54 UTC
The fix for this bug was originally to disable tests, however those tests were identifying valid problems. Moving this over to the Installer component and we'll track fixes to the AWS IPI, AWS UPI, and GCP UPI CI jobs that should resolve these.

Comment 22 Scott Dodson 2020-01-30 17:22:25 UTC
Additional changes necessary.

Comment 29 Scott Dodson 2020-02-04 18:12:46 UTC
Lets split a separate bug out for that, we really probably should've done one bug per provider from the start but ovirt comes from another team. I'll clone and assign, also since oVirt is new to 4.4 there's no need to backport beyond that.

Comment 30 zhaozhanqi 2020-02-05 07:16:20 UTC
move this bug to verified. since another bug 1798176 is tracing issue comment 28

Comment 33 errata-xmlrpc 2020-05-04 11:14:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581