Bug 1779469

Summary:	Regression: [sig-network] Networking Granular Checks: Services should function for endpoint-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s] [Skipped:Network/OVNKubernetes]
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	Installer	Assignee:	Jeremiah Stuever <jstuever>
Installer sub component:	openshift-installer	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	bbennett, gzaidman, hekumar, nagrawal, ricarril, wduan, wking, zzhao
Version:	4.3.0	Keywords:	Reopened
Target Milestone:	---
Target Release:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: AWS IPI and OSP UPI was missing security group rules allowing bi-directional traffic between control plane hosts and workers workers on TCP and UDP ports 30000 - 32767. Similarly, AWS UPI and GCP UPI jobs did not have the proper network ACLs applied in all situations, this was limited in scope to the CI jobs, so no need to include this in release notes, I'm just explaining why there's three PRs attached to this bug. Consequence: Newly introduced OVN Networking components would not work properly in clusters lacking these security group rules. Fix: For existing clusters, add security group rules allowing bi-directional traffic between control plane and workers on TCP and UDP ports 30000 - 32767. Result: OVN Networking components will work properly.	Story Points:	---
Clone Of:	1763936	Environment:
Last Closed:	2020-02-19 05:39:53 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1763936
Bug Blocks:

Description Clayton Coleman 2019-12-04 03:23:55 UTC

This is a regression (in either test or function) from previous releases and probably is a real networking failure mode.  We may not ship 4.3 without a fix for this or understanding why it failed.

+++ This bug was initially created as a clone of Bug #1763936 +++

Description of problem:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.3/139


[sig-network] Networking Granular Checks: Services should function for endpoint-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s] [Skipped:Network/OVNKubernetes] 

fail [k8s.io/kubernetes/test/e2e/framework/networking_utils.go:250]: Oct 21 21:49:55.346: Failed to find expected endpoints:
Tries 66
Command curl -g -q -s 'http://10.129.0.253:8080/dial?request=hostName&protocol=udp&host=10.0.59.21&port=32196&tries=1'
retrieved map[]
expected map[netserver-0:{} netserver-1:{} netserver-2:{} netserver-3:{} netserver-4:{} netserver-5:{}]

--- Additional comment from Wei Duan on 2019-10-22 00:54:39 EDT ---

Related job

https://testgrid.k8s.io/redhat-openshift-release-4.3-informing-ocp#release-openshift-ocp-installer-e2e-aws-upi-4.3

--- Additional comment from Casey Callendrello on 2019-10-22 08:17:27 EDT ---

Yup, this is a known bug. We've already filed a fix skipping this test, and we have a known issue. It may not be fixed for 4.3, but we'll try.

--- Additional comment from Casey Callendrello on 2019-10-22 08:22:34 EDT ---



--- Additional comment from Ben Parees on 2019-11-03 13:32:45 EST ---

Raising to high based on Clayton's comment:
This is *probably* the private cluster work breaking compact clusters, so good to know bugs are open.  Let’s make sure they have the right severity.

--- Additional comment from Casey Callendrello on 2019-11-13 08:09:03 EST ---

Haven't seen this happen again; closing.

--- Additional comment from W. Trevor King on 2019-11-21 19:30:42 EST ---

I added a link to the skipping PR, but it missed things like we're seeing in 4.3's aws-upi job today [1,2]:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.3/373/build-log.txt | grep '^failed: '
failed: (5m38s) 2019-11-21T23:24:14 "[sig-network] Networking Granular Checks: Services should function for node-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s]"
failed: (4m43s) 2019-11-21T23:24:31 "[sig-network] Networking Granular Checks: Services should function for node-Service: http [Suite:openshift/conformance/parallel] [Suite:k8s]"
failed: (9m15s) 2019-11-21T23:38:14 "[sig-network] Networking Granular Checks: Services should function for pod-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s]"
failed: (9m21s) 2019-11-21T23:42:57 "[sig-network] Networking Granular Checks: Services should function for endpoint-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s] [Skipped:Network/OVNKubernetes]"

Skips have:

  \[sig-network\] Networking Granular Checks: Services should function for endpoint-Service`,

but are lacking 'node-Service: udp', 'node-Service: http', and 'pod-service: udp' variants.  Failure counts by job for each of those over the past 24h:

$ curl -s 'https://search.svc.ci.openshift.org/search?maxAge=24h&name=.&search=failed:+.*Networking+Granular+Checks:+Services+should+function+for+node-Service:+udp' | jq -r '. | keys[]' | sed 's|.*/\([^/]*\)/[0-9]*|\1|' | sort | uniq -c | sort -n
      1 pull-ci-openshift-origin-master-e2e-aws-fips
      1 rehearse-5308-pull-ci-openshift-installer-master-e2e-aws-proxy
      1 release-openshift-ocp-installer-e2e-gcp-4.3
      1 release-openshift-ocp-installer-e2e-openstack-4.3
      1 release-openshift-ocp-installer-e2e-openstack-4.4
      3 release-openshift-ocp-installer-e2e-aws-proxy-4.3
      4 release-openshift-ocp-installer-e2e-aws-upi-4.3
      4 release-openshift-ocp-installer-e2e-aws-upi-4.4

$ curl -s 'https://search.svc.ci.openshift.org/search?maxAge=24h&name=.&search=failed:+.*Networking+Granular+Checks:+Services+should+function+for+node-Service:+http' | jq -r '. | keys[]' | sed 's|.*/\([^/]*\)/[0-9]*|\1|' | sort | uniq -c | sort -n
      1 endurance-e2e-aws-4.3
      1 pull-ci-openshift-installer-master-e2e-aws-fips
      1 pull-ci-openshift-machine-config-operator-master-e2e-aws
      1 release-openshift-ocp-installer-e2e-aws-upi-4.3
      1 release-openshift-ocp-installer-e2e-openstack-4.4
      1 release-openshift-origin-installer-e2e-aws-sdn-multitenant-4.3
      1 release-openshift-origin-installer-e2e-gcp-compact-4.3

$ curl -s 'https://search.svc.ci.openshift.org/search?maxAge=24h&name=.&search=failed:+.*Networking+Granular+Checks:+Services+should+function+for+pod-Service:+udp' | jq -r '. | keys[]' | sed 's|.*/\([^/]*\)/[0-9]*|\1|' | sort | uniq -c | sort -n
      1 rehearse-5308-pull-ci-openshift-installer-master-e2e-aws-proxy
      3 release-openshift-ocp-installer-e2e-aws-proxy-4.3
      4 release-openshift-ocp-installer-e2e-aws-upi-4.4
      5 release-openshift-ocp-installer-e2e-aws-upi-4.3

[1]: https://search.svc.ci.openshift.org/chart?name=release-openshift-ocp-installer-e2e-aws-upi-4.3&search=failed:%20.*Networking%20Granular%20Checks:%20Services%20should%20function%20for%20.*-Service
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.3/373

--- Additional comment from W. Trevor King on 2019-11-23 00:44:17 EST ---



--- Additional comment from W. Trevor King on 2019-11-23 00:45:46 EST ---



--- Additional comment from W. Trevor King on 2019-11-23 00:46:59 EST ---

Ricardo was looking into this over in [1].

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1767946#c2

--- Additional comment from W. Trevor King on 2019-11-23 01:14:17 EST ---

This is currently causing ~10% of all ^release-.*4.3$ errors over the past 12h [1,2,3,4,5,6,7,8].  Bumping the severity/priority.

[1]: https://search.svc.ci.openshift.org/chart?maxAge=12h&name=^release-.*4.3$&search=failed:.*Networking%20Granular%20Checks:%20Services%20should%20function%20for
[2]: https://search.svc.ci.openshift.org/?name=^release.*4.3$&search=failed%3A.*Networking+Granular+Checks%3A+Services+should+function+for&maxAge=12h&context=0&type=all
[3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-compact-4.3/32
[4]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.3/381
[5]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-compact-4.3/40
[6]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.3/382
[7]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-proxy-4.3/226
[8]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.3/383

Comment 1 Clayton Coleman 2019-12-04 03:28:06 UTC

I'd like a clear understanding of why this was believed to be a flake.  Is it a flake upstream?  I see this flaking when run individually, which I find highly suspect.

Comment 6 Ricardo Carrillo Cruz 2019-12-20 09:39:02 UTC

I'm unable to reproduce, this must be an environmental issue on CI.
I ran on a 4.4 cluster the test 100 times:

[ricky@ricky-laptop origin]$ for i in {1..100}; do _output/local/bin/linux/amd64/openshift-tests run-test "[sig-network] Networking Granular Checks: Services should function for endpoint-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s] [Skipped:Network/OVNKubernetes]" ; echo $? >> /tmp/test.out ; done

[ricky@ricky-laptop origin]$ grep -c 0 /tmp/test.out
100

It succeeds 100%.

Comment 8 Neelesh Agrawal 2019-12-20 15:56:48 UTC

*** Bug 1784594 has been marked as a duplicate of this bug. ***

Comment 9 Gal Zaidman 2020-01-13 09:36:43 UTC

At oVirt we see the same network test failing constantly on our CI runs, for example:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_release/4340/rehearse-4340-release-openshift-ocp-installer-e2e-ovirt-4.4/35

As far as I understood for this bug and 1763936 we didn't find the reason for the problem, but just skipped the tests

Comment 10 Gal Zaidman 2020-01-29 18:51:03 UTC

On oVIrt we are seeing some of the tests from:
[sig-network] Networking Granular Checks: Services should function for node-Service
[sig-network] Networking Granular Checks: Services should function for pod-Service
[sig-network] Networking Granular Checks: Services should function for endpoint-Service
failing on each run of conformance,
When we tried to debug the issue we saw that that the test starts 4 pods on the workers 1 pod is the caller, and 3 pods are behind a service nodeport (there are 2 workers on ovirt) when the caller pod from worker 1 tries to use communicate with the service via nodeport and the service happens to select that a pod running on worker 1 as well then we have a failer. If the service selects a pod on worker 2 then the test passes.
This happens on openstack (most likely the same issue) and rarely on aws.
We were able to reproduce this issue on ovirt env.
Here are a few examples of last conformance runs that hit the issue:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.4/222
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.4/223
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.4/226
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.4/228
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.4/229

Comment 11 Gal Zaidman 2020-01-29 18:51:22 UTC

On oVIrt we are seeing some of the tests from:
[sig-network] Networking Granular Checks: Services should function for node-Service
[sig-network] Networking Granular Checks: Services should function for pod-Service
[sig-network] Networking Granular Checks: Services should function for endpoint-Service
failing on each run of conformance,
When we tried to debug the issue we saw that that the test starts 4 pods on the workers 1 pod is the caller, and 3 pods are behind a service nodeport (there are 2 workers on ovirt) when the caller pod from worker 1 tries to use communicate with the service via nodeport and the service happens to select that a pod running on worker 1 as well then we have a failer. If the service selects a pod on worker 2 then the test passes.
This happens on openstack (most likely the same issue) and rarely on aws.
We were able to reproduce this issue on ovirt env.
Here are a few examples of last conformance runs that hit the issue:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.4/222
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.4/223
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.4/226
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.4/228
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.4/229

Comment 12 Clayton Coleman 2020-01-29 19:09:46 UTC

The network stress test is continuously red on these tests

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.4-informing#release-openshift-origin-installer-e2e-aws-sdn-network-stress-4.4&sort-by-failures=

At this point having a fairly consistent reproducer for:

[It] [Top Level] [sig-network] Networking Granular Checks: Services should function for node-Service: udp [Skipped:openstack] [Suite:openshift/conformance/parallel] [Suite:k8s]

that fails completely on the stress test AND we know that it is related to a node talking to a node service (from the same node) highlights that this is likely a real issue.

Moving priority to high given that this may mean that same node service connectivity is broken for host services.  Or, if the issue is that we have no plan to support UDP connectivity from host A back to a nodeport service on host A, then we need to fix the test and explain why (unlikely, high bar).

Comment 13 Ricardo Carrillo Cruz 2020-02-03 16:16:06 UTC

I ran the stress test suite on an OVN cluster and it worked fine:

Feb 03 15:52:56.757 I ns/openshift-etcd-operator deployment/etcd-operator Updated ConfigMap/etcd-ca-bundle -n openshift-etcd-operator: cause by changes in data.ca-bundle.crt (27 times)
Feb 03 15:58:56.211 I ns/openshift-machine-api machine/rcarrillocruz-ovn-gra-z8rb7-master-2 Updated machine rcarrillocruz-ovn-gra-z8rb7-master-2 (16 times)
Feb 03 15:58:56.400 I ns/openshift-machine-api machine/rcarrillocruz-ovn-gra-z8rb7-worker-us-east-2a-8j2lv Updated machine rcarrillocruz-ovn-gra-z8rb7-worker-us-east-2a-8j2lv (19 times)
Feb 03 15:58:56.554 I ns/openshift-machine-api machine/rcarrillocruz-ovn-gra-z8rb7-worker-us-east-2b-b2wb5 Updated machine rcarrillocruz-ovn-gra-z8rb7-worker-us-east-2b-b2wb5 (18 times)
Feb 03 15:58:56.737 I ns/openshift-machine-api machine/rcarrillocruz-ovn-gra-z8rb7-worker-us-east-2c-fwcxk Updated machine rcarrillocruz-ovn-gra-z8rb7-worker-us-east-2c-fwcxk (19 times)
Feb 03 15:58:57.758 I ns/openshift-machine-api machine/rcarrillocruz-ovn-gra-z8rb7-master-0 Updated machine rcarrillocruz-ovn-gra-z8rb7-master-0 (16 times)
Feb 03 15:58:58.703 I ns/openshift-machine-api machine/rcarrillocruz-ovn-gra-z8rb7-master-1 Updated machine rcarrillocruz-ovn-gra-z8rb7-master-1 (16 times)
Feb 03 15:59:30.885 I ns/openshift-etcd-operator deployment/etcd-operator Updated ConfigMap/etcd-ca-bundle -n openshift-etcd-operator: cause by changes in data.ca-bundle.crt (29 times)
Feb 03 16:04:31.941 E kube-apiserver Kube API started failing: Get https://api.rcarrillocruz-ovn-granular.devcluster.openshift.com:6443/api/v1/namespaces/kube-system?timeout=3s: context deadline exceeded
Feb 03 16:04:31.941 I openshift-apiserver OpenShift API started failing: Get https://api.rcarrillocruz-ovn-granular.devcluster.openshift.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=3s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Feb 03 16:04:33.513 - 15s   E kube-apiserver Kube API is not responding to GET requests
Feb 03 16:04:33.513 - 15s   E openshift-apiserver OpenShift API is not responding to GET requests

825 pass, 15 skip (51m25s)


Will do a few more runs, hoping to hit it at some point.

Comment 14 Gal Zaidman 2020-02-04 07:44:49 UTC

H(In reply to Ricardo Carrillo Cruz from comment #13)
> I ran the stress test suite on an OVN cluster and it worked fine:
> 
> Feb 03 15:52:56.757 I ns/openshift-etcd-operator deployment/etcd-operator
> Updated ConfigMap/etcd-ca-bundle -n openshift-etcd-operator: cause by
> changes in data.ca-bundle.crt (27 times)
> Feb 03 15:58:56.211 I ns/openshift-machine-api
> machine/rcarrillocruz-ovn-gra-z8rb7-master-2 Updated machine
> rcarrillocruz-ovn-gra-z8rb7-master-2 (16 times)
> Feb 03 15:58:56.400 I ns/openshift-machine-api
> machine/rcarrillocruz-ovn-gra-z8rb7-worker-us-east-2a-8j2lv Updated machine
> rcarrillocruz-ovn-gra-z8rb7-worker-us-east-2a-8j2lv (19 times)
> Feb 03 15:58:56.554 I ns/openshift-machine-api
> machine/rcarrillocruz-ovn-gra-z8rb7-worker-us-east-2b-b2wb5 Updated machine
> rcarrillocruz-ovn-gra-z8rb7-worker-us-east-2b-b2wb5 (18 times)
> Feb 03 15:58:56.737 I ns/openshift-machine-api
> machine/rcarrillocruz-ovn-gra-z8rb7-worker-us-east-2c-fwcxk Updated machine
> rcarrillocruz-ovn-gra-z8rb7-worker-us-east-2c-fwcxk (19 times)
> Feb 03 15:58:57.758 I ns/openshift-machine-api
> machine/rcarrillocruz-ovn-gra-z8rb7-master-0 Updated machine
> rcarrillocruz-ovn-gra-z8rb7-master-0 (16 times)
> Feb 03 15:58:58.703 I ns/openshift-machine-api
> machine/rcarrillocruz-ovn-gra-z8rb7-master-1 Updated machine
> rcarrillocruz-ovn-gra-z8rb7-master-1 (16 times)
> Feb 03 15:59:30.885 I ns/openshift-etcd-operator deployment/etcd-operator
> Updated ConfigMap/etcd-ca-bundle -n openshift-etcd-operator: cause by
> changes in data.ca-bundle.crt (29 times)
> Feb 03 16:04:31.941 E kube-apiserver Kube API started failing: Get
> https://api.rcarrillocruz-ovn-granular.devcluster.openshift.com:6443/api/v1/
> namespaces/kube-system?timeout=3s: context deadline exceeded
> Feb 03 16:04:31.941 I openshift-apiserver OpenShift API started failing: Get
> https://api.rcarrillocruz-ovn-granular.devcluster.openshift.com:6443/apis/
> image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/
> missing?timeout=3s: context deadline exceeded (Client.Timeout exceeded while
> awaiting headers)
> Feb 03 16:04:33.513 - 15s   E kube-apiserver Kube API is not responding to
> GET requests
> Feb 03 16:04:33.513 - 15s   E openshift-apiserver OpenShift API is not
> responding to GET requests
> 
> 825 pass, 15 skip (51m25s)
> 
> 
> Will do a few more runs, hoping to hit it at some point.

As I said above the problem is easily reproducible on overt.
We are seeing this problem with most tests that relate to NodePort.
We are able to create a debugging environment, meaning deploy an ovirt cluster, and pause the test in the middle so we will be able to see exactly what happens.
Can we schedule a debug session this week and pin this problem?

Comment 15 Ricardo Carrillo Cruz 2020-02-04 09:36:44 UTC

It's not needed yet.
This appears to not be platform specific, as the stress test showing issues in CI runs in AWS (even though my tests against
AWS haven't shown any issues so far).
I'll try to reach Clayton to find out more about how this stress test works.

Comment 17 zhaozhanqi 2020-02-07 10:35:28 UTC

checked
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.3/790
move this bug to verified,

Comment 19 errata-xmlrpc 2020-02-19 05:39:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0492