Bug 1779469
Summary: | Regression: [sig-network] Networking Granular Checks: Services should function for endpoint-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s] [Skipped:Network/OVNKubernetes] | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> |
Component: | Installer | Assignee: | Jeremiah Stuever <jstuever> |
Installer sub component: | openshift-installer | QA Contact: | zhaozhanqi <zzhao> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | bbennett, gzaidman, hekumar, nagrawal, ricarril, wduan, wking, zzhao |
Version: | 4.3.0 | Keywords: | Reopened |
Target Milestone: | --- | ||
Target Release: | 4.3.z | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: AWS IPI and OSP UPI was missing security group rules allowing bi-directional traffic between control plane hosts and workers workers on TCP and UDP ports 30000 - 32767.
Similarly, AWS UPI and GCP UPI jobs did not have the proper network ACLs applied in all situations, this was limited in scope to the CI jobs, so no need to include this in release notes, I'm just explaining why there's three PRs attached to this bug.
Consequence: Newly introduced OVN Networking components would not work properly in clusters lacking these security group rules.
Fix: For existing clusters, add security group rules allowing bi-directional traffic between control plane and workers on TCP and UDP ports 30000 - 32767.
Result: OVN Networking components will work properly.
|
Story Points: | --- |
Clone Of: | 1763936 | Environment: | |
Last Closed: | 2020-02-19 05:39:53 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1763936 | ||
Bug Blocks: |
Description
Clayton Coleman
2019-12-04 03:23:55 UTC
I'd like a clear understanding of why this was believed to be a flake. Is it a flake upstream? I see this flaking when run individually, which I find highly suspect. I'm unable to reproduce, this must be an environmental issue on CI. I ran on a 4.4 cluster the test 100 times: [ricky@ricky-laptop origin]$ for i in {1..100}; do _output/local/bin/linux/amd64/openshift-tests run-test "[sig-network] Networking Granular Checks: Services should function for endpoint-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s] [Skipped:Network/OVNKubernetes]" ; echo $? >> /tmp/test.out ; done [ricky@ricky-laptop origin]$ grep -c 0 /tmp/test.out 100 It succeeds 100%. *** Bug 1784594 has been marked as a duplicate of this bug. *** At oVirt we see the same network test failing constantly on our CI runs, for example: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_release/4340/rehearse-4340-release-openshift-ocp-installer-e2e-ovirt-4.4/35 As far as I understood for this bug and 1763936 we didn't find the reason for the problem, but just skipped the tests On oVIrt we are seeing some of the tests from: [sig-network] Networking Granular Checks: Services should function for node-Service [sig-network] Networking Granular Checks: Services should function for pod-Service [sig-network] Networking Granular Checks: Services should function for endpoint-Service failing on each run of conformance, When we tried to debug the issue we saw that that the test starts 4 pods on the workers 1 pod is the caller, and 3 pods are behind a service nodeport (there are 2 workers on ovirt) when the caller pod from worker 1 tries to use communicate with the service via nodeport and the service happens to select that a pod running on worker 1 as well then we have a failer. If the service selects a pod on worker 2 then the test passes. This happens on openstack (most likely the same issue) and rarely on aws. We were able to reproduce this issue on ovirt env. Here are a few examples of last conformance runs that hit the issue: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.4/222 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.4/223 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.4/226 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.4/228 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.4/229 On oVIrt we are seeing some of the tests from: [sig-network] Networking Granular Checks: Services should function for node-Service [sig-network] Networking Granular Checks: Services should function for pod-Service [sig-network] Networking Granular Checks: Services should function for endpoint-Service failing on each run of conformance, When we tried to debug the issue we saw that that the test starts 4 pods on the workers 1 pod is the caller, and 3 pods are behind a service nodeport (there are 2 workers on ovirt) when the caller pod from worker 1 tries to use communicate with the service via nodeport and the service happens to select that a pod running on worker 1 as well then we have a failer. If the service selects a pod on worker 2 then the test passes. This happens on openstack (most likely the same issue) and rarely on aws. We were able to reproduce this issue on ovirt env. Here are a few examples of last conformance runs that hit the issue: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.4/222 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.4/223 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.4/226 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.4/228 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.4/229 The network stress test is continuously red on these tests https://testgrid.k8s.io/redhat-openshift-ocp-release-4.4-informing#release-openshift-origin-installer-e2e-aws-sdn-network-stress-4.4&sort-by-failures= At this point having a fairly consistent reproducer for: [It] [Top Level] [sig-network] Networking Granular Checks: Services should function for node-Service: udp [Skipped:openstack] [Suite:openshift/conformance/parallel] [Suite:k8s] that fails completely on the stress test AND we know that it is related to a node talking to a node service (from the same node) highlights that this is likely a real issue. Moving priority to high given that this may mean that same node service connectivity is broken for host services. Or, if the issue is that we have no plan to support UDP connectivity from host A back to a nodeport service on host A, then we need to fix the test and explain why (unlikely, high bar). I ran the stress test suite on an OVN cluster and it worked fine: Feb 03 15:52:56.757 I ns/openshift-etcd-operator deployment/etcd-operator Updated ConfigMap/etcd-ca-bundle -n openshift-etcd-operator: cause by changes in data.ca-bundle.crt (27 times) Feb 03 15:58:56.211 I ns/openshift-machine-api machine/rcarrillocruz-ovn-gra-z8rb7-master-2 Updated machine rcarrillocruz-ovn-gra-z8rb7-master-2 (16 times) Feb 03 15:58:56.400 I ns/openshift-machine-api machine/rcarrillocruz-ovn-gra-z8rb7-worker-us-east-2a-8j2lv Updated machine rcarrillocruz-ovn-gra-z8rb7-worker-us-east-2a-8j2lv (19 times) Feb 03 15:58:56.554 I ns/openshift-machine-api machine/rcarrillocruz-ovn-gra-z8rb7-worker-us-east-2b-b2wb5 Updated machine rcarrillocruz-ovn-gra-z8rb7-worker-us-east-2b-b2wb5 (18 times) Feb 03 15:58:56.737 I ns/openshift-machine-api machine/rcarrillocruz-ovn-gra-z8rb7-worker-us-east-2c-fwcxk Updated machine rcarrillocruz-ovn-gra-z8rb7-worker-us-east-2c-fwcxk (19 times) Feb 03 15:58:57.758 I ns/openshift-machine-api machine/rcarrillocruz-ovn-gra-z8rb7-master-0 Updated machine rcarrillocruz-ovn-gra-z8rb7-master-0 (16 times) Feb 03 15:58:58.703 I ns/openshift-machine-api machine/rcarrillocruz-ovn-gra-z8rb7-master-1 Updated machine rcarrillocruz-ovn-gra-z8rb7-master-1 (16 times) Feb 03 15:59:30.885 I ns/openshift-etcd-operator deployment/etcd-operator Updated ConfigMap/etcd-ca-bundle -n openshift-etcd-operator: cause by changes in data.ca-bundle.crt (29 times) Feb 03 16:04:31.941 E kube-apiserver Kube API started failing: Get https://api.rcarrillocruz-ovn-granular.devcluster.openshift.com:6443/api/v1/namespaces/kube-system?timeout=3s: context deadline exceeded Feb 03 16:04:31.941 I openshift-apiserver OpenShift API started failing: Get https://api.rcarrillocruz-ovn-granular.devcluster.openshift.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=3s: context deadline exceeded (Client.Timeout exceeded while awaiting headers) Feb 03 16:04:33.513 - 15s E kube-apiserver Kube API is not responding to GET requests Feb 03 16:04:33.513 - 15s E openshift-apiserver OpenShift API is not responding to GET requests 825 pass, 15 skip (51m25s) Will do a few more runs, hoping to hit it at some point. H(In reply to Ricardo Carrillo Cruz from comment #13) > I ran the stress test suite on an OVN cluster and it worked fine: > > Feb 03 15:52:56.757 I ns/openshift-etcd-operator deployment/etcd-operator > Updated ConfigMap/etcd-ca-bundle -n openshift-etcd-operator: cause by > changes in data.ca-bundle.crt (27 times) > Feb 03 15:58:56.211 I ns/openshift-machine-api > machine/rcarrillocruz-ovn-gra-z8rb7-master-2 Updated machine > rcarrillocruz-ovn-gra-z8rb7-master-2 (16 times) > Feb 03 15:58:56.400 I ns/openshift-machine-api > machine/rcarrillocruz-ovn-gra-z8rb7-worker-us-east-2a-8j2lv Updated machine > rcarrillocruz-ovn-gra-z8rb7-worker-us-east-2a-8j2lv (19 times) > Feb 03 15:58:56.554 I ns/openshift-machine-api > machine/rcarrillocruz-ovn-gra-z8rb7-worker-us-east-2b-b2wb5 Updated machine > rcarrillocruz-ovn-gra-z8rb7-worker-us-east-2b-b2wb5 (18 times) > Feb 03 15:58:56.737 I ns/openshift-machine-api > machine/rcarrillocruz-ovn-gra-z8rb7-worker-us-east-2c-fwcxk Updated machine > rcarrillocruz-ovn-gra-z8rb7-worker-us-east-2c-fwcxk (19 times) > Feb 03 15:58:57.758 I ns/openshift-machine-api > machine/rcarrillocruz-ovn-gra-z8rb7-master-0 Updated machine > rcarrillocruz-ovn-gra-z8rb7-master-0 (16 times) > Feb 03 15:58:58.703 I ns/openshift-machine-api > machine/rcarrillocruz-ovn-gra-z8rb7-master-1 Updated machine > rcarrillocruz-ovn-gra-z8rb7-master-1 (16 times) > Feb 03 15:59:30.885 I ns/openshift-etcd-operator deployment/etcd-operator > Updated ConfigMap/etcd-ca-bundle -n openshift-etcd-operator: cause by > changes in data.ca-bundle.crt (29 times) > Feb 03 16:04:31.941 E kube-apiserver Kube API started failing: Get > https://api.rcarrillocruz-ovn-granular.devcluster.openshift.com:6443/api/v1/ > namespaces/kube-system?timeout=3s: context deadline exceeded > Feb 03 16:04:31.941 I openshift-apiserver OpenShift API started failing: Get > https://api.rcarrillocruz-ovn-granular.devcluster.openshift.com:6443/apis/ > image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/ > missing?timeout=3s: context deadline exceeded (Client.Timeout exceeded while > awaiting headers) > Feb 03 16:04:33.513 - 15s E kube-apiserver Kube API is not responding to > GET requests > Feb 03 16:04:33.513 - 15s E openshift-apiserver OpenShift API is not > responding to GET requests > > 825 pass, 15 skip (51m25s) > > > Will do a few more runs, hoping to hit it at some point. As I said above the problem is easily reproducible on overt. We are seeing this problem with most tests that relate to NodePort. We are able to create a debugging environment, meaning deploy an ovirt cluster, and pause the test in the middle so we will be able to see exactly what happens. Can we schedule a debug session this week and pin this problem? It's not needed yet. This appears to not be platform specific, as the stress test showing issues in CI runs in AWS (even though my tests against AWS haven't shown any issues so far). I'll try to reach Clayton to find out more about how this stress test works. checked https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.3/790 move this bug to verified, Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0492 |