Bug 1846875 - Network setup test high failure rate
Summary: Network setup test high failure rate
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Multi-Arch
Version: 4.8
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: 4.8.0
Assignee: Deep Mistry
QA Contact: Barry Donahue
URL:
Whiteboard:
: 1886940 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-15 07:51 UTC by Rafael Fonseca
Modified: 2021-07-27 22:33 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
[sig-network] Networking Granular Checks: Services should function for pod-Service(hostNetwork): udp
Last Closed: 2021-07-27 22:32:23 UTC
Target Upstream Version:
Embargoed:
dmistry: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1763936 0 high CLOSED [sig-network] Networking Granular Checks: Services should function for endpoint-Service: udp [Suite:openshift/conformanc... 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1829498 0 high CLOSED [sig-network] Networking Granular Checks: Services - Multiple tests failing on e2e-aws-sdn-network-stress-4.4 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1868106 0 unspecified CLOSED glibc: Transaction ID collisions cause slow DNS lookups in getaddrinfo 2023-12-15 18:48:20 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:33:11 UTC

Description Rafael Fonseca 2020-06-15 07:51:45 UTC
Description of problem: network timeouts at a suspiciously high failure rate


Version-Release number of selected component (if applicable): 4.3, 4.4, 4.5


How reproducible: Always


Steps to Reproduce: Openshift CI


Actual results: Unable to connect/talk to the internet: Get http://google.com: dial tcp: i/o timeout


Expected results:


Additional info: https://issues.redhat.com/browse/MULTIARCH-188

4.3.z tests:
[sig-network] Networking Granular Checks: Services should function for client IP based session affinity: http [Suite:openshift/conformance/parallel] [Suite:k8s] [Skipped:Network/OVNKubernetes]
[sig-network] Networking Granular Checks: Services should function for client IP based session affinity: udp [Suite:openshift/conformance/parallel] [Suite:k8s] [Skipped:Network/OVNKubernetes]
[sig-network] Networking Granular Checks: Services should function for endpoint-Service: http [Suite:openshift/conformance/parallel] [Suite:k8s] [Skipped:Network/OVNKubernetes]
[sig-network] Networking Granular Checks: Services should function for endpoint-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s] [Skipped:Network/OVNKubernetes]
[sig-network] Networking Granular Checks: Services should function for node-Service: http [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-network] Networking Granular Checks: Services should function for node-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-network] Networking Granular Checks: Services should function for pod-Service: http [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-network] Networking Granular Checks: Services should function for pod-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-network] Networking Granular Checks: Services should update endpoints: http [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-network] Networking Granular Checks: Services should update endpoints: udp [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-network] Networking should provide Internet connection for containers [Feature:Networking-IPv4] [Suite:openshift/conformance/parallel] [Suite:k8s] [Skipped:azure] 

4.4 tests:
[sig-network] Networking [Top Level] [sig-network] Networking should provide Internet connection for containers [Feature:Networking-IPv4] [Skipped:azure] [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-network] Networking Granular Checks: Services [Top Level] [sig-network] Networking Granular Checks: Services should be able to handle large requests: http [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-network] Networking Granular Checks: Services [Top Level] [sig-network] Networking Granular Checks: Services should function for node-Service: http [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-network] Networking Granular Checks: Services [Top Level] [sig-network] Networking Granular Checks: Services should function for node-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-network] Networking Granular Checks: Services [Top Level] [sig-network] Networking Granular Checks: Services should function for pod-Service: http [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-network] Networking Granular Checks: Services [Top Level] [sig-network] Networking Granular Checks: Services should function for pod-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-network] Networking Granular Checks: Services [Top Level] [sig-network] Networking Granular Checks: Services should update endpoints: http [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-network] Networking Granular Checks: Services [Top Level] [sig-network] Networking Granular Checks: Services should update endpoints: udp [Suite:openshift/conformance/parallel] [Suite:k8s]

4.5 tests:
[sig-network] Internal connectivity for TCP and UDP on ports 9000-9999 is allowed [Suite:openshift/conformance/parallel]
[sig-network] Networking Granular Checks: Pods should function for intra-pod communication: http [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]
[sig-network] Networking Granular Checks: Pods should function for intra-pod communication: udp [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]
[sig-network] Networking Granular Checks: Pods should function for node-pod communication: http [LinuxOnly] [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]
[sig-network] Networking Granular Checks: Services should be able to handle large requests: http [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-network] Networking Granular Checks: Services should be able to handle large requests: udp [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-network] Networking Granular Checks: Services should function for endpoint-Service: http [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-network] Networking Granular Checks: Services should function for endpoint-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-network] Networking should provide Internet connection for containers [Feature:Networking-IPv4] [Skipped:azure] [Suite:openshift/conformance/parallel] [Suite:k8s]

Comment 2 Rashmi Gottipati 2020-07-10 21:19:06 UTC
This bug is still occurring. Here's a link to the latest job that failed - 
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.4/1281589179389579264

Comment 3 Dan Li 2020-07-21 15:51:31 UTC
Hi Rafael, will someone from your QE team have the bandwidth to look at this bug during the current release? If not, I would like to set the "UpcomingSprint" tag

Comment 4 Dan Li 2020-07-21 15:56:55 UTC
(In reply to Dan Li from comment #3)
> Hi Rafael, will someone from your QE team have the bandwidth to look at this
> bug during the current release? If not, I would like to set the
> "UpcomingSprint" tag

By "current release" I meant before August 1

Comment 5 Rafael Fonseca 2020-07-21 19:14:56 UTC
I started investigating this issue. Moving it to Assigned.

Comment 6 Dan Li 2020-07-21 19:28:58 UTC
Setting the Target Release to 4.5.z per Rafael's Comment 5. Please feel free to change if necessary

Comment 8 Dan Li 2020-07-21 20:02:57 UTC
Hi Rafael - can we add a Severity and Target Release per Eric's Comment 7?

Comment 9 Rafael Fonseca 2020-07-21 20:14:19 UTC
Sure. I'm setting it to Low because:
 1) the respective tests are currently disabled and do not impact CI;
 2) this issue could be related to how our CI clusters are set up. It shouldn't be a problem for customer deployments since they have control over their own networking.

Comment 11 Dan Li 2020-07-21 21:51:24 UTC
Setting the Target Release to 4.6.0 as there is no "Depends On" bug in 4.6. Hopefully it will be correct this time :) Please feel free to change if necessary

Comment 12 Rafael Fonseca 2020-07-27 15:22:14 UTC
I couldn't reproduce the google tcp i/o timeout in my local CI run, but one of the network issues is related to https://github.com/kubernetes/kubernetes/pull/92193 (based on https://github.com/kubernetes/kubernetes/issues/88986). As of now, it's unclear whether this issue affects baremetal installs. The kubernetes links above contain a workaround that can be used until the fix is released.

Comment 13 Dan Li 2020-07-30 20:42:07 UTC
Hi Rafael, will this bug be closed before next Monday? If you are still working on it, can we add UpcomingSprint label?

Comment 14 Rafael Fonseca 2020-07-30 22:36:55 UTC
This is flaky in nature, so better add the label and keep monitoring the CI runs.

Comment 15 Rafael Fonseca 2020-08-16 10:02:15 UTC
Issue #1868106 could also be contributing to the timeouts.

Comment 16 Dan Li 2020-08-18 15:57:20 UTC
Hi Rafael, will this bug be closed before the end of this week? If you are still working on it, can we add "UpcomingSprint" label?

Comment 17 Rafael Fonseca 2020-08-18 16:11:38 UTC
I don't think so. Feel free to add the label.

Comment 18 Dan Li 2020-09-08 18:12:55 UTC
Hi Rafael, will this bug be resolved before the end of the sprint this week? If you are still working on it, I would like to add "UpcomingSprint" label

Comment 19 Rafael Fonseca 2020-09-09 13:14:50 UTC
Still working on this. There is some investigation going on in the glibc side of things.

Comment 20 Dan Li 2020-09-28 15:10:26 UTC
Hi Rafael, do you think this bug will be resolved before the end of this sprint October 3rd? If not, I would like to add "UpcomingSprint" label.

Comment 21 Rafael Fonseca 2020-10-06 12:43:53 UTC
This test "[sig-network] Networking should provide Internet connection for containers [Feature:Networking-IPv4] [Skipped:azure] [Suite:openshift/conformance/parallel] [Suite:k8s]" is supposed to be excluded as per https://github.com/kubernetes/kubernetes/issues/95185

Comment 22 Dan Li 2020-10-13 22:33:19 UTC
Hi Rafael, since this bug was reported in 4.3.z and OCP 4.3 will go end-of-life next week post 4.6 release, should we close this bug? Alternatively, we can re-target this bug to 4.4, 4.5, or 4.6 if the bug is ongoing.

Comment 24 Dan Li 2020-10-19 19:49:03 UTC
Hi Rafael, will this bug be resolved before the end of this sprint (Oct 24th)? If not, can we add "UpcomingSprint"?

Comment 26 Dan Li 2020-10-27 14:38:38 UTC
Hi @Rafael, can we assign this bug a Target Release (4.6.z or 4.7)? 4.6 just released and is no longer a valid Target Release.

Comment 27 Dan Li 2020-11-09 16:53:26 UTC
Changing the assignee to Rafael per Comment 5

Hi Rafael, will this bug be resolved before the end of this sprint (Nov 14th)? If not, can we add "UpcomingSprint"?

Comment 28 Dan Li 2020-11-30 18:24:48 UTC
Hi Rafael, do you think this bug will be resolved before the end of this sprint (Dec 5th)? If not, can we add "UpcomingSprint"?

Comment 29 Dan Li 2020-12-15 18:11:56 UTC
Hi Rafael, I am doing this exercise one week early because most people are out next week. 

1. Do you think this bug will be resolved before the end of this sprint (December 26th)? If not, I'd like to add "UpcomingSprint"
2. Do you think this bug's Target Release is still 4.7.0? If it does not target 4.7, can we set it to blank value "---"?

Comment 30 Nick Hale 2021-01-08 00:40:47 UTC
Bug-watcher here -- I'm seeing this crop up again in some 4.7 jobs; e.g. https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-remote-libvirt-ppc64le-4.7/1346788604482949120

Comment 31 Jeremy Poulin 2021-01-08 02:42:57 UTC
Hi!

These tests are disabled on 4.7 as of yesterday evening (after that run).
The reason for the failures is because the OpenShift 3.11 Prow build farm has DNS outages during peak hours. We raised this to DPTP, but they said that we are better off upgrading to one of the 4.x build farms. Since we don't want to do a platform migration this close to release, we've disabled the tests until the migration is complete.
https://github.com/openshift/release/pull/14662

Comment 32 Dan Li 2021-01-12 19:23:48 UTC
Hi Rafael, do you think this bug will be resolved before the end of this sprint (Jan 16th)? If not, can we add "UpcomingSprint"?

Comment 33 Dan Li 2021-02-01 15:07:59 UTC
Hi Rafael, do you think this bug will be resolved before the end of this sprint (Feb 6th)? If not, can we set the "Reviewed-in-Sprint" flag to "+"?

Comment 34 Dan Li 2021-02-18 20:00:42 UTC
Hi Rafael, since 4.4 will go end of support after 4.7 GA, can we either re-target the version of this bug to a later release, or close out this bug as its reported version will no longer be in support?

Comment 35 Rafael Fonseca 2021-02-18 21:32:22 UTC
Moving to 4.8 since this might still be resolved by the cluster migration in CI.

Comment 36 Jeremy Poulin 2021-02-20 20:42:09 UTC
*** Bug 1886940 has been marked as a duplicate of this bug. ***

Comment 37 Dan Li 2021-02-22 15:55:41 UTC
Hi Deep, do you think this bug will be resolved before the end of the sprint (Feb. 26th)? If not, can we set "Reviewed-in-Sprint" flag?

Comment 38 Deep Mistry 2021-03-04 16:23:44 UTC
https://github.com/openshift/release/pull/16331

Comment 39 Dan Li 2021-03-15 16:38:37 UTC
Hi Deep, do you think this bug will be resolved before the end of the sprint (Mar 20th)? If not, can we set "Reviewed-in-Sprint" flag?

Comment 40 Deep Mistry 2021-03-15 17:45:34 UTC
This bug is resolved and waiting for the PR to merge.

Comment 41 Dan Li 2021-03-22 12:40:10 UTC
Hi Deep, I'm going thru the bugs to triage and see that your PR 16331 has merged. Should this bug be at ON_QA at this point? If it's still at POST (PR hasn't merged), then let's add "Reviewed-in-Sprint" for the past sprint before the bot resets the flag later this week.

Comment 42 Dan Li 2021-03-22 12:50:03 UTC
Changing to ON_QA after discussion with Deep as the PR has merged.

Comment 44 Deep Mistry 2021-06-17 14:08:33 UTC
Networking Granular related tests are passing in CI.
Latest CI run -> https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-remote-libvirt-ppc64le-4.8/1405495486701375488

Comment 45 Dan Li 2021-06-17 14:10:49 UTC
Marking as VERIFIED per Deep's Comment 44

Comment 47 errata-xmlrpc 2021-07-27 22:32:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.