Description of problem: network timeouts at a suspiciously high failure rate Version-Release number of selected component (if applicable): 4.3, 4.4, 4.5 How reproducible: Always Steps to Reproduce: Openshift CI Actual results: Unable to connect/talk to the internet: Get http://google.com: dial tcp: i/o timeout Expected results: Additional info: https://issues.redhat.com/browse/MULTIARCH-188 4.3.z tests: [sig-network] Networking Granular Checks: Services should function for client IP based session affinity: http [Suite:openshift/conformance/parallel] [Suite:k8s] [Skipped:Network/OVNKubernetes] [sig-network] Networking Granular Checks: Services should function for client IP based session affinity: udp [Suite:openshift/conformance/parallel] [Suite:k8s] [Skipped:Network/OVNKubernetes] [sig-network] Networking Granular Checks: Services should function for endpoint-Service: http [Suite:openshift/conformance/parallel] [Suite:k8s] [Skipped:Network/OVNKubernetes] [sig-network] Networking Granular Checks: Services should function for endpoint-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s] [Skipped:Network/OVNKubernetes] [sig-network] Networking Granular Checks: Services should function for node-Service: http [Suite:openshift/conformance/parallel] [Suite:k8s] [sig-network] Networking Granular Checks: Services should function for node-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s] [sig-network] Networking Granular Checks: Services should function for pod-Service: http [Suite:openshift/conformance/parallel] [Suite:k8s] [sig-network] Networking Granular Checks: Services should function for pod-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s] [sig-network] Networking Granular Checks: Services should update endpoints: http [Suite:openshift/conformance/parallel] [Suite:k8s] [sig-network] Networking Granular Checks: Services should update endpoints: udp [Suite:openshift/conformance/parallel] [Suite:k8s] [sig-network] Networking should provide Internet connection for containers [Feature:Networking-IPv4] [Suite:openshift/conformance/parallel] [Suite:k8s] [Skipped:azure] 4.4 tests: [sig-network] Networking [Top Level] [sig-network] Networking should provide Internet connection for containers [Feature:Networking-IPv4] [Skipped:azure] [Suite:openshift/conformance/parallel] [Suite:k8s] [sig-network] Networking Granular Checks: Services [Top Level] [sig-network] Networking Granular Checks: Services should be able to handle large requests: http [Suite:openshift/conformance/parallel] [Suite:k8s] [sig-network] Networking Granular Checks: Services [Top Level] [sig-network] Networking Granular Checks: Services should function for node-Service: http [Suite:openshift/conformance/parallel] [Suite:k8s] [sig-network] Networking Granular Checks: Services [Top Level] [sig-network] Networking Granular Checks: Services should function for node-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s] [sig-network] Networking Granular Checks: Services [Top Level] [sig-network] Networking Granular Checks: Services should function for pod-Service: http [Suite:openshift/conformance/parallel] [Suite:k8s] [sig-network] Networking Granular Checks: Services [Top Level] [sig-network] Networking Granular Checks: Services should function for pod-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s] [sig-network] Networking Granular Checks: Services [Top Level] [sig-network] Networking Granular Checks: Services should update endpoints: http [Suite:openshift/conformance/parallel] [Suite:k8s] [sig-network] Networking Granular Checks: Services [Top Level] [sig-network] Networking Granular Checks: Services should update endpoints: udp [Suite:openshift/conformance/parallel] [Suite:k8s] 4.5 tests: [sig-network] Internal connectivity for TCP and UDP on ports 9000-9999 is allowed [Suite:openshift/conformance/parallel] [sig-network] Networking Granular Checks: Pods should function for intra-pod communication: http [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s] [sig-network] Networking Granular Checks: Pods should function for intra-pod communication: udp [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s] [sig-network] Networking Granular Checks: Pods should function for node-pod communication: http [LinuxOnly] [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s] [sig-network] Networking Granular Checks: Services should be able to handle large requests: http [Suite:openshift/conformance/parallel] [Suite:k8s] [sig-network] Networking Granular Checks: Services should be able to handle large requests: udp [Suite:openshift/conformance/parallel] [Suite:k8s] [sig-network] Networking Granular Checks: Services should function for endpoint-Service: http [Suite:openshift/conformance/parallel] [Suite:k8s] [sig-network] Networking Granular Checks: Services should function for endpoint-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s] [sig-network] Networking should provide Internet connection for containers [Feature:Networking-IPv4] [Skipped:azure] [Suite:openshift/conformance/parallel] [Suite:k8s]
This bug is still occurring. Here's a link to the latest job that failed - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.4/1281589179389579264
Hi Rafael, will someone from your QE team have the bandwidth to look at this bug during the current release? If not, I would like to set the "UpcomingSprint" tag
(In reply to Dan Li from comment #3) > Hi Rafael, will someone from your QE team have the bandwidth to look at this > bug during the current release? If not, I would like to set the > "UpcomingSprint" tag By "current release" I meant before August 1
I started investigating this issue. Moving it to Assigned.
Setting the Target Release to 4.5.z per Rafael's Comment 5. Please feel free to change if necessary
Hi Rafael - can we add a Severity and Target Release per Eric's Comment 7?
Sure. I'm setting it to Low because: 1) the respective tests are currently disabled and do not impact CI; 2) this issue could be related to how our CI clusters are set up. It shouldn't be a problem for customer deployments since they have control over their own networking.
Setting the Target Release to 4.6.0 as there is no "Depends On" bug in 4.6. Hopefully it will be correct this time :) Please feel free to change if necessary
I couldn't reproduce the google tcp i/o timeout in my local CI run, but one of the network issues is related to https://github.com/kubernetes/kubernetes/pull/92193 (based on https://github.com/kubernetes/kubernetes/issues/88986). As of now, it's unclear whether this issue affects baremetal installs. The kubernetes links above contain a workaround that can be used until the fix is released.
Hi Rafael, will this bug be closed before next Monday? If you are still working on it, can we add UpcomingSprint label?
This is flaky in nature, so better add the label and keep monitoring the CI runs.
Issue #1868106 could also be contributing to the timeouts.
Hi Rafael, will this bug be closed before the end of this week? If you are still working on it, can we add "UpcomingSprint" label?
I don't think so. Feel free to add the label.
Hi Rafael, will this bug be resolved before the end of the sprint this week? If you are still working on it, I would like to add "UpcomingSprint" label
Still working on this. There is some investigation going on in the glibc side of things.
Hi Rafael, do you think this bug will be resolved before the end of this sprint October 3rd? If not, I would like to add "UpcomingSprint" label.
This test "[sig-network] Networking should provide Internet connection for containers [Feature:Networking-IPv4] [Skipped:azure] [Suite:openshift/conformance/parallel] [Suite:k8s]" is supposed to be excluded as per https://github.com/kubernetes/kubernetes/issues/95185
Hi Rafael, since this bug was reported in 4.3.z and OCP 4.3 will go end-of-life next week post 4.6 release, should we close this bug? Alternatively, we can re-target this bug to 4.4, 4.5, or 4.6 if the bug is ongoing.
Hi Rafael, will this bug be resolved before the end of this sprint (Oct 24th)? If not, can we add "UpcomingSprint"?
Hi @Rafael, can we assign this bug a Target Release (4.6.z or 4.7)? 4.6 just released and is no longer a valid Target Release.
Changing the assignee to Rafael per Comment 5 Hi Rafael, will this bug be resolved before the end of this sprint (Nov 14th)? If not, can we add "UpcomingSprint"?
Hi Rafael, do you think this bug will be resolved before the end of this sprint (Dec 5th)? If not, can we add "UpcomingSprint"?
Hi Rafael, I am doing this exercise one week early because most people are out next week. 1. Do you think this bug will be resolved before the end of this sprint (December 26th)? If not, I'd like to add "UpcomingSprint" 2. Do you think this bug's Target Release is still 4.7.0? If it does not target 4.7, can we set it to blank value "---"?
Bug-watcher here -- I'm seeing this crop up again in some 4.7 jobs; e.g. https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-remote-libvirt-ppc64le-4.7/1346788604482949120
Hi! These tests are disabled on 4.7 as of yesterday evening (after that run). The reason for the failures is because the OpenShift 3.11 Prow build farm has DNS outages during peak hours. We raised this to DPTP, but they said that we are better off upgrading to one of the 4.x build farms. Since we don't want to do a platform migration this close to release, we've disabled the tests until the migration is complete. https://github.com/openshift/release/pull/14662
Hi Rafael, do you think this bug will be resolved before the end of this sprint (Jan 16th)? If not, can we add "UpcomingSprint"?
Hi Rafael, do you think this bug will be resolved before the end of this sprint (Feb 6th)? If not, can we set the "Reviewed-in-Sprint" flag to "+"?
Hi Rafael, since 4.4 will go end of support after 4.7 GA, can we either re-target the version of this bug to a later release, or close out this bug as its reported version will no longer be in support?
Moving to 4.8 since this might still be resolved by the cluster migration in CI.
*** Bug 1886940 has been marked as a duplicate of this bug. ***
Hi Deep, do you think this bug will be resolved before the end of the sprint (Feb. 26th)? If not, can we set "Reviewed-in-Sprint" flag?
https://github.com/openshift/release/pull/16331
Hi Deep, do you think this bug will be resolved before the end of the sprint (Mar 20th)? If not, can we set "Reviewed-in-Sprint" flag?
This bug is resolved and waiting for the PR to merge.
Hi Deep, I'm going thru the bugs to triage and see that your PR 16331 has merged. Should this bug be at ON_QA at this point? If it's still at POST (PR hasn't merged), then let's add "Reviewed-in-Sprint" for the past sprint before the bot resets the flag later this week.
Changing to ON_QA after discussion with Deep as the PR has merged.
Networking Granular related tests are passing in CI. Latest CI run -> https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-remote-libvirt-ppc64le-4.8/1405495486701375488
Marking as VERIFIED per Deep's Comment 44
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438