Bug 2076809

Summary: metal OVN: NetworkPolicy [LinuxOnly] NetworkPolicy between server and client should ...: 80: connect: no route to host
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: NetworkingAssignee: Jacob Tanenbaum <jtanenba>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: unspecified CC: bbennett, sippy, vpickard
Version: 4.8Keywords: Regression
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-04-22 15:37:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2022-04-19 22:42:14 UTC
Several NetworkPolicy test cases recently died [1], with 4.8.0-0.nightly-2022-03-29-081438 passing [2] and 4.8.0-0.nightly-2022-03-31-201158 failing [3].

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-ipv6/1509625252559196160/build-log.txt | grep -A8 'Failing t
ests'
Failing tests:

[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client should allow egress access to server in CIDR block [Feature:NetworkPolicy] [Skipped:Network/OpenShiftSDN/Multitenant] [Skipped:Network/OpenShiftSDN] [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client should enforce egress policy allowing traffic to a server in a different namespace based on PodSelector and NamespaceSelector [Feature:NetworkPolicy] [Skipped:Network/OpenShiftSDN/Multitenant] [Skipped:Network/OpenShiftSDN] [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client should enforce multiple egress policies with egress allow-all policy taking precedence [Feature:NetworkPolicy] [Skipped:Network/OpenShiftSDN/Multitenant] [Skipped:Network/OpenShiftSDN] [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client should enforce policies to check ingress and egress policies can be controlled independently based on PodSelector [Feature:NetworkPolicy] [Skipped:Network/OpenShiftSDN/Multitenant] [Skipped:Network/OpenShiftSDN] [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client should ensure an IP overlapping both IPBlock.CIDR and IPBlock.Except is allowed [Feature:NetworkPolicy] [Skipped:Network/OpenShiftSDN/Multitenant] [Skipped:Network/OpenShiftSDN] [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client should work with Ingress,Egress specified together [Feature:NetworkPolicy] [Skipped:Network/OpenShiftSDN/Multitenant] [Skipped:Network/OpenShiftSDN] [Suite:openshift/conformance/parallel] [Suite:k8s]

Failure mode for the test cases seems to be consistently:

  OTHER: dial tcp [fd02::fc1e]:80: connect: no route to host

Bunch of jobs hitting this:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=96h&type=junit&search=NetworkPolicy+between+server+and+client+should' | grep 'failures match' | grep -v pull-ci- | sort
periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-aws-arm64-single-node (all) - 13 runs, 85% failed, 9% of failures match = 8% impact
periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-remote-libvirt-s390x (all) - 7 runs, 100% failed, 14% of failures match = 14% impact
periodic-ci-openshift-multiarch-master-nightly-4.9-ocp-e2e-remote-libvirt-s390x (all) - 8 runs, 100% failed, 25% of failures match = 25% impact
periodic-ci-openshift-ovn-kubernetes-release-4.10-e2e-ibmcloud-ipi-ovn-periodic (all) - 4 runs, 100% failed, 25% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-gcp-ovn-rt-upgrade (all) - 16 runs, 69% failed, 27% of failures match = 19% impact
periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade (all) - 275 runs, 74% failed, 0% of failures match = 0% impact
periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-gcp-ovn-rt-upgrade (all) - 16 runs, 100% failed, 6% of failures match = 6% impact
periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-gcp-ovn-upgrade (all) - 90 runs, 80% failed, 11% of failures match = 9% impact
periodic-ci-openshift-release-master-ci-4.6-e2e-aws (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.6-e2e-azure (all) - 8 runs, 13% failed, 100% of failures match = 13% impact
periodic-ci-openshift-release-master-ci-4.7-e2e-aws-network-stress (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.7-e2e-gcp-ovn (all) - 3 runs, 33% failed, 100% of failures match = 33% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-ovn-network-stress (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-ovn (all) - 5 runs, 20% failed, 100% of failures match = 20% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-ovn (all) - 5 runs, 40% failed, 50% of failures match = 20% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade (all) - 5 runs, 60% failed, 33% of failures match = 20% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-ovn-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-nightly-4.10-e2e-metal-ipi-compact (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.10-e2e-metal-ipi-ovn-ipv6 (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.4-e2e-aws (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-compact (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-ipv6 (all) - 3 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-compact (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-ovn-ipv6 (all) - 3 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-okd-4.10-e2e-vsphere (all) - 14 runs, 93% failed, 62% of failures match = 57% impact
periodic-ci-openshift-release-master-okd-4.11-e2e-vsphere (all) - 32 runs, 91% failed, 72% of failures match = 66% impact
periodic-ci-openshift-release-master-okd-4.6-e2e-aws (all) - 2 runs, 50% failed, 200% of failures match = 100% impact
periodic-ci-openshift-release-master-okd-4.9-e2e-vsphere (all) - 4 runs, 100% failed, 25% of failures match = 25% impact
periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-e2e-openstack-ovn (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-upgrade-from-stable-4.9-e2e-openstack-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.11-e2e-openstack-ccm-install (all) - 4 runs, 100% failed, 25% of failures match = 25% impact
periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.11-e2e-openstack-ovn (all) - 4 runs, 100% failed, 25% of failures match = 25% impact
periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.11-e2e-openstack-techpreview-parallel (all) - 4 runs, 100% failed, 25% of failures match = 25% impact
periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.11-upgrade-from-stable-4.10-e2e-openstack-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.8-e2e-openstack-parallel (all) - 4 runs, 75% failed, 33% of failures match = 25% impact
periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.8-upgrade-from-stable-4.7-e2e-openstack-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.9-upgrade-from-stable-4.8-e2e-openstack-upgrade (all) - 4 runs, 100% failed, 25% of failures match = 25% impact
release-openshift-origin-installer-e2e-aws-upgrade-4.2-to-4.3-to-4.4-to-4.5-ci (all) - 4 runs, 100% failed, 25% of failures match = 25% impact
release-openshift-origin-installer-e2e-azure-compact-4.4 (all) - 4 runs, 100% failed, 25% of failures match = 25% impact
release-openshift-origin-installer-e2e-gcp-compact-4.4 (all) - 4 runs, 100% failed, 25% of failures match = 25% impact

Confirming the ":80: connect: no route to host" context:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=96h&maxMatches=10&type=junit&context=10&search=NetworkPolicy+between+server+and+client+should' | jq -r 'to_entries[] | select(.valu
e | tostring | contains(":80: connect: no route to host")).key' | sort | sed 's|.*/\([^/]*\)/[0-9]*$|\1|' | sort | uniq -c
      1 periodic-ci-openshift-release-master-nightly-4.10-e2e-metal-ipi-compact
      3 periodic-ci-openshift-release-master-nightly-4.10-e2e-metal-ipi-ovn-ipv6
      1 periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-compact
      3 periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-ipv6
      1 periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-compact
      3 periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-ovn-ipv6
      2 pull-ci-openshift-cluster-etcd-operator-release-4.10-e2e-metal-ipi-ovn-ipv6
     23 pull-ci-openshift-cluster-network-operator-release-4.10-e2e-metal-ipi-ovn-ipv6
     27 pull-ci-openshift-cluster-network-operator-release-4.8-e2e-metal-ipi-ovn-ipv6
     27 pull-ci-openshift-cluster-network-operator-release-4.8-e2e-metal-ipi-ovn-ipv6-ipsec
     16 pull-ci-openshift-cluster-network-operator-release-4.9-e2e-metal-ipi-ovn-ipv6
      1 pull-ci-openshift-installer-release-4.10-e2e-metal-ipi-ovn-ipv6
     12 pull-ci-openshift-kubernetes-release-4.8-e2e-metal-ipi-ovn-ipv6
      1 pull-ci-openshift-machine-config-operator-release-4.8-e2e-metal-ipi-ovn-ipv6
      1 pull-ci-openshift-machine-config-operator-release-4.9-e2e-metal-ipi-ovn-ipv6
      2 pull-ci-openshift-ovn-kubernetes-release-4.10-e2e-metal-ipi-ovn-ipv6
      1 pull-ci-openshift-router-release-4.10-e2e-metal-ipi-ovn-ipv6

So at least 4.8 through 4.10, and possibly other versions are affected.  All of these context hits are metal, most are OVN IPv6, and some are compact.

Diffing the good and bad 4.8 nightlies:

$ REF_A=https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-ipv6/1509595459935539200/artifacts/release/artifacts/release-images-latest
$ REF_B=https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-ipv6/1509625252559196160/artifacts/release/artifacts/release-images-latest
$ JQ='[.spec.tags[] | .name + " " + .annotations["io.openshift.build.source-location"] + "/commit/" + .annotations["io.openshift.build.commit.id"]] | sort[]'
$ diff -U0 <(curl -s "${REF_A}" | jq -r "${JQ}") <(curl -s "${REF_B}" | jq -r "${JQ}")
--- /dev/fd/63  2022-04-19 09:10:23.703841396 -0700
+++ /dev/fd/62  2022-04-19 09:10:23.708841396 -0700
@@ -92 +92 @@
-machine-config-operator https://github.com/openshift/machine-config-operator/commit/3238e971006b3d9ba002122a767fa7654e86928f
+machine-config-operator https://github.com/openshift/machine-config-operator/commit/b8e1b11cc4e01d9ffe4e66a432c14d70e59838a6
@@ -111 +111 @@
-openstack-machine-controllers https://github.com/openshift/cluster-api-provider-openstack/commit/eb8656e9dfb4e6945ba8a7776433bc19e9397ba8
+openstack-machine-controllers https://github.com/openshift/cluster-api-provider-openstack/commit/77840b9a431880b15ee05d4a3f327b7ff2a682e8
@@ -118 +118 @@
-ovn-kubernetes https://github.com/openshift/ovn-kubernetes/commit/7976c7975b550571e6ff24158411633022122692
+ovn-kubernetes https://github.com/openshift/ovn-kubernetes/commit/aa2c3f4048ba38f119c312b26d8682107a9b1dc6

So [4] would be suspicious OVN changes. 

[1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-ipv6
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-ipv6/1509595459935539200
[3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-ipv6/1509625252559196160
[4]: https://github.com/openshift/ovn-kubernetes/compare/7976c7975b550571e6ff24158411633022122692...aa2c3f4048ba38f119c312b26d8682107a9b1dc6

Comment 1 W. Trevor King 2022-04-19 22:45:17 UTC
Per [1], we're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the ImpactStatementRequested label has been added to this bug. When responding, please remove ImpactStatementRequested and set the ImpactStatementProposed label. The expectation is that the assignee answers these questions.

Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking?
* example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
* example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time

What is the impact? Is it serious enough to warrant blocking edges?
* example: Up to 2 minute disruption in edge routing
* example: Up to 90 seconds of API downtime
* example: etcd loses quorum and you have to restore from backup

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
* example: Issue resolves itself after five minutes
* example: Admin uses oc to fix things
* example: Admin must SSH to hosts, restore from backups, or other non standard admin activities

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
* example: No, it has always been like this we just never noticed
* example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

[1]: https://github.com/openshift/enhancements/blob/master/enhancements/update/update-blocker-lifecycle/README.md#impact-statement-request

Comment 2 Jacob Tanenbaum 2022-04-22 15:37:53 UTC

*** This bug has been marked as a duplicate of bug 2077370 ***

Comment 3 W. Trevor King 2022-04-27 22:24:00 UTC
Impact-statement request has moved to [1].

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2074839#c1