Bug 1968021

Summary: network-check-source deployment does not complete during upgrade
Product: OpenShift Container Platform Reporter: jamo luhrsen <jluhrsen>
Component: NetworkingAssignee: Andrew Stoycos <astoycos>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED WONTFIX Docs Contact:
Severity: high    
Priority: low CC: vpickard
Version: 4.8   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-02-15 17:17:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description jamo luhrsen 2021-06-04 17:59:30 UTC
Description of problem:

The 4.7->4.8 ovn upgrade job has a failure in "Cluster should remain functional during upgrade" which
complains that network-check-source does not finish deploying. full test log erorr:

  fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:153]: during upgrade to registry.build01.ci.openshift.org/ci-op-ry63bg7k/release@sha256:e469b0e52d9e030a83b6311447a37264ca49b330ba67488c58a0d66b085fac85
Unexpected error:
    <*errors.errorString | 0xc00095a8e0>: {
        s: "ClusterOperators did not settle: \nclusteroperator/image-registry is Progressing for 15m27.554551071s because \"Progressing: The deployment has not completed\"\n\tclusteroperator/network is Progressing for 4m21.554559551s because \"Deployment \\\"openshift-network-diagnostics/network-check-source\\\" is not available (awaiting 1 nodes)\"",
    }
    ClusterOperators did not settle: 
    clusteroperator/image-registry is Progressing for 15m27.554551071s because "Progressing: The deployment has not completed"
    	clusteroperator/network is Progressing for 4m21.554559551s because "Deployment \"openshift-network-diagnostics/network-check-source\" is not available (awaiting 1 nodes)"


This job has many other failures all pointing to networking, like routes not being up or
OVS port bindings timing out. Maybe they are all related. Here is the port binding bz:
  https://bugzilla.redhat.com/show_bug.cgi?id=1968009

the network check pod log has a 'no route to host' error when trying to connect to something
api/auth:

  F0604 02:43:42.814512       1 cmd.go:129] unable to load configmap based request-header-client-ca-file: Get "https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication": dial tcp 172.30.0.1:443: connect: no route to host

  https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1400613589785513984/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-network-diagnostics_network-check-source-dbdfd5479-vtmx4_check-endpoints.log


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info: