Created attachment 1760838 [details] must-gather information Description of problem: Hello I am upgrading a OCP 4.6.18 work well cluster to OCP 4.7.0, but I have found the upgrading was failed due to many critical Cluster Operators failed Version-Release number of selected component (if applicable): Current OCP version is OCP 4.6.18 Target Upgrading Version is OCP 4.7.0 OCP Cluster setup method: Bare-Metal UPI OCP Cluster Networking Environment Completely Disconnected Networking Envrionment ! The Networking Envrionment Cannot access any Internet resource ! I have setup a Disconnected Registry Server for this OCP Cluster setup. The environment have been prove work well because the OCP 4.6.18 setup successful! How reproducible: Upgrding Current OCP 4.6.18 Cluster to OCP 4.7.0 Steps to Reproduce: 1. Upgrding Current OCP 4.6.18 Cluster to OCP 4.7.0 # oc adm upgrade --allow-explicit-upgrade \ --allow-upgrade-with-warnings=true \ --force=true \ --to-image=ocp-1.registry.example.internal:5001/ocp4/openshift4:4.7.0 2. Gather Upgrading Output (1) Status of All Nodes ============================== # oc get nodes ============================== NAME STATUS ROLES AGE VERSION master-01.cluster2.ocp4.example.internal Ready master,worker 14h v1.20.0+ba45583 master-02.cluster2.ocp4.example.internal Ready master,worker 14h v1.20.0+ba45583 master-03.cluster2.ocp4.example.internal Ready master,worker 13h v1.20.0+ba45583 (2)Status of All ClusterOperators ============================== # oc get co ============================== NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.7.0 False False True 35m baremetal 4.7.0 True False False 14h cloud-credential 4.7.0 True False False 16h cluster-autoscaler 4.7.0 True False False 16h config-operator 4.7.0 True False False 16h console 4.7.0 False False True 98m csi-snapshot-controller 4.7.0 True False False 126m dns 4.7.0 True False False 15h etcd 4.7.0 True False False 15h image-registry 4.7.0 True False True 15h ingress 4.7.0 True False False 122m insights 4.7.0 True False False 16h kube-apiserver 4.7.0 True False False 15h kube-controller-manager 4.7.0 True False False 16h kube-scheduler 4.7.0 True False False 16h kube-storage-version-migrator 4.7.0 True False False 13h machine-api 4.7.0 True False False 16h machine-approver 4.7.0 True False False 16h machine-config 4.7.0 True False False 13h marketplace 4.7.0 True False False 128m monitoring 4.7.0 False True True 13h network 4.7.0 True False False 14h node-tuning 4.7.0 True False False 14h openshift-apiserver 4.7.0 False False False 14h openshift-controller-manager 4.7.0 True False False 14h openshift-samples 4.7.0 False False False 128m operator-lifecycle-manager 4.7.0 True False False 16h operator-lifecycle-manager-catalog 4.7.0 True False False 16h operator-lifecycle-manager-packageserver 4.7.0 True False False 4s service-ca 4.7.0 True False False 16h storage 4.7.0 True False False 16h Actual results: The Upgrading processing is always in True ============================== # oc get clusterversion ============================== NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.18 True True 15h Working towards 4.7.0: 8 of 668 done (1% complete) ============================== # oc get mcp ============================== NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-0e3d4f0417b31f214ce4a4ad13e9abc1 True False False 3 3 3 0 16h worker rendered-worker-330381a45417785c52a234b02b7a8e83 True False False 0 0 0 0 16h ============================== # oc describe co authentication openshift-apiserver ============================== Name: authentication Namespace: Labels: <none> Annotations: exclude.release.openshift.io/internal-openshift-hosted: true include.release.openshift.io/self-managed-high-availability: true API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2021-03-04T15:43:48Z Generation: 1 Managed Fields: API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:exclude.release.openshift.io/internal-openshift-hosted: f:include.release.openshift.io/self-managed-high-availability: f:spec: f:status: .: f:extension: Manager: cluster-version-operator Operation: Update Time: 2021-03-04T15:43:48Z API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: f:conditions: f:relatedObjects: f:versions: Manager: authentication-operator Operation: Update Time: 2021-03-04T17:14:05Z Resource Version: 515482 Self Link: /apis/config.openshift.io/v1/clusteroperators/authentication UID: a601ca64-6d29-4f38-bfaa-b4f487c3462e Spec: Status: Conditions: Last Transition Time: 2021-03-04T17:47:07Z Message: OAuthServiceEndpointsCheckEndpointAccessibleControllerDegraded: Get "https://10.129.0.58:6443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) RouteDegraded: Unable to get or create required route openshift-authentication/oauth-openshift: the server is currently unable to handle the request (get routes.route.openshift.io oauth-openshift) OAuthRouteCheckEndpointAccessibleControllerDegraded: Get "https://oauth-openshift.apps.cluster2.ocp4.example.internal/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) OAuthServerDeploymentDegraded: Unable to get "openshift-browser-client" bootstrapped OAuth client: the server is currently unable to handle the request (post oauthclients.oauth.openshift.io) Reason: OAuthRouteCheckEndpointAccessibleController_SyncError::OAuthServerDeployment_GetFailed::OAuthServiceEndpointsCheckEndpointAccessibleController_SyncError::Route_FailedCreate Status: True Type: Degraded Last Transition Time: 2021-03-05T08:03:09Z Message: OAuthVersionRouteProgressing: Request to "https://oauth-openshift.apps.cluster2.ocp4.example.internal/healthz" not successful yet Reason: OAuthVersionRoute_WaitingForRoute Status: True Type: Progressing Last Transition Time: 2021-03-05T07:23:50Z Message: APIServicesAvailable: "oauth.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) APIServicesAvailable: "user.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) OAuthVersionRouteAvailable: HTTP request to "https://oauth-openshift.apps.cluster2.ocp4.example.internal/healthz" failed: dial tcp: i/o timeout OAuthRouteCheckEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.cluster2.ocp4.example.internal/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Reason: APIServices_Error::OAuthRouteCheckEndpointAccessibleController_EndpointUnavailable::OAuthVersionRoute_RequestFailed Status: False Type: Available Last Transition Time: 2021-03-04T15:48:06Z Message: All is well Reason: AsExpected Status: True Type: Upgradeable Extension: <nil> Related Objects: Group: operator.openshift.io Name: cluster Resource: authentications Group: config.openshift.io Name: cluster Resource: authentications Group: config.openshift.io Name: cluster Resource: infrastructures Group: config.openshift.io Name: cluster Resource: oauths Group: route.openshift.io Name: oauth-openshift Namespace: openshift-authentication Resource: routes Group: Name: oauth-openshift Namespace: openshift-authentication Resource: services Group: Name: openshift-config Resource: namespaces Group: Name: openshift-config-managed Resource: namespaces Group: Name: openshift-authentication Resource: namespaces Group: Name: openshift-authentication-operator Resource: namespaces Group: Name: openshift-ingress Resource: namespaces Group: Name: openshift-oauth-apiserver Resource: namespaces Versions: Name: oauth-apiserver Version: 4.7.0 Name: operator Version: 4.7.0 Name: oauth-openshift Version: 4.7.0_openshift Events: <none> Name: openshift-apiserver Namespace: Labels: <none> Annotations: exclude.release.openshift.io/internal-openshift-hosted: true include.release.openshift.io/self-managed-high-availability: true API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2021-03-04T15:43:48Z Generation: 1 Managed Fields: API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:exclude.release.openshift.io/internal-openshift-hosted: f:include.release.openshift.io/self-managed-high-availability: f:spec: f:status: .: f:extension: Manager: cluster-version-operator Operation: Update Time: 2021-03-04T15:43:48Z API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: f:conditions: f:relatedObjects: f:versions: Manager: cluster-openshift-apiserver-operator Operation: Update Time: 2021-03-04T17:08:16Z Resource Version: 515436 Self Link: /apis/config.openshift.io/v1/clusteroperators/openshift-apiserver UID: 27ff2b81-937d-4789-9449-b87ab49c7573 Spec: Status: Conditions: Last Transition Time: 2021-03-05T05:57:35Z Message: All is well Reason: AsExpected Status: False Type: Degraded Last Transition Time: 2021-03-04T17:23:16Z Message: All is well Reason: AsExpected Status: False Type: Progressing Last Transition Time: 2021-03-04T17:57:40Z Message: APIServicesAvailable: "build.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) APIServicesAvailable: "template.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) Reason: APIServices_Error Status: False Type: Available Last Transition Time: 2021-03-04T15:48:32Z Message: All is well Reason: AsExpected Status: True Type: Upgradeable Extension: <nil> Related Objects: Group: operator.openshift.io Name: cluster Resource: openshiftapiservers Group: Name: openshift-config Resource: namespaces Group: Name: openshift-config-managed Resource: namespaces Group: Name: openshift-apiserver-operator Resource: namespaces Group: Name: openshift-apiserver Resource: namespaces Group: Name: openshift-etcd-operator Resource: namespaces Group: Name: host-etcd-2 Namespace: openshift-etcd Resource: endpoints Group: controlplane.operator.openshift.io Name: Namespace: openshift-apiserver Resource: podnetworkconnectivitychecks Group: apiregistration.k8s.io Name: v1.apps.openshift.io Resource: apiservices Group: apiregistration.k8s.io Name: v1.authorization.openshift.io Resource: apiservices Group: apiregistration.k8s.io Name: v1.build.openshift.io Resource: apiservices Group: apiregistration.k8s.io Name: v1.image.openshift.io Resource: apiservices Group: apiregistration.k8s.io Name: v1.project.openshift.io Resource: apiservices Group: apiregistration.k8s.io Name: v1.quota.openshift.io Resource: apiservices Group: apiregistration.k8s.io Name: v1.route.openshift.io Resource: apiservices Group: apiregistration.k8s.io Name: v1.security.openshift.io Resource: apiservices Group: apiregistration.k8s.io Name: v1.template.openshift.io Resource: apiservices Versions: Name: operator Version: 4.7.0 Name: openshift-apiserver Version: 4.7.0 Events: <none> ============================== # oc adm must-gather --image=ocp-2.registry.example.internal:5010/openshift/origin-must-gather:latest ============================== [must-gather ] OUT Using must-gather plugin-in image: ocp-2.registry.example.internal:5010/openshift/origin-must-gather:latest [must-gather ] OUT namespace/openshift-must-gather-f5zbr created [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-jb6xk created [must-gather ] OUT pod for plug-in image ocp-2.registry.example.internal:5010/openshift/origin-must-gather:latest created [must-gather-96kqh] POD Error from server (ServiceUnavailable): the server is currently unable to handle the request (get projects.project.openshift.io openshift-must- [must-gather-96kqh] POD Error from server (ServiceUnavailable): the server is currently unable to handle the request (get projects.project.openshift.io openshift-must- [must-gather-96kqh] POD Gathering data for ns/openshift-cluster-version... [must-gather-96kqh] POD Gathering data for ns/default... [must-gather-96kqh] POD Gathering data for ns/openshift... [must-gather-96kqh] POD Gathering data for ns/kube-system... [must-gather-96kqh] POD Gathering data for ns/openshift-etcd... [must-gather-96kqh] POD Gathering data for ns/openshift-kni-infra... [must-gather-96kqh] POD Wrote inspect data to must-gather. [must-gather-96kqh] POD error: errors ocurred while gathering data: [must-gather-96kqh] POD [the server doesn't have a resource type "deploymentconfigs", the server doesn't have a resource type "imagestreams"] [must-gather-96kqh] POD Gathering data for ns/openshift-config... [must-gather-96kqh] POD Gathering data for ns/openshift-config-managed... [must-gather-96kqh] POD Gathering data for ns/openshift-authentication... [must-gather-96kqh] POD Gathering data for ns/openshift-authentication-operator... [must-gather-96kqh] POD Gathering data for ns/openshift-ingress... [must-gather-96kqh] POD Gathering data for ns/openshift-oauth-apiserver... [must-gather-96kqh] POD Gathering data for ns/openshift-machine-api... [must-gather-96kqh] POD Gathering data for ns/openshift-cloud-credential-operator... [must-gather-96kqh] POD Gathering data for ns/openshift-config-operator... [must-gather-96kqh] POD Gathering data for ns/openshift-console-operator... [must-gather-96kqh] POD Gathering data for ns/openshift-console... [must-gather-96kqh] POD Gathering data for ns/openshift-cluster-storage-operator... [must-gather-96kqh] POD Gathering data for ns/openshift-dns-operator... [must-gather-96kqh] POD Gathering data for ns/openshift-dns... [must-gather-96kqh] POD Gathering data for ns/openshift-etcd-operator... [must-gather-96kqh] POD Gathering data for ns/openshift-etcd... [must-gather-96kqh] POD Gathering data for ns/openshift-image-registry... [must-gather-96kqh] POD Gathering data for ns/openshift-ingress-operator... [must-gather-96kqh] POD Gathering data for ns/openshift-ingress-canary... [must-gather-96kqh] POD Gathering data for ns/openshift-insights... [must-gather-96kqh] POD Gathering data for ns/openshift-kube-apiserver-operator... [must-gather-96kqh] POD Gathering data for ns/openshift-kube-apiserver... [must-gather-96kqh] POD Gathering data for ns/openshift-kube-controller-manager... [must-gather-96kqh] POD Gathering data for ns/openshift-kube-controller-manager-operator... [must-gather-96kqh] OUT gather logs unavailable: unexpected EOF [must-gather-96kqh] OUT waiting for gather to complete [must-gather-96kqh] OUT gather never finished: timed out waiting for the condition [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-jb6xk deleted [must-gather ] OUT namespace/openshift-must-gather-f5zbr deleted error: gather never finished for pod must-gather-96kqh: timed out waiting for the condition Expected results: Expect for OCP 4.6.18 can upgrade to OCP 4.7.0 Additional info: I have gather all of must-gather infomation from my cluster into this bug case attachment!
I use Haproxy as OCP Loadbalancer, the configuration as following: [root@lb-01 ~]# cat /etc/haproxy/haproxy.cfg # Global settings #--------------------------------------------------------------------- global maxconn 20000 log /dev/log local0 info chroot /var/lib/haproxy pidfile /var/run/haproxy.pid user haproxy group haproxy daemon # turn on stats unix socket stats socket /var/lib/haproxy/stats #--------------------------------------------------------------------- # common defaults that all the 'listen' and 'backend' sections will # use if not designated in their block #--------------------------------------------------------------------- defaults mode http log global option httplog option dontlognull # option http-server-close option forwardfor except 127.0.0.0/8 option redispatch retries 3 timeout http-request 10s timeout queue 1m timeout connect 10s timeout client 300s timeout server 300s timeout http-keep-alive 10s timeout check 10s maxconn 20000 listen stats bind 192.168.30.29:9000 mode http stats enable stats uri / frontend cluster2-ocp4-api-server-frontend bind 192.168.30.32:6443 mode tcp option tcplog default_backend cluster2-ocp4-api-server-backend frontend cluster2-ocp4-machine-config-server-frontend bind 192.168.30.32:22623 mode tcp option tcplog default_backend cluster2-ocp4-machine-config-server-backend frontend cluster2-ocp4-ingress-http-frontend bind 192.168.30.32:80 mode tcp option tcplog default_backend cluster2-ocp4-ingress-http-backend frontend cluster2-ocp4-ingress-https-frontend bind 192.168.30.32:443 mode tcp option tcplog default_backend cluster2-ocp4-ingress-https-backend backend cluster2-ocp4-api-server-backend option httpchk GET /readyz HTTP/1.0 option log-health-checks balance roundrobin mode tcp server bootstrap bootstrap.ocp4.example.internal:6443 weight 1 verify none check check-ssl inter 1s fall 2 rise 3 server master-01 master-01.cluster2.ocp4.example.internal:6443 weight 1 verify none check check-ssl inter 1s fall 2 rise 3 server master-02 master-02.cluster2.ocp4.example.internal:6443 weight 1 verify none check check-ssl inter 1s fall 2 rise 3 server master-03 master-03.cluster2.ocp4.example.internal:6443 weight 1 verify none check check-ssl inter 1s fall 2 rise 3 backend cluster2-ocp4-machine-config-server-backend balance roundrobin mode tcp server bootstrap bootstrap.ocp4.example.internal:22623 check server master-01 master-01.cluster2.ocp4.example.internal:22623 check server master-02 master-02.cluster2.ocp4.example.internal:22623 check server master-03 master-03.cluster2.ocp4.example.internal:22623 check backend cluster2-ocp4-ingress-http-backend balance source mode tcp server master-01 master-01.cluster2.ocp4.example.internal:80 check server master-02 master-02.cluster2.ocp4.example.internal:80 check server master-03 master-03.cluster2.ocp4.example.internal:80 check backend cluster2-ocp4-ingress-https-backend balance source mode tcp server master-01 master-01.cluster2.ocp4.example.internal:443 check server master-02 master-02.cluster2.ocp4.example.internal:443 check server master-03 master-03.cluster2.ocp4.example.internal:443 check
[root@support ~]# OPERATOR_POD=$(oc get po -n openshift-apiserver-operator -o name | grep -o "[^/]*$") [root@support ~]# OPERAND_PODS_IP=$(oc get po -n openshift-apiserver -o wide --no-headers | awk '{print $6}') [root@support ~]# [root@support ~]# OPERAND_PODS_IP=$(echo $OPERAND_PODS_IP) [root@support ~]# [root@support ~]# oc rsh -n openshift-apiserver-operator $OPERATOR_POD bash -c "for i in $OPERAND_PODS_IP; "'do echo "curl $i:"; curl -k --connect-timeout 10 https://$i:8443/healthz; echo; done' curl 10.130.0.46: curl: (28) Connection timed out after 10000 milliseconds curl 10.128.0.40: ok curl 10.129.0.55: curl: (28) Operation timed out after 10001 milliseconds with 0 out of 0 bytes received
# oc get network -o yaml | grep network f:networkType: {} f:networkType: {} manager: cluster-network-operator selfLink: /apis/config.openshift.io/v1/networks/cluster networkType: OpenShiftSDN networkType: OpenShiftSDN
oc get pod -n openshift-sdn NAME READY STATUS RESTARTS AGE ovs-26hkw 1/1 Running 0 17h ovs-jll4b 1/1 Running 0 17h ovs-pzjxm 1/1 Running 0 17h sdn-24qx5 2/2 Running 0 17h <------ log have in attachment sdn-2gz8l 2/2 Running 0 17h <------ log have in attachment sdn-65wd2 2/2 Running 0 17h <------ log have in attachment sdn-controller-jxrvr 1/1 Running 0 17h sdn-controller-k6kdz 1/1 Running 0 17h sdn-controller-p9jcs 1/1 Running 0 17h
# oc get podnetworkconnectivitycheck network-check-source-master-02-to-openshift-apiserver-endpoint-master-01 -n openshift-network-diagnostics -o yaml apiVersion: controlplane.operator.openshift.io/v1alpha1 kind: PodNetworkConnectivityCheck metadata: creationTimestamp: "2021-03-04T17:53:16Z" generation: 2 managedFields: - apiVersion: controlplane.operator.openshift.io/v1alpha1 fieldsType: FieldsV1 fieldsV1: f:spec: .: {} f:sourcePod: {} f:targetEndpoint: {} f:tlsClientCert: .: {} f:name: {} manager: cluster-network-operator operation: Update time: "2021-03-04T17:53:16Z" - apiVersion: controlplane.operator.openshift.io/v1alpha1 fieldsType: FieldsV1 fieldsV1: f:status: .: {} f:conditions: {} f:failures: {} f:outages: {} manager: cluster-network-check-endpoints operation: Update time: "2021-03-04T18:00:26Z" name: network-check-source-master-02-to-openshift-apiserver-endpoint-master-01 namespace: openshift-network-diagnostics resourceVersion: "610891" selfLink: /apis/controlplane.operator.openshift.io/v1alpha1/namespaces/openshift-network-diagnostics/podnetworkconnectivitychecks/network-check-source-master-02-to-openshift-apiserver-endpoint-master-01 uid: 3a841d54-28e2-4827-9eb3-c027605333c7 spec: sourcePod: network-check-source-7b56ddbc7b-8f5t6 targetEndpoint: 10.129.0.55:8443 tlsClientCert: name: "" status: conditions: - lastTransitionTime: "2021-03-04T18:00:26Z" message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError status: "False" type: Reachable failures: - latency: 10.00057387s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-05T10:54:06Z" - latency: 10.00390748s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-05T10:53:06Z" - latency: 10.00098655s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-05T10:52:06Z" - latency: 10.000545401s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-05T10:51:06Z" - latency: 10.000161853s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-05T10:50:06Z" - latency: 10.000235184s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-05T10:49:06Z" - latency: 10.000348697s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-05T10:48:06Z" - latency: 10.000433986s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-05T10:47:06Z" - latency: 10.000154293s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-05T10:46:06Z" - latency: 10.000199761s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-05T10:45:06Z" outages: - end: null endLogs: - latency: 10.00057387s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-05T10:54:06Z" - latency: 10.00390748s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-05T10:53:06Z" - latency: 10.00098655s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-05T10:52:06Z" - latency: 10.000545401s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-05T10:51:06Z" - latency: 10.000161853s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-05T10:50:06Z" message: Connectivity outage detected at 2021-03-04T17:54:16.768114443Z start: "2021-03-04T17:54:16Z" startLogs: - latency: 10.003812744s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-05T05:54:06Z" - latency: 10.000270641s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.3:8443: dial tcp 10.129.0.3:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-04T17:54:16Z"
Created attachment 1760860 [details] sdn log
Sounds somewhat like bug 1931997, although this isn't my space, so that's mostly based on counts of similar words ;).
@welin this bug is marked with urgent severity and with "upgrades" keyword. So we want to know on which platform you saw the bug asap. So that we can decide if it a duplicate issue or a new upgrade blocker. Thanks.
i can see this exact same behaviour on 4.7.3 and 4.7.4 fresh installs (no update) on UPI VMware VMs when using OpenshiftSDN. same behaviour means that i run all the listed commands shown in this bug ticket and i get all the same results. wait-for install-complete does never finish. i did a couple of tests with the exact same infrastructure but varied on cluster version and chosen SDN * using OVN instead of OpenshiftSDN with 4.7.x results in a working cluster. * using OpenshiftSDN with 4.6.x results in a working cluster so it seems in my case it is just OpenshiftSDN + 4.7.x that is not working let me know if you are interested in any specific logs etc
@gerd.oberlechner what is the vmware HW version on your UPI infrastructure? Please take a look at: https://access.redhat.com/solutions/5896081. The KCS article talks about upgrades, but the issue at hand is applicable to fresh installs as well. Please let us know if the workaround helps. The issue will be addressed in 4.7.5 and beyond of OCP.
Thank your the advice. The vmware HW version is 15 (ESX 6.7 U3 - 16316930) I tried 4.7.5, 4.7.6 and also 4.7.7 but all of those still show the same behaviour. I followed the hint in the KCS article and disabled vxlan offload on all nodes and the cluster became vivid instantly and responded fast to all kinds of queries.
(In reply to Gerd Oberlechner from comment #18) > Thank your the advice. > The vmware HW version is 15 (ESX 6.7 U3 - 16316930) > I tried 4.7.5, 4.7.6 and also 4.7.7 but all of those still show the same > behaviour. > I followed the hint in the KCS article and disabled vxlan offload on all > nodes and the cluster became vivid instantly and responded fast to all kinds > of queries. Gerd, vxlan offload issues were not fully resolved until 4.7.11, fixed via https://bugzilla.redhat.com/show_bug.cgi?id=1956749. This bug started off not being specific to vSphere so if you suspect vxlan offload issues on vsphere still lets make sure we have a separate bug to root cause that.
Every outstanding occurence of this appears to be tied to vxlan offload problems with vsphere, so marking this as a dupe of 1956749
*** This bug has been marked as a duplicate of bug 1956749 ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days