Bug 1935591 - Multiple Cluster Operators failed when OCP 4.6.18 upgrade to OCP 4.7.0 !
Summary: Multiple Cluster Operators failed when OCP 4.6.18 upgrade to OCP 4.7.0 !
Keywords:
Status: CLOSED DUPLICATE of bug 1956749
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.7
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: ---
Assignee: Yu Qi Zhang
QA Contact: Jian Zhang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-05 08:41 UTC by kevin
Modified: 2024-10-01 17:37 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-08-20 15:49:00 UTC
Target Upstream Version:
Embargoed:
skrenger: needinfo-


Attachments (Terms of Use)
must-gather information (1.54 KB, application/gzip)
2021-03-05 08:41 UTC, kevin
no flags Details
sdn log (358.47 KB, application/zip)
2021-03-05 11:01 UTC, kevin
no flags Details

Description kevin 2021-03-05 08:41:16 UTC
Created attachment 1760838 [details]
must-gather information

Description of problem:

Hello

I am upgrading a OCP 4.6.18 work well cluster to OCP 4.7.0, but I have found the upgrading was failed due to many critical Cluster Operators failed

Version-Release number of selected component (if applicable):

Current OCP version is OCP 4.6.18
Target Upgrading Version is OCP 4.7.0

OCP Cluster setup method:

Bare-Metal UPI

OCP Cluster Networking Environment

Completely Disconnected Networking Envrionment !
The Networking Envrionment Cannot access any Internet resource !
I have setup a Disconnected Registry Server for this OCP Cluster setup.
The environment have been prove work well because the OCP 4.6.18 setup successful!

How reproducible:

Upgrding Current OCP 4.6.18 Cluster to OCP 4.7.0


Steps to Reproduce:

1. Upgrding Current OCP 4.6.18 Cluster to OCP 4.7.0

# oc adm upgrade --allow-explicit-upgrade \
  --allow-upgrade-with-warnings=true \
  --force=true \
  --to-image=ocp-1.registry.example.internal:5001/ocp4/openshift4:4.7.0

2. Gather Upgrading Output

(1) Status of All Nodes
==============================
# oc get nodes
==============================
NAME                                       STATUS   ROLES           AGE   VERSION
master-01.cluster2.ocp4.example.internal   Ready    master,worker   14h   v1.20.0+ba45583
master-02.cluster2.ocp4.example.internal   Ready    master,worker   14h   v1.20.0+ba45583
master-03.cluster2.ocp4.example.internal   Ready    master,worker   13h   v1.20.0+ba45583

(2)Status of All ClusterOperators
==============================
# oc get co
==============================
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.7.0     False       False         True       35m
baremetal                                  4.7.0     True        False         False      14h
cloud-credential                           4.7.0     True        False         False      16h
cluster-autoscaler                         4.7.0     True        False         False      16h
config-operator                            4.7.0     True        False         False      16h
console                                    4.7.0     False       False         True       98m
csi-snapshot-controller                    4.7.0     True        False         False      126m
dns                                        4.7.0     True        False         False      15h
etcd                                       4.7.0     True        False         False      15h
image-registry                             4.7.0     True        False         True       15h
ingress                                    4.7.0     True        False         False      122m
insights                                   4.7.0     True        False         False      16h
kube-apiserver                             4.7.0     True        False         False      15h
kube-controller-manager                    4.7.0     True        False         False      16h
kube-scheduler                             4.7.0     True        False         False      16h
kube-storage-version-migrator              4.7.0     True        False         False      13h
machine-api                                4.7.0     True        False         False      16h
machine-approver                           4.7.0     True        False         False      16h
machine-config                             4.7.0     True        False         False      13h
marketplace                                4.7.0     True        False         False      128m
monitoring                                 4.7.0     False       True          True       13h
network                                    4.7.0     True        False         False      14h
node-tuning                                4.7.0     True        False         False      14h
openshift-apiserver                        4.7.0     False       False         False      14h
openshift-controller-manager               4.7.0     True        False         False      14h
openshift-samples                          4.7.0     False       False         False      128m
operator-lifecycle-manager                 4.7.0     True        False         False      16h
operator-lifecycle-manager-catalog         4.7.0     True        False         False      16h
operator-lifecycle-manager-packageserver   4.7.0     True        False         False      4s
service-ca                                 4.7.0     True        False         False      16h
storage                                    4.7.0     True        False         False      16h


Actual results:

The Upgrading processing is always in True
==============================
# oc get clusterversion
==============================
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.18    True        True          15h     Working towards 4.7.0: 8 of 668 done (1% complete)

==============================
# oc get mcp
==============================
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-0e3d4f0417b31f214ce4a4ad13e9abc1   True      False      False      3              3                   3                     0                      16h
worker   rendered-worker-330381a45417785c52a234b02b7a8e83   True      False      False      0              0                   0                     0                      16h

==============================
# oc describe co authentication openshift-apiserver
==============================
Name:         authentication
Namespace:    
Labels:       <none>
Annotations:  exclude.release.openshift.io/internal-openshift-hosted: true
              include.release.openshift.io/self-managed-high-availability: true
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2021-03-04T15:43:48Z
  Generation:          1
  Managed Fields:
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:exclude.release.openshift.io/internal-openshift-hosted:
          f:include.release.openshift.io/self-managed-high-availability:
      f:spec:
      f:status:
        .:
        f:extension:
    Manager:      cluster-version-operator
    Operation:    Update
    Time:         2021-03-04T15:43:48Z
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
        f:relatedObjects:
        f:versions:
    Manager:         authentication-operator
    Operation:       Update
    Time:            2021-03-04T17:14:05Z
  Resource Version:  515482
  Self Link:         /apis/config.openshift.io/v1/clusteroperators/authentication
  UID:               a601ca64-6d29-4f38-bfaa-b4f487c3462e
Spec:
Status:
  Conditions:
    Last Transition Time:  2021-03-04T17:47:07Z
    Message:               OAuthServiceEndpointsCheckEndpointAccessibleControllerDegraded: Get "https://10.129.0.58:6443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
RouteDegraded: Unable to get or create required route openshift-authentication/oauth-openshift: the server is currently unable to handle the request (get routes.route.openshift.io oauth-openshift)
OAuthRouteCheckEndpointAccessibleControllerDegraded: Get "https://oauth-openshift.apps.cluster2.ocp4.example.internal/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
OAuthServerDeploymentDegraded: Unable to get "openshift-browser-client" bootstrapped OAuth client: the server is currently unable to handle the request (post oauthclients.oauth.openshift.io)
    Reason:                OAuthRouteCheckEndpointAccessibleController_SyncError::OAuthServerDeployment_GetFailed::OAuthServiceEndpointsCheckEndpointAccessibleController_SyncError::Route_FailedCreate
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2021-03-05T08:03:09Z
    Message:               OAuthVersionRouteProgressing: Request to "https://oauth-openshift.apps.cluster2.ocp4.example.internal/healthz" not successful yet
    Reason:                OAuthVersionRoute_WaitingForRoute
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2021-03-05T07:23:50Z
    Message:               APIServicesAvailable: "oauth.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)
APIServicesAvailable: "user.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)
OAuthVersionRouteAvailable: HTTP request to "https://oauth-openshift.apps.cluster2.ocp4.example.internal/healthz" failed: dial tcp: i/o timeout
OAuthRouteCheckEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.cluster2.ocp4.example.internal/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
    Reason:                APIServices_Error::OAuthRouteCheckEndpointAccessibleController_EndpointUnavailable::OAuthVersionRoute_RequestFailed
    Status:                False
    Type:                  Available
    Last Transition Time:  2021-03-04T15:48:06Z
    Message:               All is well
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:               <nil>
  Related Objects:
    Group:      operator.openshift.io
    Name:       cluster
    Resource:   authentications
    Group:      config.openshift.io
    Name:       cluster
    Resource:   authentications
    Group:      config.openshift.io
    Name:       cluster
    Resource:   infrastructures
    Group:      config.openshift.io
    Name:       cluster
    Resource:   oauths
    Group:      route.openshift.io
    Name:       oauth-openshift
    Namespace:  openshift-authentication
    Resource:   routes
    Group:      
    Name:       oauth-openshift
    Namespace:  openshift-authentication
    Resource:   services
    Group:      
    Name:       openshift-config
    Resource:   namespaces
    Group:      
    Name:       openshift-config-managed
    Resource:   namespaces
    Group:      
    Name:       openshift-authentication
    Resource:   namespaces
    Group:      
    Name:       openshift-authentication-operator
    Resource:   namespaces
    Group:      
    Name:       openshift-ingress
    Resource:   namespaces
    Group:      
    Name:       openshift-oauth-apiserver
    Resource:   namespaces
  Versions:
    Name:     oauth-apiserver
    Version:  4.7.0
    Name:     operator
    Version:  4.7.0
    Name:     oauth-openshift
    Version:  4.7.0_openshift
Events:       <none>
Name:         openshift-apiserver
Namespace:    
Labels:       <none>
Annotations:  exclude.release.openshift.io/internal-openshift-hosted: true
              include.release.openshift.io/self-managed-high-availability: true
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2021-03-04T15:43:48Z
  Generation:          1
  Managed Fields:
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:exclude.release.openshift.io/internal-openshift-hosted:
          f:include.release.openshift.io/self-managed-high-availability:
      f:spec:
      f:status:
        .:
        f:extension:
    Manager:      cluster-version-operator
    Operation:    Update
    Time:         2021-03-04T15:43:48Z
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
        f:relatedObjects:
        f:versions:
    Manager:         cluster-openshift-apiserver-operator
    Operation:       Update
    Time:            2021-03-04T17:08:16Z
  Resource Version:  515436
  Self Link:         /apis/config.openshift.io/v1/clusteroperators/openshift-apiserver
  UID:               27ff2b81-937d-4789-9449-b87ab49c7573
Spec:
Status:
  Conditions:
    Last Transition Time:  2021-03-05T05:57:35Z
    Message:               All is well
    Reason:                AsExpected
    Status:                False
    Type:                  Degraded
    Last Transition Time:  2021-03-04T17:23:16Z
    Message:               All is well
    Reason:                AsExpected
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2021-03-04T17:57:40Z
    Message:               APIServicesAvailable: "build.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)
APIServicesAvailable: "template.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)
    Reason:                APIServices_Error
    Status:                False
    Type:                  Available
    Last Transition Time:  2021-03-04T15:48:32Z
    Message:               All is well
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:               <nil>
  Related Objects:
    Group:      operator.openshift.io
    Name:       cluster
    Resource:   openshiftapiservers
    Group:      
    Name:       openshift-config
    Resource:   namespaces
    Group:      
    Name:       openshift-config-managed
    Resource:   namespaces
    Group:      
    Name:       openshift-apiserver-operator
    Resource:   namespaces
    Group:      
    Name:       openshift-apiserver
    Resource:   namespaces
    Group:      
    Name:       openshift-etcd-operator
    Resource:   namespaces
    Group:      
    Name:       host-etcd-2
    Namespace:  openshift-etcd
    Resource:   endpoints
    Group:      controlplane.operator.openshift.io
    Name:       
    Namespace:  openshift-apiserver
    Resource:   podnetworkconnectivitychecks
    Group:      apiregistration.k8s.io
    Name:       v1.apps.openshift.io
    Resource:   apiservices
    Group:      apiregistration.k8s.io
    Name:       v1.authorization.openshift.io
    Resource:   apiservices
    Group:      apiregistration.k8s.io
    Name:       v1.build.openshift.io
    Resource:   apiservices
    Group:      apiregistration.k8s.io
    Name:       v1.image.openshift.io
    Resource:   apiservices
    Group:      apiregistration.k8s.io
    Name:       v1.project.openshift.io
    Resource:   apiservices
    Group:      apiregistration.k8s.io
    Name:       v1.quota.openshift.io
    Resource:   apiservices
    Group:      apiregistration.k8s.io
    Name:       v1.route.openshift.io
    Resource:   apiservices
    Group:      apiregistration.k8s.io
    Name:       v1.security.openshift.io
    Resource:   apiservices
    Group:      apiregistration.k8s.io
    Name:       v1.template.openshift.io
    Resource:   apiservices
  Versions:
    Name:     operator
    Version:  4.7.0
    Name:     openshift-apiserver
    Version:  4.7.0
Events:       <none>

==============================
# oc adm must-gather --image=ocp-2.registry.example.internal:5010/openshift/origin-must-gather:latest
==============================
[must-gather      ] OUT Using must-gather plugin-in image: ocp-2.registry.example.internal:5010/openshift/origin-must-gather:latest
[must-gather      ] OUT namespace/openshift-must-gather-f5zbr created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-jb6xk created
[must-gather      ] OUT pod for plug-in image ocp-2.registry.example.internal:5010/openshift/origin-must-gather:latest created
[must-gather-96kqh] POD Error from server (ServiceUnavailable): the server is currently unable to handle the request (get projects.project.openshift.io openshift-must-
[must-gather-96kqh] POD Error from server (ServiceUnavailable): the server is currently unable to handle the request (get projects.project.openshift.io openshift-must-
[must-gather-96kqh] POD Gathering data for ns/openshift-cluster-version...
[must-gather-96kqh] POD Gathering data for ns/default...
[must-gather-96kqh] POD Gathering data for ns/openshift...
[must-gather-96kqh] POD Gathering data for ns/kube-system...
[must-gather-96kqh] POD Gathering data for ns/openshift-etcd...
[must-gather-96kqh] POD Gathering data for ns/openshift-kni-infra...
[must-gather-96kqh] POD Wrote inspect data to must-gather.
[must-gather-96kqh] POD error: errors ocurred while gathering data:
[must-gather-96kqh] POD     [the server doesn't have a resource type "deploymentconfigs", the server doesn't have a resource type "imagestreams"]
[must-gather-96kqh] POD Gathering data for ns/openshift-config...
[must-gather-96kqh] POD Gathering data for ns/openshift-config-managed...
[must-gather-96kqh] POD Gathering data for ns/openshift-authentication...
[must-gather-96kqh] POD Gathering data for ns/openshift-authentication-operator...
[must-gather-96kqh] POD Gathering data for ns/openshift-ingress...
[must-gather-96kqh] POD Gathering data for ns/openshift-oauth-apiserver...
[must-gather-96kqh] POD Gathering data for ns/openshift-machine-api...
[must-gather-96kqh] POD Gathering data for ns/openshift-cloud-credential-operator...
[must-gather-96kqh] POD Gathering data for ns/openshift-config-operator...
[must-gather-96kqh] POD Gathering data for ns/openshift-console-operator...
[must-gather-96kqh] POD Gathering data for ns/openshift-console...
[must-gather-96kqh] POD Gathering data for ns/openshift-cluster-storage-operator...
[must-gather-96kqh] POD Gathering data for ns/openshift-dns-operator...
[must-gather-96kqh] POD Gathering data for ns/openshift-dns...
[must-gather-96kqh] POD Gathering data for ns/openshift-etcd-operator...
[must-gather-96kqh] POD Gathering data for ns/openshift-etcd...
[must-gather-96kqh] POD Gathering data for ns/openshift-image-registry...
[must-gather-96kqh] POD Gathering data for ns/openshift-ingress-operator...
[must-gather-96kqh] POD Gathering data for ns/openshift-ingress-canary...
[must-gather-96kqh] POD Gathering data for ns/openshift-insights...
[must-gather-96kqh] POD Gathering data for ns/openshift-kube-apiserver-operator...
[must-gather-96kqh] POD Gathering data for ns/openshift-kube-apiserver...
[must-gather-96kqh] POD Gathering data for ns/openshift-kube-controller-manager...
[must-gather-96kqh] POD Gathering data for ns/openshift-kube-controller-manager-operator...
[must-gather-96kqh] OUT gather logs unavailable: unexpected EOF
[must-gather-96kqh] OUT waiting for gather to complete
[must-gather-96kqh] OUT gather never finished: timed out waiting for the condition
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-jb6xk deleted
[must-gather      ] OUT namespace/openshift-must-gather-f5zbr deleted
error: gather never finished for pod must-gather-96kqh: timed out waiting for the condition


Expected results:

Expect for OCP 4.6.18 can upgrade to OCP 4.7.0

Additional info:

I have gather all of must-gather infomation from my cluster into this bug case attachment!

Comment 1 kevin 2021-03-05 08:46:04 UTC
I use Haproxy as OCP Loadbalancer, the configuration as following:

[root@lb-01 ~]# cat /etc/haproxy/haproxy.cfg


# Global settings
#---------------------------------------------------------------------
global
    maxconn     20000
    log         /dev/log local0 info
    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    user        haproxy
    group       haproxy
    daemon

    # turn on stats unix socket
    stats socket /var/lib/haproxy/stats

#---------------------------------------------------------------------
# common defaults that all the 'listen' and 'backend' sections will
# use if not designated in their block
#---------------------------------------------------------------------
defaults
    mode                    http
    log                     global
    option                  httplog
    option                  dontlognull
#    option http-server-close
    option forwardfor       except 127.0.0.0/8
    option                  redispatch
    retries                 3
    timeout http-request    10s
    timeout queue           1m
    timeout connect         10s
    timeout client          300s
    timeout server          300s
    timeout http-keep-alive 10s
    timeout check           10s
    maxconn                 20000

listen stats
    bind 192.168.30.29:9000
    mode http
    stats enable
    stats uri /

frontend cluster2-ocp4-api-server-frontend
    bind 192.168.30.32:6443
    mode tcp
    option tcplog
    default_backend cluster2-ocp4-api-server-backend

frontend cluster2-ocp4-machine-config-server-frontend
    bind 192.168.30.32:22623
    mode tcp
    option tcplog
    default_backend cluster2-ocp4-machine-config-server-backend

frontend cluster2-ocp4-ingress-http-frontend
    bind 192.168.30.32:80
    mode tcp
    option tcplog
    default_backend cluster2-ocp4-ingress-http-backend

frontend cluster2-ocp4-ingress-https-frontend
    bind 192.168.30.32:443
    mode tcp
    option tcplog
    default_backend cluster2-ocp4-ingress-https-backend

backend cluster2-ocp4-api-server-backend
    option  httpchk GET /readyz HTTP/1.0
    option  log-health-checks
    balance roundrobin
    mode tcp
    server     bootstrap bootstrap.ocp4.example.internal:6443 weight 1 verify none check check-ssl inter 1s fall 2 rise 3
    server     master-01 master-01.cluster2.ocp4.example.internal:6443 weight 1 verify none check check-ssl inter 1s fall 2 rise 3
    server     master-02 master-02.cluster2.ocp4.example.internal:6443 weight 1 verify none check check-ssl inter 1s fall 2 rise 3
    server     master-03 master-03.cluster2.ocp4.example.internal:6443 weight 1 verify none check check-ssl inter 1s fall 2 rise 3

backend cluster2-ocp4-machine-config-server-backend
    balance roundrobin
    mode tcp
    server     bootstrap bootstrap.ocp4.example.internal:22623 check
    server     master-01 master-01.cluster2.ocp4.example.internal:22623 check
    server     master-02 master-02.cluster2.ocp4.example.internal:22623 check
    server     master-03 master-03.cluster2.ocp4.example.internal:22623 check

backend cluster2-ocp4-ingress-http-backend
    balance source
    mode tcp
    server     master-01 master-01.cluster2.ocp4.example.internal:80 check
    server     master-02 master-02.cluster2.ocp4.example.internal:80 check
    server     master-03 master-03.cluster2.ocp4.example.internal:80 check

backend cluster2-ocp4-ingress-https-backend
    balance source
    mode tcp
    server     master-01 master-01.cluster2.ocp4.example.internal:443 check
    server     master-02 master-02.cluster2.ocp4.example.internal:443 check
    server     master-03 master-03.cluster2.ocp4.example.internal:443 check

Comment 2 kevin 2021-03-05 09:37:22 UTC
[root@support ~]# OPERATOR_POD=$(oc get po -n openshift-apiserver-operator -o name | grep -o "[^/]*$")
[root@support ~]# OPERAND_PODS_IP=$(oc get po -n openshift-apiserver -o wide --no-headers | awk '{print $6}')
[root@support ~]# 
[root@support ~]# OPERAND_PODS_IP=$(echo $OPERAND_PODS_IP)
[root@support ~]# 
[root@support ~]# oc rsh -n openshift-apiserver-operator $OPERATOR_POD bash -c "for i in $OPERAND_PODS_IP; "'do echo "curl $i:"; curl -k --connect-timeout 10 https://$i:8443/healthz; echo; done'
curl 10.130.0.46:
curl: (28) Connection timed out after 10000 milliseconds
curl 10.128.0.40:
ok
curl 10.129.0.55:
curl: (28) Operation timed out after 10001 milliseconds with 0 out of 0 bytes received

Comment 3 kevin 2021-03-05 09:39:31 UTC
# oc get network -o yaml | grep network
          f:networkType: {}
          f:networkType: {}
      manager: cluster-network-operator
    selfLink: /apis/config.openshift.io/v1/networks/cluster
    networkType: OpenShiftSDN
    networkType: OpenShiftSDN

Comment 4 kevin 2021-03-05 10:47:04 UTC
oc get pod -n openshift-sdn 
NAME                   READY   STATUS    RESTARTS   AGE
ovs-26hkw              1/1     Running   0          17h
ovs-jll4b              1/1     Running   0          17h
ovs-pzjxm              1/1     Running   0          17h
sdn-24qx5              2/2     Running   0          17h <------ log have in attachment
sdn-2gz8l              2/2     Running   0          17h <------ log have in attachment
sdn-65wd2              2/2     Running   0          17h <------ log have in attachment
sdn-controller-jxrvr   1/1     Running   0          17h
sdn-controller-k6kdz   1/1     Running   0          17h
sdn-controller-p9jcs   1/1     Running   0          17h

Comment 5 kevin 2021-03-05 10:59:21 UTC
# oc get podnetworkconnectivitycheck network-check-source-master-02-to-openshift-apiserver-endpoint-master-01 -n openshift-network-diagnostics -o yaml
apiVersion: controlplane.operator.openshift.io/v1alpha1
kind: PodNetworkConnectivityCheck
metadata:
  creationTimestamp: "2021-03-04T17:53:16Z"
  generation: 2
  managedFields:
  - apiVersion: controlplane.operator.openshift.io/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        .: {}
        f:sourcePod: {}
        f:targetEndpoint: {}
        f:tlsClientCert:
          .: {}
          f:name: {}
    manager: cluster-network-operator
    operation: Update
    time: "2021-03-04T17:53:16Z"
  - apiVersion: controlplane.operator.openshift.io/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        .: {}
        f:conditions: {}
        f:failures: {}
        f:outages: {}
    manager: cluster-network-check-endpoints
    operation: Update
    time: "2021-03-04T18:00:26Z"
  name: network-check-source-master-02-to-openshift-apiserver-endpoint-master-01
  namespace: openshift-network-diagnostics
  resourceVersion: "610891"
  selfLink: /apis/controlplane.operator.openshift.io/v1alpha1/namespaces/openshift-network-diagnostics/podnetworkconnectivitychecks/network-check-source-master-02-to-openshift-apiserver-endpoint-master-01
  uid: 3a841d54-28e2-4827-9eb3-c027605333c7
spec:
  sourcePod: network-check-source-7b56ddbc7b-8f5t6
  targetEndpoint: 10.129.0.55:8443
  tlsClientCert:
    name: ""
status:
  conditions:
  - lastTransitionTime: "2021-03-04T18:00:26Z"
    message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout'
    reason: TCPConnectError
    status: "False"
    type: Reachable
  failures:
  - latency: 10.00057387s
    message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout'
    reason: TCPConnectError
    success: false
    time: "2021-03-05T10:54:06Z"
  - latency: 10.00390748s
    message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout'
    reason: TCPConnectError
    success: false
    time: "2021-03-05T10:53:06Z"
  - latency: 10.00098655s
    message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout'
    reason: TCPConnectError
    success: false
    time: "2021-03-05T10:52:06Z"
  - latency: 10.000545401s
    message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout'
    reason: TCPConnectError
    success: false
    time: "2021-03-05T10:51:06Z"
  - latency: 10.000161853s
    message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout'
    reason: TCPConnectError
    success: false
    time: "2021-03-05T10:50:06Z"
  - latency: 10.000235184s
    message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout'
    reason: TCPConnectError
    success: false
    time: "2021-03-05T10:49:06Z"
  - latency: 10.000348697s
    message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout'
    reason: TCPConnectError
    success: false
    time: "2021-03-05T10:48:06Z"
  - latency: 10.000433986s
    message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout'
    reason: TCPConnectError
    success: false
    time: "2021-03-05T10:47:06Z"
  - latency: 10.000154293s
    message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout'
    reason: TCPConnectError
    success: false
    time: "2021-03-05T10:46:06Z"
  - latency: 10.000199761s
    message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout'
    reason: TCPConnectError
    success: false
    time: "2021-03-05T10:45:06Z"
  outages:
  - end: null
    endLogs:
    - latency: 10.00057387s
      message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout'
      reason: TCPConnectError
      success: false
      time: "2021-03-05T10:54:06Z"
    - latency: 10.00390748s
      message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout'
      reason: TCPConnectError
      success: false
      time: "2021-03-05T10:53:06Z"
    - latency: 10.00098655s
      message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout'
      reason: TCPConnectError
      success: false
      time: "2021-03-05T10:52:06Z"
    - latency: 10.000545401s
      message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout'
      reason: TCPConnectError
      success: false
      time: "2021-03-05T10:51:06Z"
    - latency: 10.000161853s
      message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout'
      reason: TCPConnectError
      success: false
      time: "2021-03-05T10:50:06Z"
    message: Connectivity outage detected at 2021-03-04T17:54:16.768114443Z
    start: "2021-03-04T17:54:16Z"
    startLogs:
    - latency: 10.003812744s
      message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout'
      reason: TCPConnectError
      success: false
      time: "2021-03-05T05:54:06Z"
    - latency: 10.000270641s
      message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.3:8443: dial tcp 10.129.0.3:8443: i/o timeout'
      reason: TCPConnectError
      success: false
      time: "2021-03-04T17:54:16Z"

Comment 6 kevin 2021-03-05 11:01:34 UTC
Created attachment 1760860 [details]
sdn log

Comment 7 W. Trevor King 2021-03-05 23:01:35 UTC
Sounds somewhat like bug 1931997, although this isn't my space, so that's mostly based on counts of similar words ;).

Comment 12 Lalatendu Mohanty 2021-03-30 19:57:45 UTC
@welin  this bug is marked with urgent severity and with "upgrades" keyword. So we want to know on which platform you saw the bug asap. So that we can decide if it a duplicate issue or a new upgrade blocker. Thanks.

Comment 13 Gerd Oberlechner 2021-04-04 20:57:30 UTC
i can see this exact same behaviour on 4.7.3 and 4.7.4 fresh installs (no update) on UPI VMware VMs when using OpenshiftSDN. same behaviour means that i run all the listed commands shown in this bug ticket and i get all the same results. wait-for install-complete does never finish.

i did a couple of tests with the exact same infrastructure but varied on cluster version and chosen SDN
* using OVN instead of OpenshiftSDN with 4.7.x results in a working cluster. 
* using OpenshiftSDN with 4.6.x results in a working cluster 

so it seems in my case it is just OpenshiftSDN + 4.7.x that is not working

let me know if you are interested in any specific logs etc

Comment 17 Aniket Bhat 2021-04-16 14:41:36 UTC
@gerd.oberlechner what is the vmware HW version on your UPI infrastructure? 

Please take a look at: https://access.redhat.com/solutions/5896081. The KCS article talks about upgrades, but the issue at hand is applicable to fresh installs as well. Please let us know if the workaround helps.

The issue will be addressed in 4.7.5 and beyond of OCP.

Comment 18 Gerd Oberlechner 2021-04-19 15:28:31 UTC
Thank your the advice.
The vmware HW version is 15 (ESX 6.7 U3 - 16316930)
I tried 4.7.5, 4.7.6 and also 4.7.7 but all of those still show the same behaviour.
I followed the hint in the KCS article and disabled vxlan offload on all nodes and the cluster became vivid instantly and responded fast to all kinds of queries.

Comment 27 Scott Dodson 2021-08-20 15:39:19 UTC
(In reply to Gerd Oberlechner from comment #18)
> Thank your the advice.
> The vmware HW version is 15 (ESX 6.7 U3 - 16316930)
> I tried 4.7.5, 4.7.6 and also 4.7.7 but all of those still show the same
> behaviour.
> I followed the hint in the KCS article and disabled vxlan offload on all
> nodes and the cluster became vivid instantly and responded fast to all kinds
> of queries.

Gerd,

vxlan offload issues were not fully resolved until 4.7.11, fixed via https://bugzilla.redhat.com/show_bug.cgi?id=1956749. This bug started off not being specific to vSphere so if you suspect vxlan offload issues on vsphere still lets make sure we have a separate bug to root cause that.

Comment 28 Scott Dodson 2021-08-20 15:48:43 UTC
Every outstanding occurence of this appears to be tied to vxlan offload problems with vsphere, so marking this as a dupe of 1956749

Comment 29 Scott Dodson 2021-08-20 15:49:00 UTC

*** This bug has been marked as a duplicate of bug 1956749 ***

Comment 31 Red Hat Bugzilla 2023-09-15 01:02:51 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.