Bug 1935591
Summary: | Multiple Cluster Operators failed when OCP 4.6.18 upgrade to OCP 4.7.0 ! | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | kevin <welin> | ||||||
Component: | Machine Config Operator | Assignee: | Yu Qi Zhang <jerzhang> | ||||||
Status: | CLOSED DUPLICATE | QA Contact: | Jian Zhang <jiazha> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 4.7 | CC: | alexisph, anbhat, aos-bugs, bretm, chdeshpa, gerd.oberlechner, jokerman, llopezmo, lmohanty, sdodson, shishika, skrenger, wking, yhe, zzhao | ||||||
Target Milestone: | --- | Keywords: | Upgrades | ||||||
Target Release: | --- | Flags: | skrenger:
needinfo-
|
||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2021-08-20 15:49:00 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
kevin
2021-03-05 08:41:16 UTC
I use Haproxy as OCP Loadbalancer, the configuration as following: [root@lb-01 ~]# cat /etc/haproxy/haproxy.cfg # Global settings #--------------------------------------------------------------------- global maxconn 20000 log /dev/log local0 info chroot /var/lib/haproxy pidfile /var/run/haproxy.pid user haproxy group haproxy daemon # turn on stats unix socket stats socket /var/lib/haproxy/stats #--------------------------------------------------------------------- # common defaults that all the 'listen' and 'backend' sections will # use if not designated in their block #--------------------------------------------------------------------- defaults mode http log global option httplog option dontlognull # option http-server-close option forwardfor except 127.0.0.0/8 option redispatch retries 3 timeout http-request 10s timeout queue 1m timeout connect 10s timeout client 300s timeout server 300s timeout http-keep-alive 10s timeout check 10s maxconn 20000 listen stats bind 192.168.30.29:9000 mode http stats enable stats uri / frontend cluster2-ocp4-api-server-frontend bind 192.168.30.32:6443 mode tcp option tcplog default_backend cluster2-ocp4-api-server-backend frontend cluster2-ocp4-machine-config-server-frontend bind 192.168.30.32:22623 mode tcp option tcplog default_backend cluster2-ocp4-machine-config-server-backend frontend cluster2-ocp4-ingress-http-frontend bind 192.168.30.32:80 mode tcp option tcplog default_backend cluster2-ocp4-ingress-http-backend frontend cluster2-ocp4-ingress-https-frontend bind 192.168.30.32:443 mode tcp option tcplog default_backend cluster2-ocp4-ingress-https-backend backend cluster2-ocp4-api-server-backend option httpchk GET /readyz HTTP/1.0 option log-health-checks balance roundrobin mode tcp server bootstrap bootstrap.ocp4.example.internal:6443 weight 1 verify none check check-ssl inter 1s fall 2 rise 3 server master-01 master-01.cluster2.ocp4.example.internal:6443 weight 1 verify none check check-ssl inter 1s fall 2 rise 3 server master-02 master-02.cluster2.ocp4.example.internal:6443 weight 1 verify none check check-ssl inter 1s fall 2 rise 3 server master-03 master-03.cluster2.ocp4.example.internal:6443 weight 1 verify none check check-ssl inter 1s fall 2 rise 3 backend cluster2-ocp4-machine-config-server-backend balance roundrobin mode tcp server bootstrap bootstrap.ocp4.example.internal:22623 check server master-01 master-01.cluster2.ocp4.example.internal:22623 check server master-02 master-02.cluster2.ocp4.example.internal:22623 check server master-03 master-03.cluster2.ocp4.example.internal:22623 check backend cluster2-ocp4-ingress-http-backend balance source mode tcp server master-01 master-01.cluster2.ocp4.example.internal:80 check server master-02 master-02.cluster2.ocp4.example.internal:80 check server master-03 master-03.cluster2.ocp4.example.internal:80 check backend cluster2-ocp4-ingress-https-backend balance source mode tcp server master-01 master-01.cluster2.ocp4.example.internal:443 check server master-02 master-02.cluster2.ocp4.example.internal:443 check server master-03 master-03.cluster2.ocp4.example.internal:443 check [root@support ~]# OPERATOR_POD=$(oc get po -n openshift-apiserver-operator -o name | grep -o "[^/]*$") [root@support ~]# OPERAND_PODS_IP=$(oc get po -n openshift-apiserver -o wide --no-headers | awk '{print $6}') [root@support ~]# [root@support ~]# OPERAND_PODS_IP=$(echo $OPERAND_PODS_IP) [root@support ~]# [root@support ~]# oc rsh -n openshift-apiserver-operator $OPERATOR_POD bash -c "for i in $OPERAND_PODS_IP; "'do echo "curl $i:"; curl -k --connect-timeout 10 https://$i:8443/healthz; echo; done' curl 10.130.0.46: curl: (28) Connection timed out after 10000 milliseconds curl 10.128.0.40: ok curl 10.129.0.55: curl: (28) Operation timed out after 10001 milliseconds with 0 out of 0 bytes received # oc get network -o yaml | grep network f:networkType: {} f:networkType: {} manager: cluster-network-operator selfLink: /apis/config.openshift.io/v1/networks/cluster networkType: OpenShiftSDN networkType: OpenShiftSDN oc get pod -n openshift-sdn NAME READY STATUS RESTARTS AGE ovs-26hkw 1/1 Running 0 17h ovs-jll4b 1/1 Running 0 17h ovs-pzjxm 1/1 Running 0 17h sdn-24qx5 2/2 Running 0 17h <------ log have in attachment sdn-2gz8l 2/2 Running 0 17h <------ log have in attachment sdn-65wd2 2/2 Running 0 17h <------ log have in attachment sdn-controller-jxrvr 1/1 Running 0 17h sdn-controller-k6kdz 1/1 Running 0 17h sdn-controller-p9jcs 1/1 Running 0 17h # oc get podnetworkconnectivitycheck network-check-source-master-02-to-openshift-apiserver-endpoint-master-01 -n openshift-network-diagnostics -o yaml apiVersion: controlplane.operator.openshift.io/v1alpha1 kind: PodNetworkConnectivityCheck metadata: creationTimestamp: "2021-03-04T17:53:16Z" generation: 2 managedFields: - apiVersion: controlplane.operator.openshift.io/v1alpha1 fieldsType: FieldsV1 fieldsV1: f:spec: .: {} f:sourcePod: {} f:targetEndpoint: {} f:tlsClientCert: .: {} f:name: {} manager: cluster-network-operator operation: Update time: "2021-03-04T17:53:16Z" - apiVersion: controlplane.operator.openshift.io/v1alpha1 fieldsType: FieldsV1 fieldsV1: f:status: .: {} f:conditions: {} f:failures: {} f:outages: {} manager: cluster-network-check-endpoints operation: Update time: "2021-03-04T18:00:26Z" name: network-check-source-master-02-to-openshift-apiserver-endpoint-master-01 namespace: openshift-network-diagnostics resourceVersion: "610891" selfLink: /apis/controlplane.operator.openshift.io/v1alpha1/namespaces/openshift-network-diagnostics/podnetworkconnectivitychecks/network-check-source-master-02-to-openshift-apiserver-endpoint-master-01 uid: 3a841d54-28e2-4827-9eb3-c027605333c7 spec: sourcePod: network-check-source-7b56ddbc7b-8f5t6 targetEndpoint: 10.129.0.55:8443 tlsClientCert: name: "" status: conditions: - lastTransitionTime: "2021-03-04T18:00:26Z" message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError status: "False" type: Reachable failures: - latency: 10.00057387s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-05T10:54:06Z" - latency: 10.00390748s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-05T10:53:06Z" - latency: 10.00098655s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-05T10:52:06Z" - latency: 10.000545401s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-05T10:51:06Z" - latency: 10.000161853s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-05T10:50:06Z" - latency: 10.000235184s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-05T10:49:06Z" - latency: 10.000348697s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-05T10:48:06Z" - latency: 10.000433986s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-05T10:47:06Z" - latency: 10.000154293s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-05T10:46:06Z" - latency: 10.000199761s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-05T10:45:06Z" outages: - end: null endLogs: - latency: 10.00057387s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-05T10:54:06Z" - latency: 10.00390748s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-05T10:53:06Z" - latency: 10.00098655s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-05T10:52:06Z" - latency: 10.000545401s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-05T10:51:06Z" - latency: 10.000161853s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-05T10:50:06Z" message: Connectivity outage detected at 2021-03-04T17:54:16.768114443Z start: "2021-03-04T17:54:16Z" startLogs: - latency: 10.003812744s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.55:8443: dial tcp 10.129.0.55:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-05T05:54:06Z" - latency: 10.000270641s message: 'openshift-apiserver-endpoint-master-01: failed to establish a TCP connection to 10.129.0.3:8443: dial tcp 10.129.0.3:8443: i/o timeout' reason: TCPConnectError success: false time: "2021-03-04T17:54:16Z" Created attachment 1760860 [details]
sdn log
Sounds somewhat like bug 1931997, although this isn't my space, so that's mostly based on counts of similar words ;). @welin this bug is marked with urgent severity and with "upgrades" keyword. So we want to know on which platform you saw the bug asap. So that we can decide if it a duplicate issue or a new upgrade blocker. Thanks. i can see this exact same behaviour on 4.7.3 and 4.7.4 fresh installs (no update) on UPI VMware VMs when using OpenshiftSDN. same behaviour means that i run all the listed commands shown in this bug ticket and i get all the same results. wait-for install-complete does never finish. i did a couple of tests with the exact same infrastructure but varied on cluster version and chosen SDN * using OVN instead of OpenshiftSDN with 4.7.x results in a working cluster. * using OpenshiftSDN with 4.6.x results in a working cluster so it seems in my case it is just OpenshiftSDN + 4.7.x that is not working let me know if you are interested in any specific logs etc @gerd.oberlechner what is the vmware HW version on your UPI infrastructure? Please take a look at: https://access.redhat.com/solutions/5896081. The KCS article talks about upgrades, but the issue at hand is applicable to fresh installs as well. Please let us know if the workaround helps. The issue will be addressed in 4.7.5 and beyond of OCP. Thank your the advice. The vmware HW version is 15 (ESX 6.7 U3 - 16316930) I tried 4.7.5, 4.7.6 and also 4.7.7 but all of those still show the same behaviour. I followed the hint in the KCS article and disabled vxlan offload on all nodes and the cluster became vivid instantly and responded fast to all kinds of queries. (In reply to Gerd Oberlechner from comment #18) > Thank your the advice. > The vmware HW version is 15 (ESX 6.7 U3 - 16316930) > I tried 4.7.5, 4.7.6 and also 4.7.7 but all of those still show the same > behaviour. > I followed the hint in the KCS article and disabled vxlan offload on all > nodes and the cluster became vivid instantly and responded fast to all kinds > of queries. Gerd, vxlan offload issues were not fully resolved until 4.7.11, fixed via https://bugzilla.redhat.com/show_bug.cgi?id=1956749. This bug started off not being specific to vSphere so if you suspect vxlan offload issues on vsphere still lets make sure we have a separate bug to root cause that. Every outstanding occurence of this appears to be tied to vxlan offload problems with vsphere, so marking this as a dupe of 1956749 *** This bug has been marked as a duplicate of bug 1956749 *** The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |