Created attachment 1710313 [details] tcp dump for TLS handshake error Description of problem: We are installing IPI OCP 4.4.14/4.6.0, (4.6 version for "user defined routing instead of loadbalancer") as private cluster on Azure. Just after installation platform seems stable, but after some time (days) openshift-api server gets into non-availibility state. Problems are same on both versions. ➤ oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api oaz-dev-tnhr6-master-0 Running Standard_D4s_v3 westeurope 2 10d openshift-machine-api oaz-dev-tnhr6-master-1 Running Standard_D4s_v3 westeurope 1 10d openshift-machine-api oaz-dev-tnhr6-master-2 Running Standard_D4s_v3 westeurope 3 10d openshift-machine-api oaz-dev-tnhr6-worker-westeurope1-n8jk7 Running Standard_D4s_v3 westeurope 1 10d openshift-machine-api oaz-dev-tnhr6-worker-westeurope2-2vzp4 Running Standard_D4s_v3 westeurope 2 10d openshift-machine-api oaz-dev-tnhr6-worker-westeurope3-gblt6 Running Standard_D4s_v3 westeurope 3 10d ➤ oc get co| sed -n '1p;/openshift-apiserver/p' NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE openshift-apiserver 4.6.0-0.nightly-2020-07-24-111750 False False False 34m ➤ oc get events -n openshift-apiserver-operator LAST SEEN TYPE REASON OBJECT MESSAGE 50s Warning OpenShiftAPICheckFailed deployment/openshift-apiserver-operator "image.openshift.io.v1" failed with HTTP status code 503 (the server is currently unable to handle the request) 4s Warning OpenShiftAPICheckFailed deployment/openshift-apiserver-operator "project.openshift.io.v1" failed with HTTP status code 503 (the server is currently unable to handle the request) 30s Warning OpenShiftAPICheckFailed deployment/openshift-apiserver-operator "quota.openshift.io.v1" failed with HTTP status code 503 (the server is currently unable to handle the request) 19s Warning OpenShiftAPICheckFailed deployment/openshift-apiserver-operator "apps.openshift.io.v1" failed with HTTP status code 503 (the server is currently unable to handle the request) 35s Warning OpenShiftAPICheckFailed deployment/openshift-apiserver-operator "security.openshift.io.v1" failed with HTTP status code 503 (the server is currently unable to handle the request) 2m7s Warning OpenShiftAPICheckFailed deployment/openshift-apiserver-operator "template.openshift.io.v1" failed with HTTP status code 503 (the server is currently unable to handle the request) 24s Warning OpenShiftAPICheckFailed deployment/openshift-apiserver-operator "route.openshift.io.v1" failed with HTTP status code 503 (the server is currently unable to handle the request) 14s Warning OpenShiftAPICheckFailed deployment/openshift-apiserver-operator "authorization.openshift.io.v1" failed with HTTP status code 503 (the server is currently unable to handle the request) 9s Warning OpenShiftAPICheckFailed deployment/openshift-apiserver-operator "build.openshift.io.v1" failed with HTTP status code 503 (the server is currently unable to handle the request) 86s Warning OpenShiftAPICheckFailed deployment/openshift-apiserver-operator "oauth.openshift.io.v1" failed with HTTP status code 503 (the server is currently unable to handle the request) 40m Normal OperatorStatusChanged deployment/openshift-apiserver-operator Status for clusteroperator/openshift-apiserver changed: Available changed from True to False ("APIServicesAvailable: \"build.openshift.io.v1\" is not api-resources provided by openshift-apiserver are unavailable ➤ oc api-resources 1>/dev/null error: unable to retrieve the complete list of server APIs: authorization.openshift.io/v1: the server is currently unable to handle the request, oauth.openshift.io/v1: the server is currently unable to handle the request, packages.operators.coreos.com/v1: the server is currently unable to handle the request, route.openshift.io/v1: the server is currently unable to handle the request, security.openshift.io/v1: the server is currently unable to handle the request ➤ oc logs -n openshift-apiserver deploy/apiserver -c openshift-apiserver # this log is full of messages like, I am not sure weather is originator or result I0804 10:46:44.984624 1 log.go:181] http: TLS handshake error from 10.128.0.1:53288: read tcp 10.130.0.8:8443->10.128.0.1:53288: read: connection timed out I0804 10:46:44.984795 1 log.go:181] http: TLS handshake error from 10.128.0.1:53212: read tcp 10.130.0.8:8443->10.128.0.1:53212: read: connection timed out I0804 10:46:44.984837 1 log.go:181] http: TLS handshake error from 10.128.0.1:48688: read tcp 10.130.0.8:8443->10.128.0.1:48688: read: connection timed out I0804 10:46:44.984865 1 log.go:181] http: TLS handshake error from 10.128.0.1:48840: read tcp 10.130.0.8:8443->10.128.0.1:48840: read: connection timed out I0804 10:46:44.984891 1 log.go:181] http: TLS handshake error from 10.128.0.1:48770: read tcp 10.130.0.8:8443->10.128.0.1:48770: read: connection timed out I0804 10:46:44.984919 1 log.go:181] http: TLS handshake error from 10.128.0.1:48626: read tcp 10.130.0.8:8443->10.128.0.1:48626: read: connection timed out I0804 10:46:44.984945 1 log.go:181] http: TLS handshake error from 10.128.0.1:53370: read tcp 10.130.0.8:8443->10.128.0.1:53370: read: connection timed out I0804 10:46:46.041491 1 log.go:181] http: TLS handshake error from 10.128.0.1:43762: EOF I0804 10:46:46.041594 1 log.go:181] http: TLS handshake error from 10.128.0.1:43740: EOF I0804 10:46:46.041642 1 log.go:181] http: TLS handshake error from 10.128.0.1:43750: EOF #with closer look I have discovered a pod behind error logs are from pod openshift-kube-apiserver/kube-apiserver-oaz-dev-tnhr6-master-0/kube-apiserver-check-endpoints When I force nodes to restart, problem will dissapear for something like 18-20 hours. Also strange thing is that not all pods of openshift-apiserver deployment are affected (we have 3 master nodes and 1 or 2 are affected at once) Version-Release number of selected component (if applicable): 4.4.14 4.6.0-0.nightly-2020-07-24-111750 How reproducible: Everytime. Steps to Reproduce: 1.IPI install as private cluster into Azure Cloud and wait. 2. 3. Actual results: generally not working Expected results: working Additional info:
The "log spam" cause by check-endpoints was resolved by https://bugzilla.redhat.com/show_bug.cgi?id=1855284. I would suggested trying a newer build. Not enough info to debug the OpenShiftAPICheckFailed events please provide must-gather output if still reproducible.
Solved in version 4.6.0-0.nightly-2020-08-18-070534, maybe earlier, but this version is working. I have no access to 4.6.0-0.nightly-2020-07-24-111750 anymore for must-gather option.
Hi, I have the same problem on OpenShift 4.5.20, IPI installation as private cluster on Azure. At the moment I forced my threee master nodes to restart but it's a workaround. Any suggestions? Thanks in advance Best regards Vincenzo Marzario
Try to change NIC for master nodes to accelerated network and add read cache on os disks, both on master and worker nodes. Or maybe better solution is to upgrade to 4.6.
Hi, I can't upgrade to 4.6 because of constraints with other technologies. I'll try your tip. Thanks Vincenzo Marzario