1865847 – openshift-api server available=false

Bug 1865847 - openshift-api server available=false

Summary: openshift-api server available=false

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	openshift-apiserver
Sub Component:
Version:	4.6
Hardware:	x86_64
OS:	Other
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Luis Sanchez
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-04 11:09 UTC by Tomas Dedic
Modified:	2021-02-23 17:00 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-08-21 13:25:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
tcp dump for TLS handshake error (170.21 KB, image/png) 2020-08-04 11:09 UTC, Tomas Dedic	no flags	Details
View All

Description Tomas Dedic 2020-08-04 11:09:41 UTC

Created attachment 1710313 [details]
tcp dump for TLS handshake error

Description of problem:
We are installing IPI OCP 4.4.14/4.6.0, (4.6 version for "user defined routing instead of loadbalancer") as private cluster on Azure. Just after installation platform seems stable, but after some time (days) openshift-api server gets into non-availibility state. Problems are same on both versions.

➤ oc get machines -A
NAMESPACE               NAME                                     PHASE     TYPE              REGION       ZONE   AGE
openshift-machine-api   oaz-dev-tnhr6-master-0                   Running   Standard_D4s_v3   westeurope   2      10d
openshift-machine-api   oaz-dev-tnhr6-master-1                   Running   Standard_D4s_v3   westeurope   1      10d
openshift-machine-api   oaz-dev-tnhr6-master-2                   Running   Standard_D4s_v3   westeurope   3      10d
openshift-machine-api   oaz-dev-tnhr6-worker-westeurope1-n8jk7   Running   Standard_D4s_v3   westeurope   1      10d
openshift-machine-api   oaz-dev-tnhr6-worker-westeurope2-2vzp4   Running   Standard_D4s_v3   westeurope   2      10d
openshift-machine-api   oaz-dev-tnhr6-worker-westeurope3-gblt6   Running   Standard_D4s_v3   westeurope   3      10d


➤ oc get co| sed -n '1p;/openshift-apiserver/p'
  NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
  openshift-apiserver                        4.6.0-0.nightly-2020-07-24-111750   False       False         False      34m
    
➤ oc get events -n openshift-apiserver-operator
LAST SEEN   TYPE      REASON                    OBJECT                                    MESSAGE
50s         Warning   OpenShiftAPICheckFailed   deployment/openshift-apiserver-operator   "image.openshift.io.v1" failed with HTTP status code 503 (the server is currently unable to handle the request)
4s          Warning   OpenShiftAPICheckFailed   deployment/openshift-apiserver-operator   "project.openshift.io.v1" failed with HTTP status code 503 (the server is currently unable to handle the request)
30s         Warning   OpenShiftAPICheckFailed   deployment/openshift-apiserver-operator   "quota.openshift.io.v1" failed with HTTP status code 503 (the server is currently unable to handle the request)
19s         Warning   OpenShiftAPICheckFailed   deployment/openshift-apiserver-operator   "apps.openshift.io.v1" failed with HTTP status code 503 (the server is currently unable to handle the request)
35s         Warning   OpenShiftAPICheckFailed   deployment/openshift-apiserver-operator   "security.openshift.io.v1" failed with HTTP status code 503 (the server is currently unable to handle the request)
2m7s        Warning   OpenShiftAPICheckFailed   deployment/openshift-apiserver-operator   "template.openshift.io.v1" failed with HTTP status code 503 (the server is currently unable to handle the request)
24s         Warning   OpenShiftAPICheckFailed   deployment/openshift-apiserver-operator   "route.openshift.io.v1" failed with HTTP status code 503 (the server is currently unable to handle the request)
14s         Warning   OpenShiftAPICheckFailed   deployment/openshift-apiserver-operator   "authorization.openshift.io.v1" failed with HTTP status code 503 (the server is currently unable to handle the request)
9s          Warning   OpenShiftAPICheckFailed   deployment/openshift-apiserver-operator   "build.openshift.io.v1" failed with HTTP status code 503 (the server is currently unable to handle the request)
86s         Warning   OpenShiftAPICheckFailed   deployment/openshift-apiserver-operator   "oauth.openshift.io.v1" failed with HTTP status code 503 (the server is currently unable to handle the request)
40m         Normal    OperatorStatusChanged     deployment/openshift-apiserver-operator   Status for clusteroperator/openshift-apiserver changed: Available changed from True to False ("APIServicesAvailable: \"build.openshift.io.v1\" is not

api-resources provided by openshift-apiserver are unavailable

➤ oc api-resources 1>/dev/null

  error: unable to retrieve the complete list of server APIs: 
  authorization.openshift.io/v1: the server is currently unable to handle the request,
  oauth.openshift.io/v1: the server is currently unable to handle the request,
  packages.operators.coreos.com/v1: the server is currently unable to handle the request,
  route.openshift.io/v1: the server is currently unable to handle the request,
  security.openshift.io/v1: the server is currently unable to handle the request

➤ oc logs -n openshift-apiserver deploy/apiserver -c openshift-apiserver
# this log is full of messages like, I am not sure weather is originator or result
I0804 10:46:44.984624       1 log.go:181] http: TLS handshake error from 10.128.0.1:53288: read tcp 10.130.0.8:8443->10.128.0.1:53288: read: connection timed out
I0804 10:46:44.984795       1 log.go:181] http: TLS handshake error from 10.128.0.1:53212: read tcp 10.130.0.8:8443->10.128.0.1:53212: read: connection timed out
I0804 10:46:44.984837       1 log.go:181] http: TLS handshake error from 10.128.0.1:48688: read tcp 10.130.0.8:8443->10.128.0.1:48688: read: connection timed out
I0804 10:46:44.984865       1 log.go:181] http: TLS handshake error from 10.128.0.1:48840: read tcp 10.130.0.8:8443->10.128.0.1:48840: read: connection timed out
I0804 10:46:44.984891       1 log.go:181] http: TLS handshake error from 10.128.0.1:48770: read tcp 10.130.0.8:8443->10.128.0.1:48770: read: connection timed out
I0804 10:46:44.984919       1 log.go:181] http: TLS handshake error from 10.128.0.1:48626: read tcp 10.130.0.8:8443->10.128.0.1:48626: read: connection timed out
I0804 10:46:44.984945       1 log.go:181] http: TLS handshake error from 10.128.0.1:53370: read tcp 10.130.0.8:8443->10.128.0.1:53370: read: connection timed out
I0804 10:46:46.041491       1 log.go:181] http: TLS handshake error from 10.128.0.1:43762: EOF
I0804 10:46:46.041594       1 log.go:181] http: TLS handshake error from 10.128.0.1:43740: EOF
I0804 10:46:46.041642       1 log.go:181] http: TLS handshake error from 10.128.0.1:43750: EOF
#with closer look I have discovered a pod behind error logs are from pod openshift-kube-apiserver/kube-apiserver-oaz-dev-tnhr6-master-0/kube-apiserver-check-endpoints

When I force nodes to restart, problem will dissapear for something like 18-20 hours. Also strange thing is that not all pods of openshift-apiserver deployment are affected (we have 3 master nodes and 1 or 2 are affected at once) 

Version-Release number of selected component (if applicable):
4.4.14
4.6.0-0.nightly-2020-07-24-111750 


How reproducible:
Everytime.

Steps to Reproduce:
1.IPI install as private cluster into Azure Cloud and wait. 

2.
3.

Actual results:
generally not working

Expected results:
working

Additional info:

Comment 1 Luis Sanchez 2020-08-21 13:16:36 UTC

The "log spam" cause by check-endpoints was resolved by https://bugzilla.redhat.com/show_bug.cgi?id=1855284. I would suggested trying a newer build.

Not enough info to debug the OpenShiftAPICheckFailed events please provide must-gather output if still reproducible.

Comment 2 Tomas Dedic 2020-08-31 11:33:12 UTC

Solved in version 4.6.0-0.nightly-2020-08-18-070534, maybe earlier, but this version is working. 
I have no access to 4.6.0-0.nightly-2020-07-24-111750 anymore for must-gather option.

Comment 3 Vincenzo Marzario 2021-02-23 15:24:20 UTC

Hi,
I have the same problem on OpenShift 4.5.20, IPI installation as private cluster on Azure.
At the moment I forced my threee master nodes to restart but it's a workaround.

Any suggestions?

Thanks in advance

Best regards
Vincenzo Marzario

Comment 4 Tomas Dedic 2021-02-23 16:52:40 UTC

Try to change NIC for master nodes to accelerated network and add read cache on os disks, both on master and worker nodes.
Or maybe better solution is to upgrade to 4.6.

Comment 5 Vincenzo Marzario 2021-02-23 17:00:48 UTC

Hi, 
I can't upgrade to 4.6 because of constraints with other technologies.
I'll try your tip.

Thanks
Vincenzo Marzario

Note You need to log in before you can comment on or make changes to this bug.