1828861 – Kubernetes APIs remain available

Bug 1828861 - Kubernetes APIs remain available

Summary: Kubernetes APIs remain available

Keywords:
Status:	CLOSED DUPLICATE of bug 1845411
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.2.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Stefan Schimanski
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-28 13:38 UTC by Ben Parees
Modified:	2020-06-18 10:20 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	Kubernetes APIs remain available
Last Closed:	2020-06-18 10:20:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Ben Parees 2020-04-28 13:38:42 UTC

test:
Kubernetes APIs remain available 

is failing frequently in CI, see search results:
https://search.svc.ci.openshift.org/?maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=Kubernetes+APIs+remain+available

Comment 1 Clayton Coleman 2020-04-28 13:50:55 UTC

Needs investigation on non-GCP causes (GCP caused by gcp-route exiting before kube-apiserver).

Comment 2 Venkata Siva Teja Areti 2020-05-06 20:46:22 UTC

I considered this test run to look into this issue

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/773/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-upgrade/1383

I can see the below errors from the test log

May 06 15:24:58.720 E kube-apiserver Kube API started failing: Get https://api.ci-op-7cwyjx04-06a72.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/kube-system?timeout=15s: dial tcp 54.190.189.72:6443: connect: connection refused
May 06 15:24:59.553 E kube-apiserver Kube API is not responding to GET requests
May 06 15:24:59.637 I kube-apiserver Kube API started responding to GET requests

After looking into artifacts directory, I could not find logs for one kube-apiserver pod at above timestamps

$ head -n3 openshift-kube-apiserver_kube-apiserver-ip-10-0-157-201.us-west-2.compute.internal_kube-apiserver.log
Copying system trust bundle
Flag --openshift-config has been deprecated, to be removed
I0506 15:27:30.698893       1 feature_gate.go:244] feature gates: &{map[APIPriorityAndFairness:true]}
$ head -n3 openshift-kube-apiserver_kube-apiserver-ip-10-0-157-201.us-west-2.compute.internal_kube-apiserver_previous.log 
$

There are NO previous logs that tells anything about the API Server. Found this log statement in etcd logs on same node around the same time

$ grep osutil openshift-etcd_etcd-ip-10-0-157-201.us-west-2.compute.internal_etcd.log
2020-05-06 15:24:57.894871 N | pkg/osutil: received terminated signal, shutting down...

So, I am guessing the API server is not running on this node around this time, hence there is "connection refused error". I tried to establish that this apiserver pod is not running but was not successful. I am interested to learn if someone can teach me how to do it.

If that is the case, why is loadbalancer still sending requests to this API server?

Comment 3 Stefan Schimanski 2020-05-19 11:32:14 UTC

Closing until we see this again with more logs.

Comment 4 Ben Parees 2020-05-19 12:06:54 UTC

What do you mean see it again?  this is failing consistently:
https://sippy-bparees.svc.ci.openshift.org/?release=4.5#TopFailingTests

https://search.apps.build01.ci.devcluster.openshift.com/?search=Kubernetes+APIs+remain+available&maxAge=168h&context=1&type=junit&name=release.*4.5.*&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 6 Clayton Coleman 2020-05-20 14:40:07 UTC

This is high but not urgent priority - it's fine if we investigate in 4.6.  However, outage during upgrade for customers is a p0 issue from a prioritization perspective for customers, so this should take priority over feature work.

Comment 7 Stefan Schimanski 2020-06-18 10:20:48 UTC

We track this in 1845411, by platform as sub-BZ, and by version for critical cases like release blockers.

*** This bug has been marked as a duplicate of bug 1845411 ***

Note You need to log in before you can comment on or make changes to this bug.