Bug 1828861 - Kubernetes APIs remain available
Summary: Kubernetes APIs remain available
Keywords:
Status: CLOSED DUPLICATE of bug 1845411
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.2.z
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.6.0
Assignee: Stefan Schimanski
QA Contact: Xingxing Xia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-28 13:38 UTC by Ben Parees
Modified: 2020-06-18 10:20 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Kubernetes APIs remain available
Last Closed: 2020-06-18 10:20:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Ben Parees 2020-04-28 13:38:42 UTC
test:
Kubernetes APIs remain available 

is failing frequently in CI, see search results:
https://search.svc.ci.openshift.org/?maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=Kubernetes+APIs+remain+available

Comment 1 Clayton Coleman 2020-04-28 13:50:55 UTC
Needs investigation on non-GCP causes (GCP caused by gcp-route exiting before kube-apiserver).

Comment 2 Venkata Siva Teja Areti 2020-05-06 20:46:22 UTC
I considered this test run to look into this issue

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/773/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-upgrade/1383

I can see the below errors from the test log

May 06 15:24:58.720 E kube-apiserver Kube API started failing: Get https://api.ci-op-7cwyjx04-06a72.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/kube-system?timeout=15s: dial tcp 54.190.189.72:6443: connect: connection refused
May 06 15:24:59.553 E kube-apiserver Kube API is not responding to GET requests
May 06 15:24:59.637 I kube-apiserver Kube API started responding to GET requests

After looking into artifacts directory, I could not find logs for one kube-apiserver pod at above timestamps

$ head -n3 openshift-kube-apiserver_kube-apiserver-ip-10-0-157-201.us-west-2.compute.internal_kube-apiserver.log
Copying system trust bundle
Flag --openshift-config has been deprecated, to be removed
I0506 15:27:30.698893       1 feature_gate.go:244] feature gates: &{map[APIPriorityAndFairness:true]}
$ head -n3 openshift-kube-apiserver_kube-apiserver-ip-10-0-157-201.us-west-2.compute.internal_kube-apiserver_previous.log 
$

There are NO previous logs that tells anything about the API Server. Found this log statement in etcd logs on same node around the same time

$ grep osutil openshift-etcd_etcd-ip-10-0-157-201.us-west-2.compute.internal_etcd.log
2020-05-06 15:24:57.894871 N | pkg/osutil: received terminated signal, shutting down...

So, I am guessing the API server is not running on this node around this time, hence there is "connection refused error". I tried to establish that this apiserver pod is not running but was not successful. I am interested to learn if someone can teach me how to do it.

If that is the case, why is loadbalancer still sending requests to this API server?

Comment 3 Stefan Schimanski 2020-05-19 11:32:14 UTC
Closing until we see this again with more logs.

Comment 6 Clayton Coleman 2020-05-20 14:40:07 UTC
This is high but not urgent priority - it's fine if we investigate in 4.6.  However, outage during upgrade for customers is a p0 issue from a prioritization perspective for customers, so this should take priority over feature work.

Comment 7 Stefan Schimanski 2020-06-18 10:20:48 UTC
We track this in 1845411, by platform as sub-BZ, and by version for critical cases like release blockers.

*** This bug has been marked as a duplicate of bug 1845411 ***


Note You need to log in before you can comment on or make changes to this bug.