test: Kubernetes APIs remain available is failing frequently in CI, see search results: https://search.svc.ci.openshift.org/?maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=Kubernetes+APIs+remain+available
Needs investigation on non-GCP causes (GCP caused by gcp-route exiting before kube-apiserver).
I considered this test run to look into this issue https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/773/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-upgrade/1383 I can see the below errors from the test log May 06 15:24:58.720 E kube-apiserver Kube API started failing: Get https://api.ci-op-7cwyjx04-06a72.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/kube-system?timeout=15s: dial tcp 54.190.189.72:6443: connect: connection refused May 06 15:24:59.553 E kube-apiserver Kube API is not responding to GET requests May 06 15:24:59.637 I kube-apiserver Kube API started responding to GET requests After looking into artifacts directory, I could not find logs for one kube-apiserver pod at above timestamps $ head -n3 openshift-kube-apiserver_kube-apiserver-ip-10-0-157-201.us-west-2.compute.internal_kube-apiserver.log Copying system trust bundle Flag --openshift-config has been deprecated, to be removed I0506 15:27:30.698893 1 feature_gate.go:244] feature gates: &{map[APIPriorityAndFairness:true]} $ head -n3 openshift-kube-apiserver_kube-apiserver-ip-10-0-157-201.us-west-2.compute.internal_kube-apiserver_previous.log $ There are NO previous logs that tells anything about the API Server. Found this log statement in etcd logs on same node around the same time $ grep osutil openshift-etcd_etcd-ip-10-0-157-201.us-west-2.compute.internal_etcd.log 2020-05-06 15:24:57.894871 N | pkg/osutil: received terminated signal, shutting down... So, I am guessing the API server is not running on this node around this time, hence there is "connection refused error". I tried to establish that this apiserver pod is not running but was not successful. I am interested to learn if someone can teach me how to do it. If that is the case, why is loadbalancer still sending requests to this API server?
Closing until we see this again with more logs.
What do you mean see it again? this is failing consistently: https://sippy-bparees.svc.ci.openshift.org/?release=4.5#TopFailingTests https://search.apps.build01.ci.devcluster.openshift.com/?search=Kubernetes+APIs+remain+available&maxAge=168h&context=1&type=junit&name=release.*4.5.*&maxMatches=5&maxBytes=20971520&groupBy=job
This is high but not urgent priority - it's fine if we investigate in 4.6. However, outage during upgrade for customers is a p0 issue from a prioritization perspective for customers, so this should take priority over feature work.
We track this in 1845411, by platform as sub-BZ, and by version for critical cases like release blockers. *** This bug has been marked as a duplicate of bug 1845411 ***