Bug 1889900

Summary:	OpenShift API stops responding during upgrade
Product:	OpenShift Container Platform	Reporter:	Fabian von Feilitzsch <fabian>
Component:	openshift-apiserver	Assignee:	Luis Sanchez <sanchezl>
Status:	CLOSED DUPLICATE	QA Contact:	Xingxing Xia <xxia>
Severity:	high	Docs Contact:
Priority:	low
Version:	4.7	CC:	aos-bugs, bruce_link, jsafrane, mf.flip, mfojtik, wking, wlewis
Target Milestone:	---	Flags:	mfojtik: needinfo?
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	trt LifecycleStale
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-01-27 17:54:18 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Fabian von Feilitzsch 2020-10-20 20:54:04 UTC

Description of problem:
OpenShift API was unreachable for 2h15m3s of 2h33m31s (88%) of upgrade

Version-Release number of selected component (if applicable):
4.7

Example failing job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1318545558922596352

It looks like this may be preventing the kube-apiserver from upgrading, as the `kube-apiserver-check-endpoints` pod is crashlooping with the following failure: 

E1020 16:41:10.647385       1 reflector.go:127] k8s.io/client-go.0/tools/cache/reflector.go:156: Failed to watch *v1alpha1.PodNetworkConnectivityCheck: failed to list *v1alpha1.PodNetworkConnectivityCheck: the server could not find the requested resource (get podnetworkconnectivitychecks.controlplane.operator.openshift.io)
I1020 16:41:10.870356       1 start_stop_controllers.go:70] The server doesn't have a resource type "podnetworkconnectivitychecks.controlplane.operator.openshift.io".

and in the kube-apiserver logs it seems the podnetworkconnectivitychecks api is repeatedly added to and removed from the OpenAPI spec with the following logs looping:

I1020 16:43:56.145475      18 cacher.go:402] cacher (*unstructured.Unstructured): initialized
I1020 16:43:56.890503      18 controller.go:172] Updating CRD OpenAPI spec because podnetworkconnectivitychecks.controlplane.operator.openshift.io was removed
I1020 16:43:57.215968      18 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I1020 16:43:57.215968      18 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I1020 16:43:57.216106      18 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I1020 16:43:57.216148      18 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I1020 16:43:58.646013      18 aggregator.go:226] Updating OpenAPI spec because k8s_internal_local_delegation_chain_0000000002 is updated
I1020 16:44:00.249165      18 aggregator.go:229] Finished OpenAPI spec generation after 1.603120881s
I1020 16:44:02.228507      18 aggregator.go:226] Updating OpenAPI spec because k8s_internal_local_delegation_chain_0000000002 is updated
I1020 16:44:03.845507      18 aggregator.go:229] Finished OpenAPI spec generation after 1.616970005s
I1020 16:44:04.471745      18 controller.go:189] Updating CRD OpenAPI spec because podnetworkconnectivitychecks.controlplane.operator.openshift.io changed
I1020 16:44:04.527616      18 client.go:360] parsed scheme: "endpoint"
I1020 16:44:04.527680      18 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://10.0.134.149:2379  <nil> 0 <nil>} {https://10.0.178.80:2379  <nil> 0 <nil>} {https://10.0.252.43:2379  <nil> 0 <nil>} {https://localhost:2379  <nil> 0 <nil>}]
I1020 16:44:04.537229      18 store.go:1378] Monitoring podnetworkconnectivitychecks.controlplane.operator.openshift.io count at <storage-prefix>//controlplane.operator.openshift.io/podnetworkconnectivitychecks
I1020 16:44:04.557550      18 cacher.go:402] cacher (*unstructured.Unstructured): initialized

Comment 1 Jan Safranek 2020-10-21 10:33:46 UTC

Bumping severity, OCP 4.7 releases are not promoted due to this bug.

Comment 2 Michal Fojtik 2020-11-20 11:12:06 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 3 Roman Kravtsov 2020-12-15 17:58:16 UTC

During the upgrade from OKD 4.5 to 4.6, I ran into a similar problem. 

Description
After cluster upgrade I cant open openshift console. OAuth service not working.

In events present message "APIService check reporeted "oauth.openshift.io.v1" is not ready: 503"

Some commands stop working:

$ ./oc get route --all-namespaces Error from server (ServiceUnavailable): the server is currently unable to handle the request (get routes.route.openshift.io)

Some operators in DEGRADED state

$ ./oc get co | awk '$5 == "True" || $5 == "DEGRADED"'
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.6.0-0.okd-2020-12-12-135354 False True True 24h
console 4.6.0-0.okd-2020-12-12-135354 False False True 23h
image-registry 4.6.0-0.okd-2020-12-12-135354 True False True 20h
monitoring

Version
4.5.0-0.okd-2020-10-15-235428
upgrade to
4.6.0-0.okd-2020-11-27-200126
4.6.0-0.okd-2020-12-12-135354


The problem is described in more detail here https://github.com/openshift/okd/issues/395

I can provide any additional information if required.

Comment 4 Michal Fojtik 2020-12-15 18:10:17 UTC

The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 5 Bruce Link 2020-12-15 21:02:19 UTC

We are having same issue as mf.flip.

As above, 

$ oc get route --all-namespaces
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get routes.route.openshift.io)

$ oc get co | awk '$5 == "True" || $5 == "DEGRADED"'
NAME                                       VERSION                         AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-0.okd-2020-12-12-135354   False       True          True       19h
console                                    4.6.0-0.okd-2020-12-12-135354   False       False         True       13h
image-registry                             4.6.0-0.okd-2020-12-12-135354   False       True          True       18h
monitoring                                 4.6.0-0.okd-2020-12-12-135354   False       True          True       19h

Looking at unavailable operators adds:

openshift-apiserver                        4.6.0-0.okd-2020-12-12-135354   False       False         False      19h
operator-lifecycle-manager-packageserver   4.6.0-0.okd-2020-12-12-135354   False       True          False      13s

Further information, including must-gather report is available.

Comment 6 Michal Fojtik 2021-01-14 21:38:47 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 7 Luis Sanchez 2021-01-27 17:54:18 UTC


*** This bug has been marked as a duplicate of bug 1912820 ***