Bug 1889900 - OpenShift API stops responding during upgrade [NEEDINFO]
Summary: OpenShift API stops responding during upgrade
Keywords:
Status: NEW
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: openshift-apiserver
Version: 4.7
Hardware: Unspecified
OS: Unspecified
low
high
Target Milestone: ---
: 4.7.0
Assignee: Luis Sanchez
QA Contact: Xingxing Xia
URL:
Whiteboard: trt LifecycleStale
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-20 20:54 UTC by Fabian von Feilitzsch
Modified: 2020-11-20 11:12 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
mfojtik: needinfo?


Attachments (Terms of Use)

Description Fabian von Feilitzsch 2020-10-20 20:54:04 UTC
Description of problem:
OpenShift API was unreachable for 2h15m3s of 2h33m31s (88%) of upgrade

Version-Release number of selected component (if applicable):
4.7

Example failing job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1318545558922596352

It looks like this may be preventing the kube-apiserver from upgrading, as the `kube-apiserver-check-endpoints` pod is crashlooping with the following failure: 

E1020 16:41:10.647385       1 reflector.go:127] k8s.io/client-go@v0.19.0/tools/cache/reflector.go:156: Failed to watch *v1alpha1.PodNetworkConnectivityCheck: failed to list *v1alpha1.PodNetworkConnectivityCheck: the server could not find the requested resource (get podnetworkconnectivitychecks.controlplane.operator.openshift.io)
I1020 16:41:10.870356       1 start_stop_controllers.go:70] The server doesn't have a resource type "podnetworkconnectivitychecks.controlplane.operator.openshift.io".

and in the kube-apiserver logs it seems the podnetworkconnectivitychecks api is repeatedly added to and removed from the OpenAPI spec with the following logs looping:

I1020 16:43:56.145475      18 cacher.go:402] cacher (*unstructured.Unstructured): initialized
I1020 16:43:56.890503      18 controller.go:172] Updating CRD OpenAPI spec because podnetworkconnectivitychecks.controlplane.operator.openshift.io was removed
I1020 16:43:57.215968      18 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I1020 16:43:57.215968      18 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I1020 16:43:57.216106      18 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I1020 16:43:57.216148      18 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I1020 16:43:58.646013      18 aggregator.go:226] Updating OpenAPI spec because k8s_internal_local_delegation_chain_0000000002 is updated
I1020 16:44:00.249165      18 aggregator.go:229] Finished OpenAPI spec generation after 1.603120881s
I1020 16:44:02.228507      18 aggregator.go:226] Updating OpenAPI spec because k8s_internal_local_delegation_chain_0000000002 is updated
I1020 16:44:03.845507      18 aggregator.go:229] Finished OpenAPI spec generation after 1.616970005s
I1020 16:44:04.471745      18 controller.go:189] Updating CRD OpenAPI spec because podnetworkconnectivitychecks.controlplane.operator.openshift.io changed
I1020 16:44:04.527616      18 client.go:360] parsed scheme: "endpoint"
I1020 16:44:04.527680      18 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://10.0.134.149:2379  <nil> 0 <nil>} {https://10.0.178.80:2379  <nil> 0 <nil>} {https://10.0.252.43:2379  <nil> 0 <nil>} {https://localhost:2379  <nil> 0 <nil>}]
I1020 16:44:04.537229      18 store.go:1378] Monitoring podnetworkconnectivitychecks.controlplane.operator.openshift.io count at <storage-prefix>//controlplane.operator.openshift.io/podnetworkconnectivitychecks
I1020 16:44:04.557550      18 cacher.go:402] cacher (*unstructured.Unstructured): initialized

Comment 1 Jan Safranek 2020-10-21 10:33:46 UTC
Bumping severity, OCP 4.7 releases are not promoted due to this bug.

Comment 2 Michal Fojtik 2020-11-20 11:12:06 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.


Note You need to log in before you can comment on or make changes to this bug.