1690043 – APIServer should return a structured error and retry-after for graceful shutdown errors

Bug 1690043 - APIServer should return a structured error and retry-after for graceful shutdown errors

Summary: APIServer should return a structured error and retry-after for graceful shutd...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Master
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Michal Fojtik
QA Contact:	zhou ying
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1690167 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-03-18 16:10 UTC by Clayton Coleman
Modified:	2019-06-04 10:46 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:46:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:46:10 UTC

Description Clayton Coleman 2019-03-18 16:10:28 UTC

CMO reported a hard error (failing=true) on it's cluster operator, and this should be an error it handles and ignores/retries:

Mar 18 14:54:32.602 E clusteroperator/monitoring changed Failing to True: Failed to rollout the stack. Error: running task Updating Prometheus-k8s failed: reconciling Prometheus ClusterRoleBinding failed: updating ClusterRoleBinding object failed: an error on the server ("apiserver is shutting down.") has prevented the request from succeeding (put clusterrolebindings.rbac.authorization.k8s.io prometheus-k8s)

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5804

Depends on https://github.com/kubernetes/kubernetes/pull/75368 which should make it automatic.

For 4.1 we want the server to return a structured error and have client stacks gracefully retry the error, to minimize the churn caused by API restarts. Blocks GA

Comment 1 Clayton Coleman 2019-03-18 16:10:51 UTC

Related to https://bugzilla.redhat.com/show_bug.cgi?id=1684547

Need to ensure all components are protected.

Comment 2 Michal Fojtik 2019-03-19 09:06:23 UTC

*** Bug 1690167 has been marked as a duplicate of this bug. ***

Comment 3 Michal Fojtik 2019-03-19 10:12:07 UTC

To mitigate: https://github.com/openshift/origin/pull/22355

Stefan believe we have bug in shutdown order, so we still need to look at that. The pick above should make the error less disturbing.

Comment 5 zhou ying 2019-04-08 09:29:44 UTC

Still could see errors from: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/6519/artifacts/e2e-aws/junit/junit_e2e_20190406-222605.xml

Comment 7 Michal Fojtik 2019-04-09 11:29:34 UTC

To match with upstream: https://github.com/openshift/origin/pull/22511

Comment 8 zhou ying 2019-04-10 02:38:10 UTC

No 'apiserver is shutting down' error , but have some related error: ClusterOperatorNotAvailable: Cluster operator openshift-apiserver has not yet reported success.   Not sure is same issue or not.

Comment 9 Michal Fojtik 2019-04-10 18:02:44 UTC

That is different error and it has been fixed today.

Comment 10 zhou ying 2019-04-11 07:59:35 UTC

Checked with latest e2e test logs, no 'apiserver is shutting down' error , will verify this.

Comment 12 errata-xmlrpc 2019-06-04 10:46:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.