1666936 – Improve cluster operator errors when wait.Poll times out

Bug 1666936 - Improve cluster operator errors when wait.Poll times out

Summary: Improve cluster operator errors when wait.Poll times out

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Frederic Branczyk
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-01-17 01:59 UTC by Robert Bost
Modified:	2022-03-13 16:46 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-01-30 17:24:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3813031	0	None	None	None	2019-01-17 01:59:22 UTC
Red Hat Knowledge Base (Solution)	3813211	0	None	None	None	2019-01-17 01:59:22 UTC

Description Robert Bost 2019-01-17 01:59:22 UTC

Description of problem: When a timeout occurs while checking for a resource, the error can be improved to point cluster admin in a good direction. Examples of errors:

E0116 18:08:52.185756       1 operator.go:207] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating Grafana failed: waiting for Grafana Route to become ready failed: timed out waiting for the condition
E0110 14:10:47.279365       1 operator.go:207] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating Prometheus-k8s failed: waiting for Prometheus object changes failed: timed out waiting for the condition

Both of the errors above stem from a timeout in a wait.Poll() call who's error is quite generic. It would be helpful to include some more detail than "timed out waiting for the condition" (what is the condition, where can the cluster admin look to troubleshoot the issue?). 

NOTE: Both issues above have been addressed in customer clusters, this bz is only to request improvement on error message. 

Version-Release number of selected component (if applicable): OCP 3.11

Additional info:
I have two potential recommends:
1) Add option for custom timeout message in wait.Poll and other wait methods. I don't suspect this is feasible after grepping codebase for references to ErrWaitTimeout :(
2) Check return value of wait.Poll in methods like WaitForRouteReady and WaitForPrometheus and include some extra detail about the failure

Comment 1 Frederic Branczyk 2019-01-17 10:03:27 UTC

This makes a lot of sense. Can you turn this into an RFE so our PM can prioritize this?

Comment 2 Robert Bost 2019-01-30 17:24:10 UTC

RFE filed here: https://jira.coreos.com/browse/RFE-13 I'll go ahead and close out this bz as deferred; hope that is appropriate.

Note You need to log in before you can comment on or make changes to this bug.