Bug 1666936 - Improve cluster operator errors when wait.Poll times out
Summary: Improve cluster operator errors when wait.Poll times out
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
medium
unspecified
Target Milestone: ---
: ---
Assignee: Frederic Branczyk
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-17 01:59 UTC by Robert Bost
Modified: 2022-03-13 16:46 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-01-30 17:24:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3813031 0 None None None 2019-01-17 01:59:22 UTC
Red Hat Knowledge Base (Solution) 3813211 0 None None None 2019-01-17 01:59:22 UTC

Description Robert Bost 2019-01-17 01:59:22 UTC
Description of problem: When a timeout occurs while checking for a resource, the error can be improved to point cluster admin in a good direction. Examples of errors:

E0116 18:08:52.185756       1 operator.go:207] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating Grafana failed: waiting for Grafana Route to become ready failed: timed out waiting for the condition
E0110 14:10:47.279365       1 operator.go:207] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating Prometheus-k8s failed: waiting for Prometheus object changes failed: timed out waiting for the condition

Both of the errors above stem from a timeout in a wait.Poll() call who's error is quite generic. It would be helpful to include some more detail than "timed out waiting for the condition" (what is the condition, where can the cluster admin look to troubleshoot the issue?). 

NOTE: Both issues above have been addressed in customer clusters, this bz is only to request improvement on error message. 

Version-Release number of selected component (if applicable): OCP 3.11

Additional info:
I have two potential recommends:
1) Add option for custom timeout message in wait.Poll and other wait methods. I don't suspect this is feasible after grepping codebase for references to ErrWaitTimeout :(
2) Check return value of wait.Poll in methods like WaitForRouteReady and WaitForPrometheus and include some extra detail about the failure

Comment 1 Frederic Branczyk 2019-01-17 10:03:27 UTC
This makes a lot of sense. Can you turn this into an RFE so our PM can prioritize this?

Comment 2 Robert Bost 2019-01-30 17:24:10 UTC
RFE filed here: https://jira.coreos.com/browse/RFE-13 I'll go ahead and close out this bz as deferred; hope that is appropriate.


Note You need to log in before you can comment on or make changes to this bug.