Description of problem: When a timeout occurs while checking for a resource, the error can be improved to point cluster admin in a good direction. Examples of errors: E0116 18:08:52.185756 1 operator.go:207] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating Grafana failed: waiting for Grafana Route to become ready failed: timed out waiting for the condition E0110 14:10:47.279365 1 operator.go:207] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating Prometheus-k8s failed: waiting for Prometheus object changes failed: timed out waiting for the condition Both of the errors above stem from a timeout in a wait.Poll() call who's error is quite generic. It would be helpful to include some more detail than "timed out waiting for the condition" (what is the condition, where can the cluster admin look to troubleshoot the issue?). NOTE: Both issues above have been addressed in customer clusters, this bz is only to request improvement on error message. Version-Release number of selected component (if applicable): OCP 3.11 Additional info: I have two potential recommends: 1) Add option for custom timeout message in wait.Poll and other wait methods. I don't suspect this is feasible after grepping codebase for references to ErrWaitTimeout :( 2) Check return value of wait.Poll in methods like WaitForRouteReady and WaitForPrometheus and include some extra detail about the failure
This makes a lot of sense. Can you turn this into an RFE so our PM can prioritize this?
RFE filed here: https://jira.coreos.com/browse/RFE-13 I'll go ahead and close out this bz as deferred; hope that is appropriate.