Bug 1694220
| Summary: | Monitoring operator failed to upgrade | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Ben Parees <bparees> |
| Component: | Monitoring | Assignee: | lserven |
| Status: | CLOSED NOTABUG | QA Contact: | Junqi Zhao <juzhao> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.1.0 | CC: | anpicker, erooth, fbranczy, minden, mloibl, surbania |
| Target Milestone: | --- | ||
| Target Release: | 4.1.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-04-11 09:36:08 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Ben Parees
2019-03-29 19:53:33 UTC
Related operator failed to upgrade bug: https://bugzilla.redhat.com/show_bug.cgi?id=1694216 Relate operator failed to upgrade bug: https://bugzilla.redhat.com/show_bug.cgi?id=1694222 The logs show that the cluster-monitoring-operator was in the middle of rolling out. No errors have happened, it was simply still in the middle of things. Should we close this as not a bug? The implication is *something* took longer than we seem to expect to upgrade/roll out so where i'd want to see the investigation go is: 1) is our timeout for upgrade too low? 2a) did something earlier in the upgrade process take an excessive amount of time? (do we need the CVO to start reporting more information about how long it waited for various upgrades to complete so we can chase this more easily?) 2b) was that because something deeper in the stack was unstable? (e.g. an operator took 20 minutes to upgrade because the kube-api server was unavailable?) Those questions are why I linked this bug to the other 2 above...we can probably close out 2 of them as dupes of the third, but we should use one to chase the above questions, imho. 1) Maybe yes, I looked at this in particular and the log line that says monitoring didn't finish happens at 08:49:27.934: ``` Mar 29 08:49:27.934: INFO: cluster upgrade is failing: Cluster operator monitoring is still updating ``` And the cluster-monitoring-operator finishes at 08:49:29.971763 (roughly two seconds afterwards) this is the excerpt from its logs: ``` I0329 08:49:29.971763 1 operator.go:314] Updating ClusterOperator status to done. ``` Which seems aligned with the ClusterOperator object, which indicates it was set to have finished at the same time. In total this upgrade of the cluster-monitoring stack took roughly a minute and a half (started at 08:47:59.767848): ``` I0329 08:47:59.767848 1 operator.go:298] Updating ClusterOperator status to in progress. ``` Given the amount of components that have to be upgraded, I'd deem this an appropriate amount of time. 2a) To be able to answer this for other components, I do think the CVO should have statistics on each component, it's hard for me to now troubleshoot _which_ component maybe took too long causing the cluster-monitoring-operator to not finish early enough. 2b) I couldn't spot any errors in the cluster-monitoring-operator logs, so it doesn't seem like a core control plane component was unstable. Given all of the above I'd deem this not a cluster-monitoring-operator issue, but as I said in 2a it's very hard to now say what _did_ take too long. I've looked into a number of the occurrences of ``` Mar 29 08:49:27.934: INFO: cluster upgrade is failing: Cluster operator monitoring is still updating ``` And have yet to find non-transitive errors (I searched through roughly 10 failures that contained this message). Most of them are related to the controller-manager being unavailable, therefore DaemonSets/StatefulSets/Deployments are not progressing. For which I did check a number of failures and there were controller manager alerts firing of it not being available. Closing this as not a bug, as we haven't found any evidence of the cluster-monitoring-operator having done something incorrect, but will keep an eye out for it. |