Bug 1694220

Summary:	Monitoring operator failed to upgrade
Product:	OpenShift Container Platform	Reporter:	Ben Parees <bparees>
Component:	Monitoring	Assignee:	lserven
Status:	CLOSED NOTABUG	QA Contact:	Junqi Zhao <juzhao>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.1.0	CC:	anpicker, erooth, fbranczy, minden, mloibl, surbania
Target Milestone:	---
Target Release:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-04-11 09:36:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ben Parees 2019-03-29 19:53:33 UTC

Description of problem:
Mar 29 08:49:27.934: INFO: cluster upgrade is failing: Cluster operator monitoring is still updating

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/733

How reproducible:
flake


Note:  We are seeing this form of failure for several operators, I will cross link them as i open the bugs.

I have not dug into whether this operator specifically failed to upgrade, or if something earlier in the process took so long that this operator was the "victim" of the eventual timeout.  As you investigate the job that failed, feel free to reassign this if you think there is a root cause that is independent of your operator.

If your operator currently lacks sufficient events/logs/etc to determine when it started upgrading and what it was doing when we timed out, consider using this bug to introduce that information.

Comment 1 Ben Parees 2019-03-29 19:54:30 UTC

Related operator failed to upgrade bug: https://bugzilla.redhat.com/show_bug.cgi?id=1694216

Comment 2 Ben Parees 2019-03-29 19:56:52 UTC

Relate operator failed to upgrade bug: https://bugzilla.redhat.com/show_bug.cgi?id=1694222

Comment 4 Frederic Branczyk 2019-04-08 08:40:24 UTC

The logs show that the cluster-monitoring-operator was in the middle of rolling out. No errors have happened, it was simply still in the middle of things. Should we close this as not a bug?

Comment 5 Ben Parees 2019-04-08 12:45:10 UTC

The implication is *something* took longer than we seem to expect to upgrade/roll out so where i'd want to see the investigation go is:

1) is our timeout for upgrade too low?
2a) did something earlier in the upgrade process take an excessive amount of time?  (do we need the CVO to start reporting more information about how long it waited for various upgrades to complete so we can chase this more easily?)
2b) was that because something deeper in the stack was unstable?  (e.g. an operator took 20 minutes to upgrade because the kube-api server was unavailable?)

Those questions are why I linked this bug to the other 2 above...we can probably close out 2 of them as dupes of the third, but we should use one to chase the above questions, imho.

Comment 6 Frederic Branczyk 2019-04-08 17:35:53 UTC

1) Maybe yes, I looked at this in particular and the log line that says monitoring didn't finish happens at 08:49:27.934:

```
Mar 29 08:49:27.934: INFO: cluster upgrade is failing: Cluster operator monitoring is still updating
```

And the cluster-monitoring-operator finishes at 08:49:29.971763 (roughly two seconds afterwards) this is the excerpt from its logs:

```
I0329 08:49:29.971763       1 operator.go:314] Updating ClusterOperator status to done.
```

Which seems aligned with the ClusterOperator object, which indicates it was set to have finished at the same time. In total this upgrade of the cluster-monitoring stack took roughly a minute and a half (started at 08:47:59.767848):

```
I0329 08:47:59.767848       1 operator.go:298] Updating ClusterOperator status to in progress.
```

Given the amount of components that have to be upgraded, I'd deem this an appropriate amount of time.

2a) To be able to answer this for other components, I do think the CVO should have statistics on each component, it's hard for me to now troubleshoot _which_ component maybe took too long causing the cluster-monitoring-operator to not finish early enough.
2b) I couldn't spot any errors in the cluster-monitoring-operator logs, so it doesn't seem like a core control plane component was unstable.

Given all of the above I'd deem this not a cluster-monitoring-operator issue, but as I said in 2a it's very hard to now say what _did_ take too long.

Comment 7 Frederic Branczyk 2019-04-11 09:36:08 UTC

I've looked into a number of the occurrences of 

```
Mar 29 08:49:27.934: INFO: cluster upgrade is failing: Cluster operator monitoring is still updating
```

And have yet to find non-transitive errors (I searched through roughly 10 failures that contained this message). Most of them are related to the controller-manager being unavailable, therefore DaemonSets/StatefulSets/Deployments are not progressing. For which I did check a number of failures and there were controller manager alerts firing of it not being available.

Closing this as not a bug, as we haven't found any evidence of the cluster-monitoring-operator having done something incorrect, but will keep an eye out for it.