Bug 1837832

Summary:	From a 4.3.18 -> 4.3.19 update: Upgradeable=True RollOutInProgress Rollout of the monitoring stack is in progress. Please wait until it finishes
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Monitoring	Assignee:	Lili Cosic <lcosic>
Status:	CLOSED NOTABUG	QA Contact:	Junqi Zhao <juzhao>
Severity:	low	Docs Contact:
Priority:	unspecified
Version:	4.3.z	CC:	alegrand, anpicker, erooth, kakkoyun, lcosic, mloibl, pkrupa, surbania, travi
Target Milestone:	---
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-05-25 08:12:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description W. Trevor King 2020-05-20 05:52:43 UTC

From an Insights tarball from a 4.3.18 -> 4.3.19 update:

$ tar -xOz config/clusteroperator/monitoring <20200519062637-32ad8cfe89fd45ddb28f2eda2c34936d | jq -r '.status.conditions[] | "  " + .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + " " +.message'
  2020-05-19T02:34:40Z Available=True RollOutDone Successfully rolled out the stack.
  2020-05-19T02:34:40Z Progressing=False  
  2020-05-19T02:34:40Z Degraded=False  
  2020-05-19T04:26:32Z Upgradeable=True RollOutInProgress Rollout of the monitoring stack is in progress. Please wait until it finishes.

The Upgradeable=True with RollOutInProgress really sounds like it's progressing, and yet, Progressing=False.  Also, Upgradeable=True plus a "Please wait" message is a pretty odd.  If you wanted folks to wait, I'd expect Upgradeable=False.  Possibly the reason and message are just not getting reset to some "all is well" placeholders when the transition completes?  Also, the timestamps on the conditions are all well before the 4.3.18 -> 4.3.19 update itself.  From a later must-gather:

$ yaml2json <cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml | jq -r '.status.history[] | .startedTime + " " + .completionTime + " " + .version + " " + .state + " " + (.verified | tostring)' | head -n2
2020-05-19T06:32:38Z null 4.3.19 Partial true
2020-05-05T22:08:18Z 2020-05-05T23:35:40Z 4.3.18 Completed true

So not clear to me why the monitoring operator would be poking around with conditions at 04:26:32Z.  Possibly in response to an autoscaler or other node activity.

Comment 2 Sergiusz Urbaniak 2020-05-20 11:35:48 UTC

As per: https://coreos.slack.com/archives/C0VMT03S5/p1589961556398800?thread_ts=1589952447.394700&cid=C0VMT03S5

> lili  I understood we should not be setting Upgradeable=False ?

Can you advise Trevor?

Until clarified setting low severity.

Comment 3 W. Trevor King 2020-05-20 14:17:37 UTC

Sounds like monitoring doesn't have anything that would call for Upgradeable=False and "you can't bump minor version 4.y -> 4.(y+1) because $THIS would break".  So fix is probably pick a reason ("AsExpected" or similar) and message ("This is fine" or similar) and always set those instead of the current "RollOutInProgress" and "Rollout of the monitoring stack is in progress. Please wait until it finishes".  No functional impact, so low priority is appropriate, but seems like a straightforward fix and folks like me with my admin hat on would be less confused once the reason/message makes sense with the Upgradeable=False type/status.

Comment 4 Lili Cosic 2020-05-25 08:12:13 UTC

We want to modify this in 4.6 onwards, created a task to not forget https://issues.redhat.com/browse/MON-1126. Closing as agreed on slack.