Bug 1837832

Summary: From a 4.3.18 -> 4.3.19 update: Upgradeable=True RollOutInProgress Rollout of the monitoring stack is in progress. Please wait until it finishes
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: MonitoringAssignee: Lili Cosic <lcosic>
Status: CLOSED NOTABUG QA Contact: Junqi Zhao <juzhao>
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.3.zCC: alegrand, anpicker, erooth, kakkoyun, lcosic, mloibl, pkrupa, surbania, travi
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-25 08:12:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2020-05-20 05:52:43 UTC
From an Insights tarball from a 4.3.18 -> 4.3.19 update:

$ tar -xOz config/clusteroperator/monitoring <20200519062637-32ad8cfe89fd45ddb28f2eda2c34936d | jq -r '.status.conditions[] | "  " + .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + " " +.message'
  2020-05-19T02:34:40Z Available=True RollOutDone Successfully rolled out the stack.
  2020-05-19T02:34:40Z Progressing=False  
  2020-05-19T02:34:40Z Degraded=False  
  2020-05-19T04:26:32Z Upgradeable=True RollOutInProgress Rollout of the monitoring stack is in progress. Please wait until it finishes.

The Upgradeable=True with RollOutInProgress really sounds like it's progressing, and yet, Progressing=False.  Also, Upgradeable=True plus a "Please wait" message is a pretty odd.  If you wanted folks to wait, I'd expect Upgradeable=False.  Possibly the reason and message are just not getting reset to some "all is well" placeholders when the transition completes?  Also, the timestamps on the conditions are all well before the 4.3.18 -> 4.3.19 update itself.  From a later must-gather:

$ yaml2json <cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml | jq -r '.status.history[] | .startedTime + " " + .completionTime + " " + .version + " " + .state + " " + (.verified | tostring)' | head -n2
2020-05-19T06:32:38Z null 4.3.19 Partial true
2020-05-05T22:08:18Z 2020-05-05T23:35:40Z 4.3.18 Completed true

So not clear to me why the monitoring operator would be poking around with conditions at 04:26:32Z.  Possibly in response to an autoscaler or other node activity.

Comment 2 Sergiusz Urbaniak 2020-05-20 11:35:48 UTC
As per: https://coreos.slack.com/archives/C0VMT03S5/p1589961556398800?thread_ts=1589952447.394700&cid=C0VMT03S5

> lili  I understood we should not be setting Upgradeable=False ?

Can you advise Trevor?

Until clarified setting low severity.

Comment 3 W. Trevor King 2020-05-20 14:17:37 UTC
Sounds like monitoring doesn't have anything that would call for Upgradeable=False and "you can't bump minor version 4.y -> 4.(y+1) because $THIS would break".  So fix is probably pick a reason ("AsExpected" or similar) and message ("This is fine" or similar) and always set those instead of the current "RollOutInProgress" and "Rollout of the monitoring stack is in progress. Please wait until it finishes".  No functional impact, so low priority is appropriate, but seems like a straightforward fix and folks like me with my admin hat on would be less confused once the reason/message makes sense with the Upgradeable=False type/status.

Comment 4 Lili Cosic 2020-05-25 08:12:13 UTC
We want to modify this in 4.6 onwards, created a task to not forget https://issues.redhat.com/browse/MON-1126. Closing as agreed on slack.