Bug 1837832 - From a 4.3.18 -> 4.3.19 update: Upgradeable=True RollOutInProgress Rollout of the monitoring stack is in progress. Please wait until it finishes
Summary: From a 4.3.18 -> 4.3.19 update: Upgradeable=True RollOutInProgress Rollout of...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 4.5.0
Assignee: Lili Cosic
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-20 05:52 UTC by W. Trevor King
Modified: 2020-05-25 08:12 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-25 08:12:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description W. Trevor King 2020-05-20 05:52:43 UTC
From an Insights tarball from a 4.3.18 -> 4.3.19 update:

$ tar -xOz config/clusteroperator/monitoring <20200519062637-32ad8cfe89fd45ddb28f2eda2c34936d | jq -r '.status.conditions[] | "  " + .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + " " +.message'
  2020-05-19T02:34:40Z Available=True RollOutDone Successfully rolled out the stack.
  2020-05-19T02:34:40Z Progressing=False  
  2020-05-19T02:34:40Z Degraded=False  
  2020-05-19T04:26:32Z Upgradeable=True RollOutInProgress Rollout of the monitoring stack is in progress. Please wait until it finishes.

The Upgradeable=True with RollOutInProgress really sounds like it's progressing, and yet, Progressing=False.  Also, Upgradeable=True plus a "Please wait" message is a pretty odd.  If you wanted folks to wait, I'd expect Upgradeable=False.  Possibly the reason and message are just not getting reset to some "all is well" placeholders when the transition completes?  Also, the timestamps on the conditions are all well before the 4.3.18 -> 4.3.19 update itself.  From a later must-gather:

$ yaml2json <cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml | jq -r '.status.history[] | .startedTime + " " + .completionTime + " " + .version + " " + .state + " " + (.verified | tostring)' | head -n2
2020-05-19T06:32:38Z null 4.3.19 Partial true
2020-05-05T22:08:18Z 2020-05-05T23:35:40Z 4.3.18 Completed true

So not clear to me why the monitoring operator would be poking around with conditions at 04:26:32Z.  Possibly in response to an autoscaler or other node activity.

Comment 2 Sergiusz Urbaniak 2020-05-20 11:35:48 UTC
As per: https://coreos.slack.com/archives/C0VMT03S5/p1589961556398800?thread_ts=1589952447.394700&cid=C0VMT03S5

> lili  I understood we should not be setting Upgradeable=False ?

Can you advise Trevor?

Until clarified setting low severity.

Comment 3 W. Trevor King 2020-05-20 14:17:37 UTC
Sounds like monitoring doesn't have anything that would call for Upgradeable=False and "you can't bump minor version 4.y -> 4.(y+1) because $THIS would break".  So fix is probably pick a reason ("AsExpected" or similar) and message ("This is fine" or similar) and always set those instead of the current "RollOutInProgress" and "Rollout of the monitoring stack is in progress. Please wait until it finishes".  No functional impact, so low priority is appropriate, but seems like a straightforward fix and folks like me with my admin hat on would be less confused once the reason/message makes sense with the Upgradeable=False type/status.

Comment 4 Lili Cosic 2020-05-25 08:12:13 UTC
We want to modify this in 4.6 onwards, created a task to not forget https://issues.redhat.com/browse/MON-1126. Closing as agreed on slack.


Note You need to log in before you can comment on or make changes to this bug.