Bug 1954515
| Summary: | The `csv_upgrade_count` metric value is inaccurate | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jian Zhang <jiazha> |
| Component: | OLM | Assignee: | Anik <anbhatta> |
| OLM sub component: | OLM | QA Contact: | Jian Zhang <jiazha> |
| Status: | CLOSED WONTFIX | Docs Contact: | |
| Severity: | low | ||
| Priority: | low | CC: | anbhatta, ankithom, davegord, kuiwang, tbuskey |
| Version: | 4.8 | ||
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-06-01 14:31:59 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Jian Zhang
2021-04-28 10:17:37 UTC
@jiazha looks like the problem is that the counter is increased during the process of transitioning the replaced CSV's state from Succeeded to Replaced. That is problematic for the reason that when the controller is reconciling the CSV, if the controller failed to transition the CSV's state from state A->B, that job is re-queued, i.e the controller sees the CSV in the Succeeded phase again, and increases the counter, again. Which is why you're seeing the incorrect value for `csv_upgrade_count`
While it is problematic, I'm of the opinion that this metric should be removed altogether. The semantics of this metric isn't really well defined to begin with, and if I am not wrong this metric isn't really used by anyone either. The metrics really in use everywhere today are the `csv_succeeded` and `csv_abnormal` metrics. Is it possible to provide a picture of how you were using this metric/how you will be affected if this metric was removed.
If this information, i.e how many CSVs have been upgraded in the cluster really needs to be surfaced, then our best bet would be to re-purpose the `csv_succeeded` metric to include a `replaces` label.
i.e the labels today are {NAMESPACE_LABEL, NAME_LABEL, VERSION_LABEL}, and it can be enhanced to include the REPLACES label. Getting the count of upgrades that took place in the cluster would then look like count(csv_succeeded{replaces=*}).
However, caveat here is that we'd really have to justify the need to include the label, i.e it'd have to be really necessary to answer the question "how many CSVs have been upgraded in this cluster" to warrant an increase in the cardinality of this metric(there's limitation to the amount of data that we want to export out of each openshift cluster and aggregate them in our telemeters).
Closing this as WONTFIX since this is fairly low impact. Please re-open if the assessment is incorrect |