Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1954515

Summary: The `csv_upgrade_count` metric value is inaccurate
Product: OpenShift Container Platform Reporter: Jian Zhang <jiazha>
Component: OLMAssignee: Anik <anbhatta>
OLM sub component: OLM QA Contact: Jian Zhang <jiazha>
Status: CLOSED WONTFIX Docs Contact:
Severity: low    
Priority: low CC: anbhatta, ankithom, davegord, kuiwang, tbuskey
Version: 4.8   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-06-01 14:31:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jian Zhang 2021-04-28 10:17:37 UTC
Description of problem:
After upgrading an operator once, the `csv_upgrade_count` metric value changed to 3 from 0.

Version-Release number of selected component (if applicable):
Cluster version is 4.8.0-0.nightly-2021-04-26-151924

How reproducible:
always

Steps to Reproduce:
1. Install OCP 4.8.
[jzhang@dhcp-140-36 ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-04-26-151924   True        False         5h13m   Cluster version is 4.8.0-0.nightly-2021-04-26-151924

2. Create two routes so that the user can access the OLM metric services.
[jzhang@dhcp-140-36 ~]$ oc project openshift-operator-lifecycle-manager
Now using project "openshift-operator-lifecycle-manager" on server "https://api.jiazha28.qe.devcluster.openshift.com:6443".
[jzhang@dhcp-140-36 ~]$ oc get svc
NAME                       TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
catalog-operator-metrics   ClusterIP   172.30.101.161   <none>        8081/TCP   145m
olm-operator-metrics       ClusterIP   172.30.130.76    <none>        8081/TCP   145m
packageserver-service      ClusterIP   172.30.61.171    <none>        5443/TCP   111m

[jzhang@dhcp-140-36 ~]$ oc create route reencrypt catalog-metrics --service catalog-operator-metrics
route.route.openshift.io/catalog-metrics created
[jzhang@dhcp-140-36 ~]$ oc create route reencrypt olm-metrics --service olm-operator-metrics
route.route.openshift.io/olm-metrics created

[jzhang@dhcp-140-36 ~]$ oc get route
NAME              HOST/PORT                                                                                        PATH   SERVICES                   PORT            TERMINATION   WILDCARD
catalog-metrics   catalog-metrics-openshift-operator-lifecycle-manager.apps.jiazha28.qe.devcluster.openshift.com          catalog-operator-metrics   https-metrics   reencrypt     None
olm-metrics       olm-metrics-openshift-operator-lifecycle-manager.apps.jiazha28.qe.devcluster.openshift.com              olm-operator-metrics       https-metrics   reencrypt     None

3. Check the `csv_upgrade_count` metric, the value is 0, as expected.
[jzhang@dhcp-140-36 ~]$ curl -s -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)" https://olm-metrics-openshift-operator-lifecycle-manager.apps.jiazha28.qe.devcluster.openshift.com/metrics | grep -E "^csv"
csv_count 1
csv_succeeded{name="packageserver",namespace="openshift-operator-lifecycle-manager",version="0.17.0"} 1
csv_upgrade_count 0

4, Subscribe to etcdoperator.v0.9.2 operator, and set automatic upgrade. Like:
1) First, we need to create an OperatorGroup first.
mac:~ jianzhang$ cat og.yaml 
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: test-og
  namespace: default
spec:
  targetNamespaces:
  - default
mac:~ jianzhang$ oc get og -n default
NAME      AGE
test-og   5h17m
2) And then, create a subscription. Like below:
mac:~ jianzhang$ cat sub-43-upgrade.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: etcd-upgrade-test
  namespace: default
spec:
  channel: singlenamespace-alpha
  installPlanApproval: Automatic
  name: etcd
  source: community-operators
  sourceNamespace: openshift-marketplace
  startingCSV: etcdoperator.v0.9.2
mac:~ jianzhang$ oc get sub -n default
NAME                PACKAGE   SOURCE                CHANNEL
etcd-upgrade-test   etcd      community-operators   singlenamespace-alpha

Now, we can see this operator is upgrading to etcdoperator.v0.9.4 version.
[jzhang@dhcp-140-36 ~]$ oc get sub -n default
NAME        PACKAGE   SOURCE                CHANNEL
etcd-test   etcd      community-operators   singlenamespace-alpha
[jzhang@dhcp-140-36 ~]$ oc get ip -n default
NAME            CSV                   APPROVAL    APPROVED
install-gr6lz   etcdoperator.v0.9.2   Automatic   true
[jzhang@dhcp-140-36 ~]$ oc get csv -n default
NAME                  DISPLAY   VERSION   REPLACES              PHASE
etcdoperator.v0.9.4   etcd      0.9.4     etcdoperator.v0.9.2   Succeeded
[jzhang@dhcp-140-36 ~]$ oc get ip -n default
NAME            CSV                   APPROVAL    APPROVED
install-d99v6   etcdoperator.v0.9.4   Automatic   true
install-gr6lz   etcdoperator.v0.9.2   Automatic   true
[jzhang@dhcp-140-36 ~]$ oc get csv -A
NAMESPACE                              NAME                  DISPLAY          VERSION   REPLACES              PHASE
default                                etcdoperator.v0.9.4   etcd             0.9.4     etcdoperator.v0.9.2   Succeeded
openshift-operator-lifecycle-manager   packageserver         Package Server   0.17.0                          Succeeded

5, Check the `csv_upgrade_count` metric again.

Actual results:
The value is 3, but I only upgrade once, it should be 1.
 
[jzhang@dhcp-140-36 ~]$ curl -s -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)" https://olm-metrics-openshift-operator-lifecycle-manager.apps.jiazha28.qe.devcluster.openshift.com/metrics | grep -E "^csv"
csv_count 2
csv_succeeded{name="etcdoperator.v0.9.4",namespace="default",version="0.9.4"} 1
csv_succeeded{name="packageserver",namespace="openshift-operator-lifecycle-manager",version="0.17.0"} 1
csv_upgrade_count 3

Expected results:
Only etcdoperator.v0.9.2 upgraded to etcdoperator.v0.9.4, upgrade once, the "csv_upgrade_count" should be 1, not 3.

Additional info:
[jzhang@dhcp-140-36 ~]$ oc get nodes
NAME                                         STATUS   ROLES           AGE     VERSION
ip-10-0-134-117.us-east-2.compute.internal   Ready    master,worker   5h43m   v1.21.0-rc.0+6143dea
ip-10-0-175-14.us-east-2.compute.internal    Ready    master,worker   5h43m   v1.21.0-rc.0+6143dea
ip-10-0-219-252.us-east-2.compute.internal   Ready    master,worker   5h43m   v1.21.0-rc.0+6143dea

Comment 1 Anik 2021-05-25 22:22:10 UTC
@jiazha looks like the problem is that the counter is increased during the process of transitioning the replaced CSV's state from Succeeded to Replaced. That is problematic for the reason that when the controller is reconciling the CSV, if the controller failed to transition the CSV's state from state A->B, that job is re-queued, i.e the controller sees the CSV in the Succeeded phase again, and increases the counter, again. Which is why you're seeing the incorrect value for `csv_upgrade_count`

While it is problematic, I'm of the opinion that this metric should be removed altogether. The semantics of this metric isn't really well defined to begin with, and if I am not wrong this metric isn't really used by anyone either. The metrics really in use everywhere today are the `csv_succeeded` and `csv_abnormal` metrics. Is it possible to provide a picture of how you were using this metric/how you will be affected if this metric was removed. 

If this information, i.e how many CSVs have been upgraded in the cluster really needs to be surfaced, then our best bet would be to re-purpose the `csv_succeeded` metric to include a `replaces` label. 
i.e the labels today are {NAMESPACE_LABEL, NAME_LABEL, VERSION_LABEL}, and it can be enhanced to include the REPLACES label. Getting the count of upgrades that took place in the cluster would then look like count(csv_succeeded{replaces=*}). 

However, caveat here is that we'd really have to justify the need to include the label, i.e it'd have to be really necessary to answer the question "how many CSVs have been upgraded in this cluster" to warrant an increase in the cardinality of this metric(there's limitation to the amount of data that we want to export out of each openshift cluster and aggregate them in our telemeters).

Comment 2 Anik 2021-06-01 14:31:59 UTC
Closing this as WONTFIX since this is fairly low impact. Please re-open if the assessment is incorrect