1871303 – [sig-instrumentation] Prometheus when installed on the cluster should have important platform topology metrics

Bug 1871303 - [sig-instrumentation] Prometheus when installed on the cluster should have important platform topology metrics

Summary: [sig-instrumentation] Prometheus when installed on the cluster should have im...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Jack Ottofaro
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-22 02:50 UTC by Maru Newby
Modified:	2021-07-27 22:33 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously would get a CI error because installer metric is generated with invoker set to "" due to race condition upon CVO startup. This has been fixed by referenced PR.
Clone Of:
Environment:	[sig-instrumentation] Prometheus when installed on the cluster should have important platform topology metrics
Last Closed:	2021-07-27 22:32:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Prometheus graph for cluster_installer metric (58.16 KB, image/png) 2020-08-24 08:40 UTC, Simon Pasquier	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 528	0	None	closed	Bug 1871303: metrics: serve metrics after leader lock acquired	2021-06-11 04:58:56 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:33:31 UTC

Description Maru Newby 2020-08-22 02:50:32 UTC

test:
[sig-instrumentation] Prometheus when installed on the cluster should have important platform topology metrics 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-instrumentation%5C%5D+Prometheus+when+installed+on+the+cluster+should+have+important+platform+topology+metrics

A common job result:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.5/1296868367386284032

Includes the following error:

fail [github.com/openshift/origin/test/extended/util/prometheus/helpers.go:174]: Expected
    <map[string]error | len:1>: {
        "cluster_installer{type!=\"\",invoker!=\"\"}": {
            s: "promQL query: cluster_installer{type!=\"\",invoker!=\"\"} had reported incorrect results:\n[]",
        },
    }
to be empty

Comment 1 Simon Pasquier 2020-08-24 08:40:21 UTC

Created attachment 1712315 [details]
Prometheus graph for cluster_installer metric

The failed job [1] is an issue with the cluster version operator that doesn't report the cluster_installer metrics with the expected type and invoker labels.

Looking at the Prometheus data dump, I can see that most of the time, CVO doesn't expose the metric with the expected labels (see attached screenshot) hence it randomly fails the test. The same issue was already discovered while investigating another bug [2] and quickly looking at the last failed jobs with the same test, other failures report only ([3] for instance).

[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.5/1296868367386284032
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1852919#c12
[3] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.5/1296824997049798656

Comment 4 W. Trevor King 2020-08-26 18:41:07 UTC

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=168h&context=0&type=junit&groupBy=job&search=%5C%5Bsig-instrumentation%5C%5D+Prometheus+when+installed+on+the+cluster+should+have+important+platform+topology+metrics' | grep 'failures match' | sort
endurance-e2e-aws-4.5 - 5 runs, 100% failed, 40% of failures match
promote-release-openshift-machine-os-content-e2e-aws-4.5 - 143 runs, 24% failed, 21% of failures match
promote-release-openshift-machine-os-content-e2e-aws-4.6 - 138 runs, 30% failed, 7% of failures match
promote-release-openshift-okd-machine-os-content-e2e-aws-4.5 - 7 runs, 100% failed, 14% of failures match
promote-release-openshift-okd-machine-os-content-e2e-aws-4.6 - 81 runs, 79% failed, 2% of failures match
pull-ci-cri-o-cri-o-master-e2e-aws - 136 runs, 55% failed, 3% of failures match
...
pull-ci-operator-framework-operator-registry-master-e2e-aws - 30 runs, 83% failed, 4% of failures match
rehearse-11093-pull-ci-openshift-cluster-network-operator-master-e2e-vsphere - 2 runs, 50% failed, 100% of failures match
...
rehearse-11247-pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-multi - 5 runs, 20% failed, 100% of failures match
release-openshift-ocp-e2e-aws-scaleup-rhel7-4.5 - 28 runs, 46% failed, 8% of failures match
release-openshift-ocp-installer-e2e-aws-4.5 - 33 runs, 33% failed, 36% of failures match
release-openshift-ocp-installer-e2e-aws-4.6 - 110 runs, 55% failed, 7% of failures match
release-openshift-ocp-installer-e2e-aws-ovn-4.5 - 14 runs, 43% failed, 50% of failures match
release-openshift-ocp-installer-e2e-aws-ovn-4.6 - 62 runs, 71% failed, 2% of failures match
release-openshift-ocp-installer-e2e-aws-upi-4.6 - 63 runs, 95% failed, 2% of failures match
release-openshift-ocp-installer-e2e-azure-4.5 - 23 runs, 61% failed, 7% of failures match
release-openshift-ocp-installer-e2e-gcp-4.5 - 23 runs, 17% failed, 25% of failures match
release-openshift-ocp-installer-e2e-gcp-ovn-4.6 - 62 runs, 90% failed, 4% of failures match
release-openshift-ocp-installer-e2e-metal-compact-4.5 - 14 runs, 7% failed, 100% of failures match
release-openshift-ocp-installer-e2e-openstack-4.5 - 38 runs, 39% failed, 7% of failures match
release-openshift-ocp-installer-e2e-ovirt-4.6 - 67 runs, 94% failed, 2% of failures match
release-openshift-ocp-installer-e2e-vsphere-upi-4.5 - 16 runs, 31% failed, 40% of failures match
release-openshift-okd-installer-e2e-aws-4.6 - 30 runs, 90% failed, 4% of failures match
release-openshift-origin-installer-e2e-aws-compact-4.5 - 4 runs, 25% failed, 100% of failures match
release-openshift-origin-installer-e2e-azure-4.6 - 102 runs, 72% failed, 3% of failures match
release-openshift-origin-installer-launch-gcp - 465 runs, 53% failed, 0% of failures match

Looks like 4.5 is being hit especially hard by this.

Comment 5 W. Trevor King 2020-08-26 19:26:02 UTC

Hah, looking at just the past day:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&context=0&type=junit&groupBy=job&search=%5C%5Bsig-instrumentation%5C%5D+Prometheus+when+installed+on+the+cluster+should+have+important+platform+topology+metrics' | grep 'failures match' | sort
promote-release-openshift-machine-os-content-e2e-aws-4.5 - 20 runs, 20% failed, 25% of failures match
promote-release-openshift-okd-machine-os-content-e2e-aws-4.6 - 10 runs, 100% failed, 10% of failures match
pull-ci-openshift-ovn-kubernetes-master-e2e-openstack - 11 runs, 100% failed, 9% of failures match
pull-ci-openshift-sdn-master-e2e-gcp - 8 runs, 25% failed, 50% of failures match
pull-ci-openshift-sriov-network-device-plugin-master-e2e-aws - 3 runs, 33% failed, 100% of failures match
pull-ci-openshift-sriov-network-operator-master-e2e-aws - 7 runs, 57% failed, 50% of failures match
rehearse-11148-pull-ci-openshift-machine-config-operator-release-4.6-e2e-vsphere - 3 runs, 33% failed, 100% of failures match
rehearse-11222-pull-ci-openshift-installer-master-e2e-aws-upi - 2 runs, 50% failed, 100% of failures match
rehearse-11247-pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-multi - 5 runs, 20% failed, 100% of failures match
release-openshift-ocp-installer-e2e-gcp-ovn-4.6 - 14 runs, 93% failed, 8% of failures match

So someone did something that fixed this; great, except maybe on those OVN and machine-os-content jobs.

Comment 6 W. Trevor King 2020-08-28 23:14:33 UTC

https://github.com/openshift/release/pull/11333 should help with debugging remaining occurrences.

Comment 7 Lalatendu Mohanty 2020-08-31 14:16:28 UTC

As per comment #5 , setting sev as low.

Comment 8 Andrew Kiselev 2020-09-04 19:05:58 UTC

This test fails in e2e-metal-ipi as of today: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.5-e2e-metal-ipi/1301924213296205824

Comment 9 W. Trevor King 2020-09-12 20:50:35 UTC

Still dunno what's going on here.  Hopefully we'll figure out our higher-priority bugs and get down to this one next sprint.

Comment 10 Oleg Bulatov 2020-09-24 14:47:30 UTC

Today it failed in e2e-aws: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-image-registry-operator/612/pull-ci-openshift-cluster-image-registry-operator-release-4.5-e2e-aws/1309107688537329664

Comment 11 W. Trevor King 2020-10-04 02:39:51 UTC

Comment 9 is still current.

Comment 13 Jack Ottofaro 2020-10-23 19:17:50 UTC

Still working on higher-priority bugs.

Comment 14 Scott Dodson 2020-11-09 19:51:34 UTC

This happens relatively frequently which yields a lot of noise in CI jobs but has no clear customer impact, so increasing priority slightly but leaving severity at low.

Comment 16 Lalatendu Mohanty 2021-03-16 16:34:09 UTC

We still this in CI https://search.ci.openshift.org/?search=%5C%5Bsig-instrumentation%5C%5D+Prometheus+when+installed+on+the+cluster+should+have+important+platform+topology+metrics&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job , hence moving to medium sev.

Comment 18 Johnny Liu 2021-03-23 11:05:46 UTC

Verified this bug with 4.8.0-0.nightly-2021-03-22-104536, PASS.

Trigger a ipi install, 'invoker' and 'type' is set correctly

[root@preserve-jialiu-ansible ~]# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=cluster_installer' | jq
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "cluster_installer",
          "endpoint": "metrics",
          "instance": "10.0.223.119:9099",
          "invoker": "user",
          "job": "cluster-version-operator",
          "namespace": "openshift-cluster-version",
          "pod": "cluster-version-operator-85699c5bf8-njf4t",
          "service": "cluster-version-operator",
          "type": "openshift-install",
          "version": "v4.8.0"
        },
        "value": [
          1616488052.303,
          "1"
        ]
      }
    ]
  }
}
[root@preserve-jialiu-ansible ~]# oc get cm openshift-install-manifests -n openshift-config -o json
{
    "apiVersion": "v1",
    "data": {
        "invoker": "user",
        "version": "v4.8.0"
    },
    "kind": "ConfigMap",
    "metadata": {
        "creationTimestamp": "2021-03-23T07:16:19Z",
        "managedFields": [
            {
                "apiVersion": "v1",
                "fieldsType": "FieldsV1",
                "fieldsV1": {
                    "f:data": {
                        ".": {},
                        "f:invoker": {},
                        "f:version": {}
                    }
                },
                "manager": "cluster-bootstrap",
                "operation": "Update",
                "time": "2021-03-23T07:16:19Z"
            }
        ],
        "name": "openshift-install-manifests",
        "namespace": "openshift-config",
        "resourceVersion": "1503",
        "uid": "ee3f8244-ef4f-424f-b402-e221beffb5fd"
    }
}
[root@preserve-jialiu-ansible ~]# oc get cm openshift-install -n openshift-config -o json
{
    "apiVersion": "v1",
    "data": {
        "invoker": "user",
        "version": "v4.8.0"
    },
    "kind": "ConfigMap",
    "metadata": {
        "creationTimestamp": "2021-03-23T07:16:19Z",
        "managedFields": [
            {
                "apiVersion": "v1",
                "fieldsType": "FieldsV1",
                "fieldsV1": {
                    "f:data": {
                        ".": {},
                        "f:invoker": {},
                        "f:version": {}
                    }
                },
                "manager": "cluster-bootstrap",
                "operation": "Update",
                "time": "2021-03-23T07:16:19Z"
            }
        ],
        "name": "openshift-install",
        "namespace": "openshift-config",
        "resourceVersion": "1510",
        "uid": "b348a3f2-d731-43d5-80f2-5f56a143cb56"
    }
}

Also search it in https://search.ci.openshift.org/?search=Prometheus+when+installed+on+the+cluster+should+have+important+platform+topology+metrics+&maxAge=336h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job, and search "fail [github.com/openshift/origin/test/extended/util/prometheus/helpers.go:174]: Expected" on the page, no 4.8 failure due to this error.

Comment 21 errata-xmlrpc 2021-07-27 22:32:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.