Bug 1871303
| Summary: | [sig-instrumentation] Prometheus when installed on the cluster should have important platform topology metrics | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Maru Newby <mnewby> | ||||
| Component: | Cluster Version Operator | Assignee: | Jack Ottofaro <jack.ottofaro> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Johnny Liu <jialiu> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 4.5 | CC: | akiselev, alegrand, anpicker, aos-bugs, astoycos, erooth, jack.ottofaro, jokerman, kakkoyun, lcosic, lmohanty, obulatov, pkrupa, spasquie, surbania, vrutkovs, wking, yanyang | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.8.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: |
Previously would get a CI error because installer metric is generated with invoker set to "" due to race condition upon CVO startup. This has been fixed by referenced PR.
|
Story Points: | --- | ||||
| Clone Of: | Environment: |
[sig-instrumentation] Prometheus when installed on the cluster should have important platform topology metrics
|
|||||
| Last Closed: | 2021-07-27 22:32:47 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Maru Newby
2020-08-22 02:50:32 UTC
Created attachment 1712315 [details] Prometheus graph for cluster_installer metric The failed job [1] is an issue with the cluster version operator that doesn't report the cluster_installer metrics with the expected type and invoker labels. Looking at the Prometheus data dump, I can see that most of the time, CVO doesn't expose the metric with the expected labels (see attached screenshot) hence it randomly fails the test. The same issue was already discovered while investigating another bug [2] and quickly looking at the last failed jobs with the same test, other failures report only ([3] for instance). [1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.5/1296868367386284032 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1852919#c12 [3] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.5/1296824997049798656 $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=168h&context=0&type=junit&groupBy=job&search=%5C%5Bsig-instrumentation%5C%5D+Prometheus+when+installed+on+the+cluster+should+have+important+platform+topology+metrics' | grep 'failures match' | sort endurance-e2e-aws-4.5 - 5 runs, 100% failed, 40% of failures match promote-release-openshift-machine-os-content-e2e-aws-4.5 - 143 runs, 24% failed, 21% of failures match promote-release-openshift-machine-os-content-e2e-aws-4.6 - 138 runs, 30% failed, 7% of failures match promote-release-openshift-okd-machine-os-content-e2e-aws-4.5 - 7 runs, 100% failed, 14% of failures match promote-release-openshift-okd-machine-os-content-e2e-aws-4.6 - 81 runs, 79% failed, 2% of failures match pull-ci-cri-o-cri-o-master-e2e-aws - 136 runs, 55% failed, 3% of failures match ... pull-ci-operator-framework-operator-registry-master-e2e-aws - 30 runs, 83% failed, 4% of failures match rehearse-11093-pull-ci-openshift-cluster-network-operator-master-e2e-vsphere - 2 runs, 50% failed, 100% of failures match ... rehearse-11247-pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-multi - 5 runs, 20% failed, 100% of failures match release-openshift-ocp-e2e-aws-scaleup-rhel7-4.5 - 28 runs, 46% failed, 8% of failures match release-openshift-ocp-installer-e2e-aws-4.5 - 33 runs, 33% failed, 36% of failures match release-openshift-ocp-installer-e2e-aws-4.6 - 110 runs, 55% failed, 7% of failures match release-openshift-ocp-installer-e2e-aws-ovn-4.5 - 14 runs, 43% failed, 50% of failures match release-openshift-ocp-installer-e2e-aws-ovn-4.6 - 62 runs, 71% failed, 2% of failures match release-openshift-ocp-installer-e2e-aws-upi-4.6 - 63 runs, 95% failed, 2% of failures match release-openshift-ocp-installer-e2e-azure-4.5 - 23 runs, 61% failed, 7% of failures match release-openshift-ocp-installer-e2e-gcp-4.5 - 23 runs, 17% failed, 25% of failures match release-openshift-ocp-installer-e2e-gcp-ovn-4.6 - 62 runs, 90% failed, 4% of failures match release-openshift-ocp-installer-e2e-metal-compact-4.5 - 14 runs, 7% failed, 100% of failures match release-openshift-ocp-installer-e2e-openstack-4.5 - 38 runs, 39% failed, 7% of failures match release-openshift-ocp-installer-e2e-ovirt-4.6 - 67 runs, 94% failed, 2% of failures match release-openshift-ocp-installer-e2e-vsphere-upi-4.5 - 16 runs, 31% failed, 40% of failures match release-openshift-okd-installer-e2e-aws-4.6 - 30 runs, 90% failed, 4% of failures match release-openshift-origin-installer-e2e-aws-compact-4.5 - 4 runs, 25% failed, 100% of failures match release-openshift-origin-installer-e2e-azure-4.6 - 102 runs, 72% failed, 3% of failures match release-openshift-origin-installer-launch-gcp - 465 runs, 53% failed, 0% of failures match Looks like 4.5 is being hit especially hard by this. Hah, looking at just the past day: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&context=0&type=junit&groupBy=job&search=%5C%5Bsig-instrumentation%5C%5D+Prometheus+when+installed+on+the+cluster+should+have+important+platform+topology+metrics' | grep 'failures match' | sort promote-release-openshift-machine-os-content-e2e-aws-4.5 - 20 runs, 20% failed, 25% of failures match promote-release-openshift-okd-machine-os-content-e2e-aws-4.6 - 10 runs, 100% failed, 10% of failures match pull-ci-openshift-ovn-kubernetes-master-e2e-openstack - 11 runs, 100% failed, 9% of failures match pull-ci-openshift-sdn-master-e2e-gcp - 8 runs, 25% failed, 50% of failures match pull-ci-openshift-sriov-network-device-plugin-master-e2e-aws - 3 runs, 33% failed, 100% of failures match pull-ci-openshift-sriov-network-operator-master-e2e-aws - 7 runs, 57% failed, 50% of failures match rehearse-11148-pull-ci-openshift-machine-config-operator-release-4.6-e2e-vsphere - 3 runs, 33% failed, 100% of failures match rehearse-11222-pull-ci-openshift-installer-master-e2e-aws-upi - 2 runs, 50% failed, 100% of failures match rehearse-11247-pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-multi - 5 runs, 20% failed, 100% of failures match release-openshift-ocp-installer-e2e-gcp-ovn-4.6 - 14 runs, 93% failed, 8% of failures match So someone did something that fixed this; great, except maybe on those OVN and machine-os-content jobs. https://github.com/openshift/release/pull/11333 should help with debugging remaining occurrences. As per comment #5 , setting sev as low. This test fails in e2e-metal-ipi as of today: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.5-e2e-metal-ipi/1301924213296205824 Still dunno what's going on here. Hopefully we'll figure out our higher-priority bugs and get down to this one next sprint. Today it failed in e2e-aws: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-image-registry-operator/612/pull-ci-openshift-cluster-image-registry-operator-release-4.5-e2e-aws/1309107688537329664 Comment 9 is still current. Still working on higher-priority bugs. This happens relatively frequently which yields a lot of noise in CI jobs but has no clear customer impact, so increasing priority slightly but leaving severity at low. We still this in CI https://search.ci.openshift.org/?search=%5C%5Bsig-instrumentation%5C%5D+Prometheus+when+installed+on+the+cluster+should+have+important+platform+topology+metrics&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job , hence moving to medium sev. Verified this bug with 4.8.0-0.nightly-2021-03-22-104536, PASS. Trigger a ipi install, 'invoker' and 'type' is set correctly [root@preserve-jialiu-ansible ~]# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=cluster_installer' | jq { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "cluster_installer", "endpoint": "metrics", "instance": "10.0.223.119:9099", "invoker": "user", "job": "cluster-version-operator", "namespace": "openshift-cluster-version", "pod": "cluster-version-operator-85699c5bf8-njf4t", "service": "cluster-version-operator", "type": "openshift-install", "version": "v4.8.0" }, "value": [ 1616488052.303, "1" ] } ] } } [root@preserve-jialiu-ansible ~]# oc get cm openshift-install-manifests -n openshift-config -o json { "apiVersion": "v1", "data": { "invoker": "user", "version": "v4.8.0" }, "kind": "ConfigMap", "metadata": { "creationTimestamp": "2021-03-23T07:16:19Z", "managedFields": [ { "apiVersion": "v1", "fieldsType": "FieldsV1", "fieldsV1": { "f:data": { ".": {}, "f:invoker": {}, "f:version": {} } }, "manager": "cluster-bootstrap", "operation": "Update", "time": "2021-03-23T07:16:19Z" } ], "name": "openshift-install-manifests", "namespace": "openshift-config", "resourceVersion": "1503", "uid": "ee3f8244-ef4f-424f-b402-e221beffb5fd" } } [root@preserve-jialiu-ansible ~]# oc get cm openshift-install -n openshift-config -o json { "apiVersion": "v1", "data": { "invoker": "user", "version": "v4.8.0" }, "kind": "ConfigMap", "metadata": { "creationTimestamp": "2021-03-23T07:16:19Z", "managedFields": [ { "apiVersion": "v1", "fieldsType": "FieldsV1", "fieldsV1": { "f:data": { ".": {}, "f:invoker": {}, "f:version": {} } }, "manager": "cluster-bootstrap", "operation": "Update", "time": "2021-03-23T07:16:19Z" } ], "name": "openshift-install", "namespace": "openshift-config", "resourceVersion": "1510", "uid": "b348a3f2-d731-43d5-80f2-5f56a143cb56" } } Also search it in https://search.ci.openshift.org/?search=Prometheus+when+installed+on+the+cluster+should+have+important+platform+topology+metrics+&maxAge=336h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job, and search "fail [github.com/openshift/origin/test/extended/util/prometheus/helpers.go:174]: Expected" on the page, no 4.8 failure due to this error. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |