1853007 – Prometheus when installed on the cluster should report telemetry if a cloud.openshift.com token is present

Bug 1853007 - Prometheus when installed on the cluster should report telemetry if a cloud.openshift.com token is present

Summary: Prometheus when installed on the cluster should report telemetry if a cloud.o...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.2.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Simon Pasquier
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1854126 1855092 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-07-01 17:55 UTC by Corey Daley
Modified:	2020-10-27 16:12 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:	[Feature:Prometheus][Conformance] Prometheus when installed on the cluster should report telemetry if a cloud.openshift.com token is present [sig-instrumentation] Prometheus when installed on the cluster should report telemetry if a cloud.openshift.com token is present
Last Closed:	2020-10-27 16:11:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift prometheus-operator pull 79	0	None	closed	Bug 1853007: Revert "Bug 1806541: Update prometheus-operator to 0.40.0"	2020-11-11 17:37:39 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:12:05 UTC

Description Corey Daley 2020-07-01 17:55:10 UTC

test:
[Feature:Prometheus][Conformance] Prometheus when installed on the cluster should report telemetry if a cloud.openshift.com token is present 

is failing frequently in CI, see search results:
https://search.svc.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5BFeature%3APrometheus%5C%5D%5C%5BConformance%5C%5D+Prometheus+when+installed+on+the+cluster+should+report+telemetry+if+a+cloud%5C.openshift%5C.com+token+is+present

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.2/1278243006352199680

Comment 1 W. Trevor King 2020-07-02 04:51:35 UTC

Example job failed this test-case with:

fail [k8s.io/kubernetes/test/e2e/framework/util.go:3526]: Unexpected error:
    <*errors.errorString | 0xc000275410>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
occurred

which is not a very useful error message.  On the other hand, 4.2 is very close to EOL; not clear to me how much effort it's worth putting into improving the test suite.  If this is still failing in newer releases, it might be worth improving the error message to make it more obvious what went wrong.  Stdout from the test-case includes:

Jul  1 09:08:31.042: INFO: Creating new exec pod
Jul  1 09:13:31.094: INFO: Unexpected error occurred: timed out waiting for the condition
...
Jul  1 09:13:31.101: INFO: POD           NODE                                                      PHASE    GRACE  CONDITIONS
Jul  1 09:13:31.101: INFO: execpodtm9mn  ci-op--xl4zs-w-c-vd6mz.c.openshift-gce-devel-ci.internal  Pending         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-07-01 09:08:31 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2020-07-01 09:08:31 +0000 UTC ContainersNotReady containers with unready status: [exec]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2020-07-01 09:08:31 +0000 UTC ContainersNotReady containers with unready status: [exec]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-07-01 09:08:31 +0000 UTC  }]

so probably should have an error like "timed out waiting for exec pod", with bonus points for summarizing the pod status so we can see why it's not ready yet.  Pretty sure this is not a monitoring failure for this cluster; maybe the node folks would be able to provide more guidance about why exec took so long to get going?  Might be something in the node's kubelet logs in artifacts?

Comment 2 Simon Pasquier 2020-07-03 09:23:16 UTC

From https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.4/1278954465306611712, the telemeter client logs show failures due to rate limits:

level=error caller=forwarder.go:268 ts=2020-07-03T08:23:40.726596188Z component=forwarder/worker msg="unable to forward results" err="unable to authorize to server: unable to exchange initial token for a long lived token: 429:\nrate limited, please try again later\n"
level=error caller=forwarder.go:268 ts=2020-07-03T08:24:41.404550716Z component=forwarder/worker msg="unable to forward results" err="unable to authorize to server: unable to exchange initial token for a long lived token: 429:\nrate limited, please try again later\n"
level=error caller=forwarder.go:268 ts=2020-07-03T08:25:42.051516075Z component=forwarder/worker msg="unable to forward results" err="unable to authorize to server: unable to exchange initial token for a long lived token: 429:\nrate limited, please try again later\n"
level=error caller=forwarder.go:268 ts=2020-07-03T08:26:42.700226791Z component=forwarder/worker msg="unable to forward results" err="unable to authorize to server: unable to exchange initial token for a long lived token: 429:\nrate limited, please try again later\n"
level=error caller=forwarder.go:268 ts=2020-07-03T08:27:43.307409111Z component=forwarder/worker msg="unable to forward results" err="unable to authorize to server: unable to exchange initial token for a long lived token: 429:\nrate limited, please try again later\n"
level=error caller=forwarder.go:268 ts=2020-07-03T08:28:43.914615154Z component=forwarder/worker msg="unable to forward results" err="unable to authorize to server: unable to exchange initial token for a long lived token: 429:\nrate limited, please try again later\n"
level=error caller=forwarder.go:268 ts=2020-07-03T08:29:44.480537147Z component=forwarder/worker msg="unable to forward results" err="unable to authorize to server: unable to exchange initial token for a long lived token: 429:\nrate limited, please try again later\n"
level=error caller=forwarder.go:268 ts=2020-07-03T08:30:45.094740177Z component=forwarder/worker msg="unable to forward results" err="unable to authorize to server: unable to exchange initial token for a long lived token: 429:\nrate limited, please try again later\n"
level=error caller=forwarder.go:268 ts=2020-07-03T08:31:45.694823334Z component=forwarder/worker msg="unable to forward results" err="unable to authorize to server: unable to exchange initial token for a long lived token: 429:\nrate limited, please try again later\n"
level=error caller=forwarder.go:268 ts=2020-07-03T08:32:46.249462758Z component=forwarder/worker msg="unable to forward results" err="unable to authorize to server: unable to exchange initial token for a long lived token: 429:\nrate limited, please try again later\n"
level=error caller=forwarder.go:268 ts=2020-07-03T08:33:46.789579718Z component=forwarder/worker msg="unable to forward results" err="unable to authorize to server: unable to exchange initial token for a long lived token: 429:\nrate limited, please try again later\n"
level=error caller=forwarder.go:268 ts=2020-07-03T08:34:47.416487769Z component=forwarder/worker msg="unable to forward results" err="unable to authorize to server: unable to exchange initial token for a long lived token: 429:\nrate limited, please try again later\n"

Comment 3 Simon Pasquier 2020-07-03 12:06:07 UTC

Found another example for telemeter-client being rate-limited with the e2e-aws-ovn-4.3 job:

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-ovn-4.3/1278756012282613760/artifacts/e2e-aws/pods/openshift-monitoring_telemeter-client-79c9575579-5xwn9_telemeter-client.log

Comment 8 Simon Pasquier 2020-07-06 13:58:59 UTC

*** Bug 1854126 has been marked as a duplicate of this bug. ***

Comment 15 W. Trevor King 2020-07-07 22:16:46 UTC

Slightly different CI symptoms from a 4.5.0-0.nightly-2020-07-07-210042 job [1]

fail [github.com/openshift/origin/test/extended/util/prometheus/helpers.go:174]: Expected
    <map[string]error | len:2>: {
        "metricsclient_request_send{client=\"federate_to\",job=\"telemeter-client\",status_code=\"200\"} >= 1": {
            s: "promQL query: metricsclient_request_send{client=\"federate_to\",job=\"telemeter-client\",status_code=\"200\"} >= 1 had reported incorrect results:\n[]",
        },
        "federate_samples{job=\"telemeter-client\"} >= 10": {
            s: "promQL query: federate_samples{job=\"telemeter-client\"} >= 10 had reported incorrect results:\n[]",
        },
    }
to be empty

These symptoms were also reported in bug 1854126, which was closed as a dup of this bug.  Making this comment to get them into a public comment for CI search indexing.

[1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/1280608601861263360

Comment 16 W. Trevor King 2020-07-07 22:19:25 UTC

Raising to urgent based on:

$ w3m -dump -cols 200 'https://search.svc.ci.openshift.org/?search=promQL%20query:%20.*metricsclient_request_send.*telemeter-client.*had%20reported%20incorrect%20results&name=release-openshift-ocp' | grep 'failures match'
release-openshift-ocp-installer-e2e-aws-upi-4.2 - 2 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-vsphere-upi-4.2 - 2 runs, 50% failed, 300% of failures match
release-openshift-ocp-installer-e2e-metal-4.2 - 2 runs, 50% failed, 400% of failures match
release-openshift-ocp-installer-e2e-aws-4.5 - 22 runs, 91% failed, 80% of failures match
release-openshift-ocp-installer-e2e-aws-4.2 - 2 runs, 50% failed, 200% of failures match
release-openshift-ocp-installer-e2e-aws-fips-4.5 - 5 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-aws-upi-4.5 - 5 runs, 100% failed, 140% of failures match
release-openshift-ocp-installer-e2e-gcp-4.5 - 5 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-metal-4.5 - 5 runs, 100% failed, 140% of failures match
release-openshift-ocp-installer-e2e-aws-ovn-4.5 - 6 runs, 100% failed, 50% of failures match
release-openshift-ocp-installer-e2e-aws-fips-4.3 - 12 runs, 50% failed, 200% of failures match
release-openshift-ocp-installer-e2e-aws-4.6 - 14 runs, 64% failed, 167% of failures match
release-openshift-ocp-installer-e2e-azure-ovn-4.4 - 1 runs, 100% failed, 200% of failures match
release-openshift-ocp-installer-e2e-openstack-4.5 - 6 runs, 33% failed, 50% of failures match
release-openshift-ocp-installer-e2e-azure-4.3 - 2 runs, 100% failed, 150% of failures match
release-openshift-ocp-installer-e2e-metal-4.3 - 2 runs, 100% failed, 200% of failures match
release-openshift-ocp-installer-e2e-gcp-4.3 - 2 runs, 50% failed, 200% of failures match
release-openshift-ocp-installer-e2e-vsphere-upi-4.5 - 6 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-metal-4.6 - 2 runs, 50% failed, 200% of failures match
release-openshift-ocp-installer-e2e-metal-compact-4.6 - 2 runs, 100% failed, 150% of failures match
release-openshift-ocp-installer-e2e-gcp-4.6 - 2 runs, 100% failed, 150% of failures match
release-openshift-ocp-installer-e2e-aws-upi-4.6 - 1 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-aws-proxy-4.2 - 2 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-aws-mirrors-4.2 - 2 runs, 50% failed, 200% of failures match
release-openshift-ocp-installer-e2e-azure-ovn-4.3 - 1 runs, 100% failed, 300% of failures match
release-openshift-ocp-installer-e2e-aws-4.4 - 4 runs, 50% failed, 400% of failures match
release-openshift-ocp-installer-e2e-azure-4.4 - 2 runs, 50% failed, 300% of failures match
release-openshift-ocp-installer-e2e-aws-fips-4.4 - 2 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-aws-upi-4.4 - 2 runs, 100% failed, 50% of failures match
release-openshift-ocp-installer-e2e-aws-ovn-4.4 - 2 runs, 100% failed, 150% of failures match
release-openshift-ocp-installer-e2e-openstack-4.4 - 3 runs, 100% failed, 67% of failures match
release-openshift-ocp-installer-e2e-azure-4.5 - 5 runs, 100% failed, 60% of failures match
release-openshift-ocp-installer-e2e-openstack-4.3 - 1 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-aws-mirrors-4.3 - 1 runs, 100% failed, 200% of failures match
release-openshift-ocp-installer-e2e-aws-upi-4.3 - 1 runs, 100% failed, 200% of failures match
release-openshift-ocp-installer-e2e-aws-4.3 - 2 runs, 50% failed, 400% of failures match
release-openshift-ocp-installer-e2e-aws-ovn-4.3 - 2 runs, 50% failed, 200% of failures match
release-openshift-ocp-installer-e2e-gcp-4.4 - 2 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-metal-4.4 - 2 runs, 50% failed, 100% of failures match
release-openshift-ocp-installer-e2e-vsphere-upi-4.4 - 2 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-aws-proxy-4.4 - 1 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-gcp-rt-4.6 - 1 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-aws-mirrors-4.5 - 1 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-azure-ovn-4.6 - 1 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-gcp-ovn-4.6 - 1 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-aws-mirrors-4.6 - 1 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-aws-fips-4.6 - 1 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-aws-ovn-4.6 - 1 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-gcp-ovn-4.5 - 1 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-azure-4.6 - 12 runs, 100% failed, 25% of failures match
release-openshift-ocp-installer-e2e-azure-4.2 - 2 runs, 50% failed, 100% of failures match

Comment 18 Simon Pasquier 2020-07-08 08:12:24 UTC

It looks like the recent failures have a different cause. Looking at [2], the prometheus-operator logs [2] are spammed with list errors:

E0708 03:25:41.764206       1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:478: Failed to list *v1.ServiceMonitor: resourceVersion: Invalid value: "18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762": strconv.ParseUint: parsing "18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762/18762": invalid syntax
E0708 03:26:07.708655       1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:480: Failed to list *v1.PrometheusRule: resourceVersion: Invalid value: "17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821": strconv.ParseUint: parsing "17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821": invalid syntax
E0708 03:26:13.979155       1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:313: Failed to list *v1.PrometheusRule: resourceVersion: Invalid value: "17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821": strconv.ParseUint: parsing "17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821/17821": invalid syntax


In practice it means that the prometheus operator couldn't retrieve some of the service monitors and rules hence some targets aren't configured in Prometheus and metrics are missing. We've merged a PR yesterday that updated the version of prometheus operator to v0.40.0 and it is probably related to these failures.
We've got a PR in flight [3] to revert this change. 

[1] https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.6/1280688004767158272
[2] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.6/1280688004767158272/artifacts/e2e-azure/pods/openshift-monitoring_prometheus-operator-55d4df5b55-9m4bm_prometheus-operator.log
[3] https://github.com/openshift/prometheus-operator/pull/79

Comment 19 Simon Pasquier 2020-07-08 15:37:47 UTC

https://github.com/openshift/prometheus-operator/pull/79 reverting the bump of prometheus-operator v0.40.0 has been merged and the test shouldn't fail anymore on 4.6.

Comment 20 Nick Hale 2020-07-08 23:31:15 UTC

*** Bug 1855092 has been marked as a duplicate of this bug. ***

Comment 21 Simon Pasquier 2020-07-09 09:43:38 UTC

I've checked the CI failures matching this test and couldn't find any relevant in the last 24 hours.

Comment 25 Simon Pasquier 2020-07-13 10:51:07 UTC

Looking at recent failures of this test, I've found only 3 occurrences:
* https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-compact-4.3/1282437752754802688 had more than 307/2107 tests that failed. I assume that the cluster under test wasn't healthy so hard to conclude anything. 
* https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/periodic-ci-operator-framework-operator-lifecycle-managment-rhoperator-metric-e2e-aws-olm-release-4.4-daily/1282555527653494784, the test didn't fail but its name appears in the logs
* https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-sdn-multitenant-4.4/1282455870612967424, one of the Prometheus instances had the PrometheusNotIngestingSamples alert firing which means that it collected no metrics. The telemeter-client in turn didn't retrieve any metrics as shown in its logs (https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-sdn-multitenant-4.4/1282455870612967424/artifacts/e2e-aws/pods/openshift-monitoring_telemeter-client-56dbd488b5-2rxsl_telemeter-client.log). We already have bug 1845561 tracking this issue.

Comment 26 Simon Pasquier 2020-07-13 10:55:15 UTC

We also have bug 1855325 which tracks another reason why the test could fail. But given it isn't the same cause as described in this bug and it happens rarely, it has a different severity/priority.

Comment 27 Junqi Zhao 2020-07-14 03:42:17 UTC

https://search.ci.openshift.org/?search=Prometheus+when+installed+on+the+cluster+should+report+telemetry+if+a+cloud%5C.openshift%5C.com+token+is+present&maxAge=24h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

no such error with 46 in the last 24 hours.

Comment 29 errata-xmlrpc 2020-10-27 16:11:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.