Bug 1855325 - [Feature:Prometheus][Conformance] Prometheus when installed on the cluster [Top Level] [Feature:Prometheus][Conformance] Prometheus when installed on the cluster should report telemetry if a cloud.openshift.com token is present
Summary: [Feature:Prometheus][Conformance] Prometheus when installed on the cluster [T...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.4
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.7.0
Assignee: Pawel Krupa
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-09 14:55 UTC by Varsha
Modified: 2021-02-24 15:15 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
[Feature:Prometheus][Conformance] Prometheus when installed on the cluster [Top Level] [Feature:Prometheus][Conformance] Prometheus when installed on the cluster should report telemetry if a cloud.openshift.com token is present
Last Closed: 2021-02-24 15:13:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 25496 0 None closed Bug 1855325: Move checking telemetry data sending to later stages 2021-01-13 09:47:55 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:15:16 UTC

Description Varsha 2020-07-09 14:55:29 UTC
test:
[Feature:Prometheus][Conformance] Prometheus when installed on the cluster [Top Level] [Feature:Prometheus][Conformance] Prometheus when installed on the cluster should report telemetry if a cloud.openshift.com token is present 

is failing frequently in CI, see search results:
https://search.svc.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5BFeature%3APrometheus%5C%5D%5C%5BConformance%5C%5D+Prometheus+when+installed+on+the+cluster+%5C%5BTop+Level%5C%5D+%5C%5BFeature%3APrometheus%5C%5D%5C%5BConformance%5C%5D+Prometheus+when+installed+on+the+cluster+should+report+telemetry+if+a+cloud%5C.openshift%5C.com+token+is+present



Link to the job which is failing: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.4/1281185853179170816

Snippet of failure logs:

fail [github.com/openshift/origin/test/extended/builds/valuefrom.go:46]: Unexpected error:
    <*util.ExitError | 0xc001945350>: {
        Cmd: "oc --namespace=e2e-test-build-valuefrom-jllvb --kubeconfig=/tmp/configfile256536350 create -f /tmp/fixture-testdata-dir123358715/test/extended/testdata/builds/valuefrom/test-is.json --validate=false",
        StdErr: "error: error when creating \"/tmp/fixture-testdata-dir123358715/test/extended/testdata/builds/valuefrom/test-is.json\": Post https://api.ci-op-zt1lsbbl-e99c3.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/e2e-test-build-valuefrom-jllvb/imagestreams: EOF",
        ExitError: {
            ProcessState: {
                pid: 4991,
                status: 256,
                rusage: {
                    Utime: {Sec: 0, Usec: 205566},
                    Stime: {Sec: 0, Usec: 105962},
                    Maxrss: 75936,
                    Ixrss: 0,
                    Idrss: 0,
                    Isrss: 0,
                    Minflt: 12720,
                    Majflt: 0,
                    Nswap: 0,
                    Inblock: 0,
                    Oublock: 0,
                    Msgsnd: 0,
                    Msgrcv: 0,
                    Nsignals: 0,
                    Nvcsw: 888,
                    Nivcsw: 20,
                },
            },
            Stderr: nil,
        },
    }
    exit status 1


Jul 09 11:41:30.684 E ns/openshift-kube-controller-manager pod/kube-controller-manager-control-plane-0 node/control-plane-0 container=cluster-policy-controller container exited with code 255 (Error): I0709 11:41:26.958277       1 cert_rotation.go:137] Starting client certificate rotation controller\nI0709 11:41:26.966046       1 policy_controller.go:41] Starting controllers on 0.0.0.0:10357 (31debebc)\nI0709 11:41:26.967891       1 standalone_apiserver.go:103] Started health checks at 0.0.0.0:10357\nF0709 11:41:26.969002       1 standalone_apiserver.go:119] listen tcp 0.0.0.0:10357: bind: address already in use\n
Jul 09 11:43:55.182 E ns/openshift-kube-apiserver pod/kube-apiserver-control-plane-0 node/control-plane-0 container=setup init container exited with code 124 (Error): ................................................................................
Jul 09 11:44:06.597 E ns/openshift-console pod/console-7dbc679bf6-cbxd8 node/control-plane-1 container=console container exited with code 2 (Error): 2020-07-09T11:38:28Z cmd/main: cookies are secure!\n2020-07-09T11:38:28Z cmd/main: Binding to [::]:8443...\n2020-07-09T11:38:28Z cmd/main: using TLS\n




One of tests related to the following is failing:
- Prometheus when installed on the cluster should report telemetry if a cloud.openshift.com token is present.

A bug was created for the same: https://bugzilla.redhat.com/show_bug.cgi?id=1853007 against a different release.
However, as the relevant PR is closed and the following comment: https://bugzilla.redhat.com/show_bug.cgi?id=1853007#c21 expects the job to not fail, raising a bug for further investigation.

Comment 2 Simon Pasquier 2020-07-09 15:40:21 UTC
I suspect that the issue is bad timing between when the test is executed and when telemetry metrics are available from Prometheus.

The telemeter-client logs [1] show that it didn't retrieve any metrics at 11:40:29.8762. This is consistent with the Prometheus logs [2][3] which show that they were starting around that time.

The Prometheus dump shows also that the telemeter client has sent samples to the telemetry backend after the 11:45:18 mark while the test reported the failure at 11:45:03.430. Given that the telemeter client sends data every 4min30s and the test checking whether telemetry data has been sent does 5 retries at an interval of 10 seconds, this would explain it.

The issue is probably rare and less visible in 4.6 since failing tests are retried to eliminate flakes. We should still look into making the test more predictible.

[1] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.4/1281185853179170816/artifacts/e2e-vsphere-upi/pods/openshift-monitoring_telemeter-client-66dbfd95b7-zgv7k_telemeter-client.log
[2] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.4/1281185853179170816/artifacts/e2e-vsphere-upi/pods/openshift-monitoring_prometheus-k8s-0_prometheus.log
[3] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.4/1281185853179170816/artifacts/e2e-vsphere-upi/pods/openshift-monitoring_prometheus-k8s-1_prometheus.log

Comment 3 Ben Parees 2020-07-10 03:00:45 UTC
this was a tollbooth issue which has been resolved, there are only a couple flaky recent failures, most of the failures are over a day old(from before the issue was addressed) and will slowly fall off the test history.

Comment 15 errata-xmlrpc 2021-02-24 15:13:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.