Bug 1855325

Summary: [Feature:Prometheus][Conformance] Prometheus when installed on the cluster [Top Level] [Feature:Prometheus][Conformance] Prometheus when installed on the cluster should report telemetry if a cloud.openshift.com token is present
Product: OpenShift Container Platform Reporter: Varsha <vnarsing>
Component: MonitoringAssignee: Pawel Krupa <pkrupa>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: low Docs Contact:
Priority: low    
Version: 4.4CC: alegrand, anpicker, bparees, erooth, kakkoyun, lcosic, pkrupa, spasquie, surbania
Target Milestone: ---Keywords: Reopened
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
[Feature:Prometheus][Conformance] Prometheus when installed on the cluster [Top Level] [Feature:Prometheus][Conformance] Prometheus when installed on the cluster should report telemetry if a cloud.openshift.com token is present
Last Closed: 2021-02-24 15:13:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Varsha 2020-07-09 14:55:29 UTC
test:
[Feature:Prometheus][Conformance] Prometheus when installed on the cluster [Top Level] [Feature:Prometheus][Conformance] Prometheus when installed on the cluster should report telemetry if a cloud.openshift.com token is present 

is failing frequently in CI, see search results:
https://search.svc.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5BFeature%3APrometheus%5C%5D%5C%5BConformance%5C%5D+Prometheus+when+installed+on+the+cluster+%5C%5BTop+Level%5C%5D+%5C%5BFeature%3APrometheus%5C%5D%5C%5BConformance%5C%5D+Prometheus+when+installed+on+the+cluster+should+report+telemetry+if+a+cloud%5C.openshift%5C.com+token+is+present



Link to the job which is failing: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.4/1281185853179170816

Snippet of failure logs:

fail [github.com/openshift/origin/test/extended/builds/valuefrom.go:46]: Unexpected error:
    <*util.ExitError | 0xc001945350>: {
        Cmd: "oc --namespace=e2e-test-build-valuefrom-jllvb --kubeconfig=/tmp/configfile256536350 create -f /tmp/fixture-testdata-dir123358715/test/extended/testdata/builds/valuefrom/test-is.json --validate=false",
        StdErr: "error: error when creating \"/tmp/fixture-testdata-dir123358715/test/extended/testdata/builds/valuefrom/test-is.json\": Post https://api.ci-op-zt1lsbbl-e99c3.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/e2e-test-build-valuefrom-jllvb/imagestreams: EOF",
        ExitError: {
            ProcessState: {
                pid: 4991,
                status: 256,
                rusage: {
                    Utime: {Sec: 0, Usec: 205566},
                    Stime: {Sec: 0, Usec: 105962},
                    Maxrss: 75936,
                    Ixrss: 0,
                    Idrss: 0,
                    Isrss: 0,
                    Minflt: 12720,
                    Majflt: 0,
                    Nswap: 0,
                    Inblock: 0,
                    Oublock: 0,
                    Msgsnd: 0,
                    Msgrcv: 0,
                    Nsignals: 0,
                    Nvcsw: 888,
                    Nivcsw: 20,
                },
            },
            Stderr: nil,
        },
    }
    exit status 1


Jul 09 11:41:30.684 E ns/openshift-kube-controller-manager pod/kube-controller-manager-control-plane-0 node/control-plane-0 container=cluster-policy-controller container exited with code 255 (Error): I0709 11:41:26.958277       1 cert_rotation.go:137] Starting client certificate rotation controller\nI0709 11:41:26.966046       1 policy_controller.go:41] Starting controllers on 0.0.0.0:10357 (31debebc)\nI0709 11:41:26.967891       1 standalone_apiserver.go:103] Started health checks at 0.0.0.0:10357\nF0709 11:41:26.969002       1 standalone_apiserver.go:119] listen tcp 0.0.0.0:10357: bind: address already in use\n
Jul 09 11:43:55.182 E ns/openshift-kube-apiserver pod/kube-apiserver-control-plane-0 node/control-plane-0 container=setup init container exited with code 124 (Error): ................................................................................
Jul 09 11:44:06.597 E ns/openshift-console pod/console-7dbc679bf6-cbxd8 node/control-plane-1 container=console container exited with code 2 (Error): 2020-07-09T11:38:28Z cmd/main: cookies are secure!\n2020-07-09T11:38:28Z cmd/main: Binding to [::]:8443...\n2020-07-09T11:38:28Z cmd/main: using TLS\n




One of tests related to the following is failing:
- Prometheus when installed on the cluster should report telemetry if a cloud.openshift.com token is present.

A bug was created for the same: https://bugzilla.redhat.com/show_bug.cgi?id=1853007 against a different release.
However, as the relevant PR is closed and the following comment: https://bugzilla.redhat.com/show_bug.cgi?id=1853007#c21 expects the job to not fail, raising a bug for further investigation.

Comment 2 Simon Pasquier 2020-07-09 15:40:21 UTC
I suspect that the issue is bad timing between when the test is executed and when telemetry metrics are available from Prometheus.

The telemeter-client logs [1] show that it didn't retrieve any metrics at 11:40:29.8762. This is consistent with the Prometheus logs [2][3] which show that they were starting around that time.

The Prometheus dump shows also that the telemeter client has sent samples to the telemetry backend after the 11:45:18 mark while the test reported the failure at 11:45:03.430. Given that the telemeter client sends data every 4min30s and the test checking whether telemetry data has been sent does 5 retries at an interval of 10 seconds, this would explain it.

The issue is probably rare and less visible in 4.6 since failing tests are retried to eliminate flakes. We should still look into making the test more predictible.

[1] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.4/1281185853179170816/artifacts/e2e-vsphere-upi/pods/openshift-monitoring_telemeter-client-66dbfd95b7-zgv7k_telemeter-client.log
[2] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.4/1281185853179170816/artifacts/e2e-vsphere-upi/pods/openshift-monitoring_prometheus-k8s-0_prometheus.log
[3] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.4/1281185853179170816/artifacts/e2e-vsphere-upi/pods/openshift-monitoring_prometheus-k8s-1_prometheus.log

Comment 3 Ben Parees 2020-07-10 03:00:45 UTC
this was a tollbooth issue which has been resolved, there are only a couple flaky recent failures, most of the failures are over a day old(from before the issue was addressed) and will slowly fall off the test history.

Comment 15 errata-xmlrpc 2021-02-24 15:13:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633