Bug 1975555 - Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured
Summary: Prometheus when installed on the cluster shouldn't report any alerts in firin...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.7
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: ---
: ---
Assignee: Allen Ray
QA Contact: ge liu
URL:
Whiteboard: tag-ci LifecycleFrozen
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-23 22:11 UTC by Ben Parees
Modified: 2022-09-16 13:36 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1975779 (view as bug list)
Environment:
job=periodic-ci-openshift-release-master-nightly-4.7-upgrade-from-stable-4.6-e2e-metal-ipi-upgrade=all job=periodic-ci-openshift-release-master-nightly-4.7-e2e-metal-ipi-upgrade=all
Last Closed: 2022-09-12 10:14:43 UTC
Target Upstream Version:
Embargoed:
mfojtik: needinfo?
bparees: needinfo-


Attachments (Terms of Use)

Description Ben Parees 2021-06-23 22:11:08 UTC
Description of problem:

Test is failing in our metal ipi upgrade jobs:


https://search.ci.openshift.org/?search=Prometheus+when+installed+on+the+cluster+shouldn%27t+report+any+alerts+in+firing+state+apart+from+Watchdog+and+AlertmanagerReceiversNotConfigured&maxAge=48h&context=1&type=junit&name=metal&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

example:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.7-upgrade-from-stable-4.6-e2e-metal-ipi-upgrade/1406325890748518400

fail [k8s.io/kubernetes.0/test/e2e/framework/pod/resource.go:483]: failed to create new exec pod in namespace: e2e-test-prometheus-dbqnp
Unexpected error:
    <*errors.errorString | 0xc0003369b0>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
occurred


Since it is the test itself that's failing to launch, as opposed to an incorrect alert firing, starting this with the monitoring team, but since this test doesn't seem to fail on other platforms you will likely want to engage the Metal team.


Version-Release number of selected component (if applicable):
4.6 to 4.7 upgrade


How reproducible:
fairly consistent in our payload acceptance periodic jobs

Comment 1 Jan Fajerski 2021-06-24 08:24:17 UTC
Looking at the example run, there was an issue with pulling an image

Jun 19 20:12:21.483: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for execpod: { } Scheduled: Successfully assigned e2e-test-prometheus-dbqnp/execpod to worker-1
Jun 19 20:12:21.483: INFO: At 2021-06-19 20:07:23 +0000 UTC - event for execpod: {multus } AddedInterface: Add eth0 [10.128.2.10/23]
Jun 19 20:12:21.483: INFO: At 2021-06-19 20:07:24 +0000 UTC - event for execpod: {kubelet worker-1} Pulling: Pulling image "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest"
Jun 19 20:12:21.483: INFO: At 2021-06-19 20:07:24 +0000 UTC - event for execpod: {kubelet worker-1} Failed: Failed to pull image "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest": rpc error: code = Unknown desc = error pinging docker registry image-registry.openshift-image-registry.svc:5000: Get "https://image-registry.openshift-image-registry.svc:5000/v2/": x509: certificate signed by unknown authority
Jun 19 20:12:21.483: INFO: At 2021-06-19 20:07:24 +0000 UTC - event for execpod: {kubelet worker-1} Failed: Error: ErrImagePull
Jun 19 20:12:21.483: INFO: At 2021-06-19 20:07:25 +0000 UTC - event for execpod: {kubelet worker-1} BackOff: Back-off pulling image "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest"
Jun 19 20:12:21.483: INFO: At 2021-06-19 20:07:25 +0000 UTC - event for execpod: {kubelet worker-1} Failed: Error: ImagePullBackOff
Jun 19 20:12:21.483: INFO: At 2021-06-19 20:08:57 +0000 UTC - event for execpod: {kubelet worker-1} Failed: Failed to pull image "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest": rpc error: code = Unknown desc = Error reading manifest latest in image-registry.openshift-image-registry.svc:5000/openshift/tools: unauthorized: authentication required

will reach out the the metal ipi team.

Comment 4 Jan Fajerski 2021-06-24 12:00:00 UTC
Cloned this for the image pull error. https://bugzilla.redhat.com/show_bug.cgi?id=1975779

Comment 5 Jan Fajerski 2021-06-24 12:02:20 UTC
Reassigning to etcd component. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade/1407794486494367744 seems to indicate the etcd isn't starting as expected.

Comment 6 Sam Batschelet 2021-06-28 19:07:59 UTC
@jfajersk can you provide perhaps a log reference to why etcd is the assignee?

Comment 7 Jan Fajerski 2021-07-02 08:57:12 UTC
(In reply to Jan Fajerski from comment #5)
> Reassigning to etcd component.
> https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-
> openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-
> upgrade/1407794486494367744 seems to indicate the etcd isn't starting as
> expected.

The link lists a failure like this:

: [sig-arch][Early] Managed cluster should start all core operators [Skipped:Disconnected] [Suite:openshift/conformance/parallel] expand_less	0s
fail [github.com/onsi/ginkgo.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: Jun 23 21:20:49.315: Some cluster operators are not ready: insights (Degraded=True PeriodicGatherFailed: Source clusterconfig could not be retrieved: etcdserver: leader changed, etcdserver: request timed out)

Comment 8 Michal Fojtik 2021-08-01 09:47:10 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 9 Ben Parees 2021-08-01 18:09:56 UTC
I don't know if etcd is really the right component for this bug, but the issue is definitely still occurring:
https://search.ci.openshift.org/?search=Prometheus+when+installed+on+the+cluster+shouldn%27t+report+any+alerts+in+firing+state+apart+from+Watchdog+and+AlertmanagerReceiversNotConfigured&maxAge=48h&context=5&type=junit&name=metal&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job


Recent run:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.7-e2e-metal-ipi-upgrade/1421441651221467136

: [sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Suite:openshift/conformance/parallel] expand_less 	5m2s
fail [k8s.io/kubernetes.0/test/e2e/framework/pod/resource.go:483]: failed to create new exec pod in namespace: e2e-test-prometheus-f47ld
Unexpected error:
    <*errors.errorString | 0xc00034a9b0>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
occurred


it's failing in 0.58% of metal runs
1.5% of metal 4.8 runs
10% of metal 4.7 runs


0.31% of all runs
0.61% of all 4.8 runs
1.86% of all 4.7 runs

So that would indicate it definitely is more of an issue on metal (though that may(likely) simply point to more overburdened nodes causing timeouts, as compared to something that's really specific to the metal codebase)

Removing lifecycle stale.

Comment 10 Michal Fojtik 2021-08-31 18:35:03 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 18 W. Trevor King 2022-03-01 23:58:41 UTC
Query from comment 11 still turns up plenty of hits, including [1].  That's over a week old now, but we've also had a lot of trouble cutting 4.11 nightlies recently, so I wouldn't treat "4.11 periodics don't seem to have more recent failures" as a sign that this is fixed in 4.11 without doing more legwork or waiting for 4.11 nightlies to get building again [2].

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-metal-ipi-upgrade-ovn-ipv6/1494211727867252736
[2]: https://amd64.ocp.releases.ci.openshift.org/#4.11.0-0.nightly

Comment 19 Ben Parees 2022-03-02 00:34:02 UTC
The specific failure to be investigated (at least the original intent of this bug) is the case where the pod does not even launch, as indicated by this error string:

failed to create new exec pod in namespace: e2e-test-prometheus


https://search.ci.openshift.org/?search=failed+to+create+new+exec+pod+in+namespace%3A+e2e-test-prometheus&maxAge=336h&context=5&type=build-log&name=4.10&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

(note that search restricts the query to 4.10 jobs, to confirm the failure mode is still occurring in recent releases that we'd have cause to care about).

The general test itself fails much more frequently for other reasons, which is worthy of its own bug that should be owned by the monitoring team to drive, but the failures where the pod itself doesn't launch aren't going to be resolved by them, hence this bug which seems to have currently landed on the etcd team (I can't speak to the validity of that assessment/ownership)


Note You need to log in before you can comment on or make changes to this bug.