Description of problem: Test is failing in our metal ipi upgrade jobs: https://search.ci.openshift.org/?search=Prometheus+when+installed+on+the+cluster+shouldn%27t+report+any+alerts+in+firing+state+apart+from+Watchdog+and+AlertmanagerReceiversNotConfigured&maxAge=48h&context=1&type=junit&name=metal&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job example: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.7-upgrade-from-stable-4.6-e2e-metal-ipi-upgrade/1406325890748518400 fail [k8s.io/kubernetes.0/test/e2e/framework/pod/resource.go:483]: failed to create new exec pod in namespace: e2e-test-prometheus-dbqnp Unexpected error: <*errors.errorString | 0xc0003369b0>: { s: "timed out waiting for the condition", } timed out waiting for the condition occurred Since it is the test itself that's failing to launch, as opposed to an incorrect alert firing, starting this with the monitoring team, but since this test doesn't seem to fail on other platforms you will likely want to engage the Metal team. Version-Release number of selected component (if applicable): 4.6 to 4.7 upgrade How reproducible: fairly consistent in our payload acceptance periodic jobs
Looking at the example run, there was an issue with pulling an image Jun 19 20:12:21.483: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for execpod: { } Scheduled: Successfully assigned e2e-test-prometheus-dbqnp/execpod to worker-1 Jun 19 20:12:21.483: INFO: At 2021-06-19 20:07:23 +0000 UTC - event for execpod: {multus } AddedInterface: Add eth0 [10.128.2.10/23] Jun 19 20:12:21.483: INFO: At 2021-06-19 20:07:24 +0000 UTC - event for execpod: {kubelet worker-1} Pulling: Pulling image "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest" Jun 19 20:12:21.483: INFO: At 2021-06-19 20:07:24 +0000 UTC - event for execpod: {kubelet worker-1} Failed: Failed to pull image "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest": rpc error: code = Unknown desc = error pinging docker registry image-registry.openshift-image-registry.svc:5000: Get "https://image-registry.openshift-image-registry.svc:5000/v2/": x509: certificate signed by unknown authority Jun 19 20:12:21.483: INFO: At 2021-06-19 20:07:24 +0000 UTC - event for execpod: {kubelet worker-1} Failed: Error: ErrImagePull Jun 19 20:12:21.483: INFO: At 2021-06-19 20:07:25 +0000 UTC - event for execpod: {kubelet worker-1} BackOff: Back-off pulling image "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest" Jun 19 20:12:21.483: INFO: At 2021-06-19 20:07:25 +0000 UTC - event for execpod: {kubelet worker-1} Failed: Error: ImagePullBackOff Jun 19 20:12:21.483: INFO: At 2021-06-19 20:08:57 +0000 UTC - event for execpod: {kubelet worker-1} Failed: Failed to pull image "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest": rpc error: code = Unknown desc = Error reading manifest latest in image-registry.openshift-image-registry.svc:5000/openshift/tools: unauthorized: authentication required will reach out the the metal ipi team.
Cloned this for the image pull error. https://bugzilla.redhat.com/show_bug.cgi?id=1975779
Reassigning to etcd component. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade/1407794486494367744 seems to indicate the etcd isn't starting as expected.
@jfajersk can you provide perhaps a log reference to why etcd is the assignee?
(In reply to Jan Fajerski from comment #5) > Reassigning to etcd component. > https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci- > openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi- > upgrade/1407794486494367744 seems to indicate the etcd isn't starting as > expected. The link lists a failure like this: : [sig-arch][Early] Managed cluster should start all core operators [Skipped:Disconnected] [Suite:openshift/conformance/parallel] expand_less 0s fail [github.com/onsi/ginkgo.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: Jun 23 21:20:49.315: Some cluster operators are not ready: insights (Degraded=True PeriodicGatherFailed: Source clusterconfig could not be retrieved: etcdserver: leader changed, etcdserver: request timed out)
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.
I don't know if etcd is really the right component for this bug, but the issue is definitely still occurring: https://search.ci.openshift.org/?search=Prometheus+when+installed+on+the+cluster+shouldn%27t+report+any+alerts+in+firing+state+apart+from+Watchdog+and+AlertmanagerReceiversNotConfigured&maxAge=48h&context=5&type=junit&name=metal&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Recent run: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.7-e2e-metal-ipi-upgrade/1421441651221467136 : [sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Suite:openshift/conformance/parallel] expand_less 5m2s fail [k8s.io/kubernetes.0/test/e2e/framework/pod/resource.go:483]: failed to create new exec pod in namespace: e2e-test-prometheus-f47ld Unexpected error: <*errors.errorString | 0xc00034a9b0>: { s: "timed out waiting for the condition", } timed out waiting for the condition occurred it's failing in 0.58% of metal runs 1.5% of metal 4.8 runs 10% of metal 4.7 runs 0.31% of all runs 0.61% of all 4.8 runs 1.86% of all 4.7 runs So that would indicate it definitely is more of an issue on metal (though that may(likely) simply point to more overburdened nodes causing timeouts, as compared to something that's really specific to the metal codebase) Removing lifecycle stale.
still happening. https://search.ci.openshift.org/?search=Prometheus+when+installed+on+the+cluster+shouldn%27t+report+any+alerts+in+firing+state+apart+from+Watchdog+and+AlertmanagerReceiversNotConfigured&maxAge=336h&context=5&type=junit&name=metal&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Query from comment 11 still turns up plenty of hits, including [1]. That's over a week old now, but we've also had a lot of trouble cutting 4.11 nightlies recently, so I wouldn't treat "4.11 periodics don't seem to have more recent failures" as a sign that this is fixed in 4.11 without doing more legwork or waiting for 4.11 nightlies to get building again [2]. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-metal-ipi-upgrade-ovn-ipv6/1494211727867252736 [2]: https://amd64.ocp.releases.ci.openshift.org/#4.11.0-0.nightly
The specific failure to be investigated (at least the original intent of this bug) is the case where the pod does not even launch, as indicated by this error string: failed to create new exec pod in namespace: e2e-test-prometheus https://search.ci.openshift.org/?search=failed+to+create+new+exec+pod+in+namespace%3A+e2e-test-prometheus&maxAge=336h&context=5&type=build-log&name=4.10&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job (note that search restricts the query to 4.10 jobs, to confirm the failure mode is still occurring in recent releases that we'd have cause to care about). The general test itself fails much more frequently for other reasons, which is worthy of its own bug that should be owned by the monitoring team to drive, but the failures where the pod itself doesn't launch aren't going to be resolved by them, hence this bug which seems to have currently landed on the etcd team (I can't speak to the validity of that assessment/ownership)