Examples from 4.3 promotion jobs [1]: fail [github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:134]: Expected <map[string]error | len:1>: { "ALERTS{alertname!~\"Watchdog|UsingDeprecatedAPIExtensionsV1Beta1\",alertstate=\"firing\"} >= 1": { s: "promQL query: ALERTS{alertname!~\"Watchdog|UsingDeprecatedAPIExtensionsV1Beta1\",alertstate=\"firing\"} >= 1 had reported incorrect results: ALERTS{alertname=\"TargetDown\", alertstate=\"firing\", job=\"metrics\", namespace=\"openshift-kube-controller-manager-operator\", service=\"metrics\", severity=\"warning\"} => 1 @[1575338031.017]", }, } to be empty ... failed: (7m2s) 2019-12-03T01:53:53 "[Feature:Prometheus][Conformance] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog [Suite:openshift/conformance/parallel/minimal]" But [2] says the pod has been running since 1:20Z and is still running at gather time, so probably a metrics-gathering thing. Similar issue in [3]: ALERTS{alertname=\"TargetDown\", alertstate=\"firing\", job=\"metrics\", namespace=\"openshift-console-operator\", service=\"metrics\", severity=\"warning\"} despite a healthy console operator [4]. Hit this 15 times today (1% of all e2e failures) [5]. 4 of those (6% of failures) were for 4.3 release jobs [6]. [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-shared-vpc-4.3/205 [2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-shared-vpc-4.3/205/artifacts/e2e-gcp/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-11-22-122829-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-kube-controller-manager-operator/pods/kube-controller-manager-operator-6c984f44df-vf9bx/kube-controller-manager-operator-6c984f44df-vf9bx.yaml [3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-shared-vpc-4.3/209 [4]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-shared-vpc-4.3/209/artifacts/e2e-gcp/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-11-22-122829-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-console-operator/pods/console-operator-75548dd7b4-w9pg6/console-operator-75548dd7b4-w9pg6.yaml [5]: https://search.svc.ci.openshift.org/chart?search=TargetDown.*firing [6]: https://search.svc.ci.openshift.org/chart?name=release-openshift-ocp-installer-.*4.3$&search=TargetDown.*firing
Maciej was investigating.
Around the time when this failed I'm seeing a bunch of: I1203 01:46:42.757020 1 log.go:172] http: TLS handshake error from 10.128.2.15:54968: remote error: tls: bad certificate I1203 01:46:59.347908 1 log.go:172] http: TLS handshake error from 10.129.2.13:38406: remote error: tls: bad certificate in the kcm-o logs, from what I've check the pods, these are prometheus pods, is it possible that they are using wrong certs? On the other hand I've checked other instances and it's not always kcm-o, but I've seen console-operator TargetDown, others unkown error (1 had reported incorrect results: model.Vector{(*model.Sample)(0xc004126180)}) With this in mind I'll move this back to monitoring for further investigation.
(In reply to W. Trevor King from comment #0) > [5]: https://search.svc.ci.openshift.org/chart?search=TargetDown.*firing That^ shows three of these just in the past 24h [1,2,3]. In all three cases it was the same openshift-console-operator alert I mentioned in comment 0. So still not a high flake rate, but certainly something that we can actually reproduce, right? [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/928 [2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_service-ca-operator/104/pull-ci-openshift-service-ca-operator-release-4.3-e2e-aws/8 [3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.4/795
The bug is caused by the self-sign-if-no-certs-found logic of library-go's controller command default behavior - https://github.com/openshift/cluster-authentication-operator/blob/fe70139ca94a0078f1b6a847ff9c9a366c4db0f4/vendor/github.com/openshift/library-go/pkg/controller/controllercmd/cmd.go#L195-L199 Apparently the service-ca's controllers are usually winning the race to console's operator pod creation most of the time and created the content of the secret the operator mounted. However, if they lose, self-signed certs are used instead for serving and the metrics collector (naturally) does not believe those. I'll try to create a fix for this in library-go to avoid any magic.
Actually, there was a bunch of fixes to library-go's file observer, so, at least for console which bumped library-go dep on the 10th after 4 months, we may want to check whether they appear in these jobs again, and perhaps focus on file observer differences in the 4.3 branch in vendor of their repo.
*** Bug 1820385 has been marked as a duplicate of this bug. ***
Found a BZ with the same symptoms so this may still be an issue
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581