1779438 – Spurious TargetDown alerts for healthy pods

Bug 1779438 - Spurious TargetDown alerts for healthy pods

Summary: Spurious TargetDown alerts for healthy pods

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	apiserver-auth
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Standa Laznicka
QA Contact:	scheng
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1820385 (view as bug list)
Depends On:
Blocks:	1836886 1836887
TreeView+	depends on / blocked

Reported:	2019-12-04 00:45 UTC by W. Trevor King
Modified:	2020-05-18 12:31 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: race condition in the code watching changes in files Consequence: when the mounted serving certificate changed/appeared, this was not noticed and when metrics were being scraped on the HTTPS endpoint, the serving certificate was not trusted by the scraper Fix: removed the race condition Result: the operators based on library-go should be able to reload the serving certificate correctly
Clone Of:
Environment:
Last Closed:	2020-05-04 11:18:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift library-go pull 709	0	None	closed	Bug 1779438: fix race for controller commands	2021-02-04 05:03:49 UTC
Red Hat Product Errata	RHBA-2020:0581	0	None	None	None	2020-05-04 11:18:49 UTC

Description W. Trevor King 2019-12-04 00:45:11 UTC

Examples from 4.3 promotion jobs [1]:

fail [github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:134]: Expected
    <map[string]error | len:1>: {
        "ALERTS{alertname!~\"Watchdog|UsingDeprecatedAPIExtensionsV1Beta1\",alertstate=\"firing\"} >= 1": {
            s: "promQL query: ALERTS{alertname!~\"Watchdog|UsingDeprecatedAPIExtensionsV1Beta1\",alertstate=\"firing\"} >= 1 had reported incorrect results: ALERTS{alertname=\"TargetDown\", alertstate=\"firing\", job=\"metrics\", namespace=\"openshift-kube-controller-manager-operator\", service=\"metrics\", severity=\"warning\"} => 1 @[1575338031.017]",
        },
    }
to be empty
...
failed: (7m2s) 2019-12-03T01:53:53 "[Feature:Prometheus][Conformance] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog [Suite:openshift/conformance/parallel/minimal]"

But [2] says the pod has been running since 1:20Z and is still running at gather time, so probably a metrics-gathering thing.  Similar issue in [3]:

  ALERTS{alertname=\"TargetDown\", alertstate=\"firing\", job=\"metrics\", namespace=\"openshift-console-operator\", service=\"metrics\", severity=\"warning\"}

despite a healthy console operator [4].  Hit this 15 times today (1% of all e2e failures) [5].  4 of those (6% of failures) were for 4.3 release jobs [6].

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-shared-vpc-4.3/205
[2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-shared-vpc-4.3/205/artifacts/e2e-gcp/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-11-22-122829-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-kube-controller-manager-operator/pods/kube-controller-manager-operator-6c984f44df-vf9bx/kube-controller-manager-operator-6c984f44df-vf9bx.yaml
[3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-shared-vpc-4.3/209
[4]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-shared-vpc-4.3/209/artifacts/e2e-gcp/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-11-22-122829-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-console-operator/pods/console-operator-75548dd7b4-w9pg6/console-operator-75548dd7b4-w9pg6.yaml
[5]: https://search.svc.ci.openshift.org/chart?search=TargetDown.*firing
[6]: https://search.svc.ci.openshift.org/chart?name=release-openshift-ocp-installer-.*4.3$&search=TargetDown.*firing

Comment 1 W. Trevor King 2019-12-04 00:46:48 UTC

Maciej was investigating.

Comment 2 Maciej Szulik 2019-12-05 21:20:55 UTC

Around the time when this failed I'm seeing a bunch of:

I1203 01:46:42.757020       1 log.go:172] http: TLS handshake error from 10.128.2.15:54968: remote error: tls: bad certificate                                                                                                                                                 I1203 01:46:59.347908       1 log.go:172] http: TLS handshake error from 10.129.2.13:38406: remote error: tls: bad certificate

in the kcm-o logs, from what I've check the pods, these are prometheus pods, is it possible that they are using wrong certs?

On the other hand I've checked other instances and it's not always kcm-o, but I've seen console-operator TargetDown, others
unkown error (1 had reported incorrect results: model.Vector{(*model.Sample)(0xc004126180)})

With this in mind I'll move this back to monitoring for further investigation.

Comment 6 W. Trevor King 2020-02-11 23:21:44 UTC

(In reply to W. Trevor King from comment #0)
> [5]: https://search.svc.ci.openshift.org/chart?search=TargetDown.*firing

That^ shows three of these just in the past 24h [1,2,3].  In all three cases it was the same openshift-console-operator alert I mentioned in comment 0.  So still not a high flake rate, but certainly something that we can actually reproduce, right?

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/928
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_service-ca-operator/104/pull-ci-openshift-service-ca-operator-release-4.3-e2e-aws/8
[3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.4/795

Comment 9 Standa Laznicka 2020-02-12 15:23:32 UTC

The bug is caused by the self-sign-if-no-certs-found logic of library-go's controller command default behavior - https://github.com/openshift/cluster-authentication-operator/blob/fe70139ca94a0078f1b6a847ff9c9a366c4db0f4/vendor/github.com/openshift/library-go/pkg/controller/controllercmd/cmd.go#L195-L199

Apparently the service-ca's controllers are usually winning the race to console's operator pod creation most of the time and created the content of the secret the operator mounted. However, if they lose, self-signed certs are used instead for serving and the metrics collector (naturally) does not believe those.

I'll try to create a fix for this in library-go to avoid any magic.

Comment 10 Standa Laznicka 2020-02-12 16:40:28 UTC

Actually, there was a bunch of fixes to library-go's file observer, so, at least for console which bumped library-go dep on the 10th after 4 months, we may want to check whether they appear in these jobs again, and perhaps focus on file observer differences in the 4.3 branch in vendor of their repo.

Comment 11 Standa Laznicka 2020-04-06 08:13:45 UTC

*** Bug 1820385 has been marked as a duplicate of this bug. ***

Comment 12 Standa Laznicka 2020-04-06 08:42:07 UTC

Found a BZ with the same symptoms so this may still be an issue

Comment 17 errata-xmlrpc 2020-05-04 11:18:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Note You need to log in before you can comment on or make changes to this bug.