Bug 1779438

Summary: Spurious TargetDown alerts for healthy pods
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: apiserver-authAssignee: Standa Laznicka <slaznick>
Status: CLOSED ERRATA QA Contact: scheng
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.3.0CC: alegrand, anpicker, aos-bugs, bparees, erooth, kakkoyun, lcosic, mfojtik, mloibl, pkrupa, qiwan, slaznick, surbania
Target Milestone: ---Keywords: Reopened
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: race condition in the code watching changes in files Consequence: when the mounted serving certificate changed/appeared, this was not noticed and when metrics were being scraped on the HTTPS endpoint, the serving certificate was not trusted by the scraper Fix: removed the race condition Result: the operators based on library-go should be able to reload the serving certificate correctly
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-04 11:18:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1836886, 1836887    

Description W. Trevor King 2019-12-04 00:45:11 UTC
Examples from 4.3 promotion jobs [1]:

fail [github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:134]: Expected
    <map[string]error | len:1>: {
        "ALERTS{alertname!~\"Watchdog|UsingDeprecatedAPIExtensionsV1Beta1\",alertstate=\"firing\"} >= 1": {
            s: "promQL query: ALERTS{alertname!~\"Watchdog|UsingDeprecatedAPIExtensionsV1Beta1\",alertstate=\"firing\"} >= 1 had reported incorrect results: ALERTS{alertname=\"TargetDown\", alertstate=\"firing\", job=\"metrics\", namespace=\"openshift-kube-controller-manager-operator\", service=\"metrics\", severity=\"warning\"} => 1 @[1575338031.017]",
        },
    }
to be empty
...
failed: (7m2s) 2019-12-03T01:53:53 "[Feature:Prometheus][Conformance] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog [Suite:openshift/conformance/parallel/minimal]"

But [2] says the pod has been running since 1:20Z and is still running at gather time, so probably a metrics-gathering thing.  Similar issue in [3]:

  ALERTS{alertname=\"TargetDown\", alertstate=\"firing\", job=\"metrics\", namespace=\"openshift-console-operator\", service=\"metrics\", severity=\"warning\"}

despite a healthy console operator [4].  Hit this 15 times today (1% of all e2e failures) [5].  4 of those (6% of failures) were for 4.3 release jobs [6].

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-shared-vpc-4.3/205
[2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-shared-vpc-4.3/205/artifacts/e2e-gcp/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-11-22-122829-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-kube-controller-manager-operator/pods/kube-controller-manager-operator-6c984f44df-vf9bx/kube-controller-manager-operator-6c984f44df-vf9bx.yaml
[3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-shared-vpc-4.3/209
[4]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-shared-vpc-4.3/209/artifacts/e2e-gcp/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-11-22-122829-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-console-operator/pods/console-operator-75548dd7b4-w9pg6/console-operator-75548dd7b4-w9pg6.yaml
[5]: https://search.svc.ci.openshift.org/chart?search=TargetDown.*firing
[6]: https://search.svc.ci.openshift.org/chart?name=release-openshift-ocp-installer-.*4.3$&search=TargetDown.*firing

Comment 1 W. Trevor King 2019-12-04 00:46:48 UTC
Maciej was investigating.

Comment 2 Maciej Szulik 2019-12-05 21:20:55 UTC
Around the time when this failed I'm seeing a bunch of:

I1203 01:46:42.757020       1 log.go:172] http: TLS handshake error from 10.128.2.15:54968: remote error: tls: bad certificate                                                                                                                                                 I1203 01:46:59.347908       1 log.go:172] http: TLS handshake error from 10.129.2.13:38406: remote error: tls: bad certificate

in the kcm-o logs, from what I've check the pods, these are prometheus pods, is it possible that they are using wrong certs?

On the other hand I've checked other instances and it's not always kcm-o, but I've seen console-operator TargetDown, others
unkown error (1 had reported incorrect results: model.Vector{(*model.Sample)(0xc004126180)})

With this in mind I'll move this back to monitoring for further investigation.

Comment 6 W. Trevor King 2020-02-11 23:21:44 UTC
(In reply to W. Trevor King from comment #0)
> [5]: https://search.svc.ci.openshift.org/chart?search=TargetDown.*firing

That^ shows three of these just in the past 24h [1,2,3].  In all three cases it was the same openshift-console-operator alert I mentioned in comment 0.  So still not a high flake rate, but certainly something that we can actually reproduce, right?

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/928
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_service-ca-operator/104/pull-ci-openshift-service-ca-operator-release-4.3-e2e-aws/8
[3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.4/795

Comment 9 Standa Laznicka 2020-02-12 15:23:32 UTC
The bug is caused by the self-sign-if-no-certs-found logic of library-go's controller command default behavior - https://github.com/openshift/cluster-authentication-operator/blob/fe70139ca94a0078f1b6a847ff9c9a366c4db0f4/vendor/github.com/openshift/library-go/pkg/controller/controllercmd/cmd.go#L195-L199

Apparently the service-ca's controllers are usually winning the race to console's operator pod creation most of the time and created the content of the secret the operator mounted. However, if they lose, self-signed certs are used instead for serving and the metrics collector (naturally) does not believe those.

I'll try to create a fix for this in library-go to avoid any magic.

Comment 10 Standa Laznicka 2020-02-12 16:40:28 UTC
Actually, there was a bunch of fixes to library-go's file observer, so, at least for console which bumped library-go dep on the 10th after 4 months, we may want to check whether they appear in these jobs again, and perhaps focus on file observer differences in the 4.3 branch in vendor of their repo.

Comment 11 Standa Laznicka 2020-04-06 08:13:45 UTC
*** Bug 1820385 has been marked as a duplicate of this bug. ***

Comment 12 Standa Laznicka 2020-04-06 08:42:07 UTC
Found a BZ with the same symptoms so this may still be an issue

Comment 17 errata-xmlrpc 2020-05-04 11:18:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581