1950430 – CVO serves metrics over HTTP, despite a lack of consumers

Bug 1950430 - CVO serves metrics over HTTP, despite a lack of consumers

Summary: CVO serves metrics over HTTP, despite a lack of consumers

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.8.0
Assignee:	W. Trevor King
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-04-16 16:21 UTC by W. Trevor King
Modified:	2021-07-27 23:01 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 23:01:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 481	0	None	open	Bug 1950430: pkg/cvo/metrics: Drop HTTP, require HTTPS for metrics access	2021-04-16 16:32:06 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 23:01:41 UTC

Description W. Trevor King 2021-04-16 16:21:19 UTC

We began serving metrics over HTTPS with 6132bc3 (cvo#358), which also requested monitoring to scrape us over HTTPS.  Now that that is all in place in 4.6, we no longer need to serve over HTTP in 4.7 and later.  We should drop HTTP and only serve metrics over HTTPS.

This might also fix recent occurrences of ClusterVersionOperatorDown, where Prometheus is failing to scrape some CVO pods, despite those pods claiming to be serving metrics.  For example [1] has:

  [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]	1h15m24s
  [sig-arch] Check if alerts are firing during or after upgrade success	1h15m13s
    Apr 15 17:03:27.273: Unexpected alerts fired or pending during the upgrade:

    alert ClusterVersionOperatorDown fired for 2280 seconds with labels: {severity="critical"}
    alert TargetDown fired for 1770 seconds with labels: {job="cluster-version-operator", namespace="openshift-cluster-version", service="cluster-version-operator", severity="warning"}

The current suspicion is that the issue is a bug or incompatibility between the scraper's cmux and our metric server's cmux, and going to HTTPS-only would remove our server's cmux.  Recent frequency:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=alert+ClusterVersionOperatorDown+fired' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-compact-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 17 runs, 76% failed, 8% of failures match = 6% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-azure (all) - 11 runs, 64% failed, 14% of failures match = 9% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 11 runs, 100% failed, 9% of failures match = 9% impact
pull-ci-openshift-cluster-config-operator-master-e2e-upgrade (all) - 5 runs, 40% failed, 50% of failures match = 20% impact
...
pull-ci-openshift-router-master-e2e-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-okd-installer-e2e-aws-upgrade (all) - 16 runs, 63% failed, 10% of failures match = 6% impact

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1382704884662407168

Comment 2 Johnny Liu 2021-04-21 09:59:19 UTC

Verified this bug with 4.8.0-0.nightly-2021-04-20-195442, and PASS.

[root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-version rsh cluster-version-operator-68f556c79-n6lm8
sh-4.4# cat /proc/1/cmdline 
/usr/bin/cluster-version-operatorstart--release-image=registry.ci.openshift.org/ocp/release@sha256:a4c4f8797512b71c84ee9899fbbec6713db67d490e9b82596888f23dd1ccd9a4--enable-auto-update=false--enable-default-cluster-version=true--listen=0.0.0.0:9099--serving-cert-file=/etc/tls/serving-cert/tls.crt--serving-key-file=/etc/tls/serving-cert/tls.key--v=5

sh-4.4# /usr/bin/cluster-version-operator start --help
Starts Cluster Version Operator
<--snip-->
      --serving-cert-file string         The X.509 certificate file for serving metrics over HTTPS.  You must set both --serving-cert-file and --serving-key-file unless you set --listen empty.
      --serving-key-file string          The X.509 key file for serving metrics over HTTPS.  You must set both --serving-cert-file and --serving-key-file unless you set --listen empty.
<--snip-->

sh-4.4# /usr/bin/cluster-version-operator start --release-image=registry.ci.openshift.org/ocp/release@sha256:a4c4f8797512b71c84ee9899fbbec6713db67d490e9b82596888f23dd1ccd9a4 --enable-auto-update=false --enable-default-cluster-version=true --listen=0.0.0.0:9099 --v=5
I0421 09:45:40.940525      49 start.go:21] ClusterVersionOperator 4.8.0-202104201727.p0-6fdd1e0
F0421 09:45:40.940609      49 start.go:24] error: --listen was not set empty, so --serving-cert-file must be set
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0xc000012001, 0xc00045c000, 0x71, 0xb4)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:1026 +0xb9
k8s.io/klog/v2.(*loggingT).output(0x27910e0, 0xc000000003, 0x0, 0x0, 0xc000590540, 0x202645e, 0x8, 0x18, 0x0)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:975 +0x191
k8s.io/klog/v2.(*loggingT).printf(0x27910e0, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0x1a5ceae, 0x9, 0xc000136310, 0x1, ...)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:750 +0x191
k8s.io/klog/v2.Fatalf(...)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:1502
main.init.3.func1(0xc000454000, 0xc00042e140, 0x0, 0x5)
	/go/src/github.com/openshift/cluster-version-operator/cmd/start.go:24 +0x1ed
github.com/spf13/cobra.(*Command).execute(0xc000454000, 0xc00042e0f0, 0x5, 0x5, 0xc000454000, 0xc00042e0f0)
	/go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:854 +0x2c2
github.com/spf13/cobra.(*Command).ExecuteC(0x277d600, 0xc000000180, 0xc00005c740, 0x46eb45)
	/go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:958 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
	/go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:895
main.main()
	/go/src/github.com/openshift/cluster-version-operator/cmd/main.go:26 +0x53

goroutine 6 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x27910e0)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:1169 +0x8b
created by k8s.io/klog/v2.init.0
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:417 +0xdf

sh-4.4# curl 127.0.0.1:9099
Client sent an HTTP request to an HTTPS server.

Comment 3 W. Trevor King 2021-04-21 23:58:21 UTC

Looking a lot better vs. the Prom scraping too:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=alert+ClusterVersionOperatorDown+fired' | grep 'failures match' | sort
periodic-ci-openshift-release-master-okd-4.8-e2e-vsphere (all) - 11 runs, 100% failed, 9% of failures match = 9% impact

Comment 6 errata-xmlrpc 2021-07-27 23:01:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.