Hide Forgot
We began serving metrics over HTTPS with 6132bc3 (cvo#358), which also requested monitoring to scrape us over HTTPS. Now that that is all in place in 4.6, we no longer need to serve over HTTP in 4.7 and later. We should drop HTTP and only serve metrics over HTTPS. This might also fix recent occurrences of ClusterVersionOperatorDown, where Prometheus is failing to scrape some CVO pods, despite those pods claiming to be serving metrics. For example [1] has: [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] 1h15m24s [sig-arch] Check if alerts are firing during or after upgrade success 1h15m13s Apr 15 17:03:27.273: Unexpected alerts fired or pending during the upgrade: alert ClusterVersionOperatorDown fired for 2280 seconds with labels: {severity="critical"} alert TargetDown fired for 1770 seconds with labels: {job="cluster-version-operator", namespace="openshift-cluster-version", service="cluster-version-operator", severity="warning"} The current suspicion is that the issue is a bug or incompatibility between the scraper's cmux and our metric server's cmux, and going to HTTPS-only would remove our server's cmux. Recent frequency: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=alert+ClusterVersionOperatorDown+fired' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.8-e2e-azure-compact-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 17 runs, 76% failed, 8% of failures match = 6% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-azure (all) - 11 runs, 64% failed, 14% of failures match = 9% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 11 runs, 100% failed, 9% of failures match = 9% impact pull-ci-openshift-cluster-config-operator-master-e2e-upgrade (all) - 5 runs, 40% failed, 50% of failures match = 20% impact ... pull-ci-openshift-router-master-e2e-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact release-openshift-okd-installer-e2e-aws-upgrade (all) - 16 runs, 63% failed, 10% of failures match = 6% impact [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1382704884662407168
Verified this bug with 4.8.0-0.nightly-2021-04-20-195442, and PASS. [root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-version rsh cluster-version-operator-68f556c79-n6lm8 sh-4.4# cat /proc/1/cmdline /usr/bin/cluster-version-operatorstart--release-image=registry.ci.openshift.org/ocp/release@sha256:a4c4f8797512b71c84ee9899fbbec6713db67d490e9b82596888f23dd1ccd9a4--enable-auto-update=false--enable-default-cluster-version=true--listen=0.0.0.0:9099--serving-cert-file=/etc/tls/serving-cert/tls.crt--serving-key-file=/etc/tls/serving-cert/tls.key--v=5 sh-4.4# /usr/bin/cluster-version-operator start --help Starts Cluster Version Operator <--snip--> --serving-cert-file string The X.509 certificate file for serving metrics over HTTPS. You must set both --serving-cert-file and --serving-key-file unless you set --listen empty. --serving-key-file string The X.509 key file for serving metrics over HTTPS. You must set both --serving-cert-file and --serving-key-file unless you set --listen empty. <--snip--> sh-4.4# /usr/bin/cluster-version-operator start --release-image=registry.ci.openshift.org/ocp/release@sha256:a4c4f8797512b71c84ee9899fbbec6713db67d490e9b82596888f23dd1ccd9a4 --enable-auto-update=false --enable-default-cluster-version=true --listen=0.0.0.0:9099 --v=5 I0421 09:45:40.940525 49 start.go:21] ClusterVersionOperator 4.8.0-202104201727.p0-6fdd1e0 F0421 09:45:40.940609 49 start.go:24] error: --listen was not set empty, so --serving-cert-file must be set goroutine 1 [running]: k8s.io/klog/v2.stacks(0xc000012001, 0xc00045c000, 0x71, 0xb4) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:1026 +0xb9 k8s.io/klog/v2.(*loggingT).output(0x27910e0, 0xc000000003, 0x0, 0x0, 0xc000590540, 0x202645e, 0x8, 0x18, 0x0) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:975 +0x191 k8s.io/klog/v2.(*loggingT).printf(0x27910e0, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0x1a5ceae, 0x9, 0xc000136310, 0x1, ...) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:750 +0x191 k8s.io/klog/v2.Fatalf(...) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:1502 main.init.3.func1(0xc000454000, 0xc00042e140, 0x0, 0x5) /go/src/github.com/openshift/cluster-version-operator/cmd/start.go:24 +0x1ed github.com/spf13/cobra.(*Command).execute(0xc000454000, 0xc00042e0f0, 0x5, 0x5, 0xc000454000, 0xc00042e0f0) /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:854 +0x2c2 github.com/spf13/cobra.(*Command).ExecuteC(0x277d600, 0xc000000180, 0xc00005c740, 0x46eb45) /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:958 +0x375 github.com/spf13/cobra.(*Command).Execute(...) /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:895 main.main() /go/src/github.com/openshift/cluster-version-operator/cmd/main.go:26 +0x53 goroutine 6 [chan receive]: k8s.io/klog/v2.(*loggingT).flushDaemon(0x27910e0) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:1169 +0x8b created by k8s.io/klog/v2.init.0 /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:417 +0xdf sh-4.4# curl 127.0.0.1:9099 Client sent an HTTP request to an HTTPS server.
Looking a lot better vs. the Prom scraping too: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=alert+ClusterVersionOperatorDown+fired' | grep 'failures match' | sort periodic-ci-openshift-release-master-okd-4.8-e2e-vsphere (all) - 11 runs, 100% failed, 9% of failures match = 9% impact
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438