Bug 1801300
Summary: | cluster-autoscaler metrics collection breaks after upgrade to 4.3 | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Jaspreet Kaur <jkaur> | ||||||
Component: | Cluster Version Operator | Assignee: | Vadim Rutkovsky <vrutkovs> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Jianwei Hou <jhou> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 4.2.0 | CC: | alegrand, anpicker, aos-bugs, aprajapa, bleanhar, brad.williams, cblecker, erooth, jokerman, kakkoyun, lcosic, lmohanty, mharri, mloibl, mmasters, mrhodes, mtleilia, mwoodson, pkrupa, sdodson, syangsao, vrutkovs, wking, zhsun | ||||||
Target Milestone: | --- | Keywords: | Upgrades | ||||||
Target Release: | 4.4.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: |
Cause: CVO didn't apply containerPort changes
Consequence: new port for ClusterAutoscaler metrics didn't get applied to the deployment
Fix: CVO now applies containerPort changes as expected
Result: ClusterAutoscaler deployment reports metrics on a new port
|
Story Points: | --- | ||||||
Clone Of: | |||||||||
: | 1802710 (view as bug list) | Environment: | |||||||
Last Closed: | 2020-05-13 21:57:34 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1802710, 1803258 | ||||||||
Attachments: |
|
Description
Jaspreet Kaur
2020-02-10 15:27:12 UTC
Related to https://bugzilla.redhat.com/show_bug.cgi?id=1776725 and https://github.com/openshift/cluster-version-operator/pull/272 We are seeing this in Openshift Dedicated (OSD) as well when upgrading from 4.2 to 4.3.0 (we are using 4.3.0) With help from Michael McCune, we saw some odd setup of the metrics port in the deployment: ------------------------------------------------------------------------------------------ # oc get deployments cluster-autoscaler-operator -o yaml apiVersion: extensions/v1beta1 kind: Deployment metadata: annotations: ... - name: WEBHOOKS_CERT_DIR value: /etc/cluster-autoscaler-operator/tls - name: WEBHOOKS_PORT value: "8443" - name: METRICS_PORT value: "9191" image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:702706ac38dfdd0baacaa858d40e0bb58e0c612fa58167a998f55882e7d9538c imagePullPolicy: IfNotPresent name: cluster-autoscaler-operator ports: - containerPort: 8443 protocol: TCP - containerPort: 8080 name: metrics protocol: TCP ... ------------------------------------------------------------------------------------------ The metrics port seems to defined as a var to port 9191, but the port listening on 8080 is labeled "metrics". I have confirmed that we are seeing this in Openshift Starter as well... # oc get deployments cluster-autoscaler-operator -o yaml apiVersion: apps/v1 kind: Deployment metadata: annotations: deployment.kubernetes.io/revision: "13" exclude.release.openshift.io/internal-openshift-hosted: "true" creationTimestamp: "2019-10-25T16:21:05Z" generation: 13 labels: k8s-app: cluster-autoscaler-operator ... - name: WEBHOOKS_CERT_DIR value: /etc/cluster-autoscaler-operator/tls - name: WEBHOOKS_PORT value: "8443" - name: METRICS_PORT value: "9191" image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7eb12244ffb6df6668be9aee4484bdcd39f8cf1c7c94c56def1138ba24b91e03 imagePullPolicy: IfNotPresent name: cluster-autoscaler-operator ports: - containerPort: 8443 protocol: TCP - containerPort: 8080 name: metrics protocol: TCP resources: requests: cpu: 20m memory: 50Mi terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError volumeMounts: - mountPath: /etc/cluster-autoscaler-operator/tls/service-ca name: ca-cert readOnly: true - mountPath: /etc/cluster-autoscaler-operator/tls name: cert readOnly: true - args: - --secure-listen-address=0.0.0.0:9192 - --upstream=http://127.0.0.1:9191/ - --tls-cert-file=/etc/tls/private/tls.crt - --tls-private-key-file=/etc/tls/private/tls.key - --config-file=/etc/kube-rbac-proxy/config-file.yaml - --logtostderr=true - --v=10 image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b32fbe4ffef894aef53eb8fff509c2bf1ec1347f1ae229fa95ae615a2e514a48 imagePullPolicy: IfNotPresent name: kube-rbac-proxy ports: - containerPort: 9192 name: metrics protocol: TCP resources: {} ... Between 4.2 and 4.3, the "metrics" port on the "cluster-autoscaler-operator" container was deleted, and the "metrics" port on the "kube-rbac-proxy" container was added[1]. Comment 8 shows that the same deployment has both "metrics" ports defined in the same pod spec. 1. https://github.com/openshift/cluster-autoscaler-operator/compare/release-4.2...openshift:release-4.3#diff-b04b9fb0b15ef843b20a21cf9b14d3ddR35-L54 As Abhinav noted in https://bugzilla.redhat.com/show_bug.cgi?id=1802248#c4, it looks like bug 1802248 (concerning the ingress operator) and this bug (concerning the autoscaler operator) have fundamentally the same cause. Just verified. Deleting the deployment and having CVO recreate it works. oc delete deployment cluster-autoscaler-operator -n openshift-machine-api This will remove the messed up deployment, and CVO will redeploy a clean one. This clears all the alerts. Is this a dup of bug 1783221 (4.4, VERIFIED), bug 1798049 (4.3, POST), and bug 1800346 (4.2, POST) about the CVO exploding when you remove a port? Reseting priority/severity since my browser was "helpfully" remembering my values through a bug refresh and clobbered the updates... @Trevor it looks like there was a large change between 4.2 and 4.3 for the cluster-autoscaler-operator 4.2: https://github.com/openshift/cluster-autoscaler-operator/blob/release-4.2/install/07_deployment.yaml 4.3: https://github.com/openshift/cluster-autoscaler-operator/blob/release-4.3/install/07_deployment.yaml the main thing to see is that the deplyoment for that operator had the explicit metrics port removed from its deployment. now, i am not sure if this is a problem that needs to be fixed just yet, i am still learning about the change. i don't know if the port was removed because of a different change in the way metrics are scraped, or if this is a regression. Looks like [1] (in 4.3 but not 4.2) dropped metrics from the 'ports' list (where it had been 8080) and added a 9191 -> 9192 translater to serve metrics over HTTPS. But the removed port entry is from the end of the list, so I don't expect the bug 1798049 panic to fire here. Do we have a CI job or must-gather from a cluster that exhibits this bug's breakage? [1]: https://github.com/openshift/cluster-autoscaler-operator/commit/cebc1b062c872a136109f1916e4b25c0f7ef5ebe (In reply to Matt Woodson from comment #13) > Just verified. Deleting the deployment and having CVO recreate it works. > > oc delete deployment cluster-autoscaler-operator -n openshift-machine-api > > This will remove the messed up deployment, and CVO will redeploy a clean > one. This clears all the alerts. Just confirmed this as well on my cluster. Thanks Matt! (In reply to W. Trevor King from comment #17) > Looks like [1] (in 4.3 but not 4.2) dropped metrics from the 'ports' list > (where it had been 8080) and added a 9191 -> 9192 translater to serve > metrics over HTTPS. But the removed port entry is from the end of the list, ahh, ok. i was wondering about the 9192 metrics port, i thought it was just for that rbac-proxy container but the https passthrough makes sense. appreciate the explanation. > so I don't expect the bug 1798049 panic to fire here. Do we have a CI job > or must-gather from a cluster that exhibits this bug's breakage? > afaik, we just have a handful of reported cases, not sure about CI. Can anyone who hit this in their cluster attach CVO logs (openshift-cluster-version namespace)? Even if you've since deleted the deployment to unstick the CVO, CVO logs would still be helpful. Created attachment 1662814 [details] cluster version operator logs Attached from my cluster for comment#20 Created attachment 1662816 [details]
CVO Logs for us-west-1 Starter Cluster
Attached is the CVS log for a starter cluster exhibiting both alerts (ingress and autoscaler).
(In reply to brad.williams from comment #22) > Attached is the CVS log for a starter cluster exhibiting both alerts > (ingress and autoscaler). You're still seeing the alerts? $ grep -i 'deployment.*autoscaler\|Running sync.*in state\|Result of work' cvo-logs.txt I0212 21:36:31.946823 1 sync_worker.go:471] Running sync 4.3.1 (force=true) on generation 26 in state Updating at attempt 0 I0212 21:36:54.274468 1 sync_worker.go:621] Running sync for deployment "openshift-machine-api/cluster-autoscaler-operator" (168 of 494) I0212 21:36:54.669223 1 request.go:538] Throttling request took 394.433696ms, request: GET:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/deployments/cluster-autoscaler-operator I0212 21:36:55.069229 1 request.go:538] Throttling request took 396.448388ms, request: PUT:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/deployments/cluster-autoscaler-operator I0212 21:36:55.469224 1 request.go:538] Throttling request took 393.248658ms, request: GET:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/deployments/cluster-autoscaler-operator I0212 21:36:55.472887 1 sync_worker.go:634] Done syncing for deployment "openshift-machine-api/cluster-autoscaler-operator" (168 of 494) I0212 21:41:04.273997 1 task_graph.go:611] Result of work: [] I0212 21:43:56.799939 1 sync_worker.go:471] Running sync 4.3.1 (force=true) on generation 26 in state Reconciling at attempt 0 I0212 21:44:03.958452 1 sync_worker.go:621] Running sync for deployment "openshift-machine-api/cluster-autoscaler-operator" (168 of 494) I0212 21:44:04.053339 1 request.go:538] Throttling request took 94.542211ms, request: GET:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/deployments/cluster-autoscaler-operator I0212 21:44:04.153347 1 request.go:538] Throttling request took 91.801527ms, request: PUT:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/deployments/cluster-autoscaler-operator I0212 21:44:04.253324 1 request.go:538] Throttling request took 94.908989ms, request: GET:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/deployments/cluster-autoscaler-operator I0212 21:44:04.256794 1 sync_worker.go:634] Done syncing for deployment "openshift-machine-api/cluster-autoscaler-operator" (168 of 494) I0212 21:44:30.214196 1 task_graph.go:611] Result of work: [] ... I0212 22:23:52.222330 1 sync_worker.go:471] Running sync 4.3.1 (force=true) on generation 26 in state Reconciling at attempt 0 I0212 22:23:59.379809 1 sync_worker.go:621] Running sync for deployment "openshift-machine-api/cluster-autoscaler-operator" (168 of 494) I0212 22:23:59.474710 1 request.go:538] Throttling request took 94.553772ms, request: GET:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/deployments/cluster-autoscaler-operator I0212 22:23:59.574758 1 request.go:538] Throttling request took 96.192704ms, request: PUT:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/deployments/cluster-autoscaler-operator I0212 22:23:59.674712 1 request.go:538] Throttling request took 95.647392ms, request: GET:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/deployments/cluster-autoscaler-operator I0212 22:23:59.678543 1 sync_worker.go:634] Done syncing for deployment "openshift-machine-api/cluster-autoscaler-operator" (168 of 494) I0212 22:24:25.631467 1 task_graph.go:611] Result of work: [] That is a single Updating round after the CVO came up, followed by a bunch of Reconciling rounds. All of them are successful. The throttling shows the CVO attempting to PUT the Deployment, so it must think the manifest is dirty (we really should log why we think a manifest is dirty), but then the CVO very quickly decides that the "updated" Deployment is satisfactory and continues through the rest of the sync round. *** Bug 1801960 has been marked as a duplicate of this bug. *** the more i am looking at this, the more confused i am becoming. the deployment manifests that are being posted here seem to contain a mix of 4.2 and 4.3 elements. i am unsure how this is happening. there is a specific commit where the exposed "metrics" port on 8080 is removed, it should not exist in these deployments. before that commit there is no kube-rbac-proxy pod, so it seems odd that we have both in these manifests. as others have noted, there are now 2 ports labeled "metrics" for the cluster-autoscaler-operator and apparently the ServiceMonitor object is directing the service monitor to use the incorrect port(although it should not exist in these deployments). (In reply to W. Trevor King from comment #23) > (In reply to brad.williams from comment #22) > > Attached is the CVS log for a starter cluster exhibiting both alerts > > (ingress and autoscaler). > > You're still seeing the alerts? > > $ grep -i 'deployment.*autoscaler\|Running sync.*in state\|Result of work' > cvo-logs.txt > I0212 21:36:31.946823 1 sync_worker.go:471] Running sync 4.3.1 > (force=true) on generation 26 in state Updating at attempt 0 > I0212 21:36:54.274468 1 sync_worker.go:621] Running sync for > deployment "openshift-machine-api/cluster-autoscaler-operator" (168 of 494) > I0212 21:36:54.669223 1 request.go:538] Throttling request took > 394.433696ms, request: > GET:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/ > deployments/cluster-autoscaler-operator > I0212 21:36:55.069229 1 request.go:538] Throttling request took > 396.448388ms, request: > PUT:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/ > deployments/cluster-autoscaler-operator > I0212 21:36:55.469224 1 request.go:538] Throttling request took > 393.248658ms, request: > GET:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/ > deployments/cluster-autoscaler-operator > I0212 21:36:55.472887 1 sync_worker.go:634] Done syncing for > deployment "openshift-machine-api/cluster-autoscaler-operator" (168 of 494) > I0212 21:41:04.273997 1 task_graph.go:611] Result of work: [] > I0212 21:43:56.799939 1 sync_worker.go:471] Running sync 4.3.1 > (force=true) on generation 26 in state Reconciling at attempt 0 > I0212 21:44:03.958452 1 sync_worker.go:621] Running sync for > deployment "openshift-machine-api/cluster-autoscaler-operator" (168 of 494) > I0212 21:44:04.053339 1 request.go:538] Throttling request took > 94.542211ms, request: > GET:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/ > deployments/cluster-autoscaler-operator > I0212 21:44:04.153347 1 request.go:538] Throttling request took > 91.801527ms, request: > PUT:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/ > deployments/cluster-autoscaler-operator > I0212 21:44:04.253324 1 request.go:538] Throttling request took > 94.908989ms, request: > GET:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/ > deployments/cluster-autoscaler-operator > I0212 21:44:04.256794 1 sync_worker.go:634] Done syncing for > deployment "openshift-machine-api/cluster-autoscaler-operator" (168 of 494) > I0212 21:44:30.214196 1 task_graph.go:611] Result of work: [] > ... > I0212 22:23:52.222330 1 sync_worker.go:471] Running sync 4.3.1 > (force=true) on generation 26 in state Reconciling at attempt 0 > I0212 22:23:59.379809 1 sync_worker.go:621] Running sync for > deployment "openshift-machine-api/cluster-autoscaler-operator" (168 of 494) > I0212 22:23:59.474710 1 request.go:538] Throttling request took > 94.553772ms, request: > GET:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/ > deployments/cluster-autoscaler-operator > I0212 22:23:59.574758 1 request.go:538] Throttling request took > 96.192704ms, request: > PUT:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/ > deployments/cluster-autoscaler-operator > I0212 22:23:59.674712 1 request.go:538] Throttling request took > 95.647392ms, request: > GET:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/ > deployments/cluster-autoscaler-operator > I0212 22:23:59.678543 1 sync_worker.go:634] Done syncing for > deployment "openshift-machine-api/cluster-autoscaler-operator" (168 of 494) > I0212 22:24:25.631467 1 task_graph.go:611] Result of work: [] > > That is a single Updating round after the CVO came up, followed by a bunch > of Reconciling rounds. All of them are successful. The throttling shows > the CVO attempting to PUT the Deployment, so it must think the manifest is > dirty (we really should log why we think a manifest is dirty), but then the > CVO very quickly decides that the "updated" Deployment is satisfactory and > continues through the rest of the sync round. I just re-ran our verification steps and these alerts are still present: ERROR [/home/brawilli/Source/continuous-release-jobs/config/imperative/verifications/generic.py:201 check_prometheus_alerts] - Critical alert is firing: {'labels': {'alertname': 'ClusterAutoscalerOperatorDown', 'severity': 'critical'}, 'annotations': {'message': 'cluster-autoscaler-operator has disappeared from Prometheus target discovery.'}, 'state': 'firing', 'activeAt': '2020-02-12T21:47:11.800595892Z', 'value': '1e+00'} WARNING [/home/brawilli/Source/continuous-release-jobs/config/imperative/verifications/generic.py:203 check_prometheus_alerts] - Alert is firing: {'labels': {'alertname': 'TargetDown', 'job': 'metrics', 'namespace': 'openshift-ingress-operator', 'service': 'metrics', 'severity': 'warning'}, 'annotations': {'message': '100% of the metrics targets in openshift-ingress-operator namespace are down.'}, 'state': 'firing', 'activeAt': '2020-02-12T21:47:00.163677339Z', 'value': '1e+02'} WARNING [/home/brawilli/Source/continuous-release-jobs/config/imperative/verifications/generic.py:203 check_prometheus_alerts] - Alert is firing: {'labels': {'alertname': 'TargetDown', 'job': 'cluster-autoscaler-operator', 'namespace': 'openshift-machine-api', 'service': 'cluster-autoscaler-operator', 'severity': 'warning'}, 'annotations': {'message': '100% of the cluster-autoscaler-operator targets in openshift-machine-api namespace are down.'}, 'state': 'firing', 'activeAt': '2020-02-12T21:47:30.163677339Z', 'value': '1e+02'} CVO is incorrectly applying port modifications. In 4.2 cluster-autoscaler-operator has 2 ports - https://github.com/openshift/cluster-autoscaler-operator/blob/release-4.2/install/07_deployment.yaml#L49-L54: - name: cluster-autoscaler-operator ... ports: - containerPort: 8443 - name: metrics containerPort: 8080 in 4.3 metrics port is moved to a different container - https://github.com/openshift/cluster-autoscaler-operator/blob/release-4.3/install/07_deployment.yaml#L78-L79: - name: kube-rbac-proxy ... ports: - containerPort: 9192 name: metrics protocol: TCP ... - name: cluster-autoscaler-operator ... ports: - containerPort: 8443 However CVO applies both in the resulting deployment: spec: containers: - args: ... ports: - containerPort: 8443 protocol: TCP - containerPort: 8080 name: metrics protocol: TCP - args: ... name: kube-rbac-proxy ports: - containerPort: 9192 name: metrics protocol: TCP That causes the service to scrape metrics incorrectly, as pod has 2 named ports in different containers This is not an upgrade blocker, though as a) it can be fixed manually b) no data loss occurs - cluster autoscaler metrics are not being collected, but existing ones are not being destroyed Trying to fill in our new impact-statement template:
# What kind of clusters are impacted because of the bug?
All clusters upgrading from 4.2 to 4.3, because the autoscaler and machine-API operator deployments both removed ports from a container template [1].
# What cluster functionality is degraded while hitting the bug?
In both cases, the metrics port was removed. So no data is lost, but new metrics gathering from these operators will be broken until the bug is fixed.
# Does this bug result in data loss?
No. But as above, for the duration of the buggy condition, new metrics will not be collected from the two affected operators.
# Is it possible to recover the cluster from the bug?
Yes. Two ways:
a. Manually 'oc delete ...' the affected Deployment. The CVO will push a replacement with the correct ports, fixing the buggy condition. The brief outage in operator availability while the Deployment is replaced should have no adverse affect on the cluster.
b. Wait until we get a release out with a fixed CVO. Update to that release, and the new CVO will remove the orphaned ports, fixing the buggy condition.
# What is the observed rate of failure we see in CI?
100% of 4.2 -> 4.3 updates? Would be good to spot-check some update CI to confirm this. From Clayton:
> we didn’t check alerts during upgrades before, we checked that they didn’t last longer than X when we test after upgrade completes. We probably need to have something that watches alerts during upgrades, like we watch for api failures
# What is the expected rate of the failure (%) across upgrades?
100% of 4.2 -> 4.3 updates? Would be good to spot-check some updates in Telemetry to confirm this.
[1]: $ oc adm release extract --to 4.2.18 quay.io/openshift-release-dev/ocp-release:4.2.18-x86_64
Extracted release payload from digest sha256:283a1625e18e0b6d7f354b1b022a0aeaab5598f2144ec484faf89e1ecb5c7498 created at 2020-02-10T20:57:43Z
$ oc adm release extract --to 4.3.1 quay.io/openshift-release-dev/ocp-release:4.3.1-x86_64
Extracted release payload from digest sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3 created at 2020-02-05T12:16:31Z
$ diff -U0 <(cd 4.2.18 && grep -rl ports: | sort) <(cd 4.3.1 && grep -rl ports: | sort)
--- /dev/fd/63 2020-02-13 13:35:53.396953972 -0800
+++ /dev/fd/62 2020-02-13 13:35:53.397953984 -0800
@@ -10 +10 @@
-0000_50_cloud-credential-operator_01_deployment.yaml
+0000_50_cloud-credential-operator_05_deployment.yaml
@@ -13,0 +14 @@
+0000_50_cluster-image-registry-operator_07-operator-service.yaml
@@ -14,0 +16,4 @@
+0000_50_cluster-ingress-operator_01-service.yaml
+0000_50_cluster-ingress-operator_02-deployment.yaml
+0000_50_cluster-machine-approver_03-metrics-service.yaml
+0000_50_cluster-machine-approver_04-deployment.yaml
@@ -17,0 +23 @@
+0000_50_cluster-samples-operator_06-metricsservice.yaml
@@ -28 +34,2 @@
-0000_50_insights-operator_05-deployment.yaml
+0000_50_insights-operator_05-service.yaml
+0000_50_insights-operator_06-deployment.yaml
@@ -34,0 +42,3 @@
+0000_70_dns-operator_01-service.yaml
+0000_70_dns-operator_02-deployment.yaml
+0000_80_machine-config-operator_00_service.yaml
$ mv 4.2.18/0000_50_cloud-credential-operator_0{1,5}_deployment.yaml # rename so diff will match
$ mv 4.2.18/0000_50_insights-operator_0{5,6}-deployment.yaml
$ diff -rU3 4.2.18 4.3.1 | grep -A10 ports:
...
diff -rU3 4.2.18/0000_30_machine-api-operator_10_service.yaml 4.3.1/0000_30_machine-api-operator_10_service.yaml
...
ports:
- - name: metrics
- port: 8080
- targetPort: metrics
- protocol: TCP
+ - name: https
+ port: 8443
+ targetPort: https
...
diff -rU3 4.2.18/0000_30_machine-api-operator_11_deployment.yaml 4.3.1/0000_30_machine-api-operator_11_deployment.yaml
...
+ ports:
+ - containerPort: 8443
+ name: https
+ protocol: TCP
+ volumeMounts:
+ - name: config
+ mountPath: /etc/kube-rbac-proxy
+ - mountPath: /etc/tls/private
+ name: machine-api-operator-tls
- name: machine-api-operator
- image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1217dc485b55c3cfd4aae7f2e88886a45b9a868099dbb17813256a0d81a63b82
--
- ports:
- - name: metrics
- containerPort: 8080
...
diff -rU3 4.2.18/0000_50_cluster-autoscaler-operator_07_deployment.yaml 4.3.1/0000_50_cluster-autoscaler-operator_07_deployment.yaml
...
+ ports:
+ - containerPort: 9192
+ name: metrics
+ protocol: TCP
+ resources: {}
+ terminationMessagePath: /dev/termination-log
+ terminationMessagePolicy: File
+ volumeMounts:
+ - name: auth-proxy-config
+ mountPath: /etc/kube-rbac-proxy
+ readOnly: true
--
ports:
- containerPort: 8443
- - name: metrics
- containerPort: 8080
...
Upgraded from 4.3.2 to 4.4.0-0.nightly-2020-02-13-212616, not reproducible. Verified this is fixed in 4.4.0-0.nightly-2020-02-13-212616 Alerts are not firing after upgrade. The cluster-autoscaler-operator deployment does not use port 8080 as well. Moving to verified. To track the 4.3 status, use 1802710 Tracking down affected versions: * 4.4.0: https://bugzilla.redhat.com/show_bug.cgi?id=1801300 https://github.com/openshift/cluster-version-operator/pull/322 Which made it into 4.4.0-rc.0: $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.4.0-rc.0-x86_64 | grep cluster-version cluster-version-operator https://github.com/openshift/cluster-version-operator 23856901003b95b559087b8e83bffdee82872b2b $ git --no-pager log --oneline --first-parent -6 origin/release-4.4 2385690 (origin/release-4.4) Merge pull request #332 from openshift-cherrypick-robot/cherry-pick-328-to-release-4.4 ... 2df3d56 Merge pull request #322 from vrutkovs/remove-outdated-ports * 4.3.z: https://bugzilla.redhat.com/show_bug.cgi?id=1802710 https://github.com/openshift/cluster-version-operator/pull/323 Which made it into 4.3.3: $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.3.3-x86_64 | grep cluster-version cluster-version-operator https://github.com/openshift/cluster-version-operator 210b4b1e6b1b7f53b5dc0d935de9c5d27058280c $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.3.2-x86_64 | grep cluster-version cluster-version-operator https://github.com/openshift/cluster-version-operator beee410fc8780e5613c09fc2690716b711747041 $ git --no-pager log --oneline --first-parent -5 origin/release-4.3 210b4b1 (origin/release-4.3) Merge pull request #321 from openshift-cherrypick-robot/cherry-pick-319-to-release-4.3 5057680 Merge pull request #323 from vrutkovs/4.3-container-ports ... beee410 Merge pull request #290 from wking/no-ephemeral-storage-in-4.2-so-4.3-cannot-rely-on-it * 4.2.z: https://bugzilla.redhat.com/show_bug.cgi?id=1802248 https://github.com/openshift/cluster-version-operator/pull/325 Which made it into 4.2.21: $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.2.21-x86_64 | grep cluster-version cluster-version-operator https://github.com/openshift/cluster-version-operator ccbed39b6faab201a1bafc49a7f519194d5dd548 $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.2.20-x86_64 | grep cluster-version cluster-version-operator https://github.com/openshift/cluster-version-operator 9f4f04e736a0bbc61323593fbb62874570f07762 $ git --no-pager log --oneline --first-parent -4 origin/release-4.2 4b39863 (origin/release-4.2) Merge pull request #314 from wking/resource-merge-index-mutating-existing-4.2 ccbed39 Merge pull request #300 from openshift-cherrypick-robot/cherry-pick-298-to-release-4.2 a8ed501 Merge pull request #325 from vrutkovs/4.2-container-ports 9f4f04e Merge pull request #288 from wking/no-ephemeral-storage-in-4.2 * autoscaler and machine-API operator both removed their metrics port in 4.2 -> 4.3 [1]. So 4.2 clusters which update to 4.3 < 4.3.5 will hit this. * ingress operator removed its metrics port in 4.1 -> 4.2 [2], so 4.1 clusters which update to 4.2 < 4.2.21 will hit this. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1801300#c29 [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1802248#c3 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581 Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475 |