Bug 1801300

Summary:

cluster-autoscaler metrics collection breaks after upgrade to 4.3

Product:

OpenShift Container Platform

Reporter:

Jaspreet Kaur <jkaur>

Component:

Cluster Version Operator

Assignee:

Vadim Rutkovsky <vrutkovs>

Status:

CLOSED ERRATA

QA Contact:

Jianwei Hou <jhou>

Severity:

high

Docs Contact:

Priority:

high

Version:

4.2.0

CC:

alegrand, anpicker, aos-bugs, aprajapa, bleanhar, brad.williams, cblecker, erooth, jokerman, kakkoyun, lcosic, lmohanty, mharri, mloibl, mmasters, mrhodes, mtleilia, mwoodson, pkrupa, sdodson, syangsao, vrutkovs, wking, zhsun

Target Milestone:

---

Keywords:

Upgrades

Target Release:

4.4.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Cause: CVO didn't apply containerPort changes Consequence: new port for ClusterAutoscaler metrics didn't get applied to the deployment Fix: CVO now applies containerPort changes as expected Result: ClusterAutoscaler deployment reports metrics on a new port

Story Points:

---

Clone Of:

Clones:

1802710 (view as bug list)

Environment:

Last Closed:

2020-05-13 21:57:34 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1802710, 1803258

Attachments:

Description	Flags
cluster version operator logs	none
CVO Logs for us-west-1 Starter Cluster	none

Description Jaspreet Kaur 2020-02-10 15:27:12 UTC

Description of problem:


When performing an upgrade from 4.2 to 4.3 on our environment we had alarms saying the cluster-autoscaler-operator was down. We troubleshot the issue and found the configuration of the deploymentapp for the cluster-autoscaler-operator container had a configuration on the container referencing port 8080 with the name of metrics. This has changed as the kube-rbac-proxy has the port 9192 with metrics.

This broke prometheus as its target was using 8080 and nothing is listening now on that TCP port



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Install cluster autoscaler
2. Upgrade to 4.3
3. 

Actual results: cluster autoscaler broken


Expected results: cluster autoscaler should be working


Additional info:

Comment 6 Alberto 2020-02-11 08:28:03 UTC

Related to https://bugzilla.redhat.com/show_bug.cgi?id=1776725 and https://github.com/openshift/cluster-version-operator/pull/272

Comment 7 Matt Woodson 2020-02-12 17:06:39 UTC

We are seeing this in Openshift Dedicated (OSD) as well when upgrading from 4.2 to 4.3.0 (we are using 4.3.0)

With help from Michael McCune, we saw some odd setup of the metrics port in the deployment:

------------------------------------------------------------------------------------------ 
# oc get deployments cluster-autoscaler-operator -o yaml 

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
...
        - name: WEBHOOKS_CERT_DIR
          value: /etc/cluster-autoscaler-operator/tls
        - name: WEBHOOKS_PORT
          value: "8443"
        - name: METRICS_PORT
          value: "9191"
        image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:702706ac38dfdd0baacaa858d40e0bb58e0c612fa58167a998f55882e7d9538c
        imagePullPolicy: IfNotPresent
        name: cluster-autoscaler-operator
        ports:
        - containerPort: 8443
          protocol: TCP
        - containerPort: 8080
          name: metrics
          protocol: TCP
...

------------------------------------------------------------------------------------------ 

The metrics port seems to defined as a var to port 9191, but the port listening on 8080 is labeled "metrics".

Comment 8 brad.williams 2020-02-12 17:13:36 UTC

I have confirmed that we are seeing this in Openshift Starter as well...

# oc get deployments cluster-autoscaler-operator -o yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "13"
    exclude.release.openshift.io/internal-openshift-hosted: "true"
  creationTimestamp: "2019-10-25T16:21:05Z"
  generation: 13
  labels:
    k8s-app: cluster-autoscaler-operator
...
        - name: WEBHOOKS_CERT_DIR
          value: /etc/cluster-autoscaler-operator/tls
        - name: WEBHOOKS_PORT
          value: "8443"
        - name: METRICS_PORT
          value: "9191"
        image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7eb12244ffb6df6668be9aee4484bdcd39f8cf1c7c94c56def1138ba24b91e03
        imagePullPolicy: IfNotPresent
        name: cluster-autoscaler-operator
        ports:
        - containerPort: 8443
          protocol: TCP
        - containerPort: 8080
          name: metrics
          protocol: TCP
        resources:
          requests:
            cpu: 20m
            memory: 50Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /etc/cluster-autoscaler-operator/tls/service-ca
          name: ca-cert
          readOnly: true
        - mountPath: /etc/cluster-autoscaler-operator/tls
          name: cert
          readOnly: true
      - args:
        - --secure-listen-address=0.0.0.0:9192
        - --upstream=http://127.0.0.1:9191/
        - --tls-cert-file=/etc/tls/private/tls.crt
        - --tls-private-key-file=/etc/tls/private/tls.key
        - --config-file=/etc/kube-rbac-proxy/config-file.yaml
        - --logtostderr=true
        - --v=10
        image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b32fbe4ffef894aef53eb8fff509c2bf1ec1347f1ae229fa95ae615a2e514a48
        imagePullPolicy: IfNotPresent
        name: kube-rbac-proxy
        ports:
        - containerPort: 9192
          name: metrics
          protocol: TCP
        resources: {}
...

Comment 9 Miciah Dashiel Butler Masters 2020-02-12 18:31:15 UTC

Between 4.2 and 4.3, the "metrics" port on the "cluster-autoscaler-operator" container was deleted, and the "metrics" port on the "kube-rbac-proxy" container was added[1].  Comment 8 shows that the same deployment has both "metrics" ports defined in the same pod spec.

1. https://github.com/openshift/cluster-autoscaler-operator/compare/release-4.2...openshift:release-4.3#diff-b04b9fb0b15ef843b20a21cf9b14d3ddR35-L54

As Abhinav noted in https://bugzilla.redhat.com/show_bug.cgi?id=1802248#c4, it looks like bug 1802248 (concerning the ingress operator) and this bug (concerning the autoscaler operator) have fundamentally the same cause.

Comment 13 Matt Woodson 2020-02-12 21:29:14 UTC

Just verified. Deleting the deployment and having CVO recreate it works.

oc delete deployment cluster-autoscaler-operator  -n openshift-machine-api

This will remove the messed up deployment, and CVO will redeploy a clean one.  This clears all the alerts.

Comment 14 W. Trevor King 2020-02-12 21:31:57 UTC

Is this a dup of bug 1783221 (4.4, VERIFIED), bug 1798049 (4.3, POST), and bug 1800346 (4.2, POST) about the CVO exploding when you remove a port?

Comment 15 W. Trevor King 2020-02-12 21:32:58 UTC

Reseting priority/severity since my browser was "helpfully" remembering my values through a bug refresh and clobbered the updates...

Comment 16 Michael McCune 2020-02-12 21:46:23 UTC

@Trevor

it looks like there was a large change between 4.2 and 4.3 for the cluster-autoscaler-operator

4.2: https://github.com/openshift/cluster-autoscaler-operator/blob/release-4.2/install/07_deployment.yaml
4.3: https://github.com/openshift/cluster-autoscaler-operator/blob/release-4.3/install/07_deployment.yaml

the main thing to see is that the deplyoment for that operator had the explicit metrics port removed from its deployment. now, i am not sure if this is a problem that needs to be fixed just yet, i am still learning about the change. i don't know if the port was removed because of a different change in the way metrics are scraped, or if this is a regression.

Comment 17 W. Trevor King 2020-02-12 21:55:03 UTC

Looks like [1] (in 4.3 but not 4.2) dropped metrics from the 'ports' list (where it had been 8080) and added a 9191 -> 9192 translater to serve metrics over HTTPS.  But the removed port entry is from the end of the list, so I don't expect the bug 1798049 panic to fire here.  Do we have a CI job or must-gather from a cluster that exhibits this bug's breakage?

[1]: https://github.com/openshift/cluster-autoscaler-operator/commit/cebc1b062c872a136109f1916e4b25c0f7ef5ebe

Comment 18 Sam Yangsao 2020-02-12 21:59:36 UTC

(In reply to Matt Woodson from comment #13)
> Just verified. Deleting the deployment and having CVO recreate it works.
> 
> oc delete deployment cluster-autoscaler-operator  -n openshift-machine-api
> 
> This will remove the messed up deployment, and CVO will redeploy a clean
> one.  This clears all the alerts.

Just confirmed this as well on my cluster.  Thanks Matt!

Comment 19 Michael McCune 2020-02-12 22:15:14 UTC

(In reply to W. Trevor King from comment #17)
> Looks like [1] (in 4.3 but not 4.2) dropped metrics from the 'ports' list
> (where it had been 8080) and added a 9191 -> 9192 translater to serve
> metrics over HTTPS.  But the removed port entry is from the end of the list,

ahh, ok. i was wondering about the 9192 metrics port, i thought it was just for that rbac-proxy container but the https passthrough makes sense. appreciate the explanation.

> so I don't expect the bug 1798049 panic to fire here.  Do we have a CI job
> or must-gather from a cluster that exhibits this bug's breakage?
> 

afaik, we just have a handful of reported cases, not sure about CI.

Comment 20 W. Trevor King 2020-02-12 22:21:10 UTC

Can anyone who hit this in their cluster attach CVO logs (openshift-cluster-version namespace)?  Even if you've since deleted the deployment to unstick the CVO, CVO logs would still be helpful.

Comment 21 Sam Yangsao 2020-02-12 22:25:00 UTC

Created attachment 1662814 [details]
cluster version operator logs

Attached from my cluster for comment#20

Comment 22 brad.williams 2020-02-12 22:30:23 UTC

Created attachment 1662816 [details]
CVO Logs for us-west-1 Starter Cluster

Attached is the CVS log for a starter cluster exhibiting both alerts (ingress and autoscaler).

Comment 23 W. Trevor King 2020-02-12 23:22:45 UTC

(In reply to brad.williams from comment #22)
> Attached is the CVS log for a starter cluster exhibiting both alerts
> (ingress and autoscaler).

You're still seeing the alerts?

$ grep -i 'deployment.*autoscaler\|Running sync.*in state\|Result of work' cvo-logs.txt 
I0212 21:36:31.946823       1 sync_worker.go:471] Running sync 4.3.1 (force=true) on generation 26 in state Updating at attempt 0
I0212 21:36:54.274468       1 sync_worker.go:621] Running sync for deployment "openshift-machine-api/cluster-autoscaler-operator" (168 of 494)
I0212 21:36:54.669223       1 request.go:538] Throttling request took 394.433696ms, request: GET:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/deployments/cluster-autoscaler-operator
I0212 21:36:55.069229       1 request.go:538] Throttling request took 396.448388ms, request: PUT:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/deployments/cluster-autoscaler-operator
I0212 21:36:55.469224       1 request.go:538] Throttling request took 393.248658ms, request: GET:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/deployments/cluster-autoscaler-operator
I0212 21:36:55.472887       1 sync_worker.go:634] Done syncing for deployment "openshift-machine-api/cluster-autoscaler-operator" (168 of 494)
I0212 21:41:04.273997       1 task_graph.go:611] Result of work: []
I0212 21:43:56.799939       1 sync_worker.go:471] Running sync 4.3.1 (force=true) on generation 26 in state Reconciling at attempt 0
I0212 21:44:03.958452       1 sync_worker.go:621] Running sync for deployment "openshift-machine-api/cluster-autoscaler-operator" (168 of 494)
I0212 21:44:04.053339       1 request.go:538] Throttling request took 94.542211ms, request: GET:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/deployments/cluster-autoscaler-operator
I0212 21:44:04.153347       1 request.go:538] Throttling request took 91.801527ms, request: PUT:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/deployments/cluster-autoscaler-operator
I0212 21:44:04.253324       1 request.go:538] Throttling request took 94.908989ms, request: GET:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/deployments/cluster-autoscaler-operator
I0212 21:44:04.256794       1 sync_worker.go:634] Done syncing for deployment "openshift-machine-api/cluster-autoscaler-operator" (168 of 494)
I0212 21:44:30.214196       1 task_graph.go:611] Result of work: []
...
I0212 22:23:52.222330       1 sync_worker.go:471] Running sync 4.3.1 (force=true) on generation 26 in state Reconciling at attempt 0
I0212 22:23:59.379809       1 sync_worker.go:621] Running sync for deployment "openshift-machine-api/cluster-autoscaler-operator" (168 of 494)
I0212 22:23:59.474710       1 request.go:538] Throttling request took 94.553772ms, request: GET:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/deployments/cluster-autoscaler-operator
I0212 22:23:59.574758       1 request.go:538] Throttling request took 96.192704ms, request: PUT:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/deployments/cluster-autoscaler-operator
I0212 22:23:59.674712       1 request.go:538] Throttling request took 95.647392ms, request: GET:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/deployments/cluster-autoscaler-operator
I0212 22:23:59.678543       1 sync_worker.go:634] Done syncing for deployment "openshift-machine-api/cluster-autoscaler-operator" (168 of 494)
I0212 22:24:25.631467       1 task_graph.go:611] Result of work: []

That is a single Updating round after the CVO came up, followed by a bunch of Reconciling rounds.  All of them are successful.  The throttling shows the CVO attempting to PUT the Deployment, so it must think the manifest is dirty (we really should log why we think a manifest is dirty), but then the CVO very quickly decides that the "updated" Deployment is satisfactory and continues through the rest of the sync round.

Comment 24 Sam Yangsao 2020-02-13 14:50:59 UTC

*** Bug 1801960 has been marked as a duplicate of this bug. ***

Comment 25 Michael McCune 2020-02-13 14:51:09 UTC

the more i am looking at this, the more confused i am becoming.

the deployment manifests that are being posted here seem to contain a mix of 4.2 and 4.3 elements. i am unsure how this is happening. there is a specific commit where the exposed "metrics" port on 8080 is removed, it should not exist in these deployments. before that commit there is no kube-rbac-proxy pod, so it seems odd that we have both in these manifests.

as others have noted, there are now 2 ports labeled "metrics" for the cluster-autoscaler-operator and apparently the ServiceMonitor object is directing the service monitor to use the incorrect port(although it should not exist in these deployments).

Comment 26 brad.williams 2020-02-13 16:14:00 UTC

(In reply to W. Trevor King from comment #23)
> (In reply to brad.williams from comment #22)
> > Attached is the CVS log for a starter cluster exhibiting both alerts
> > (ingress and autoscaler).
> 
> You're still seeing the alerts?
> 
> $ grep -i 'deployment.*autoscaler\|Running sync.*in state\|Result of work'
> cvo-logs.txt 
> I0212 21:36:31.946823       1 sync_worker.go:471] Running sync 4.3.1
> (force=true) on generation 26 in state Updating at attempt 0
> I0212 21:36:54.274468       1 sync_worker.go:621] Running sync for
> deployment "openshift-machine-api/cluster-autoscaler-operator" (168 of 494)
> I0212 21:36:54.669223       1 request.go:538] Throttling request took
> 394.433696ms, request:
> GET:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/
> deployments/cluster-autoscaler-operator
> I0212 21:36:55.069229       1 request.go:538] Throttling request took
> 396.448388ms, request:
> PUT:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/
> deployments/cluster-autoscaler-operator
> I0212 21:36:55.469224       1 request.go:538] Throttling request took
> 393.248658ms, request:
> GET:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/
> deployments/cluster-autoscaler-operator
> I0212 21:36:55.472887       1 sync_worker.go:634] Done syncing for
> deployment "openshift-machine-api/cluster-autoscaler-operator" (168 of 494)
> I0212 21:41:04.273997       1 task_graph.go:611] Result of work: []
> I0212 21:43:56.799939       1 sync_worker.go:471] Running sync 4.3.1
> (force=true) on generation 26 in state Reconciling at attempt 0
> I0212 21:44:03.958452       1 sync_worker.go:621] Running sync for
> deployment "openshift-machine-api/cluster-autoscaler-operator" (168 of 494)
> I0212 21:44:04.053339       1 request.go:538] Throttling request took
> 94.542211ms, request:
> GET:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/
> deployments/cluster-autoscaler-operator
> I0212 21:44:04.153347       1 request.go:538] Throttling request took
> 91.801527ms, request:
> PUT:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/
> deployments/cluster-autoscaler-operator
> I0212 21:44:04.253324       1 request.go:538] Throttling request took
> 94.908989ms, request:
> GET:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/
> deployments/cluster-autoscaler-operator
> I0212 21:44:04.256794       1 sync_worker.go:634] Done syncing for
> deployment "openshift-machine-api/cluster-autoscaler-operator" (168 of 494)
> I0212 21:44:30.214196       1 task_graph.go:611] Result of work: []
> ...
> I0212 22:23:52.222330       1 sync_worker.go:471] Running sync 4.3.1
> (force=true) on generation 26 in state Reconciling at attempt 0
> I0212 22:23:59.379809       1 sync_worker.go:621] Running sync for
> deployment "openshift-machine-api/cluster-autoscaler-operator" (168 of 494)
> I0212 22:23:59.474710       1 request.go:538] Throttling request took
> 94.553772ms, request:
> GET:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/
> deployments/cluster-autoscaler-operator
> I0212 22:23:59.574758       1 request.go:538] Throttling request took
> 96.192704ms, request:
> PUT:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/
> deployments/cluster-autoscaler-operator
> I0212 22:23:59.674712       1 request.go:538] Throttling request took
> 95.647392ms, request:
> GET:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-machine-api/
> deployments/cluster-autoscaler-operator
> I0212 22:23:59.678543       1 sync_worker.go:634] Done syncing for
> deployment "openshift-machine-api/cluster-autoscaler-operator" (168 of 494)
> I0212 22:24:25.631467       1 task_graph.go:611] Result of work: []
> 
> That is a single Updating round after the CVO came up, followed by a bunch
> of Reconciling rounds.  All of them are successful.  The throttling shows
> the CVO attempting to PUT the Deployment, so it must think the manifest is
> dirty (we really should log why we think a manifest is dirty), but then the
> CVO very quickly decides that the "updated" Deployment is satisfactory and
> continues through the rest of the sync round.

I just re-ran our verification steps and these alerts are still present:

ERROR [/home/brawilli/Source/continuous-release-jobs/config/imperative/verifications/generic.py:201 check_prometheus_alerts] - Critical alert is firing: {'labels': {'alertname': 'ClusterAutoscalerOperatorDown', 'severity': 'critical'}, 'annotations': {'message': 'cluster-autoscaler-operator has disappeared from Prometheus target discovery.'}, 'state': 'firing', 'activeAt': '2020-02-12T21:47:11.800595892Z', 'value': '1e+00'}

WARNING [/home/brawilli/Source/continuous-release-jobs/config/imperative/verifications/generic.py:203 check_prometheus_alerts] - Alert is firing: {'labels': {'alertname': 'TargetDown', 'job': 'metrics', 'namespace': 'openshift-ingress-operator', 'service': 'metrics', 'severity': 'warning'}, 'annotations': {'message': '100% of the metrics targets in openshift-ingress-operator namespace are down.'}, 'state': 'firing', 'activeAt': '2020-02-12T21:47:00.163677339Z', 'value': '1e+02'}

WARNING [/home/brawilli/Source/continuous-release-jobs/config/imperative/verifications/generic.py:203 check_prometheus_alerts] - Alert is firing: {'labels': {'alertname': 'TargetDown', 'job': 'cluster-autoscaler-operator', 'namespace': 'openshift-machine-api', 'service': 'cluster-autoscaler-operator', 'severity': 'warning'}, 'annotations': {'message': '100% of the cluster-autoscaler-operator targets in openshift-machine-api namespace are down.'}, 'state': 'firing', 'activeAt': '2020-02-12T21:47:30.163677339Z', 'value': '1e+02'}

Comment 27 Vadim Rutkovsky 2020-02-13 16:34:23 UTC

CVO is incorrectly applying port modifications.

In 4.2 cluster-autoscaler-operator has 2 ports - https://github.com/openshift/cluster-autoscaler-operator/blob/release-4.2/install/07_deployment.yaml#L49-L54:

      - name: cluster-autoscaler-operator
  ...
        ports:
        - containerPort: 8443
        - name: metrics
          containerPort: 8080

in 4.3 metrics port is moved to a different container - https://github.com/openshift/cluster-autoscaler-operator/blob/release-4.3/install/07_deployment.yaml#L78-L79:

      - name: kube-rbac-proxy
   ...
        ports:
        - containerPort: 9192
          name: metrics
          protocol: TCP
  ...
      - name: cluster-autoscaler-operator
  ...
        ports:
        - containerPort: 8443

However CVO applies both in the resulting deployment:

      spec:
        containers:
        - args:
  ...
          ports:
          - containerPort: 8443
            protocol: TCP
          - containerPort: 8080
            name: metrics
            protocol: TCP
        - args:
  ...
          name: kube-rbac-proxy
          ports:
          - containerPort: 9192
            name: metrics
            protocol: TCP

That causes the service to scrape metrics incorrectly, as pod has 2 named ports in different containers

Comment 28 Vadim Rutkovsky 2020-02-13 18:03:06 UTC

This is not an upgrade blocker, though as
a) it can be fixed manually
b) no data loss occurs - cluster autoscaler metrics are not being collected, but existing ones are not being destroyed

Comment 29 W. Trevor King 2020-02-13 21:54:25 UTC

Trying to fill in our new impact-statement template:

# What kind of clusters are impacted because of the bug?

All clusters upgrading from 4.2 to 4.3, because the autoscaler and machine-API operator deployments both removed ports from a container template [1].

# What cluster functionality is degraded while hitting the bug?

In both cases, the metrics port was removed.  So no data is lost, but new metrics gathering from these operators will be broken until the bug is fixed.

# Does this bug result in data loss?

No.  But as above, for the duration of the buggy condition, new metrics will not be collected from the two affected operators.

# Is it possible to recover the cluster from the bug?

Yes.  Two ways:

a. Manually 'oc delete ...' the affected Deployment.  The CVO will push a replacement with the correct ports, fixing the buggy condition.  The brief outage in operator availability while the Deployment is replaced should have no adverse affect on the cluster.
b. Wait until we get a release out with a fixed CVO.  Update to that release, and the new CVO will remove the orphaned ports, fixing the buggy condition.

# What is the observed rate of failure we see in CI?

100% of 4.2 -> 4.3 updates?  Would be good to spot-check some update CI to confirm this.  From Clayton:

> we didn’t check alerts during upgrades before, we checked that they didn’t last longer than X when we test after upgrade completes.  We probably need to have something that watches alerts during upgrades, like we watch for api failures

# What is the expected rate of the failure (%) across upgrades?

100% of 4.2 -> 4.3 updates?  Would be good to spot-check some updates in Telemetry to confirm this.

[1]: $ oc adm release extract --to 4.2.18 quay.io/openshift-release-dev/ocp-release:4.2.18-x86_64
     Extracted release payload from digest sha256:283a1625e18e0b6d7f354b1b022a0aeaab5598f2144ec484faf89e1ecb5c7498 created at 2020-02-10T20:57:43Z
     $ oc adm release extract --to 4.3.1 quay.io/openshift-release-dev/ocp-release:4.3.1-x86_64
     Extracted release payload from digest sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3 created at 2020-02-05T12:16:31Z
     $ diff -U0 <(cd 4.2.18 && grep -rl ports: | sort) <(cd 4.3.1 && grep -rl ports: | sort)
     --- /dev/fd/63	2020-02-13 13:35:53.396953972 -0800
     +++ /dev/fd/62	2020-02-13 13:35:53.397953984 -0800
     @@ -10 +10 @@
     -0000_50_cloud-credential-operator_01_deployment.yaml
     +0000_50_cloud-credential-operator_05_deployment.yaml
     @@ -13,0 +14 @@
     +0000_50_cluster-image-registry-operator_07-operator-service.yaml
     @@ -14,0 +16,4 @@
     +0000_50_cluster-ingress-operator_01-service.yaml
     +0000_50_cluster-ingress-operator_02-deployment.yaml
     +0000_50_cluster-machine-approver_03-metrics-service.yaml
     +0000_50_cluster-machine-approver_04-deployment.yaml
     @@ -17,0 +23 @@
     +0000_50_cluster-samples-operator_06-metricsservice.yaml
     @@ -28 +34,2 @@
     -0000_50_insights-operator_05-deployment.yaml
     +0000_50_insights-operator_05-service.yaml
     +0000_50_insights-operator_06-deployment.yaml
     @@ -34,0 +42,3 @@
     +0000_70_dns-operator_01-service.yaml
     +0000_70_dns-operator_02-deployment.yaml
     +0000_80_machine-config-operator_00_service.yaml
     $ mv 4.2.18/0000_50_cloud-credential-operator_0{1,5}_deployment.yaml  # rename so diff will match
     $ mv 4.2.18/0000_50_insights-operator_0{5,6}-deployment.yaml
     $ diff -rU3 4.2.18 4.3.1 | grep -A10 ports:
     ...
     diff -rU3 4.2.18/0000_30_machine-api-operator_10_service.yaml 4.3.1/0000_30_machine-api-operator_10_service.yaml
     ...
        ports:
     -  - name: metrics
     -    port: 8080
     -    targetPort: metrics
     -    protocol: TCP
     +  - name: https
     +    port: 8443
     +    targetPort: https
     ...
     diff -rU3 4.2.18/0000_30_machine-api-operator_11_deployment.yaml 4.3.1/0000_30_machine-api-operator_11_deployment.yaml
     ...
     +        ports:
     +        - containerPort: 8443
     +          name: https
     +          protocol: TCP
     +        volumeMounts:
     +        - name: config
     +          mountPath: /etc/kube-rbac-proxy
     +        - mountPath: /etc/tls/private
     +          name: machine-api-operator-tls
            - name: machine-api-operator
     -        image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1217dc485b55c3cfd4aae7f2e88886a45b9a868099dbb17813256a0d81a63b82
     --
     -        ports:
     -        - name: metrics
     -          containerPort: 8080
     ...
     diff -rU3 4.2.18/0000_50_cluster-autoscaler-operator_07_deployment.yaml 4.3.1/0000_50_cluster-autoscaler-operator_07_deployment.yaml
     ...
     +        ports:
     +        - containerPort: 9192
     +          name: metrics
     +          protocol: TCP
     +        resources: {}
     +        terminationMessagePath: /dev/termination-log
     +        terminationMessagePolicy: File
     +        volumeMounts:
     +        - name: auth-proxy-config
     +          mountPath: /etc/kube-rbac-proxy
     +          readOnly: true
     --
              ports:
              - containerPort: 8443
     -        - name: metrics
     -          containerPort: 8080
     ...

Comment 33 Jianwei Hou 2020-02-14 07:15:08 UTC

Upgraded from 4.3.2 to 4.4.0-0.nightly-2020-02-13-212616, not reproducible. Verified this is fixed in 4.4.0-0.nightly-2020-02-13-212616

Alerts are not firing after upgrade. The cluster-autoscaler-operator deployment does not use port 8080 as well. Moving to verified.

To track the 4.3 status, use 1802710

Comment 35 W. Trevor King 2020-03-20 22:32:11 UTC

Tracking down affected versions:

* 4.4.0: https://bugzilla.redhat.com/show_bug.cgi?id=1801300
  https://github.com/openshift/cluster-version-operator/pull/322
  Which made it into 4.4.0-rc.0:

  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.4.0-rc.0-x86_64 | grep cluster-version
    cluster-version-operator                       https://github.com/openshift/cluster-version-operator                       23856901003b95b559087b8e83bffdee82872b2b
  $ git --no-pager log --oneline --first-parent -6 origin/release-4.4
  2385690 (origin/release-4.4) Merge pull request #332 from openshift-cherrypick-robot/cherry-pick-328-to-release-4.4
  ...
  2df3d56 Merge pull request #322 from vrutkovs/remove-outdated-ports

* 4.3.z: https://bugzilla.redhat.com/show_bug.cgi?id=1802710
  https://github.com/openshift/cluster-version-operator/pull/323
  Which made it into 4.3.3:

  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.3.3-x86_64 | grep cluster-version
    cluster-version-operator                      https://github.com/openshift/cluster-version-operator                      210b4b1e6b1b7f53b5dc0d935de9c5d27058280c
  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.3.2-x86_64 | grep cluster-version
    cluster-version-operator                      https://github.com/openshift/cluster-version-operator                      beee410fc8780e5613c09fc2690716b711747041
  $ git --no-pager log --oneline --first-parent -5 origin/release-4.3
  210b4b1 (origin/release-4.3) Merge pull request #321 from openshift-cherrypick-robot/cherry-pick-319-to-release-4.3
  5057680 Merge pull request #323 from vrutkovs/4.3-container-ports
  ...
  beee410 Merge pull request #290 from wking/no-ephemeral-storage-in-4.2-so-4.3-cannot-rely-on-it

* 4.2.z: https://bugzilla.redhat.com/show_bug.cgi?id=1802248
  https://github.com/openshift/cluster-version-operator/pull/325
  Which made it into 4.2.21:

  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.2.21-x86_64 | grep cluster-version
    cluster-version-operator                      https://github.com/openshift/cluster-version-operator                      ccbed39b6faab201a1bafc49a7f519194d5dd548
  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.2.20-x86_64 | grep cluster-version
    cluster-version-operator                      https://github.com/openshift/cluster-version-operator                      9f4f04e736a0bbc61323593fbb62874570f07762
  $ git --no-pager log --oneline --first-parent -4 origin/release-4.2
  4b39863 (origin/release-4.2) Merge pull request #314 from wking/resource-merge-index-mutating-existing-4.2
  ccbed39 Merge pull request #300 from openshift-cherrypick-robot/cherry-pick-298-to-release-4.2
  a8ed501 Merge pull request #325 from vrutkovs/4.2-container-ports
  9f4f04e Merge pull request #288 from wking/no-ephemeral-storage-in-4.2

* autoscaler and machine-API operator both removed their metrics port in 4.2 -> 4.3 [1].  So 4.2 clusters which update to 4.3 < 4.3.5 will hit this.

* ingress operator removed its metrics port in 4.1 -> 4.2 [2], so 4.1 clusters which update to 4.2 < 4.2.21 will hit this.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1801300#c29
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1802248#c3

Comment 37 errata-xmlrpc 2020-05-13 21:57:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Comment 38 W. Trevor King 2021-04-05 17:47:18 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475