Description of problem: We recently upgraded Openshift Dedicated clusters from 4.2.16 to 4.3.0. After the upgrade, we saw the alert firing: {"__name__":"ALERTS","alertname":"TargetDown","alertstate":"firing","job":"metrics","namespace":"openshift-ingress-operator","service":"metrics","severity":"warning"} Version-Release number of selected component (if applicable): 4.3.0 How reproducible: Very. All clusters we have upgrade (10+) have seen this issue. Steps to Reproduce: 1. upgrade from 4.2.16 -> 4.3.0 Additional info: Here is the deployment: ----------------------------------------------------------------------- # oc get deployment ingress-operator -o yaml apiVersion: extensions/v1beta1 kind: Deployment metadata: annotations: config.openshift.io/inject-proxy: ingress-operator deployment.kubernetes.io/revision: "21" creationTimestamp: "2019-06-11T15:51:09Z" generation: 21 name: ingress-operator namespace: openshift-ingress-operator resourceVersion: "157543261" selfLink: /apis/extensions/v1beta1/namespaces/openshift-ingress-operator/deployments/ingress-operator uid: bb60dac5-8c60-11e9-9dbf-02ad44d21d9e spec: progressDeadlineSeconds: 600 replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: name: ingress-operator strategy: type: Recreate template: metadata: creationTimestamp: null labels: name: ingress-operator spec: containers: - command: - ingress-operator - start env: - name: RELEASE_VERSION value: 4.3.0 - name: WATCH_NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace - name: IMAGE value: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:87c6ad4e9b3111fcd96eaba4857ab3bbed53d8cfda37c86be76ad61e2db4d5f2 image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9107a722c76c8c8932c2a28bd4d900ece7a817be26ebd940ae9e8d7607201f01 imagePullPolicy: IfNotPresent name: ingress-operator ports: - containerPort: 60000 name: metrics protocol: TCP resources: requests: cpu: 10m terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError volumeMounts: - mountPath: /etc/pki/ca-trust/extracted/pem name: trusted-ca readOnly: true - args: - --logtostderr - --secure-listen-address=:9393 - --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256 - --upstream=http://127.0.0.1:60000/ - --tls-cert-file=/etc/tls/private/tls.crt - --tls-private-key-file=/etc/tls/private/tls.key image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d7b1c8f2c94af9c767ab0f60f23f8fb5b7545fdfd7c10b1ef979b7701e495f36 imagePullPolicy: IfNotPresent name: kube-rbac-proxy ports: - containerPort: 9393 name: metrics protocol: TCP resources: requests: cpu: 10m memory: 40Mi terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /etc/tls/private name: metrics-tls readOnly: true dnsPolicy: ClusterFirst nodeSelector: beta.kubernetes.io/os: linux kubernetes.io/os: linux node-role.kubernetes.io/master: "" priorityClassName: system-cluster-critical restartPolicy: Always schedulerName: default-scheduler securityContext: {} serviceAccount: ingress-operator serviceAccountName: ingress-operator terminationGracePeriodSeconds: 30 tolerations: - effect: NoSchedule key: node-role.kubernetes.io/master operator: Exists - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 120 - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 120 volumes: - name: metrics-tls secret: defaultMode: 420 secretName: metrics-tls - configMap: defaultMode: 420 items: - key: ca-bundle.crt path: tls-ca-bundle.pem name: trusted-ca name: trusted-ca status: availableReplicas: 1 conditions: - lastTransitionTime: "2019-06-11T15:51:10Z" lastUpdateTime: "2020-02-10T19:21:37Z" message: ReplicaSet "ingress-operator-87d579545" has successfully progressed. reason: NewReplicaSetAvailable status: "True" type: Progressing - lastTransitionTime: "2020-02-10T19:49:17Z" lastUpdateTime: "2020-02-10T19:49:17Z" message: Deployment has minimum availability. reason: MinimumReplicasAvailable status: "True" type: Available observedGeneration: 21 readyReplicas: 1 replicas: 1 updatedReplicas: 1 [root.internal mwoodson-tmp]# ----------------------------------------------------------------------- Here is the clusterversion: ----------------------------------------------------------------------- # oc get clusterversion -o yaml apiVersion: v1 items: - apiVersion: config.openshift.io/v1 kind: ClusterVersion metadata: creationTimestamp: "2019-06-11T15:50:09Z" generation: 25 name: version namespace: "" resourceVersion: "158703608" selfLink: /apis/config.openshift.io/v1/clusterversions/version uid: 976625de-8c60-11e9-9dbf-02ad44d21d9e spec: channel: fast-4.3 clusterID: de278845-2ed3-4a60-b4b5-a7ba6c47c615 desiredUpdate: force: false image: "" version: 4.3.0 overrides: - group: operators.coreos.com kind: OperatorSource name: redhat-operators namespace: openshift-marketplace unmanaged: true - group: operators.coreos.com kind: OperatorSource name: certified-operators namespace: openshift-marketplace unmanaged: true - group: operators.coreos.com kind: OperatorSource name: community-operators namespace: openshift-marketplace unmanaged: true upstream: https://api.openshift.com/api/upgrades_info/v1/graph status: availableUpdates: null conditions: - lastTransitionTime: "2019-06-11T16:08:44Z" message: Done applying 4.3.0 status: "True" type: Available - lastTransitionTime: "2020-02-10T19:57:30Z" status: "False" type: Failing - lastTransitionTime: "2020-02-10T19:57:45Z" message: Cluster version is 4.3.0 status: "False" type: Progressing - lastTransitionTime: "2020-02-12T15:58:31Z" status: "True" type: RetrievedUpdates - lastTransitionTime: "2020-02-10T19:14:12Z" message: Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing. reason: ClusterVersionOverridesSet status: "False" type: Upgradeable desired: force: false image: quay.io/openshift-release-dev/ocp-release@sha256:3a516480dfd68e0f87f702b4d7bdd6f6a0acfdac5cd2e9767b838ceede34d70d version: 4.3.0 history: - completionTime: "2020-02-10T19:57:45Z" image: quay.io/openshift-release-dev/ocp-release@sha256:3a516480dfd68e0f87f702b4d7bdd6f6a0acfdac5cd2e9767b838ceede34d70d startedTime: "2020-02-10T19:13:58Z" state: Completed verified: true version: 4.3.0 - completionTime: "2020-02-03T16:24:53Z" image: quay.io/openshift-release-dev/ocp-release@sha256:e5a6e348721c38a78d9299284fbb5c60fb340135a86b674b038500bf190ad514 startedTime: "2020-02-03T15:28:23Z" state: Completed verified: true version: 4.2.16 - completionTime: "2020-01-20T20:56:53Z" image: quay.io/openshift-release-dev/ocp-release@sha256:782b41750f3284f3c8ee2c1f8cb896896da074e362cf8a472846356d1617752d startedTime: "2020-01-20T20:12:56Z" state: Completed verified: true version: 4.2.13 - completionTime: "2020-01-07T19:37:56Z" image: quay.io/openshift-release-dev/ocp-release@sha256:77ade34c373062c6a6c869e0e56ef93b2faaa373adadaac1430b29484a24d843 startedTime: "2020-01-07T18:52:51Z" state: Completed verified: true version: 4.2.12 - completionTime: "2019-12-09T03:41:08Z" image: quay.io/openshift-release-dev/ocp-release@sha256:f28cbabd1227352fe704a00df796a4511880174042dece96233036a10ac61639 startedTime: "2019-12-09T02:56:00Z" state: Completed verified: true version: 4.2.9 - completionTime: "2019-11-25T16:59:30Z" image: quay.io/openshift-release-dev/ocp-release@sha256:bac62983757570b9b8f8bc84c740782984a255c16372b3e30cfc8b52c0a187b9 startedTime: "2019-11-25T16:18:37Z" state: Completed verified: true version: 4.2.7 - completionTime: "2019-11-18T17:11:07Z" image: quay.io/openshift-release-dev/ocp-release@sha256:cebce35c054f1fb066a4dc0a518064945087ac1f3637fe23d2ee2b0c433d6ba8 startedTime: "2019-11-18T16:25:59Z" state: Completed verified: true version: 4.2.4 - completionTime: "2019-11-06T19:01:29Z" image: quay.io/openshift-release-dev/ocp-release@sha256:dc782b44cac3d59101904cc5da2b9d8bdb90e55a07814df50ea7a13071b0f5f0 startedTime: "2019-11-06T18:13:56Z" state: Completed verified: true version: 4.2.2 - completionTime: "2019-11-04T16:14:11Z" image: quay.io/openshift-release-dev/ocp-release@sha256:a68066e534c41010b3750f18d620abede491965d5b0e860f5717b626cde08e5b startedTime: "2019-11-04T15:22:41Z" state: Completed verified: true version: 4.1.21 - completionTime: "2019-11-04T15:22:41Z" image: quay.io/openshift-release-dev/ocp-release@sha256:a7e97365d16d8d920fedd3684b018b780337e069deb1dd8500e866c0d6110334 startedTime: "2019-10-21T19:17:27Z" state: Completed verified: true version: 4.1.20 - completionTime: "2019-10-21T19:17:27Z" image: quay.io/openshift-release-dev/ocp-release@sha256:420633acf3fc7572372fe2df758152f6ab1f53a21c79a6c4b741fa0394c7df3a startedTime: "2019-09-30T16:51:27Z" state: Completed verified: true version: 4.1.18 - completionTime: "2019-09-30T16:51:27Z" image: quay.io/openshift-release-dev/ocp-release@sha256:61ed953962d43cae388cb3c544b4cac358d4675076c2fc0befb236209d5116f7 startedTime: "2019-09-24T22:06:56Z" state: Completed verified: true version: 4.1.16 - completionTime: "2019-09-24T22:06:56Z" image: quay.io/openshift-release-dev/ocp-release@sha256:0a7f743a98e4d0937f44561138a03db8c09cdc4817a771a67f154e032435bcef startedTime: "2019-09-16T18:18:15Z" state: Completed verified: true version: 4.1.15 - completionTime: "2019-09-16T18:18:15Z" image: quay.io/openshift-release-dev/ocp-release@sha256:fd41c9bda9e0ff306954f1fd7af6428edff8c3989b75f9fe984968db66846231 startedTime: "2019-09-11T18:35:09Z" state: Completed verified: true version: 4.1.14 - completionTime: "2019-09-11T18:35:09Z" image: quay.io/openshift-release-dev/ocp-release@sha256:212296a41e04176c308bfe169e7c6e05d77b76f403361664c3ce55cd30682a94 startedTime: "2019-09-03T18:40:32Z" state: Completed verified: true version: 4.1.13 - completionTime: "2019-09-03T18:40:32Z" image: quay.io/openshift-release-dev/ocp-release@sha256:27fd24c705d1107cc73cb7dda8257fe97900e130b68afc314d0ef0e31bcf9b8e startedTime: "2019-08-08T14:36:37Z" state: Completed verified: true version: 4.1.9 - completionTime: "2019-08-08T14:36:37Z" image: quay.io/openshift-release-dev/ocp-release@sha256:3ea2648231035c1a65e8d91fa818bb225a2815bc0d6abfc35063a11eaba8659f startedTime: "2019-08-06T17:55:35Z" state: Completed verified: true version: 4.1.8 - completionTime: "2019-08-06T17:55:35Z" image: quay.io/openshift-release-dev/ocp-release@sha256:c9ce7c3b1e77d6cc5ee366364e4e0c6c901556aa3f61f7bd394b5e3010a1f551 startedTime: "2019-07-25T18:54:42Z" state: Completed verified: true version: 4.1.7 - completionTime: "2019-07-25T18:54:42Z" image: quay.io/openshift-release-dev/ocp-release@sha256:f852f9d8c2e81a633e874e57a7d9bdd52588002a9b32fc037dba12b67cf1f8b0 startedTime: "2019-06-27T16:19:53Z" state: Completed verified: true version: 4.1.3 - completionTime: "2019-06-27T16:19:53Z" image: quay.io/openshift-release-dev/ocp-release@sha256:9c5f0df8b192a0d7b46cd5f6a4da2289c155fd5302dec7954f8f06c878160b8b startedTime: "2019-06-20T17:03:29Z" state: Completed verified: true version: 4.1.2 - completionTime: "2019-06-20T17:03:29Z" image: quay.io/openshift-release-dev/ocp-release@sha256:b8307ac0f3ec4ac86c3f3b52846425205022da52c16f56ec31cbe428501001d6 startedTime: "2019-06-11T15:50:56Z" state: Completed verified: false version: 4.1.0 observedGeneration: 25 versionHash: 6EOuKCSgs6s= kind: List metadata: resourceVersion: "" selfLink: "" -----------------------------------------------------------------------
Adding this as it _may_ be relevant. When we did the upgrade, we also got a similar alert from the autoscaler operator which is being tracked here: https://bugzilla.redhat.com/show_bug.cgi?id=1802214 It could be completely unrelated, but putting it here just in case.
Here is a workaround. Working with Miciah, he suggested removing the ingress operator deployment, and let CVO replace it. -------------------------------------------------------------------- oc delete deployment ingress-operator -n openshift-ingress-operator -------------------------------------------------------------------- Wait for a bit (seconds to minutes). CVO creates a new version of the deployment, and then redeploys the ingress operator pod. This will fix it.
The problem is that the deployment in comment 0 defines two "metrics" ports: port 60000 on the "ingress-operator" container and port 9393 on the "kube-rbac-proxy" container. The metrics endpoints resource picks the wrong port (60000), which fails. I suspect a defect in cluster-version-operator because the release image manifest only defines the one "metrics" port: % oc image extract quay.io/openshift-release-dev/ocp-release@sha256:3a516480dfd68e0f87f702b4d7bdd6f6a0acfdac5cd2e9767b838ceede34d70d --only-files % grep -e 'name: metrics' release-manifests/0000_50_cluster-ingress-operator_02-deployment.yaml name: metrics name: metrics-tls - name: metrics-tls % We did define a "metrics" port on the "ingress-operator" container in 4.1[1], but we deleted it in 4.2[2], and we added the "metrics" port on the "kube-rbac-proxy" container in 4.3[3]. It looks like CVO is blending manifests from different releases, so I am re-assigning this report to CVO. 1. https://github.com/openshift/cluster-ingress-operator/blob/release-4.1/manifests/02-deployment.yaml#L40-L42 2. https://github.com/openshift/cluster-ingress-operator/pull/266/commits/27c6abb616c1681dcda8a536e356f7bd2b35830b 3. https://github.com/openshift/cluster-ingress-operator/blob/release-4.3/manifests/02-deployment.yaml#L70-L72
looks similar to https://bugzilla.redhat.com/show_bug.cgi?id=1801300
> looks similar to https://bugzilla.redhat.com/show_bug.cgi?id=1801300 Indeed, it is similar, although in the case of cluster-autoscaler-operator, the old "metrics" port was removed[1] and the new added[2] in the same release (4.3). 1. https://github.com/openshift/cluster-autoscaler-operator/pull/119/commits/cebc1b062c872a136109f1916e4b25c0f7ef5ebe#diff-b04b9fb0b15ef843b20a21cf9b14d3ddL53-L54 2. https://github.com/openshift/cluster-autoscaler-operator/pull/122/commits/c23349abaa66cd3247543d7f6264d079c034ac6d#
Is this a dup of bug 1783221 (4.4, VERIFIED), bug 1798049 (4.3, POST), and bug 1800346 (4.2, POST) about the CVO exploding when you remove a port?
(In reply to Miciah Dashiel Butler Masters from comment #3) > We did define a "metrics" port on the "ingress-operator" container in > 4.1[1], but we deleted it in 4.2[2]... Ok, so this is the same bug with the CVO not removing container ports which is being fixed in 4.4 for bug 1801300. Vadim cloned that back to 4.3.z with bug 1802710, which should address the 4.2 -> 4.3 autoscaler issue. I'll point this one at 4.2.z to address the 4.1 -> 4.2 issue.
*** Bug 1803258 has been marked as a duplicate of this bug. ***
verified with upgrade from 4.1.0-0.nightly-2020-02-13-142910 to 4.2.0-0.nightly-2020-02-19-222048 and issue has been fixed. the ports for metrics has been removed after upgrade to 4.2 and no firing alerts for ingress-operator. ### before upgrade spec: containers: - command: - ingress-operator env: - name: RELEASE_VERSION value: 4.1.0-0.nightly-2020-02-13-142910 - name: WATCH_NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace - name: IMAGE value: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f830dc2fce1306140952995158875b8acf46af9dd6fc7d39a127b8a8fb13021f image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:35f4468e17c9dfb08cc068699a7ccee5e7cc617ba6cd045bdad6f88c2737a04a imagePullPolicy: IfNotPresent name: ingress-operator ports: - containerPort: 60000 name: metrics protocol: TCP resources: requests: cpu: 10m terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError ### after upgrade spec: containers: - command: - ingress-operator env: - name: RELEASE_VERSION value: 4.2.0-0.nightly-2020-02-19-222048 - name: WATCH_NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace - name: IMAGE value: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4574bb2945b3701c912224af8eaa4108760a7580b7e4519899d3c2f0ff5ed06f image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0eb862d24add2fc1b255fa8e60a6556ea53ec7a6809d17f64b079b0d804a2e80 imagePullPolicy: IfNotPresent name: ingress-operator resources: requests: cpu: 10m terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0614
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475