Bug 1802248 - Upgrading from 4.2.16 -> 4.3.0 causes alerts with the openshift-ingress-operator
Summary: Upgrading from 4.2.16 -> 4.3.0 causes alerts with the openshift-ingress-operator
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.2.z
Assignee: W. Trevor King
QA Contact: Hongan Li
URL:
Whiteboard:
: 1803258 (view as bug list)
Depends On: 1802710
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-02-12 17:42 UTC by Matt Woodson
Modified: 2020-03-04 04:51 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-03-04 04:51:02 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-version-operator pull 325 None closed Bug 1802248: lib/resourcemerge: remove ports which are no longer required 2020-07-28 14:21:07 UTC
Red Hat Product Errata RHBA-2020:0614 None None None 2020-03-04 04:51:12 UTC

Description Matt Woodson 2020-02-12 17:42:32 UTC
Description of problem:

We recently upgraded Openshift Dedicated clusters from 4.2.16 to 4.3.0.  After the upgrade, we saw the alert firing:

{"__name__":"ALERTS","alertname":"TargetDown","alertstate":"firing","job":"metrics","namespace":"openshift-ingress-operator","service":"metrics","severity":"warning"}



Version-Release number of selected component (if applicable):

4.3.0

How reproducible:

Very.  All clusters we have upgrade (10+) have seen this issue.

Steps to Reproduce:
1.  upgrade from 4.2.16 -> 4.3.0




Additional info:


Here is the deployment:

-----------------------------------------------------------------------
# oc get deployment ingress-operator -o yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    config.openshift.io/inject-proxy: ingress-operator
    deployment.kubernetes.io/revision: "21"
  creationTimestamp: "2019-06-11T15:51:09Z"
  generation: 21
  name: ingress-operator
  namespace: openshift-ingress-operator
  resourceVersion: "157543261"
  selfLink: /apis/extensions/v1beta1/namespaces/openshift-ingress-operator/deployments/ingress-operator
  uid: bb60dac5-8c60-11e9-9dbf-02ad44d21d9e
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      name: ingress-operator
  strategy:
    type: Recreate
  template:
    metadata:
      creationTimestamp: null
      labels:
        name: ingress-operator
    spec:
      containers:
      - command:
        - ingress-operator
        - start
        env:
        - name: RELEASE_VERSION
          value: 4.3.0
        - name: WATCH_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: IMAGE
          value: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:87c6ad4e9b3111fcd96eaba4857ab3bbed53d8cfda37c86be76ad61e2db4d5f2
        image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9107a722c76c8c8932c2a28bd4d900ece7a817be26ebd940ae9e8d7607201f01
        imagePullPolicy: IfNotPresent
        name: ingress-operator
        ports:
        - containerPort: 60000
          name: metrics
          protocol: TCP
        resources:
          requests:
            cpu: 10m
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /etc/pki/ca-trust/extracted/pem
          name: trusted-ca
          readOnly: true
      - args:
        - --logtostderr
        - --secure-listen-address=:9393
        - --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256
        - --upstream=http://127.0.0.1:60000/
        - --tls-cert-file=/etc/tls/private/tls.crt
        - --tls-private-key-file=/etc/tls/private/tls.key
        image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d7b1c8f2c94af9c767ab0f60f23f8fb5b7545fdfd7c10b1ef979b7701e495f36
        imagePullPolicy: IfNotPresent
        name: kube-rbac-proxy
        ports:
        - containerPort: 9393
          name: metrics
          protocol: TCP
        resources:
          requests:
            cpu: 10m
            memory: 40Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/tls/private
          name: metrics-tls
          readOnly: true
      dnsPolicy: ClusterFirst
      nodeSelector:
        beta.kubernetes.io/os: linux
        kubernetes.io/os: linux
        node-role.kubernetes.io/master: ""
      priorityClassName: system-cluster-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: ingress-operator
      serviceAccountName: ingress-operator
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
        operator: Exists
      - effect: NoExecute
        key: node.kubernetes.io/unreachable
        operator: Exists
        tolerationSeconds: 120
      - effect: NoExecute
        key: node.kubernetes.io/not-ready
        operator: Exists
        tolerationSeconds: 120
      volumes:
      - name: metrics-tls
        secret:
          defaultMode: 420
          secretName: metrics-tls
      - configMap:
          defaultMode: 420
          items:
          - key: ca-bundle.crt
            path: tls-ca-bundle.pem
          name: trusted-ca
        name: trusted-ca
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2019-06-11T15:51:10Z"
    lastUpdateTime: "2020-02-10T19:21:37Z"
    message: ReplicaSet "ingress-operator-87d579545" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2020-02-10T19:49:17Z"
    lastUpdateTime: "2020-02-10T19:49:17Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 21
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1
[root@hive-production-master-ip-10-121-0-81.ec2.internal mwoodson-tmp]# 
-----------------------------------------------------------------------

Here is the clusterversion:
-----------------------------------------------------------------------
# oc get clusterversion -o yaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: ClusterVersion
  metadata:
    creationTimestamp: "2019-06-11T15:50:09Z"
    generation: 25
    name: version
    namespace: ""
    resourceVersion: "158703608"
    selfLink: /apis/config.openshift.io/v1/clusterversions/version
    uid: 976625de-8c60-11e9-9dbf-02ad44d21d9e
  spec:
    channel: fast-4.3
    clusterID: de278845-2ed3-4a60-b4b5-a7ba6c47c615
    desiredUpdate:
      force: false
      image: ""
      version: 4.3.0
    overrides:
    - group: operators.coreos.com
      kind: OperatorSource
      name: redhat-operators
      namespace: openshift-marketplace
      unmanaged: true
    - group: operators.coreos.com
      kind: OperatorSource
      name: certified-operators
      namespace: openshift-marketplace
      unmanaged: true
    - group: operators.coreos.com
      kind: OperatorSource
      name: community-operators
      namespace: openshift-marketplace
      unmanaged: true
    upstream: https://api.openshift.com/api/upgrades_info/v1/graph
  status:
    availableUpdates: null
    conditions:
    - lastTransitionTime: "2019-06-11T16:08:44Z"
      message: Done applying 4.3.0
      status: "True"
      type: Available
    - lastTransitionTime: "2020-02-10T19:57:30Z"
      status: "False"
      type: Failing
    - lastTransitionTime: "2020-02-10T19:57:45Z"
      message: Cluster version is 4.3.0
      status: "False"
      type: Progressing
    - lastTransitionTime: "2020-02-12T15:58:31Z"
      status: "True"
      type: RetrievedUpdates
    - lastTransitionTime: "2020-02-10T19:14:12Z"
      message: Disabling ownership via cluster version overrides prevents upgrades.
        Please remove overrides before continuing.
      reason: ClusterVersionOverridesSet
      status: "False"
      type: Upgradeable
    desired:
      force: false
      image: quay.io/openshift-release-dev/ocp-release@sha256:3a516480dfd68e0f87f702b4d7bdd6f6a0acfdac5cd2e9767b838ceede34d70d
      version: 4.3.0
    history:
    - completionTime: "2020-02-10T19:57:45Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:3a516480dfd68e0f87f702b4d7bdd6f6a0acfdac5cd2e9767b838ceede34d70d
      startedTime: "2020-02-10T19:13:58Z"
      state: Completed
      verified: true
      version: 4.3.0
    - completionTime: "2020-02-03T16:24:53Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:e5a6e348721c38a78d9299284fbb5c60fb340135a86b674b038500bf190ad514
      startedTime: "2020-02-03T15:28:23Z"
      state: Completed
      verified: true
      version: 4.2.16
    - completionTime: "2020-01-20T20:56:53Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:782b41750f3284f3c8ee2c1f8cb896896da074e362cf8a472846356d1617752d
      startedTime: "2020-01-20T20:12:56Z"
      state: Completed
      verified: true
      version: 4.2.13
    - completionTime: "2020-01-07T19:37:56Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:77ade34c373062c6a6c869e0e56ef93b2faaa373adadaac1430b29484a24d843
      startedTime: "2020-01-07T18:52:51Z"
      state: Completed
      verified: true
      version: 4.2.12
    - completionTime: "2019-12-09T03:41:08Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:f28cbabd1227352fe704a00df796a4511880174042dece96233036a10ac61639
      startedTime: "2019-12-09T02:56:00Z"
      state: Completed
      verified: true
      version: 4.2.9
    - completionTime: "2019-11-25T16:59:30Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:bac62983757570b9b8f8bc84c740782984a255c16372b3e30cfc8b52c0a187b9
      startedTime: "2019-11-25T16:18:37Z"
      state: Completed
      verified: true
      version: 4.2.7
    - completionTime: "2019-11-18T17:11:07Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:cebce35c054f1fb066a4dc0a518064945087ac1f3637fe23d2ee2b0c433d6ba8
      startedTime: "2019-11-18T16:25:59Z"
      state: Completed
      verified: true
      version: 4.2.4
    - completionTime: "2019-11-06T19:01:29Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:dc782b44cac3d59101904cc5da2b9d8bdb90e55a07814df50ea7a13071b0f5f0
      startedTime: "2019-11-06T18:13:56Z"
      state: Completed
      verified: true
      version: 4.2.2
    - completionTime: "2019-11-04T16:14:11Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:a68066e534c41010b3750f18d620abede491965d5b0e860f5717b626cde08e5b
      startedTime: "2019-11-04T15:22:41Z"
      state: Completed
      verified: true
      version: 4.1.21
    - completionTime: "2019-11-04T15:22:41Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:a7e97365d16d8d920fedd3684b018b780337e069deb1dd8500e866c0d6110334
      startedTime: "2019-10-21T19:17:27Z"
      state: Completed
      verified: true
      version: 4.1.20
    - completionTime: "2019-10-21T19:17:27Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:420633acf3fc7572372fe2df758152f6ab1f53a21c79a6c4b741fa0394c7df3a
      startedTime: "2019-09-30T16:51:27Z"
      state: Completed
      verified: true
      version: 4.1.18
    - completionTime: "2019-09-30T16:51:27Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:61ed953962d43cae388cb3c544b4cac358d4675076c2fc0befb236209d5116f7
      startedTime: "2019-09-24T22:06:56Z"
      state: Completed
      verified: true
      version: 4.1.16
    - completionTime: "2019-09-24T22:06:56Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:0a7f743a98e4d0937f44561138a03db8c09cdc4817a771a67f154e032435bcef
      startedTime: "2019-09-16T18:18:15Z"
      state: Completed
      verified: true
      version: 4.1.15
    - completionTime: "2019-09-16T18:18:15Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:fd41c9bda9e0ff306954f1fd7af6428edff8c3989b75f9fe984968db66846231
      startedTime: "2019-09-11T18:35:09Z"
      state: Completed
      verified: true
      version: 4.1.14
    - completionTime: "2019-09-11T18:35:09Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:212296a41e04176c308bfe169e7c6e05d77b76f403361664c3ce55cd30682a94
      startedTime: "2019-09-03T18:40:32Z"
      state: Completed
      verified: true
      version: 4.1.13
    - completionTime: "2019-09-03T18:40:32Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:27fd24c705d1107cc73cb7dda8257fe97900e130b68afc314d0ef0e31bcf9b8e
      startedTime: "2019-08-08T14:36:37Z"
      state: Completed
      verified: true
      version: 4.1.9
    - completionTime: "2019-08-08T14:36:37Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:3ea2648231035c1a65e8d91fa818bb225a2815bc0d6abfc35063a11eaba8659f
      startedTime: "2019-08-06T17:55:35Z"
      state: Completed
      verified: true
      version: 4.1.8
    - completionTime: "2019-08-06T17:55:35Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:c9ce7c3b1e77d6cc5ee366364e4e0c6c901556aa3f61f7bd394b5e3010a1f551
      startedTime: "2019-07-25T18:54:42Z"
      state: Completed
      verified: true
      version: 4.1.7
    - completionTime: "2019-07-25T18:54:42Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:f852f9d8c2e81a633e874e57a7d9bdd52588002a9b32fc037dba12b67cf1f8b0
      startedTime: "2019-06-27T16:19:53Z"
      state: Completed
      verified: true
      version: 4.1.3
    - completionTime: "2019-06-27T16:19:53Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:9c5f0df8b192a0d7b46cd5f6a4da2289c155fd5302dec7954f8f06c878160b8b
      startedTime: "2019-06-20T17:03:29Z"
      state: Completed
      verified: true
      version: 4.1.2
    - completionTime: "2019-06-20T17:03:29Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:b8307ac0f3ec4ac86c3f3b52846425205022da52c16f56ec31cbe428501001d6
      startedTime: "2019-06-11T15:50:56Z"
      state: Completed
      verified: false
      version: 4.1.0
    observedGeneration: 25
    versionHash: 6EOuKCSgs6s=
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

-----------------------------------------------------------------------

Comment 1 Matt Woodson 2020-02-12 17:44:28 UTC
Adding this as it _may_ be relevant.  When we did the upgrade, we also got a similar alert from the autoscaler operator which is being tracked here:

https://bugzilla.redhat.com/show_bug.cgi?id=1802214

It could be completely unrelated, but putting it here just in case.

Comment 2 Matt Woodson 2020-02-12 17:55:21 UTC
Here is a workaround.

Working with Miciah, he suggested removing the ingress operator deployment, and let CVO replace it.

--------------------------------------------------------------------
oc delete deployment ingress-operator -n openshift-ingress-operator
--------------------------------------------------------------------


Wait for a bit (seconds to minutes).  CVO creates a new version of the deployment, and then redeploys the ingress operator pod.  This will fix it.

Comment 3 Miciah Dashiel Butler Masters 2020-02-12 17:59:10 UTC
The problem is that the deployment in comment 0 defines two "metrics" ports: port 60000 on the "ingress-operator" container and port 9393 on the "kube-rbac-proxy" container.  The metrics endpoints resource picks the wrong port (60000), which fails.  I suspect a defect in cluster-version-operator because the release image manifest only defines the one "metrics" port:

    % oc image extract quay.io/openshift-release-dev/ocp-release@sha256:3a516480dfd68e0f87f702b4d7bdd6f6a0acfdac5cd2e9767b838ceede34d70d --only-files
    % grep -e 'name: metrics' release-manifests/0000_50_cluster-ingress-operator_02-deployment.yaml
                name: metrics
                name: metrics-tls
          - name: metrics-tls
    % 

We did define a "metrics" port on the "ingress-operator" container in 4.1[1], but we deleted it in 4.2[2], and we added the "metrics" port on the "kube-rbac-proxy" container in 4.3[3].  It looks like CVO is blending manifests from different releases, so 
I am re-assigning this report to CVO.

1. https://github.com/openshift/cluster-ingress-operator/blob/release-4.1/manifests/02-deployment.yaml#L40-L42
2. https://github.com/openshift/cluster-ingress-operator/pull/266/commits/27c6abb616c1681dcda8a536e356f7bd2b35830b
3. https://github.com/openshift/cluster-ingress-operator/blob/release-4.3/manifests/02-deployment.yaml#L70-L72

Comment 4 Abhinav Dahiya 2020-02-12 18:04:00 UTC
looks similar to https://bugzilla.redhat.com/show_bug.cgi?id=1801300

Comment 5 Miciah Dashiel Butler Masters 2020-02-12 18:23:48 UTC
> looks similar to https://bugzilla.redhat.com/show_bug.cgi?id=1801300

Indeed, it is similar, although in the case of cluster-autoscaler-operator, the old "metrics" port was removed[1] and the new added[2] in the same release (4.3).

1. https://github.com/openshift/cluster-autoscaler-operator/pull/119/commits/cebc1b062c872a136109f1916e4b25c0f7ef5ebe#diff-b04b9fb0b15ef843b20a21cf9b14d3ddL53-L54
2. https://github.com/openshift/cluster-autoscaler-operator/pull/122/commits/c23349abaa66cd3247543d7f6264d079c034ac6d#

Comment 6 W. Trevor King 2020-02-12 21:35:32 UTC
Is this a dup of bug 1783221 (4.4, VERIFIED), bug 1798049 (4.3, POST), and bug 1800346 (4.2, POST) about the CVO exploding when you remove a port?

Comment 8 W. Trevor King 2020-02-13 22:13:57 UTC
(In reply to Miciah Dashiel Butler Masters from comment #3)
> We did define a "metrics" port on the "ingress-operator" container in
> 4.1[1], but we deleted it in 4.2[2]...

Ok, so this is the same bug with the CVO not removing container ports which is being fixed in 4.4 for bug 1801300.  Vadim cloned that back to 4.3.z with bug 1802710, which should address the 4.2 -> 4.3 autoscaler issue.  I'll point this one at 4.2.z to address the 4.1 -> 4.2 issue.

Comment 9 W. Trevor King 2020-02-14 20:23:25 UTC
*** Bug 1803258 has been marked as a duplicate of this bug. ***

Comment 12 Hongan Li 2020-02-21 07:59:57 UTC
verified with upgrade from 4.1.0-0.nightly-2020-02-13-142910 to 4.2.0-0.nightly-2020-02-19-222048 and issue has been fixed.
the ports for metrics has been removed after upgrade to 4.2 and no firing alerts for ingress-operator. 

### before upgrade
    spec:
      containers:
      - command:
        - ingress-operator
        env:
        - name: RELEASE_VERSION
          value: 4.1.0-0.nightly-2020-02-13-142910
        - name: WATCH_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: IMAGE
          value: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f830dc2fce1306140952995158875b8acf46af9dd6fc7d39a127b8a8fb13021f
        image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:35f4468e17c9dfb08cc068699a7ccee5e7cc617ba6cd045bdad6f88c2737a04a
        imagePullPolicy: IfNotPresent
        name: ingress-operator
        ports:
        - containerPort: 60000
          name: metrics
          protocol: TCP
        resources:
          requests:
            cpu: 10m
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError

### after upgrade
    spec:
      containers:
      - command:
        - ingress-operator
        env:
        - name: RELEASE_VERSION
          value: 4.2.0-0.nightly-2020-02-19-222048
        - name: WATCH_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: IMAGE
          value: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4574bb2945b3701c912224af8eaa4108760a7580b7e4519899d3c2f0ff5ed06f
        image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0eb862d24add2fc1b255fa8e60a6556ea53ec7a6809d17f64b079b0d804a2e80
        imagePullPolicy: IfNotPresent
        name: ingress-operator
        resources:
          requests:
            cpu: 10m
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError

Comment 14 errata-xmlrpc 2020-03-04 04:51:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0614


Note You need to log in before you can comment on or make changes to this bug.