Bug 1868304 - ValidatingWebhookConfiguration prometheusrules.openshift.io blocks monitoring downgrade from 4.6 to 4.5
Summary: ValidatingWebhookConfiguration prometheusrules.openshift.io blocks monitoring...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.6
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.5.z
Assignee: Simon Pasquier
QA Contact: Junqi Zhao
Depends On: 1869301
TreeView+ depends on / blocked
Reported: 2020-08-12 09:23 UTC by Junqi Zhao
Modified: 2020-09-08 10:55 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1869301 (view as bug list)
Last Closed: 2020-09-08 10:54:46 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 909 None closed Bug 1868304: remove ValidatingWebhookConfiguration for downgrades 2020-08-26 18:07:28 UTC
Red Hat Product Errata RHBA-2020:3510 None None None 2020-09-08 10:55:29 UTC

Description Junqi Zhao 2020-08-12 09:23:04 UTC
Description of problem:
we have added ValidatingWebhookConfiguration prometheusrules.openshift.io since 4.6, but it blocks monitoring downgrade from 4.6 to 4.5.
the workaroud is 
# oc delete ValidatingWebhookConfiguration prometheusrules.openshift.io
the downgrade would continue.

down grade from 4.6.0-0.nightly-2020-08-12-003456 to 4.5.0-0.nightly-2020-08-08-162221
# oc get co/monitoring -oyaml
  - lastTransitionTime: "2020-08-12T06:40:10Z"
    message: 'Failed to rollout the stack. Error: running task Updating Prometheus-k8s failed: reconciling Prometheus rules PrometheusRule failed: updating PrometheusRule object failed: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s: no service port 8080 found for service "prometheus-operator"'
    reason: UpdatingPrometheusK8SFailed
    status: "True"
    type: Degraded
Note: in a healthy 4.6 cluster
# oc -n openshift-monitoring get svc prometheus-operator
NAME                  TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)             AGE
prometheus-operator   ClusterIP   None         <none>        8443/TCP,8080/TCP   6h23m

when it downgrades to 4.5, it only has 8443 port, which is expected for 4.5
# oc -n openshift-monitoring get svc prometheus-operator
NAME                  TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE
prometheus-operator   ClusterIP   None         <none>        8443/TCP   5h5m

# oc -n openshift-monitoring get svc prometheus-operator -oyaml
apiVersion: v1
kind: Service
    service.beta.openshift.io/serving-cert-secret-name: prometheus-operator-tls
  creationTimestamp: "2020-08-12T03:49:15Z"
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: prometheus-operator
    app.kubernetes.io/version: v0.38.1
  - apiVersion: v1
    fieldsType: FieldsV1
          .: {}
          f:service.beta.openshift.io/serving-cert-secret-name: {}
          .: {}
          f:app.kubernetes.io/component: {}
          f:app.kubernetes.io/name: {}
          f:app.kubernetes.io/version: {}
        f:clusterIP: {}
          .: {}
            .: {}
            f:name: {}
            f:port: {}
            f:protocol: {}
            f:targetPort: {}
          .: {}
          f:app.kubernetes.io/component: {}
          f:app.kubernetes.io/name: {}
        f:sessionAffinity: {}
        f:type: {}
    manager: operator
    operation: Update
    time: "2020-08-12T06:36:52Z"
  name: prometheus-operator
  namespace: openshift-monitoring
  resourceVersion: "163229"
  selfLink: /api/v1/namespaces/openshift-monitoring/services/prometheus-operator
  uid: 56e0efe0-81e2-48e5-b75a-3df951163071
  clusterIP: None
  - name: https
    port: 8443
    protocol: TCP
    targetPort: https
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: prometheus-operator
  sessionAffinity: None
  type: ClusterIP
  loadBalancer: {}

Version-Release number of selected component (if applicable):
down grade from 4.6.0-0.nightly-2020-08-12-003456 to 4.5.0-0.nightly-2020-08-08-162221

How reproducible:

Steps to Reproduce:
1. down grade from 4.6.0-0.nightly-2020-08-12-003456 to 4.5.0-0.nightly-2020-08-08-162221

Actual results:
monitoring downgrade is blocked

Expected results:
no issue

Additional info:

Comment 1 Sergiusz Urbaniak 2020-08-13 05:23:45 UTC
i am not sure downgrades are supported, I need to clarify.

Comment 2 Sergiusz Urbaniak 2020-08-13 05:33:47 UTC
pinging pillar lead to clarify if this is considered a release blocking issue and if downgrades are supported.

Comment 3 Ben Parees 2020-08-13 13:08:51 UTC
downgrades are supported.  We allow you to downgrade in order to fix an issue before moving forward again.  We don't support you running long term on a cluster that's been downgraded(so some odd/broken behavior is acceptable), but you have to be able to perform the downgrade temporarily.

Comment 4 Sergiusz Urbaniak 2020-08-17 11:44:34 UTC
To be more precise after talking to Ben Parees OOB we need to support downgrades for a short period of time only not to break the stack. It is not supported/envisioned to keep a downgraded cluster runnning for a long time.

we have the following implementation strategies at hand:

a) we implement an explicit removal of the openshift-monitoring/prometheus-operator webhook [1] in CMO's code in the 4.5 release branch.
This option has the advantage of being cleaner, removing the 4.6 assets cleanly.
Downside is that this fixes it just for a >=4.5.z versions only once the patch lands.

b) instead of adding another port `web` to the existing openshift-monitoring/prometheus-operator [2] we could create another dedicated service. this way, when CMO is being downgraded, the service and the webhook would say
This option has the advantage of being compatible with all 4.5.z releases.
Downside is we're leaving 4.6 assets (webhook itself and webhook service) around in an 4.5 environment which exposes untested functionality (webhook validation) in 4.5.

[1] https://github.com/openshift/cluster-monitoring-operator/blob/061ba1cbed128a2b3858261b5b89f6aef268a08b/assets/prometheus-operator/prometheus-rule-validating-webhook.yaml
[2] https://github.com/openshift/cluster-monitoring-operator/blob/9d45decd69cbc40d88d869815bd3ad9fec77e5c9/assets/prometheus-operator/service.yaml#L18-L20

Comment 8 Junqi Zhao 2020-08-21 08:42:30 UTC
degrade from 4.6.0-0.nightly-2020-08-20-174655 to 4.5.0-0.nightly-2020-08-20-011847, no block for monitoring
# oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-08-20-011847 --allow-explicit-upgrade=true --force

Comment 10 errata-xmlrpc 2020-09-08 10:54:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.8 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.