1868304 – ValidatingWebhookConfiguration prometheusrules.openshift.io blocks monitoring downgrade from 4.6 to 4.5

Bug 1868304 - ValidatingWebhookConfiguration prometheusrules.openshift.io blocks monitoring downgrade from 4.6 to 4.5

Summary: ValidatingWebhookConfiguration prometheusrules.openshift.io blocks monitoring...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.5.z
Assignee:	Simon Pasquier
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:	1869301
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-12 09:23 UTC by Junqi Zhao
Modified:	2020-09-08 10:55 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1869301 (view as bug list)
Environment:
Last Closed:	2020-09-08 10:54:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 909	0	None	closed	Bug 1868304: remove ValidatingWebhookConfiguration for downgrades	2021-02-14 14:19:07 UTC
Red Hat Product Errata	RHBA-2020:3510	0	None	None	None	2020-09-08 10:55:29 UTC

Description Junqi Zhao 2020-08-12 09:23:04 UTC

Description of problem:
we have added ValidatingWebhookConfiguration prometheusrules.openshift.io since 4.6, but it blocks monitoring downgrade from 4.6 to 4.5.
the workaroud is 
# oc delete ValidatingWebhookConfiguration prometheusrules.openshift.io
the downgrade would continue.

down grade from 4.6.0-0.nightly-2020-08-12-003456 to 4.5.0-0.nightly-2020-08-08-162221
# oc get co/monitoring -oyaml
...
  - lastTransitionTime: "2020-08-12T06:40:10Z"
    message: 'Failed to rollout the stack. Error: running task Updating Prometheus-k8s failed: reconciling Prometheus rules PrometheusRule failed: updating PrometheusRule object failed: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s: no service port 8080 found for service "prometheus-operator"'
    reason: UpdatingPrometheusK8SFailed
    status: "True"
    type: Degraded
...
Note: in a healthy 4.6 cluster
# oc -n openshift-monitoring get svc prometheus-operator
NAME                  TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)             AGE
prometheus-operator   ClusterIP   None         <none>        8443/TCP,8080/TCP   6h23m

when it downgrades to 4.5, it only has 8443 port, which is expected for 4.5
# oc -n openshift-monitoring get svc prometheus-operator
NAME                  TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE
prometheus-operator   ClusterIP   None         <none>        8443/TCP   5h5m

# oc -n openshift-monitoring get svc prometheus-operator -oyaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.openshift.io/serving-cert-secret-name: prometheus-operator-tls
  creationTimestamp: "2020-08-12T03:49:15Z"
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: prometheus-operator
    app.kubernetes.io/version: v0.38.1
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:service.beta.openshift.io/serving-cert-secret-name: {}
        f:labels:
          .: {}
          f:app.kubernetes.io/component: {}
          f:app.kubernetes.io/name: {}
          f:app.kubernetes.io/version: {}
      f:spec:
        f:clusterIP: {}
        f:ports:
          .: {}
          k:{"port":8443,"protocol":"TCP"}:
            .: {}
            f:name: {}
            f:port: {}
            f:protocol: {}
            f:targetPort: {}
        f:selector:
          .: {}
          f:app.kubernetes.io/component: {}
          f:app.kubernetes.io/name: {}
        f:sessionAffinity: {}
        f:type: {}
    manager: operator
    operation: Update
    time: "2020-08-12T06:36:52Z"
  name: prometheus-operator
  namespace: openshift-monitoring
  resourceVersion: "163229"
  selfLink: /api/v1/namespaces/openshift-monitoring/services/prometheus-operator
  uid: 56e0efe0-81e2-48e5-b75a-3df951163071
spec:
  clusterIP: None
  ports:
  - name: https
    port: 8443
    protocol: TCP
    targetPort: https
  selector:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: prometheus-operator
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

Version-Release number of selected component (if applicable):
down grade from 4.6.0-0.nightly-2020-08-12-003456 to 4.5.0-0.nightly-2020-08-08-162221

How reproducible:
always

Steps to Reproduce:
1. down grade from 4.6.0-0.nightly-2020-08-12-003456 to 4.5.0-0.nightly-2020-08-08-162221
2.
3.

Actual results:
monitoring downgrade is blocked

Expected results:
no issue

Additional info:

Comment 1 Sergiusz Urbaniak 2020-08-13 05:23:45 UTC

i am not sure downgrades are supported, I need to clarify.

Comment 2 Sergiusz Urbaniak 2020-08-13 05:33:47 UTC

pinging pillar lead to clarify if this is considered a release blocking issue and if downgrades are supported.

Comment 3 Ben Parees 2020-08-13 13:08:51 UTC

downgrades are supported.  We allow you to downgrade in order to fix an issue before moving forward again.  We don't support you running long term on a cluster that's been downgraded(so some odd/broken behavior is acceptable), but you have to be able to perform the downgrade temporarily.

Comment 4 Sergiusz Urbaniak 2020-08-17 11:44:34 UTC

To be more precise after talking to Ben Parees OOB we need to support downgrades for a short period of time only not to break the stack. It is not supported/envisioned to keep a downgraded cluster runnning for a long time.

we have the following implementation strategies at hand:

a) we implement an explicit removal of the openshift-monitoring/prometheus-operator webhook [1] in CMO's code in the 4.5 release branch.
This option has the advantage of being cleaner, removing the 4.6 assets cleanly.
Downside is that this fixes it just for a >=4.5.z versions only once the patch lands.

b) instead of adding another port `web` to the existing openshift-monitoring/prometheus-operator [2] we could create another dedicated service. this way, when CMO is being downgraded, the service and the webhook would say
This option has the advantage of being compatible with all 4.5.z releases.
Downside is we're leaving 4.6 assets (webhook itself and webhook service) around in an 4.5 environment which exposes untested functionality (webhook validation) in 4.5.

[1] https://github.com/openshift/cluster-monitoring-operator/blob/061ba1cbed128a2b3858261b5b89f6aef268a08b/assets/prometheus-operator/prometheus-rule-validating-webhook.yaml
[2] https://github.com/openshift/cluster-monitoring-operator/blob/9d45decd69cbc40d88d869815bd3ad9fec77e5c9/assets/prometheus-operator/service.yaml#L18-L20

Comment 8 Junqi Zhao 2020-08-21 08:42:30 UTC

degrade from 4.6.0-0.nightly-2020-08-20-174655 to 4.5.0-0.nightly-2020-08-20-011847, no block for monitoring
# oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-08-20-011847 --allow-explicit-upgrade=true --force

Comment 10 errata-xmlrpc 2020-09-08 10:54:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.8 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3510

Note You need to log in before you can comment on or make changes to this bug.