Bug 1868304

Summary:	ValidatingWebhookConfiguration prometheusrules.openshift.io blocks monitoring downgrade from 4.6 to 4.5
Product:	OpenShift Container Platform	Reporter:	Junqi Zhao <juzhao>
Component:	Monitoring	Assignee:	Simon Pasquier <spasquie>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.6	CC:	alegrand, anpicker, bparees, erooth, kakkoyun, lcosic, mloibl, pkrupa, spasquie, surbania
Target Milestone:	---	Keywords:	Reopened
Target Release:	4.5.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1869301 (view as bug list)		Environment:
Last Closed:	2020-09-08 10:54:46 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1869301
Bug Blocks:

Description Junqi Zhao 2020-08-12 09:23:04 UTC

Description of problem:
we have added ValidatingWebhookConfiguration prometheusrules.openshift.io since 4.6, but it blocks monitoring downgrade from 4.6 to 4.5.
the workaroud is 
# oc delete ValidatingWebhookConfiguration prometheusrules.openshift.io
the downgrade would continue.

down grade from 4.6.0-0.nightly-2020-08-12-003456 to 4.5.0-0.nightly-2020-08-08-162221
# oc get co/monitoring -oyaml
...
  - lastTransitionTime: "2020-08-12T06:40:10Z"
    message: 'Failed to rollout the stack. Error: running task Updating Prometheus-k8s failed: reconciling Prometheus rules PrometheusRule failed: updating PrometheusRule object failed: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s: no service port 8080 found for service "prometheus-operator"'
    reason: UpdatingPrometheusK8SFailed
    status: "True"
    type: Degraded
...
Note: in a healthy 4.6 cluster
# oc -n openshift-monitoring get svc prometheus-operator
NAME                  TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)             AGE
prometheus-operator   ClusterIP   None         <none>        8443/TCP,8080/TCP   6h23m

when it downgrades to 4.5, it only has 8443 port, which is expected for 4.5
# oc -n openshift-monitoring get svc prometheus-operator
NAME                  TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE
prometheus-operator   ClusterIP   None         <none>        8443/TCP   5h5m

# oc -n openshift-monitoring get svc prometheus-operator -oyaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.openshift.io/serving-cert-secret-name: prometheus-operator-tls
  creationTimestamp: "2020-08-12T03:49:15Z"
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: prometheus-operator
    app.kubernetes.io/version: v0.38.1
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:service.beta.openshift.io/serving-cert-secret-name: {}
        f:labels:
          .: {}
          f:app.kubernetes.io/component: {}
          f:app.kubernetes.io/name: {}
          f:app.kubernetes.io/version: {}
      f:spec:
        f:clusterIP: {}
        f:ports:
          .: {}
          k:{"port":8443,"protocol":"TCP"}:
            .: {}
            f:name: {}
            f:port: {}
            f:protocol: {}
            f:targetPort: {}
        f:selector:
          .: {}
          f:app.kubernetes.io/component: {}
          f:app.kubernetes.io/name: {}
        f:sessionAffinity: {}
        f:type: {}
    manager: operator
    operation: Update
    time: "2020-08-12T06:36:52Z"
  name: prometheus-operator
  namespace: openshift-monitoring
  resourceVersion: "163229"
  selfLink: /api/v1/namespaces/openshift-monitoring/services/prometheus-operator
  uid: 56e0efe0-81e2-48e5-b75a-3df951163071
spec:
  clusterIP: None
  ports:
  - name: https
    port: 8443
    protocol: TCP
    targetPort: https
  selector:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: prometheus-operator
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

Version-Release number of selected component (if applicable):
down grade from 4.6.0-0.nightly-2020-08-12-003456 to 4.5.0-0.nightly-2020-08-08-162221

How reproducible:
always

Steps to Reproduce:
1. down grade from 4.6.0-0.nightly-2020-08-12-003456 to 4.5.0-0.nightly-2020-08-08-162221
2.
3.

Actual results:
monitoring downgrade is blocked

Expected results:
no issue

Additional info:

Comment 1 Sergiusz Urbaniak 2020-08-13 05:23:45 UTC

i am not sure downgrades are supported, I need to clarify.

Comment 2 Sergiusz Urbaniak 2020-08-13 05:33:47 UTC

pinging pillar lead to clarify if this is considered a release blocking issue and if downgrades are supported.

Comment 3 Ben Parees 2020-08-13 13:08:51 UTC

downgrades are supported.  We allow you to downgrade in order to fix an issue before moving forward again.  We don't support you running long term on a cluster that's been downgraded(so some odd/broken behavior is acceptable), but you have to be able to perform the downgrade temporarily.

Comment 4 Sergiusz Urbaniak 2020-08-17 11:44:34 UTC

To be more precise after talking to Ben Parees OOB we need to support downgrades for a short period of time only not to break the stack. It is not supported/envisioned to keep a downgraded cluster runnning for a long time.

we have the following implementation strategies at hand:

a) we implement an explicit removal of the openshift-monitoring/prometheus-operator webhook [1] in CMO's code in the 4.5 release branch.
This option has the advantage of being cleaner, removing the 4.6 assets cleanly.
Downside is that this fixes it just for a >=4.5.z versions only once the patch lands.

b) instead of adding another port `web` to the existing openshift-monitoring/prometheus-operator [2] we could create another dedicated service. this way, when CMO is being downgraded, the service and the webhook would say
This option has the advantage of being compatible with all 4.5.z releases.
Downside is we're leaving 4.6 assets (webhook itself and webhook service) around in an 4.5 environment which exposes untested functionality (webhook validation) in 4.5.

[1] https://github.com/openshift/cluster-monitoring-operator/blob/061ba1cbed128a2b3858261b5b89f6aef268a08b/assets/prometheus-operator/prometheus-rule-validating-webhook.yaml
[2] https://github.com/openshift/cluster-monitoring-operator/blob/9d45decd69cbc40d88d869815bd3ad9fec77e5c9/assets/prometheus-operator/service.yaml#L18-L20

Comment 8 Junqi Zhao 2020-08-21 08:42:30 UTC

degrade from 4.6.0-0.nightly-2020-08-20-174655 to 4.5.0-0.nightly-2020-08-20-011847, no block for monitoring
# oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-08-20-011847 --allow-explicit-upgrade=true --force

Comment 10 errata-xmlrpc 2020-09-08 10:54:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.8 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3510