Description of problem: we have added ValidatingWebhookConfiguration prometheusrules.openshift.io since 4.6, but it blocks monitoring downgrade from 4.6 to 4.5. the workaroud is # oc delete ValidatingWebhookConfiguration prometheusrules.openshift.io the downgrade would continue. down grade from 4.6.0-0.nightly-2020-08-12-003456 to 4.5.0-0.nightly-2020-08-08-162221 # oc get co/monitoring -oyaml ... - lastTransitionTime: "2020-08-12T06:40:10Z" message: 'Failed to rollout the stack. Error: running task Updating Prometheus-k8s failed: reconciling Prometheus rules PrometheusRule failed: updating PrometheusRule object failed: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s: no service port 8080 found for service "prometheus-operator"' reason: UpdatingPrometheusK8SFailed status: "True" type: Degraded ... Note: in a healthy 4.6 cluster # oc -n openshift-monitoring get svc prometheus-operator NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE prometheus-operator ClusterIP None <none> 8443/TCP,8080/TCP 6h23m when it downgrades to 4.5, it only has 8443 port, which is expected for 4.5 # oc -n openshift-monitoring get svc prometheus-operator NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE prometheus-operator ClusterIP None <none> 8443/TCP 5h5m # oc -n openshift-monitoring get svc prometheus-operator -oyaml apiVersion: v1 kind: Service metadata: annotations: service.beta.openshift.io/serving-cert-secret-name: prometheus-operator-tls creationTimestamp: "2020-08-12T03:49:15Z" labels: app.kubernetes.io/component: controller app.kubernetes.io/name: prometheus-operator app.kubernetes.io/version: v0.38.1 managedFields: - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: .: {} f:service.beta.openshift.io/serving-cert-secret-name: {} f:labels: .: {} f:app.kubernetes.io/component: {} f:app.kubernetes.io/name: {} f:app.kubernetes.io/version: {} f:spec: f:clusterIP: {} f:ports: .: {} k:{"port":8443,"protocol":"TCP"}: .: {} f:name: {} f:port: {} f:protocol: {} f:targetPort: {} f:selector: .: {} f:app.kubernetes.io/component: {} f:app.kubernetes.io/name: {} f:sessionAffinity: {} f:type: {} manager: operator operation: Update time: "2020-08-12T06:36:52Z" name: prometheus-operator namespace: openshift-monitoring resourceVersion: "163229" selfLink: /api/v1/namespaces/openshift-monitoring/services/prometheus-operator uid: 56e0efe0-81e2-48e5-b75a-3df951163071 spec: clusterIP: None ports: - name: https port: 8443 protocol: TCP targetPort: https selector: app.kubernetes.io/component: controller app.kubernetes.io/name: prometheus-operator sessionAffinity: None type: ClusterIP status: loadBalancer: {} Version-Release number of selected component (if applicable): down grade from 4.6.0-0.nightly-2020-08-12-003456 to 4.5.0-0.nightly-2020-08-08-162221 How reproducible: always Steps to Reproduce: 1. down grade from 4.6.0-0.nightly-2020-08-12-003456 to 4.5.0-0.nightly-2020-08-08-162221 2. 3. Actual results: monitoring downgrade is blocked Expected results: no issue Additional info:
i am not sure downgrades are supported, I need to clarify.
pinging pillar lead to clarify if this is considered a release blocking issue and if downgrades are supported.
downgrades are supported. We allow you to downgrade in order to fix an issue before moving forward again. We don't support you running long term on a cluster that's been downgraded(so some odd/broken behavior is acceptable), but you have to be able to perform the downgrade temporarily.
To be more precise after talking to Ben Parees OOB we need to support downgrades for a short period of time only not to break the stack. It is not supported/envisioned to keep a downgraded cluster runnning for a long time. we have the following implementation strategies at hand: a) we implement an explicit removal of the openshift-monitoring/prometheus-operator webhook [1] in CMO's code in the 4.5 release branch. This option has the advantage of being cleaner, removing the 4.6 assets cleanly. Downside is that this fixes it just for a >=4.5.z versions only once the patch lands. b) instead of adding another port `web` to the existing openshift-monitoring/prometheus-operator [2] we could create another dedicated service. this way, when CMO is being downgraded, the service and the webhook would say This option has the advantage of being compatible with all 4.5.z releases. Downside is we're leaving 4.6 assets (webhook itself and webhook service) around in an 4.5 environment which exposes untested functionality (webhook validation) in 4.5. [1] https://github.com/openshift/cluster-monitoring-operator/blob/061ba1cbed128a2b3858261b5b89f6aef268a08b/assets/prometheus-operator/prometheus-rule-validating-webhook.yaml [2] https://github.com/openshift/cluster-monitoring-operator/blob/9d45decd69cbc40d88d869815bd3ad9fec77e5c9/assets/prometheus-operator/service.yaml#L18-L20
degrade from 4.6.0-0.nightly-2020-08-20-174655 to 4.5.0-0.nightly-2020-08-20-011847, no block for monitoring # oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-08-20-011847 --allow-explicit-upgrade=true --force
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.5.8 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3510