Description of problem: Disconnected UPI on AWS & Private 4.7.46 cluster, we wanted to upgrade to 4.8.35, then 4.9.26, and upgrade to 4.10.0-0.nightly-2022-03-25-033937 finally the cluster is with 3 masters/2 workers 03-26 01:16:12.159 Running Command: oc get node 03-26 01:16:12.159 NAME STATUS ROLES AGE VERSION 03-26 01:16:12.159 ip-10-0-51-133.us-east-2.compute.internal Ready master 146m v1.20.14+0d60930 03-26 01:16:12.159 ip-10-0-55-174.us-east-2.compute.internal Ready worker 135m v1.20.14+0d60930 03-26 01:16:12.159 ip-10-0-57-194.us-east-2.compute.internal Ready master 146m v1.20.14+0d60930 03-26 01:16:12.159 ip-10-0-64-227.us-east-2.compute.internal Ready worker 135m v1.20.14+0d60930 03-26 01:16:12.159 ip-10-0-76-150.us-east-2.compute.internal Ready master 146m v1.20.14+0d60930 but while upgrading to 4.8.35, monitoring is degraded for status: conditions: - lastTransitionTime: "2022-03-25T19:16:50Z" message: 'Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: creating Deployment object failed after update failed: object is being deleted: deployments.apps "prometheus-operator" already exists' reason: UpdatingPrometheusOperatorFailed status: "True" type: Degraded checked from must-gather file, prometheus-operator-69f69949dd-9nnqd.yaml see from file: must-gather-88536-313203383/must-gather.local.800587091794790583/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-104b33969aa6c60608e06b1325b987c68522e875f6ac8f3541cbe485f705a902/namespaces/openshift-monitoring/pods/prometheus-operator-69f69949dd-9nnqd/prometheus-operator-69f69949dd-9nnqd.yaml *********************************** status: conditions: - lastProbeTime: null lastTransitionTime: "2022-03-25T19:16:42Z" status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: "2022-03-25T19:16:42Z" message: 'containers with unready status: [prometheus-operator kube-rbac-proxy]' reason: ContainersNotReady status: "False" type: Ready *********************************** searched from events.yaml file, it seems while updating pod prometheus-operator-69f69949dd-9nnqd which was scheduled in ip-10-0-76-150.us-east-2.compute.internal node, it tried to delete pod prometheus-operator-5ddbcb5887-lhn2q which was also scheduled in ip-10-0-76-150.us-east-2.compute.internal node, that caused the error. $ cat must-gather-88536-313203383/must-gather.local.800587091794790583/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-104b33969aa6c60608e06b1325b987c68522e875f6ac8f3541cbe485f705a902/namespaces/openshift-monitoring/core/events.yaml | grep "Successfully assigned openshift-monitoring/prometheus-operator" -A6 message: Successfully assigned openshift-monitoring/prometheus-operator-5ddbcb5887-l95j8 to ip-10-0-51-133.us-east-2.compute.internal metadata: creationTimestamp: "2022-03-25T18:41:33Z" name: prometheus-operator-5ddbcb5887-l95j8.16dfb3cfd2cd8860 namespace: openshift-monitoring resourceVersion: "131915" -- message: Successfully assigned openshift-monitoring/prometheus-operator-5ddbcb5887-lhn2q to ip-10-0-76-150.us-east-2.compute.internal metadata: creationTimestamp: "2022-03-25T18:43:46Z" name: prometheus-operator-5ddbcb5887-lhn2q.16dfb3eeb44859d5 namespace: openshift-monitoring resourceVersion: "137286" -- message: Successfully assigned openshift-monitoring/prometheus-operator-5ddbcb5887-tglqq to ip-10-0-76-150.us-east-2.compute.internal metadata: creationTimestamp: "2022-03-25T17:18:00Z" name: prometheus-operator-5ddbcb5887-tglqq.16dfaf40abe8fe1a namespace: openshift-monitoring resourceVersion: "68473" -- message: Successfully assigned openshift-monitoring/prometheus-operator-69f69949dd-9nnqd to ip-10-0-76-150.us-east-2.compute.internal metadata: creationTimestamp: "2022-03-25T19:16:42Z" name: prometheus-operator-69f69949dd-9nnqd.16dfb5bade0b366c namespace: openshift-monitoring resourceVersion: "174128" Version-Release number of selected component (if applicable): Disconnected UPI on AWS & Private 4.7.46 cluster, upgrade to 4.8.35 How reproducible: not sure, maybe not happen everytime Steps to Reproduce: 1. Disconnected UPI on AWS & Private 4.7.46 cluster, upgrade to 4.8.35 2. 3. Actual results: reconciling Prometheus Operator Deployment failed while upgrading from 4.7.46 to 4.8.35 Expected results: no issue Additional info:
Disconnected UPI on AWS & Private 4.10.15 cluster upgrades to 4.11.0-0.nightly-2022-05-18-010528 without error, prometheus-operator works well. # oc get node NAME STATUS ROLES AGE VERSION ip-10-0-57-4.us-east-2.compute.internal Ready worker 4h44m v1.23.3+69213f8 ip-10-0-61-83.us-east-2.compute.internal Ready master 4h57m v1.23.3+69213f8 ip-10-0-63-169.us-east-2.compute.internal Ready master 4h58m v1.23.3+69213f8 ip-10-0-65-35.us-east-2.compute.internal Ready worker 4h44m v1.23.3+69213f8 ip-10-0-66-26.us-east-2.compute.internal Ready master 4h57m v1.23.3+69213f8 # oc get co monitoring NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE monitoring 4.11.0-0.nightly-2022-05-18-010528 True False False 4h42m # oc -n openshift-monitoring get pod | grep prometheus-operator oc admprometheus-operator-9b6df5f-tj7sh 2/2 Running 0 44m ...
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069