Bug 2069068
| Summary: | reconciling Prometheus Operator Deployment failed while upgrading from 4.7.46 to 4.8.35 | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Junqi Zhao <juzhao> |
| Component: | Monitoring | Assignee: | Joao Marcal <jmarcal> |
| Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
| Severity: | medium | Docs Contact: | Brian Burt <bburt> |
| Priority: | medium | ||
| Version: | 4.8 | CC: | amuller, anpicker, aos-bugs, bburt, hongyli, spasquie |
| Target Milestone: | --- | ||
| Target Release: | 4.11.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Before this update, CMO after deleting a deployment did not wait for its deletion to be completed which caused some reconciliation errors. With this update, CMO now waits for the deletion of the deployments to happen before trying to recreate them which resolves the issue.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-08-10 11:02:19 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Disconnected UPI on AWS & Private 4.10.15 cluster upgrades to 4.11.0-0.nightly-2022-05-18-010528 without error, prometheus-operator works well. # oc get node NAME STATUS ROLES AGE VERSION ip-10-0-57-4.us-east-2.compute.internal Ready worker 4h44m v1.23.3+69213f8 ip-10-0-61-83.us-east-2.compute.internal Ready master 4h57m v1.23.3+69213f8 ip-10-0-63-169.us-east-2.compute.internal Ready master 4h58m v1.23.3+69213f8 ip-10-0-65-35.us-east-2.compute.internal Ready worker 4h44m v1.23.3+69213f8 ip-10-0-66-26.us-east-2.compute.internal Ready master 4h57m v1.23.3+69213f8 # oc get co monitoring NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE monitoring 4.11.0-0.nightly-2022-05-18-010528 True False False 4h42m # oc -n openshift-monitoring get pod | grep prometheus-operator oc admprometheus-operator-9b6df5f-tj7sh 2/2 Running 0 44m ... Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |
Description of problem: Disconnected UPI on AWS & Private 4.7.46 cluster, we wanted to upgrade to 4.8.35, then 4.9.26, and upgrade to 4.10.0-0.nightly-2022-03-25-033937 finally the cluster is with 3 masters/2 workers 03-26 01:16:12.159 Running Command: oc get node 03-26 01:16:12.159 NAME STATUS ROLES AGE VERSION 03-26 01:16:12.159 ip-10-0-51-133.us-east-2.compute.internal Ready master 146m v1.20.14+0d60930 03-26 01:16:12.159 ip-10-0-55-174.us-east-2.compute.internal Ready worker 135m v1.20.14+0d60930 03-26 01:16:12.159 ip-10-0-57-194.us-east-2.compute.internal Ready master 146m v1.20.14+0d60930 03-26 01:16:12.159 ip-10-0-64-227.us-east-2.compute.internal Ready worker 135m v1.20.14+0d60930 03-26 01:16:12.159 ip-10-0-76-150.us-east-2.compute.internal Ready master 146m v1.20.14+0d60930 but while upgrading to 4.8.35, monitoring is degraded for status: conditions: - lastTransitionTime: "2022-03-25T19:16:50Z" message: 'Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: creating Deployment object failed after update failed: object is being deleted: deployments.apps "prometheus-operator" already exists' reason: UpdatingPrometheusOperatorFailed status: "True" type: Degraded checked from must-gather file, prometheus-operator-69f69949dd-9nnqd.yaml see from file: must-gather-88536-313203383/must-gather.local.800587091794790583/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-104b33969aa6c60608e06b1325b987c68522e875f6ac8f3541cbe485f705a902/namespaces/openshift-monitoring/pods/prometheus-operator-69f69949dd-9nnqd/prometheus-operator-69f69949dd-9nnqd.yaml *********************************** status: conditions: - lastProbeTime: null lastTransitionTime: "2022-03-25T19:16:42Z" status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: "2022-03-25T19:16:42Z" message: 'containers with unready status: [prometheus-operator kube-rbac-proxy]' reason: ContainersNotReady status: "False" type: Ready *********************************** searched from events.yaml file, it seems while updating pod prometheus-operator-69f69949dd-9nnqd which was scheduled in ip-10-0-76-150.us-east-2.compute.internal node, it tried to delete pod prometheus-operator-5ddbcb5887-lhn2q which was also scheduled in ip-10-0-76-150.us-east-2.compute.internal node, that caused the error. $ cat must-gather-88536-313203383/must-gather.local.800587091794790583/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-104b33969aa6c60608e06b1325b987c68522e875f6ac8f3541cbe485f705a902/namespaces/openshift-monitoring/core/events.yaml | grep "Successfully assigned openshift-monitoring/prometheus-operator" -A6 message: Successfully assigned openshift-monitoring/prometheus-operator-5ddbcb5887-l95j8 to ip-10-0-51-133.us-east-2.compute.internal metadata: creationTimestamp: "2022-03-25T18:41:33Z" name: prometheus-operator-5ddbcb5887-l95j8.16dfb3cfd2cd8860 namespace: openshift-monitoring resourceVersion: "131915" -- message: Successfully assigned openshift-monitoring/prometheus-operator-5ddbcb5887-lhn2q to ip-10-0-76-150.us-east-2.compute.internal metadata: creationTimestamp: "2022-03-25T18:43:46Z" name: prometheus-operator-5ddbcb5887-lhn2q.16dfb3eeb44859d5 namespace: openshift-monitoring resourceVersion: "137286" -- message: Successfully assigned openshift-monitoring/prometheus-operator-5ddbcb5887-tglqq to ip-10-0-76-150.us-east-2.compute.internal metadata: creationTimestamp: "2022-03-25T17:18:00Z" name: prometheus-operator-5ddbcb5887-tglqq.16dfaf40abe8fe1a namespace: openshift-monitoring resourceVersion: "68473" -- message: Successfully assigned openshift-monitoring/prometheus-operator-69f69949dd-9nnqd to ip-10-0-76-150.us-east-2.compute.internal metadata: creationTimestamp: "2022-03-25T19:16:42Z" name: prometheus-operator-69f69949dd-9nnqd.16dfb5bade0b366c namespace: openshift-monitoring resourceVersion: "174128" Version-Release number of selected component (if applicable): Disconnected UPI on AWS & Private 4.7.46 cluster, upgrade to 4.8.35 How reproducible: not sure, maybe not happen everytime Steps to Reproduce: 1. Disconnected UPI on AWS & Private 4.7.46 cluster, upgrade to 4.8.35 2. 3. Actual results: reconciling Prometheus Operator Deployment failed while upgrading from 4.7.46 to 4.8.35 Expected results: no issue Additional info: