2069068 – reconciling Prometheus Operator Deployment failed while upgrading from 4.7.46 to 4.8.35

Bug 2069068 - reconciling Prometheus Operator Deployment failed while upgrading from 4.7.46 to 4.8.35

Summary: reconciling Prometheus Operator Deployment failed while upgrading from 4.7.46...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Joao Marcal
QA Contact:	Junqi Zhao
Docs Contact:	Brian Burt
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-03-28 08:10 UTC by Junqi Zhao
Modified:	2022-08-10 11:02 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Before this update, CMO after deleting a deployment did not wait for its deletion to be completed which caused some reconciliation errors. With this update, CMO now waits for the deletion of the deployments to happen before trying to recreate them which resolves the issue.
Clone Of:
Environment:
Last Closed:	2022-08-10 11:02:19 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1655	0	None	open	Bug 2069068: Refactors DeleteDeployment to wait for deletion of the deployment	2022-05-03 11:12:14 UTC
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 11:02:47 UTC

Description Junqi Zhao 2022-03-28 08:10:24 UTC

Description of problem:
Disconnected UPI on AWS & Private 4.7.46 cluster, we wanted to upgrade to 4.8.35, then 4.9.26, and upgrade to 4.10.0-0.nightly-2022-03-25-033937 finally
the cluster is with 3 masters/2 workers
03-26 01:16:12.159  Running Command: oc get node
03-26 01:16:12.159  NAME                                        STATUS   ROLES    AGE    VERSION
03-26 01:16:12.159  ip-10-0-51-133.us-east-2.compute.internal   Ready    master   146m   v1.20.14+0d60930
03-26 01:16:12.159  ip-10-0-55-174.us-east-2.compute.internal   Ready    worker   135m   v1.20.14+0d60930
03-26 01:16:12.159  ip-10-0-57-194.us-east-2.compute.internal   Ready    master   146m   v1.20.14+0d60930
03-26 01:16:12.159  ip-10-0-64-227.us-east-2.compute.internal   Ready    worker   135m   v1.20.14+0d60930
03-26 01:16:12.159  ip-10-0-76-150.us-east-2.compute.internal   Ready    master   146m   v1.20.14+0d60930

but while upgrading to 4.8.35, monitoring is degraded for
status:
  conditions:
  - lastTransitionTime: "2022-03-25T19:16:50Z"
    message: 'Failed to rollout the stack. Error: running task Updating Prometheus
      Operator failed: reconciling Prometheus Operator Deployment failed: creating
      Deployment object failed after update failed: object is being deleted: deployments.apps
      "prometheus-operator" already exists'
    reason: UpdatingPrometheusOperatorFailed
    status: "True"
    type: Degraded

checked from must-gather file, prometheus-operator-69f69949dd-9nnqd.yaml
see from file: must-gather-88536-313203383/must-gather.local.800587091794790583/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-104b33969aa6c60608e06b1325b987c68522e875f6ac8f3541cbe485f705a902/namespaces/openshift-monitoring/pods/prometheus-operator-69f69949dd-9nnqd/prometheus-operator-69f69949dd-9nnqd.yaml
***********************************
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-03-25T19:16:42Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-03-25T19:16:42Z"
    message: 'containers with unready status: [prometheus-operator kube-rbac-proxy]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
***********************************

searched from events.yaml file, it seems while updating pod prometheus-operator-69f69949dd-9nnqd which was scheduled in ip-10-0-76-150.us-east-2.compute.internal node,  it tried to delete pod prometheus-operator-5ddbcb5887-lhn2q which was also scheduled in ip-10-0-76-150.us-east-2.compute.internal node, that caused the error.

$ cat must-gather-88536-313203383/must-gather.local.800587091794790583/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-104b33969aa6c60608e06b1325b987c68522e875f6ac8f3541cbe485f705a902/namespaces/openshift-monitoring/core/events.yaml | grep "Successfully assigned openshift-monitoring/prometheus-operator" -A6
  message: Successfully assigned openshift-monitoring/prometheus-operator-5ddbcb5887-l95j8
    to ip-10-0-51-133.us-east-2.compute.internal
  metadata:
    creationTimestamp: "2022-03-25T18:41:33Z"
    name: prometheus-operator-5ddbcb5887-l95j8.16dfb3cfd2cd8860
    namespace: openshift-monitoring
    resourceVersion: "131915"
--
  message: Successfully assigned openshift-monitoring/prometheus-operator-5ddbcb5887-lhn2q
    to ip-10-0-76-150.us-east-2.compute.internal
  metadata:
    creationTimestamp: "2022-03-25T18:43:46Z"
    name: prometheus-operator-5ddbcb5887-lhn2q.16dfb3eeb44859d5
    namespace: openshift-monitoring
    resourceVersion: "137286"
--
  message: Successfully assigned openshift-monitoring/prometheus-operator-5ddbcb5887-tglqq
    to ip-10-0-76-150.us-east-2.compute.internal
  metadata:
    creationTimestamp: "2022-03-25T17:18:00Z"
    name: prometheus-operator-5ddbcb5887-tglqq.16dfaf40abe8fe1a
    namespace: openshift-monitoring
    resourceVersion: "68473"
--
  message: Successfully assigned openshift-monitoring/prometheus-operator-69f69949dd-9nnqd
    to ip-10-0-76-150.us-east-2.compute.internal
  metadata:
    creationTimestamp: "2022-03-25T19:16:42Z"
    name: prometheus-operator-69f69949dd-9nnqd.16dfb5bade0b366c
    namespace: openshift-monitoring
    resourceVersion: "174128"


Version-Release number of selected component (if applicable):
Disconnected UPI on AWS & Private 4.7.46 cluster, upgrade to 4.8.35

How reproducible:
not sure, maybe not happen everytime

Steps to Reproduce:
1. Disconnected UPI on AWS & Private 4.7.46 cluster, upgrade to 4.8.35
2.
3.

Actual results:
reconciling Prometheus Operator Deployment failed while upgrading from 4.7.46 to 4.8.35

Expected results:
no issue

Additional info:

Comment 5 Junqi Zhao 2022-05-18 13:47:26 UTC

Disconnected UPI on AWS & Private 4.10.15 cluster upgrades to 4.11.0-0.nightly-2022-05-18-010528 without error, prometheus-operator works well.
# oc get node
NAME                                        STATUS   ROLES    AGE     VERSION
ip-10-0-57-4.us-east-2.compute.internal     Ready    worker   4h44m   v1.23.3+69213f8
ip-10-0-61-83.us-east-2.compute.internal    Ready    master   4h57m   v1.23.3+69213f8
ip-10-0-63-169.us-east-2.compute.internal   Ready    master   4h58m   v1.23.3+69213f8
ip-10-0-65-35.us-east-2.compute.internal    Ready    worker   4h44m   v1.23.3+69213f8
ip-10-0-66-26.us-east-2.compute.internal    Ready    master   4h57m   v1.23.3+69213f8

# oc get co monitoring
NAME         VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
monitoring   4.11.0-0.nightly-2022-05-18-010528   True        False         False      4h42m   

# oc -n openshift-monitoring get pod | grep prometheus-operator
oc admprometheus-operator-9b6df5f-tj7sh                       2/2     Running   0          44m
...

Comment 9 errata-xmlrpc 2022-08-10 11:02:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.