Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2069068

Summary:	reconciling Prometheus Operator Deployment failed while upgrading from 4.7.46 to 4.8.35
Product:	OpenShift Container Platform	Reporter:	Junqi Zhao <juzhao>
Component:	Monitoring	Assignee:	Joao Marcal <jmarcal>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	medium	Docs Contact:	Brian Burt <bburt>
Priority:	medium
Version:	4.8	CC:	amuller, anpicker, aos-bugs, bburt, hongyli, spasquie
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Before this update, CMO after deleting a deployment did not wait for its deletion to be completed which caused some reconciliation errors. With this update, CMO now waits for the deletion of the deployments to happen before trying to recreate them which resolves the issue.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-08-10 11:02:19 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Junqi Zhao 2022-03-28 08:10:24 UTC

Description of problem:
Disconnected UPI on AWS & Private 4.7.46 cluster, we wanted to upgrade to 4.8.35, then 4.9.26, and upgrade to 4.10.0-0.nightly-2022-03-25-033937 finally
the cluster is with 3 masters/2 workers
03-26 01:16:12.159  Running Command: oc get node
03-26 01:16:12.159  NAME                                        STATUS   ROLES    AGE    VERSION
03-26 01:16:12.159  ip-10-0-51-133.us-east-2.compute.internal   Ready    master   146m   v1.20.14+0d60930
03-26 01:16:12.159  ip-10-0-55-174.us-east-2.compute.internal   Ready    worker   135m   v1.20.14+0d60930
03-26 01:16:12.159  ip-10-0-57-194.us-east-2.compute.internal   Ready    master   146m   v1.20.14+0d60930
03-26 01:16:12.159  ip-10-0-64-227.us-east-2.compute.internal   Ready    worker   135m   v1.20.14+0d60930
03-26 01:16:12.159  ip-10-0-76-150.us-east-2.compute.internal   Ready    master   146m   v1.20.14+0d60930

but while upgrading to 4.8.35, monitoring is degraded for
status:
  conditions:
  - lastTransitionTime: "2022-03-25T19:16:50Z"
    message: 'Failed to rollout the stack. Error: running task Updating Prometheus
      Operator failed: reconciling Prometheus Operator Deployment failed: creating
      Deployment object failed after update failed: object is being deleted: deployments.apps
      "prometheus-operator" already exists'
    reason: UpdatingPrometheusOperatorFailed
    status: "True"
    type: Degraded

checked from must-gather file, prometheus-operator-69f69949dd-9nnqd.yaml
see from file: must-gather-88536-313203383/must-gather.local.800587091794790583/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-104b33969aa6c60608e06b1325b987c68522e875f6ac8f3541cbe485f705a902/namespaces/openshift-monitoring/pods/prometheus-operator-69f69949dd-9nnqd/prometheus-operator-69f69949dd-9nnqd.yaml
***********************************
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-03-25T19:16:42Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-03-25T19:16:42Z"
    message: 'containers with unready status: [prometheus-operator kube-rbac-proxy]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
***********************************

searched from events.yaml file, it seems while updating pod prometheus-operator-69f69949dd-9nnqd which was scheduled in ip-10-0-76-150.us-east-2.compute.internal node,  it tried to delete pod prometheus-operator-5ddbcb5887-lhn2q which was also scheduled in ip-10-0-76-150.us-east-2.compute.internal node, that caused the error.

$ cat must-gather-88536-313203383/must-gather.local.800587091794790583/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-104b33969aa6c60608e06b1325b987c68522e875f6ac8f3541cbe485f705a902/namespaces/openshift-monitoring/core/events.yaml | grep "Successfully assigned openshift-monitoring/prometheus-operator" -A6
  message: Successfully assigned openshift-monitoring/prometheus-operator-5ddbcb5887-l95j8
    to ip-10-0-51-133.us-east-2.compute.internal
  metadata:
    creationTimestamp: "2022-03-25T18:41:33Z"
    name: prometheus-operator-5ddbcb5887-l95j8.16dfb3cfd2cd8860
    namespace: openshift-monitoring
    resourceVersion: "131915"
--
  message: Successfully assigned openshift-monitoring/prometheus-operator-5ddbcb5887-lhn2q
    to ip-10-0-76-150.us-east-2.compute.internal
  metadata:
    creationTimestamp: "2022-03-25T18:43:46Z"
    name: prometheus-operator-5ddbcb5887-lhn2q.16dfb3eeb44859d5
    namespace: openshift-monitoring
    resourceVersion: "137286"
--
  message: Successfully assigned openshift-monitoring/prometheus-operator-5ddbcb5887-tglqq
    to ip-10-0-76-150.us-east-2.compute.internal
  metadata:
    creationTimestamp: "2022-03-25T17:18:00Z"
    name: prometheus-operator-5ddbcb5887-tglqq.16dfaf40abe8fe1a
    namespace: openshift-monitoring
    resourceVersion: "68473"
--
  message: Successfully assigned openshift-monitoring/prometheus-operator-69f69949dd-9nnqd
    to ip-10-0-76-150.us-east-2.compute.internal
  metadata:
    creationTimestamp: "2022-03-25T19:16:42Z"
    name: prometheus-operator-69f69949dd-9nnqd.16dfb5bade0b366c
    namespace: openshift-monitoring
    resourceVersion: "174128"


Version-Release number of selected component (if applicable):
Disconnected UPI on AWS & Private 4.7.46 cluster, upgrade to 4.8.35

How reproducible:
not sure, maybe not happen everytime

Steps to Reproduce:
1. Disconnected UPI on AWS & Private 4.7.46 cluster, upgrade to 4.8.35
2.
3.

Actual results:
reconciling Prometheus Operator Deployment failed while upgrading from 4.7.46 to 4.8.35

Expected results:
no issue

Additional info:

Comment 5 Junqi Zhao 2022-05-18 13:47:26 UTC

Disconnected UPI on AWS & Private 4.10.15 cluster upgrades to 4.11.0-0.nightly-2022-05-18-010528 without error, prometheus-operator works well.
# oc get node
NAME                                        STATUS   ROLES    AGE     VERSION
ip-10-0-57-4.us-east-2.compute.internal     Ready    worker   4h44m   v1.23.3+69213f8
ip-10-0-61-83.us-east-2.compute.internal    Ready    master   4h57m   v1.23.3+69213f8
ip-10-0-63-169.us-east-2.compute.internal   Ready    master   4h58m   v1.23.3+69213f8
ip-10-0-65-35.us-east-2.compute.internal    Ready    worker   4h44m   v1.23.3+69213f8
ip-10-0-66-26.us-east-2.compute.internal    Ready    master   4h57m   v1.23.3+69213f8

# oc get co monitoring
NAME         VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
monitoring   4.11.0-0.nightly-2022-05-18-010528   True        False         False      4h42m   

# oc -n openshift-monitoring get pod | grep prometheus-operator
oc admprometheus-operator-9b6df5f-tj7sh                       2/2     Running   0          44m
...

Comment 9 errata-xmlrpc 2022-08-10 11:02:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069