Bug 1982369

Summary:	CMO fails to delete/recreate the deployment resource after '422 Unprocessable Entity' update response
Product:	OpenShift Container Platform	Reporter:	OpenShift BugZilla Robot <openshift-bugzilla-robot>
Component:	Monitoring	Assignee:	Jayapriya Pai <janantha>
Status:	CLOSED DUPLICATE	QA Contact:	Junqi Zhao <juzhao>
Severity:	medium	Docs Contact:
Priority:	high
Version:	4.8	CC:	alegrand, amuller, anpicker, aos-bugs, dgrisonn, eparis, erooth, lcosic, spasquie, wking
Target Milestone:	---	Keywords:	Upgrades
Target Release:	4.8.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-09-17 06:21:57 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1949840, 1956308, 2005205, 2005206
Bug Blocks:	1996132

Comment 5 Junqi Zhao 2021-08-03 03:44:24 UTC

searched with
https://search.ci.openshift.org/?search=creating+Deployment+object+failed+after+update+failed&maxAge=48h&context=1&type=bug%2Bjunit&name=periodic-ci-openshift-release-master-nightly-4.8-upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

still can see error: 
Aug 02 13:37:41.192 - 71s   E clusteroperator/monitoring condition/Degraded status/True reason/Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: creating Deployment object failed after update failed: object is being deleted: deployments.apps "prometheus-operator" already exists

example
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/20753/rehearse-20753-periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1422170933774258176

upgraded from 4.7.21 to 4.8.0-0.nightly-2021-07-31-065602
error
*************************************************************
Aug 02 13:37:41.192 E clusteroperator/monitoring condition/Available status/False reason/UpdatingPrometheusOperatorFailed changed: Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
Aug 02 13:37:41.192 E clusteroperator/monitoring condition/Degraded status/True reason/UpdatingPrometheusOperatorFailed changed: Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: creating Deployment object failed after update failed: object is being deleted: deployments.apps "prometheus-operator" already exists
Aug 02 13:37:41.192 - 71s   E clusteroperator/monitoring condition/Available status/False reason/Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
Aug 02 13:37:41.192 - 71s   E clusteroperator/monitoring condition/Degraded status/True reason/Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: creating Deployment object failed after update failed: object is being deleted: deployments.apps "prometheus-operator" already exists
Aug 02 13:37:43.169 E ns/openshift-service-ca-operator pod/service-ca-operator-699fdbb947-4cv54 node/ip-10-0-222-211.ec2.internal container/service-ca-operator reason/ContainerExit code/1 cause/Error
*************************************************************

Comment 7 Simon Pasquier 2021-08-19 15:40:47 UTC

Could it be that the log message happens before the cluster is actually upgraded to 4.8 (e.g. the cluster monitoring operator's image version is still 4.7 which doesn't include the fix)?

Looking at a 4.7 > 4.8 job [0]
* The message is logged at Aug 15 23:58:01.983 [1]
* the current CMO's logs start at Aug 16 00:26:31 and don't show the "failed to create Deployment ..." message [2]

I can't see any occurrence of the log message for 4.8 > 4.9 upgrades [3].

[0] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1427038807974219776
[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1427038807974219776/build-log.txt
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1427038807974219776/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-monitoring_cluster-monitoring-operator-9c6747665-tdfnk_cluster-monitoring-operator.log
[3] https://search.ci.openshift.org/?search=creating+Deployment+object+failed+after+update+failed&maxAge=336h&context=1&type=junit&name=periodic-ci-openshift-release-master-nightly-4.9-upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 8 Simon Pasquier 2021-08-19 16:02:29 UTC

I've searched for "creating Deployment object failed after update failed" in all jobs whose names contain "4.8" but not "4.7" (e.g. excluding 4.7 > 4.8 upgrade jobs) [1] and I've found nothing except for release-openshift-origin-installer-old-rhcos-e2e-aws-4.8. But this one is special because despite what the job name claims, it spins up a 4.7 cluster [2].

[1] https://search.ci.openshift.org/?search=creating+Deployment+object+failed+after+update+failed&maxAge=336h&context=1&type=junit&name=.*4.8.*&excludeName=.*4.7.*&maxMatches=5&maxBytes=20971520&groupBy=job
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1977095#c2

Comment 9 Scott Dodson 2021-08-20 16:00:45 UTC

I think the explanation in comment 7 makes sense so setting back to ON_QA. Is this reasonable to backport to 4.7?

Comment 10 Scott Dodson 2021-08-20 16:02:47 UTC

Actually based on the CI confirmation outline din comment 7 lets go all the way to VERIFIED.

Comment 11 Scott Dodson 2021-08-20 17:12:12 UTC

https://github.com/openshift/cluster-monitoring-operator/pull/1333#issuecomment-902802506 explains why this probably shouldn't be VERIFIED. I'll move it back to ASSIGNED now and stop meddling in your bugs.

Comment 16 Jayapriya Pai 2021-09-17 06:21:57 UTC


*** This bug has been marked as a duplicate of bug 2005205 ***