1982369 – CMO fails to delete/recreate the deployment resource after '422 Unprocessable Entity' update response

Bug 1982369 - CMO fails to delete/recreate the deployment resource after '422 Unprocessable Entity' update response

Summary: CMO fails to delete/recreate the deployment resource after '422 Unprocessable...

Keywords:
Status:	CLOSED DUPLICATE of bug 2005205
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.8.z
Assignee:	Jayapriya Pai
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:	1949840 1956308 2005205 2005206
Blocks:	1996132
TreeView+	depends on / blocked

Reported:	2021-07-14 18:03 UTC by OpenShift BugZilla Robot
Modified:	2021-09-17 06:29 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-09-17 06:21:57 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1285	0	None	open	[release-4.8] Bug 1982369: Fix deployment update with retry option	2021-07-14 18:03:49 UTC

Comment 5 Junqi Zhao 2021-08-03 03:44:24 UTC

searched with
https://search.ci.openshift.org/?search=creating+Deployment+object+failed+after+update+failed&maxAge=48h&context=1&type=bug%2Bjunit&name=periodic-ci-openshift-release-master-nightly-4.8-upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

still can see error: 
Aug 02 13:37:41.192 - 71s   E clusteroperator/monitoring condition/Degraded status/True reason/Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: creating Deployment object failed after update failed: object is being deleted: deployments.apps "prometheus-operator" already exists

example
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/20753/rehearse-20753-periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1422170933774258176

upgraded from 4.7.21 to 4.8.0-0.nightly-2021-07-31-065602
error
*************************************************************
Aug 02 13:37:41.192 E clusteroperator/monitoring condition/Available status/False reason/UpdatingPrometheusOperatorFailed changed: Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
Aug 02 13:37:41.192 E clusteroperator/monitoring condition/Degraded status/True reason/UpdatingPrometheusOperatorFailed changed: Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: creating Deployment object failed after update failed: object is being deleted: deployments.apps "prometheus-operator" already exists
Aug 02 13:37:41.192 - 71s   E clusteroperator/monitoring condition/Available status/False reason/Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
Aug 02 13:37:41.192 - 71s   E clusteroperator/monitoring condition/Degraded status/True reason/Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: creating Deployment object failed after update failed: object is being deleted: deployments.apps "prometheus-operator" already exists
Aug 02 13:37:43.169 E ns/openshift-service-ca-operator pod/service-ca-operator-699fdbb947-4cv54 node/ip-10-0-222-211.ec2.internal container/service-ca-operator reason/ContainerExit code/1 cause/Error
*************************************************************

Comment 7 Simon Pasquier 2021-08-19 15:40:47 UTC

Could it be that the log message happens before the cluster is actually upgraded to 4.8 (e.g. the cluster monitoring operator's image version is still 4.7 which doesn't include the fix)?

Looking at a 4.7 > 4.8 job [0]
* The message is logged at Aug 15 23:58:01.983 [1]
* the current CMO's logs start at Aug 16 00:26:31 and don't show the "failed to create Deployment ..." message [2]

I can't see any occurrence of the log message for 4.8 > 4.9 upgrades [3].

[0] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1427038807974219776
[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1427038807974219776/build-log.txt
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1427038807974219776/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-monitoring_cluster-monitoring-operator-9c6747665-tdfnk_cluster-monitoring-operator.log
[3] https://search.ci.openshift.org/?search=creating+Deployment+object+failed+after+update+failed&maxAge=336h&context=1&type=junit&name=periodic-ci-openshift-release-master-nightly-4.9-upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 8 Simon Pasquier 2021-08-19 16:02:29 UTC

I've searched for "creating Deployment object failed after update failed" in all jobs whose names contain "4.8" but not "4.7" (e.g. excluding 4.7 > 4.8 upgrade jobs) [1] and I've found nothing except for release-openshift-origin-installer-old-rhcos-e2e-aws-4.8. But this one is special because despite what the job name claims, it spins up a 4.7 cluster [2].

[1] https://search.ci.openshift.org/?search=creating+Deployment+object+failed+after+update+failed&maxAge=336h&context=1&type=junit&name=.*4.8.*&excludeName=.*4.7.*&maxMatches=5&maxBytes=20971520&groupBy=job
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1977095#c2

Comment 9 Scott Dodson 2021-08-20 16:00:45 UTC

I think the explanation in comment 7 makes sense so setting back to ON_QA. Is this reasonable to backport to 4.7?

Comment 10 Scott Dodson 2021-08-20 16:02:47 UTC

Actually based on the CI confirmation outline din comment 7 lets go all the way to VERIFIED.

Comment 11 Scott Dodson 2021-08-20 17:12:12 UTC

https://github.com/openshift/cluster-monitoring-operator/pull/1333#issuecomment-902802506 explains why this probably shouldn't be VERIFIED. I'll move it back to ASSIGNED now and stop meddling in your bugs.

Comment 16 Jayapriya Pai 2021-09-17 06:21:57 UTC


*** This bug has been marked as a duplicate of bug 2005205 ***

Note You need to log in before you can comment on or make changes to this bug.