Bug 1956308

Summary:	CMO fails to delete/recreate the deployment resource after '422 Unprocessable Entity' update response
Product:	OpenShift Container Platform	Reporter:	Simon Pasquier <spasquie>
Component:	Monitoring	Assignee:	Jayapriya Pai <janantha>
Status:	CLOSED DUPLICATE	QA Contact:	Junqi Zhao <juzhao>
Severity:	medium	Docs Contact:
Priority:	high
Version:	4.8	CC:	alegrand, anpicker, dgrisonn, eparis, erooth, jfajersk, kakkoyun, lcosic, rbost, wking
Target Milestone:	---	Keywords:	Upgrades
Target Release:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-09-13 14:04:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1982369

Description Simon Pasquier 2021-05-03 12:53:42 UTC

Description of problem:
When CMO trying to update a deployment, CMO might go degraded/unavailable for some time.

Version-Release number of selected component (if applicable):
4.8

How reproducible:
Not always but happens quite often in the CI.
https://search.ci.openshift.org/?search=creating+Deployment+object+failed+after+update+failed&maxAge=168h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Steps to Reproduce:
1. Upgrade from 4.7 to 4.8.
2.
3.

Actual results:
The operator goes Degraded and Unavailable for a short period of time, the reason being '... Deployment failed: creating Deployment object failed after update failed: object is being deleted: deployments.apps "xxx" already exists'.

Expected results:
No error

Additional info:
https://github.com/openshift/cluster-monitoring-operator/blob/7f4925a7203622d70b3007fbddfb6bc5cce6c1d9/pkg/client/client.go#L716-L731

The issue is probably that the delete operation being asynchronous, CMO should wait for the deployment to be effectively removed before trying to recreate it.

Comment 1 W. Trevor King 2021-06-03 05:00:38 UTC

I'm going to raise the severity to high based on [1], where the operator seems to have hit this, and then stuck forever:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616/artifacts/e2e-aws-ovn-upgrade/openshift-e2e-test/artifacts/e2e.log | grep clusteroperator/monitoring
Jun 02 05:00:02.363 W clusteroperator/monitoring condition/Progressing status/True reason/RollOutInProgress changed: Rolling out the stack.
Jun 02 05:00:02.363 - 40s   W clusteroperator/monitoring condition/Progressing status/True reason/Rolling out the stack.
Jun 02 05:00:43.066 - 7244s E clusteroperator/monitoring condition/Available status/False reason/Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
Jun 02 05:00:43.066 - 7244s E clusteroperator/monitoring condition/Degraded status/True reason/Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: creating Deployment object failed after update failed: object is being deleted: deployments.apps "prometheus-operator" already exists
Jun 02 05:00:43.066 E clusteroperator/monitoring condition/Available status/False reason/UpdatingPrometheusOperatorFailed changed: Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
Jun 02 05:00:43.066 E clusteroperator/monitoring condition/Degraded status/True reason/UpdatingPrometheusOperatorFailed changed: Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: creating Deployment object failed after update failed: object is being deleted: deployments.apps "prometheus-operator" already exists
Jun 02 05:00:43.066 W clusteroperator/monitoring condition/Progressing status/False reason/UpdatingPrometheusOperatorFailed changed: Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
Jun 02 05:00:43.142 W clusteroperator/monitoring condition/Progressing status/True reason/RollOutInProgress changed: Rolling out the stack.
Jun 02 05:00:43.142 - 37s   W clusteroperator/monitoring condition/Progressing status/True reason/Rolling out the stack.
...
Jun 02 06:48:19.016 W clusteroperator/monitoring condition/Progressing status/False reason/UpdatingPrometheusK8SFailed changed: Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
Jun 02 06:48:19.043 W clusteroperator/monitoring condition/Progressing status/True reason/RollOutInProgress changed: Rolling out the stack.
Jun 02 06:48:19.043 - 303s  W clusteroperator/monitoring condition/Progressing status/True reason/Rolling out the stack.
Jun 02 06:53:22.677 W clusteroperator/monitoring condition/Progressing status/False reason/UpdatingPrometheusK8SFailed changed: Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
Jun 02 07:00:01.836 - 85s   W clusteroperator/monitoring condition/Progressing status/True reason/Rolling out the stack.
Jun 02 07:00:01.836 W clusteroperator/monitoring condition/Progressing status/True reason/RollOutInProgress changed: Rolling out the stack.
[bz-Monitoring] clusteroperator/monitoring should not change condition/Available
[bz-Monitoring] clusteroperator/monitoring should not change condition/Degraded

Searching for the stuck-at-the-end-of-test symptoms, seems like it's not absolutely melting us down, but it's not particularly rare either:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=Failed+to+upgrade+monitoring,+operator+was+not+available.*Updatin
gPrometheusOperatorFailed' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 36 runs, 100% failed, 8% of failures match = 8% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 34 runs, 94% failed, 3% of failures match = 3% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 12 runs, 100% failed, 8% of failures match = 8% impact
pull-ci-openshift-prometheus-operator-master-e2e-aws-upgrade (all) - 8 runs, 88% failed, 14% of failures match = 13% impact

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616

Comment 2 Simon Pasquier 2021-06-03 09:06:36 UTC

Looking at [1], the initial error reported by CMO is indeed the same: "creating Deployment object failed after update failed: object is being deleted: deployments.apps "prometheus-operator" already exists". But the subsequent reconciliations fail for different reasons [2].

The next failure is because the prometheus operator isn't ready yet and the admission webhook fails (see bug 1949840):

E0602 05:01:20.514522       1 operator.go:400] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating Control Plane components failed: reconciling etcd rules PrometheusRule failed: updating PrometheusRule object failed: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator"

Then CMO fails repeatedly because the prometheus-k8s statefulset never converges to the desired state: 

E0602 05:06:24.595880       1 operator.go:400] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating Prometheus-k8s failed: waiting for Prometheus object changes failed: waiting for Prometheus openshift-monitoring/k8s: expected 2 replicas, got 1 updated replicas


Now looking at the pods [3], there's no node on which the prometheus-k8s-1 pod can be scheduled:

            "status": {
                "conditions": [
                    {
                        "lastProbeTime": null,
                        "lastTransitionTime": "2021-06-02T05:02:17Z",
                        "message": "0/6 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity rules, 1 node(s) didn't match pod anti-affinity rules, 2 node(s) had volume node affinity conflict, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.",
                        "reason": "Unschedulable",
                        "status": "False",
                        "type": "PodScheduled"
                    }
                ],


IIUC the mischeduling happens because some worker nodes have moved away from the 'us-east-1b' zone while prometheus PVs are bounded to this zone.

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/nodes.json | jq '.items| map(select( .metadata.labels["node-role.kubernetes.io/worker"] == "" )) | map( .metadata.name + ": " + .metadata.labels["topology.ebs.csi.aws.com/zone"] )' 
[
  "ip-10-0-144-164.ec2.internal: us-east-1a",
  "ip-10-0-176-27.ec2.internal: us-east-1a",
  "ip-10-0-207-235.ec2.internal: us-east-1b"
]

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/persistentvolumes.json  | jq '.items | map( .metadata.name + ": " + .spec.claimRef.name + ": " + .metadata.labels["failure-domain.beta.kubernetes.io/zone"])'
[
  "pvc-4b072738-5319-4337-8a3c-d819f28c4bf5: prometheus-data-prometheus-k8s-1: us-east-1b",
  "pvc-64a7fce5-45c3-4551-a955-5c7bb3cd5a89: prometheus-data-prometheus-k8s-0: us-east-1b"
]

[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods/openshift-monitoring_cluster-monitoring-operator-7556c4b9c6-7vlhj_cluster-monitoring-operator.log
[3] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods.json

Comment 3 Simon Pasquier 2021-06-03 13:21:06 UTC

I've created bug 1967614 to investigate the volume node affinity conflict.

Comment 4 Simon Pasquier 2021-06-03 13:23:02 UTC

Moving back to low severity since the issue with '422 Unprocessable Entity' update response resolves by itself.

Comment 8 Junqi Zhao 2021-07-14 05:17:04 UTC

checked from
https://search.ci.openshift.org/?search=creating+Deployment+object+failed+after+update+failed&maxAge=168h&context=1&type=bug%2Bjunit&name=master-ci-4.9-upgrade-from-stable-4.8&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

#1415023322030149632
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-from-stable-4.7-e2e-aws-upgrade/1415023322030149632

checked from logs
https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-from-stable-4.7-e2e-aws-upgrade/1415023322030149632/build-log.txt
upgrade path: 4.7.19->4.8.0-fc.7->4.9.0-0.ci-2021-07-13-144833
error:
running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: creating Deployment object failed after update failed: object is being deleted: deployments.apps "prometheus-operator" already exists
shows when upgrade from 4.7.19->4.8.0-fc.7, this is expected, since the fix is not in 4.8 now

no such error for
4.8.0-fc.7->4.9.0-0.ci-2021-07-13-144833

Comment 9 Vadim Rutkovsky 2021-07-30 08:51:09 UTC

*** Bug 1987914 has been marked as a duplicate of this bug. ***

Comment 10 Damien Grisonnet 2021-08-03 13:15:52 UTC

Moving back to assign since the backport to 4.8 failed QA verification: https://bugzilla.redhat.com/show_bug.cgi?id=1982369#c5

It is not known yet why this bug only manifests on 4.8, but there might be an apiserver bug that slipped into Kubernetes 1.21, but was fixed in 1.22. It seems that even with propagationPolicy set to Background and blockOwnerDeletion set to true on the dependant objects, the Deployment/Daemonset deletion doesn't wait for the object to be deleted. We are still investigating this possibility.

Comment 11 Simon Pasquier 2021-08-19 16:03:29 UTC

Setting back priority to medium because I highly suspect that the issue only happens on 4.7 clusters as commented in https://bugzilla.redhat.com/show_bug.cgi?id=1982369#c7

Comment 12 Scott Dodson 2021-08-20 16:02:22 UTC

Based on the explanation and CI verification in https://bugzilla.redhat.com/show_bug.cgi?id=1982369#c7 moving this back to VERIFIED and pushing the fix downward to 4.7.

Comment 13 Scott Dodson 2021-08-20 17:13:10 UTC

https://github.com/openshift/cluster-monitoring-operator/pull/1333#issuecomment-902802506 and comment 10 explains why there's more to do here so moving back to ASSIGNED.

Comment 16 Jayapriya Pai 2021-09-13 14:04:58 UTC

Closing as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1949840. I will clone that one for backporting in 4.8 since cherry-pick is not working due to merge conflict

*** This bug has been marked as a duplicate of bug 1949840 ***