Hide Forgot
Description of problem: When CMO trying to update a deployment, CMO might go degraded/unavailable for some time. Version-Release number of selected component (if applicable): 4.8 How reproducible: Not always but happens quite often in the CI. https://search.ci.openshift.org/?search=creating+Deployment+object+failed+after+update+failed&maxAge=168h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Steps to Reproduce: 1. Upgrade from 4.7 to 4.8. 2. 3. Actual results: The operator goes Degraded and Unavailable for a short period of time, the reason being '... Deployment failed: creating Deployment object failed after update failed: object is being deleted: deployments.apps "xxx" already exists'. Expected results: No error Additional info: https://github.com/openshift/cluster-monitoring-operator/blob/7f4925a7203622d70b3007fbddfb6bc5cce6c1d9/pkg/client/client.go#L716-L731 The issue is probably that the delete operation being asynchronous, CMO should wait for the deployment to be effectively removed before trying to recreate it.
I'm going to raise the severity to high based on [1], where the operator seems to have hit this, and then stuck forever: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616/artifacts/e2e-aws-ovn-upgrade/openshift-e2e-test/artifacts/e2e.log | grep clusteroperator/monitoring Jun 02 05:00:02.363 W clusteroperator/monitoring condition/Progressing status/True reason/RollOutInProgress changed: Rolling out the stack. Jun 02 05:00:02.363 - 40s W clusteroperator/monitoring condition/Progressing status/True reason/Rolling out the stack. Jun 02 05:00:43.066 - 7244s E clusteroperator/monitoring condition/Available status/False reason/Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error. Jun 02 05:00:43.066 - 7244s E clusteroperator/monitoring condition/Degraded status/True reason/Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: creating Deployment object failed after update failed: object is being deleted: deployments.apps "prometheus-operator" already exists Jun 02 05:00:43.066 E clusteroperator/monitoring condition/Available status/False reason/UpdatingPrometheusOperatorFailed changed: Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error. Jun 02 05:00:43.066 E clusteroperator/monitoring condition/Degraded status/True reason/UpdatingPrometheusOperatorFailed changed: Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: creating Deployment object failed after update failed: object is being deleted: deployments.apps "prometheus-operator" already exists Jun 02 05:00:43.066 W clusteroperator/monitoring condition/Progressing status/False reason/UpdatingPrometheusOperatorFailed changed: Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error. Jun 02 05:00:43.142 W clusteroperator/monitoring condition/Progressing status/True reason/RollOutInProgress changed: Rolling out the stack. Jun 02 05:00:43.142 - 37s W clusteroperator/monitoring condition/Progressing status/True reason/Rolling out the stack. ... Jun 02 06:48:19.016 W clusteroperator/monitoring condition/Progressing status/False reason/UpdatingPrometheusK8SFailed changed: Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error. Jun 02 06:48:19.043 W clusteroperator/monitoring condition/Progressing status/True reason/RollOutInProgress changed: Rolling out the stack. Jun 02 06:48:19.043 - 303s W clusteroperator/monitoring condition/Progressing status/True reason/Rolling out the stack. Jun 02 06:53:22.677 W clusteroperator/monitoring condition/Progressing status/False reason/UpdatingPrometheusK8SFailed changed: Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error. Jun 02 07:00:01.836 - 85s W clusteroperator/monitoring condition/Progressing status/True reason/Rolling out the stack. Jun 02 07:00:01.836 W clusteroperator/monitoring condition/Progressing status/True reason/RollOutInProgress changed: Rolling out the stack. [bz-Monitoring] clusteroperator/monitoring should not change condition/Available [bz-Monitoring] clusteroperator/monitoring should not change condition/Degraded Searching for the stuck-at-the-end-of-test symptoms, seems like it's not absolutely melting us down, but it's not particularly rare either: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=Failed+to+upgrade+monitoring,+operator+was+not+available.*Updatin gPrometheusOperatorFailed' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 36 runs, 100% failed, 8% of failures match = 8% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 34 runs, 94% failed, 3% of failures match = 3% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 12 runs, 100% failed, 8% of failures match = 8% impact pull-ci-openshift-prometheus-operator-master-e2e-aws-upgrade (all) - 8 runs, 88% failed, 14% of failures match = 13% impact [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616
Looking at [1], the initial error reported by CMO is indeed the same: "creating Deployment object failed after update failed: object is being deleted: deployments.apps "prometheus-operator" already exists". But the subsequent reconciliations fail for different reasons [2]. The next failure is because the prometheus operator isn't ready yet and the admission webhook fails (see bug 1949840): E0602 05:01:20.514522 1 operator.go:400] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating Control Plane components failed: reconciling etcd rules PrometheusRule failed: updating PrometheusRule object failed: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator" Then CMO fails repeatedly because the prometheus-k8s statefulset never converges to the desired state: E0602 05:06:24.595880 1 operator.go:400] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating Prometheus-k8s failed: waiting for Prometheus object changes failed: waiting for Prometheus openshift-monitoring/k8s: expected 2 replicas, got 1 updated replicas Now looking at the pods [3], there's no node on which the prometheus-k8s-1 pod can be scheduled: "status": { "conditions": [ { "lastProbeTime": null, "lastTransitionTime": "2021-06-02T05:02:17Z", "message": "0/6 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity rules, 1 node(s) didn't match pod anti-affinity rules, 2 node(s) had volume node affinity conflict, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.", "reason": "Unschedulable", "status": "False", "type": "PodScheduled" } ], IIUC the mischeduling happens because some worker nodes have moved away from the 'us-east-1b' zone while prometheus PVs are bounded to this zone. $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/nodes.json | jq '.items| map(select( .metadata.labels["node-role.kubernetes.io/worker"] == "" )) | map( .metadata.name + ": " + .metadata.labels["topology.ebs.csi.aws.com/zone"] )' [ "ip-10-0-144-164.ec2.internal: us-east-1a", "ip-10-0-176-27.ec2.internal: us-east-1a", "ip-10-0-207-235.ec2.internal: us-east-1b" ] $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/persistentvolumes.json | jq '.items | map( .metadata.name + ": " + .spec.claimRef.name + ": " + .metadata.labels["failure-domain.beta.kubernetes.io/zone"])' [ "pvc-4b072738-5319-4337-8a3c-d819f28c4bf5: prometheus-data-prometheus-k8s-1: us-east-1b", "pvc-64a7fce5-45c3-4551-a955-5c7bb3cd5a89: prometheus-data-prometheus-k8s-0: us-east-1b" ] [1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616 [2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods/openshift-monitoring_cluster-monitoring-operator-7556c4b9c6-7vlhj_cluster-monitoring-operator.log [3] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods.json
I've created bug 1967614 to investigate the volume node affinity conflict.
Moving back to low severity since the issue with '422 Unprocessable Entity' update response resolves by itself.
checked from https://search.ci.openshift.org/?search=creating+Deployment+object+failed+after+update+failed&maxAge=168h&context=1&type=bug%2Bjunit&name=master-ci-4.9-upgrade-from-stable-4.8&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job #1415023322030149632 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-from-stable-4.7-e2e-aws-upgrade/1415023322030149632 checked from logs https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-from-stable-4.7-e2e-aws-upgrade/1415023322030149632/build-log.txt upgrade path: 4.7.19->4.8.0-fc.7->4.9.0-0.ci-2021-07-13-144833 error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: creating Deployment object failed after update failed: object is being deleted: deployments.apps "prometheus-operator" already exists shows when upgrade from 4.7.19->4.8.0-fc.7, this is expected, since the fix is not in 4.8 now no such error for 4.8.0-fc.7->4.9.0-0.ci-2021-07-13-144833
*** Bug 1987914 has been marked as a duplicate of this bug. ***
Moving back to assign since the backport to 4.8 failed QA verification: https://bugzilla.redhat.com/show_bug.cgi?id=1982369#c5 It is not known yet why this bug only manifests on 4.8, but there might be an apiserver bug that slipped into Kubernetes 1.21, but was fixed in 1.22. It seems that even with propagationPolicy set to Background and blockOwnerDeletion set to true on the dependant objects, the Deployment/Daemonset deletion doesn't wait for the object to be deleted. We are still investigating this possibility.
Setting back priority to medium because I highly suspect that the issue only happens on 4.7 clusters as commented in https://bugzilla.redhat.com/show_bug.cgi?id=1982369#c7
Based on the explanation and CI verification in https://bugzilla.redhat.com/show_bug.cgi?id=1982369#c7 moving this back to VERIFIED and pushing the fix downward to 4.7.
https://github.com/openshift/cluster-monitoring-operator/pull/1333#issuecomment-902802506 and comment 10 explains why there's more to do here so moving back to ASSIGNED.
Closing as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1949840. I will clone that one for backporting in 4.8 since cherry-pick is not working due to merge conflict *** This bug has been marked as a duplicate of bug 1949840 ***