Bug 1956308
Summary: | CMO fails to delete/recreate the deployment resource after '422 Unprocessable Entity' update response | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Simon Pasquier <spasquie> |
Component: | Monitoring | Assignee: | Jayapriya Pai <janantha> |
Status: | CLOSED DUPLICATE | QA Contact: | Junqi Zhao <juzhao> |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | 4.8 | CC: | alegrand, anpicker, dgrisonn, eparis, erooth, jfajersk, kakkoyun, lcosic, rbost, wking |
Target Milestone: | --- | Keywords: | Upgrades |
Target Release: | 4.10.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-09-13 14:04:58 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1982369 |
Description
Simon Pasquier
2021-05-03 12:53:42 UTC
I'm going to raise the severity to high based on [1], where the operator seems to have hit this, and then stuck forever: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616/artifacts/e2e-aws-ovn-upgrade/openshift-e2e-test/artifacts/e2e.log | grep clusteroperator/monitoring Jun 02 05:00:02.363 W clusteroperator/monitoring condition/Progressing status/True reason/RollOutInProgress changed: Rolling out the stack. Jun 02 05:00:02.363 - 40s W clusteroperator/monitoring condition/Progressing status/True reason/Rolling out the stack. Jun 02 05:00:43.066 - 7244s E clusteroperator/monitoring condition/Available status/False reason/Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error. Jun 02 05:00:43.066 - 7244s E clusteroperator/monitoring condition/Degraded status/True reason/Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: creating Deployment object failed after update failed: object is being deleted: deployments.apps "prometheus-operator" already exists Jun 02 05:00:43.066 E clusteroperator/monitoring condition/Available status/False reason/UpdatingPrometheusOperatorFailed changed: Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error. Jun 02 05:00:43.066 E clusteroperator/monitoring condition/Degraded status/True reason/UpdatingPrometheusOperatorFailed changed: Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: creating Deployment object failed after update failed: object is being deleted: deployments.apps "prometheus-operator" already exists Jun 02 05:00:43.066 W clusteroperator/monitoring condition/Progressing status/False reason/UpdatingPrometheusOperatorFailed changed: Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error. Jun 02 05:00:43.142 W clusteroperator/monitoring condition/Progressing status/True reason/RollOutInProgress changed: Rolling out the stack. Jun 02 05:00:43.142 - 37s W clusteroperator/monitoring condition/Progressing status/True reason/Rolling out the stack. ... Jun 02 06:48:19.016 W clusteroperator/monitoring condition/Progressing status/False reason/UpdatingPrometheusK8SFailed changed: Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error. Jun 02 06:48:19.043 W clusteroperator/monitoring condition/Progressing status/True reason/RollOutInProgress changed: Rolling out the stack. Jun 02 06:48:19.043 - 303s W clusteroperator/monitoring condition/Progressing status/True reason/Rolling out the stack. Jun 02 06:53:22.677 W clusteroperator/monitoring condition/Progressing status/False reason/UpdatingPrometheusK8SFailed changed: Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error. Jun 02 07:00:01.836 - 85s W clusteroperator/monitoring condition/Progressing status/True reason/Rolling out the stack. Jun 02 07:00:01.836 W clusteroperator/monitoring condition/Progressing status/True reason/RollOutInProgress changed: Rolling out the stack. [bz-Monitoring] clusteroperator/monitoring should not change condition/Available [bz-Monitoring] clusteroperator/monitoring should not change condition/Degraded Searching for the stuck-at-the-end-of-test symptoms, seems like it's not absolutely melting us down, but it's not particularly rare either: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=Failed+to+upgrade+monitoring,+operator+was+not+available.*Updatin gPrometheusOperatorFailed' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 36 runs, 100% failed, 8% of failures match = 8% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 34 runs, 94% failed, 3% of failures match = 3% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 12 runs, 100% failed, 8% of failures match = 8% impact pull-ci-openshift-prometheus-operator-master-e2e-aws-upgrade (all) - 8 runs, 88% failed, 14% of failures match = 13% impact [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616 Looking at [1], the initial error reported by CMO is indeed the same: "creating Deployment object failed after update failed: object is being deleted: deployments.apps "prometheus-operator" already exists". But the subsequent reconciliations fail for different reasons [2]. The next failure is because the prometheus operator isn't ready yet and the admission webhook fails (see bug 1949840): E0602 05:01:20.514522 1 operator.go:400] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating Control Plane components failed: reconciling etcd rules PrometheusRule failed: updating PrometheusRule object failed: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator" Then CMO fails repeatedly because the prometheus-k8s statefulset never converges to the desired state: E0602 05:06:24.595880 1 operator.go:400] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating Prometheus-k8s failed: waiting for Prometheus object changes failed: waiting for Prometheus openshift-monitoring/k8s: expected 2 replicas, got 1 updated replicas Now looking at the pods [3], there's no node on which the prometheus-k8s-1 pod can be scheduled: "status": { "conditions": [ { "lastProbeTime": null, "lastTransitionTime": "2021-06-02T05:02:17Z", "message": "0/6 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity rules, 1 node(s) didn't match pod anti-affinity rules, 2 node(s) had volume node affinity conflict, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.", "reason": "Unschedulable", "status": "False", "type": "PodScheduled" } ], IIUC the mischeduling happens because some worker nodes have moved away from the 'us-east-1b' zone while prometheus PVs are bounded to this zone. $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/nodes.json | jq '.items| map(select( .metadata.labels["node-role.kubernetes.io/worker"] == "" )) | map( .metadata.name + ": " + .metadata.labels["topology.ebs.csi.aws.com/zone"] )' [ "ip-10-0-144-164.ec2.internal: us-east-1a", "ip-10-0-176-27.ec2.internal: us-east-1a", "ip-10-0-207-235.ec2.internal: us-east-1b" ] $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/persistentvolumes.json | jq '.items | map( .metadata.name + ": " + .spec.claimRef.name + ": " + .metadata.labels["failure-domain.beta.kubernetes.io/zone"])' [ "pvc-4b072738-5319-4337-8a3c-d819f28c4bf5: prometheus-data-prometheus-k8s-1: us-east-1b", "pvc-64a7fce5-45c3-4551-a955-5c7bb3cd5a89: prometheus-data-prometheus-k8s-0: us-east-1b" ] [1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616 [2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods/openshift-monitoring_cluster-monitoring-operator-7556c4b9c6-7vlhj_cluster-monitoring-operator.log [3] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods.json I've created bug 1967614 to investigate the volume node affinity conflict. Moving back to low severity since the issue with '422 Unprocessable Entity' update response resolves by itself. checked from https://search.ci.openshift.org/?search=creating+Deployment+object+failed+after+update+failed&maxAge=168h&context=1&type=bug%2Bjunit&name=master-ci-4.9-upgrade-from-stable-4.8&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job #1415023322030149632 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-from-stable-4.7-e2e-aws-upgrade/1415023322030149632 checked from logs https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-from-stable-4.7-e2e-aws-upgrade/1415023322030149632/build-log.txt upgrade path: 4.7.19->4.8.0-fc.7->4.9.0-0.ci-2021-07-13-144833 error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: creating Deployment object failed after update failed: object is being deleted: deployments.apps "prometheus-operator" already exists shows when upgrade from 4.7.19->4.8.0-fc.7, this is expected, since the fix is not in 4.8 now no such error for 4.8.0-fc.7->4.9.0-0.ci-2021-07-13-144833 *** Bug 1987914 has been marked as a duplicate of this bug. *** Moving back to assign since the backport to 4.8 failed QA verification: https://bugzilla.redhat.com/show_bug.cgi?id=1982369#c5 It is not known yet why this bug only manifests on 4.8, but there might be an apiserver bug that slipped into Kubernetes 1.21, but was fixed in 1.22. It seems that even with propagationPolicy set to Background and blockOwnerDeletion set to true on the dependant objects, the Deployment/Daemonset deletion doesn't wait for the object to be deleted. We are still investigating this possibility. Setting back priority to medium because I highly suspect that the issue only happens on 4.7 clusters as commented in https://bugzilla.redhat.com/show_bug.cgi?id=1982369#c7 Based on the explanation and CI verification in https://bugzilla.redhat.com/show_bug.cgi?id=1982369#c7 moving this back to VERIFIED and pushing the fix downward to 4.7. https://github.com/openshift/cluster-monitoring-operator/pull/1333#issuecomment-902802506 and comment 10 explains why there's more to do here so moving back to ASSIGNED. Closing as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1949840. I will clone that one for backporting in 4.8 since cherry-pick is not working due to merge conflict *** This bug has been marked as a duplicate of bug 1949840 *** |