Description of problem: this job is frequently failing: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7 One common test that fails is "Cluster should remain functional during upgrade" and the root of that failure is something like: clusteroperator/monitoring is not Available for 23m55.166742592s because "" clusteroperator/monitoring is Degraded for 23m55.166752686s because "Failed to rollout the stack. Error: running task Updating Thanos Querier failed: reconciling Thanos Querier Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/thanos-querier: got 1 unavailable replicas" this job: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1356013559707865088 digging in to the gather-extra logs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1356013559707865088/artifacts/e2e-aws-upgrade/gather-extra/ I can see lots of errors in logs around Thanos and Promethues. some examples: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1356013559707865088/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-monitoring_prometheus-k8s-1_thanos-sidecar.log https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1356013559707865088/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-monitoring_prometheus-k8s-0_thanos-sidecar.log https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1356013559707865088/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-monitoring_thanos-querier-6f5cdf8fc9-7bl9h_thanos-query.log Version-Release number of selected component (if applicable): 4.7 How reproducible: Looks like ~50% of the jobs are failing around this.