Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1927448

Summary: upgrade rollback (4.6->4.7->4.6) failing with trouble in thanos/promethues
Product: OpenShift Container Platform Reporter: jamo luhrsen <jluhrsen>
Component: MonitoringAssignee: Sergiusz Urbaniak <surbania>
Status: CLOSED DUPLICATE QA Contact: Junqi Zhao <juzhao>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.7CC: alegrand, anpicker, erooth, kakkoyun, lcosic, pkrupa, surbania, wking
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-11 09:29:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description jamo luhrsen 2021-02-10 18:16:13 UTC
Description of problem:

this job is frequently failing:
  https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7

One common test that fails is "Cluster should remain functional during upgrade" and the root of that failure
is something like:

  clusteroperator/monitoring is not Available for 23m55.166742592s because ""
    	clusteroperator/monitoring is Degraded for 23m55.166752686s because "Failed to rollout the stack. Error: running task Updating Thanos Querier failed: reconciling Thanos Querier Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/thanos-querier: got 1 unavailable replicas"

this job:
  https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1356013559707865088


digging in to the gather-extra logs:
  https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1356013559707865088/artifacts/e2e-aws-upgrade/gather-extra/

I can see lots of errors in logs around Thanos and Promethues. some examples:

  https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1356013559707865088/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-monitoring_prometheus-k8s-1_thanos-sidecar.log
  https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1356013559707865088/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-monitoring_prometheus-k8s-0_thanos-sidecar.log
  https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1356013559707865088/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-monitoring_thanos-querier-6f5cdf8fc9-7bl9h_thanos-query.log


Version-Release number of selected component (if applicable):

4.7

How reproducible:

Looks like ~50% of the jobs are failing around this.