Bug 1927448 - upgrade rollback (4.6->4.7->4.6) failing with trouble in thanos/promethues
Summary: upgrade rollback (4.6->4.7->4.6) failing with trouble in thanos/promethues
Keywords:
Status: CLOSED DUPLICATE of bug 1906496
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Sergiusz Urbaniak
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-10 18:16 UTC by jamo luhrsen
Modified: 2021-02-11 11:04 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-11 09:29:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description jamo luhrsen 2021-02-10 18:16:13 UTC
Description of problem:

this job is frequently failing:
  https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7

One common test that fails is "Cluster should remain functional during upgrade" and the root of that failure
is something like:

  clusteroperator/monitoring is not Available for 23m55.166742592s because ""
    	clusteroperator/monitoring is Degraded for 23m55.166752686s because "Failed to rollout the stack. Error: running task Updating Thanos Querier failed: reconciling Thanos Querier Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/thanos-querier: got 1 unavailable replicas"

this job:
  https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1356013559707865088


digging in to the gather-extra logs:
  https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1356013559707865088/artifacts/e2e-aws-upgrade/gather-extra/

I can see lots of errors in logs around Thanos and Promethues. some examples:

  https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1356013559707865088/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-monitoring_prometheus-k8s-1_thanos-sidecar.log
  https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1356013559707865088/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-monitoring_prometheus-k8s-0_thanos-sidecar.log
  https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1356013559707865088/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-monitoring_thanos-querier-6f5cdf8fc9-7bl9h_thanos-query.log


Version-Release number of selected component (if applicable):

4.7

How reproducible:

Looks like ~50% of the jobs are failing around this.


Note You need to log in before you can comment on or make changes to this bug.