1927448 – upgrade rollback (4.6->4.7->4.6) failing with trouble in thanos/promethues

Bug 1927448 - upgrade rollback (4.6->4.7->4.6) failing with trouble in thanos/promethues

Summary: upgrade rollback (4.6->4.7->4.6) failing with trouble in thanos/promethues

Keywords:
Status:	CLOSED DUPLICATE of bug 1906496
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Sergiusz Urbaniak
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-02-10 18:16 UTC by jamo luhrsen
Modified:	2021-02-11 11:04 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-11 09:29:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description jamo luhrsen 2021-02-10 18:16:13 UTC

Description of problem:

this job is frequently failing:
  https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7

One common test that fails is "Cluster should remain functional during upgrade" and the root of that failure
is something like:

  clusteroperator/monitoring is not Available for 23m55.166742592s because ""
    	clusteroperator/monitoring is Degraded for 23m55.166752686s because "Failed to rollout the stack. Error: running task Updating Thanos Querier failed: reconciling Thanos Querier Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/thanos-querier: got 1 unavailable replicas"

this job:
  https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1356013559707865088


digging in to the gather-extra logs:
  https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1356013559707865088/artifacts/e2e-aws-upgrade/gather-extra/

I can see lots of errors in logs around Thanos and Promethues. some examples:

  https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1356013559707865088/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-monitoring_prometheus-k8s-1_thanos-sidecar.log
  https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1356013559707865088/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-monitoring_prometheus-k8s-0_thanos-sidecar.log
  https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1356013559707865088/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-monitoring_thanos-querier-6f5cdf8fc9-7bl9h_thanos-query.log


Version-Release number of selected component (if applicable):

4.7

How reproducible:

Looks like ~50% of the jobs are failing around this.

Note You need to log in before you can comment on or make changes to this bug.