2109374 – ClusterVersion availableUpdates is stale: PromQL conditional risks vs. slow/stuck Thanos

Bug 2109374 - ClusterVersion availableUpdates is stale: PromQL conditional risks vs. slow/stuck Thanos

Summary: ClusterVersion availableUpdates is stale: PromQL conditional risks vs. slow/s...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.12.0
Assignee:	W. Trevor King
QA Contact:	Yang Yang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2115564
TreeView+	depends on / blocked

Reported:	2022-07-21 06:06 UTC by Wang Haoran
Modified:	2023-02-03 11:08 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-01-17 19:53:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 806	0	None	Merged	Bug 2109374: pkg/clusterconditions/promql: Cap PromQL queries at 5 minutes	2022-10-24 22:48:16 UTC
Red Hat Product Errata	RHSA-2022:7399	0	None	None	None	2023-01-17 19:53:30 UTC

Description Wang Haoran 2022-07-21 06:06:04 UTC

Description of problem:
We are seeing customer's upgrade cannot kickoff due to the availableUpdates is null in clusterversion CR
Version-Release number of the following components:


How reproducible:
sometime
Steps to Reproduce:
1.
2.
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:

Comment 1 Yang Yang 2022-07-21 06:13:30 UTC

Hi Haoran,

Can you please provide the output of oc get clusterversion -oyaml?

Comment 3 Yang Yang 2022-07-21 06:40:50 UTC

Ah, we can see 4.10.16, 4.10.17 and 4.10.18 in the conditional updates[1]. We do support upgrading along a conditional path[2] but prior to the upgrade please ensure you understand and evaluate the risks displayed along with the update:)

[1] https://docs.openshift.com/container-platform/4.10/updating/understanding-upgrade-channels-release.html#conditional-updates-overview_understanding-upgrade-channels-releases
[2] https://docs.openshift.com/container-platform/4.10/updating/updating-cluster-cli.html#update-conditional-upgrade-pathupdating-cluster-cli

Comment 4 Yang Yang 2022-07-21 07:32:11 UTC

Please ignore comment#3. Had a call with Haoran and figured out the issue he is facing is that the available updates for the 4.10.15 cluster are null as showed in comment#3, while we can see it has 3 available updates (4.10.20, 4.10.21, 4.10.22) from [1].

[1] https://access.redhat.com/labs/ocpupgradegraph/update_channel?channel=stable-4.10&arch=x86_64&is_show_hot_fix=false

Comment 6 W. Trevor King 2022-07-27 18:58:58 UTC

Getting a timestamp from the must-gather (sorry external-to-Red-Hat folks):

  $ sha256sum 02715548_must-gather-20220721_052723Z.tar.gz
  6664f8f57144a5b64e83485cd894e24b9dbfcd7a5881661da540882711b69be7  02715548_must-gather-20220721_052723Z.tar.gz
  $ tar xOz must-gather/timestamp <02715548_must-gather-20220721_052723Z.tar.gz
  2022-07-21 05:03:09.399613208 +0000 UTC m=+0.121030418
  ...
  2022-07-21 05:06:43.974196248 +0000 UTC m=+3.574959029

Pulling ClusterVersion out of the must-gather:

  $ tar xOz must-gather/cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml <02715548_must-gather-20220721_052723Z.tar.gz | yaml2json >cluster-version.json

Confirming the default 'upstream', and figuring out the channel/version:

  $ jq '{channel: .spec.channel, upstream: .spec.upstream, current: .status.desired.version}' cluster-version.json 
  {
    "channel": "stable-4.10",
    "upstream": null,
    "current": "4.10.15"
  }

Looking at history:

  $ jq '.status.history | length' cluster-version.json 
  11
  $ jq -r '.status.history[0] | .startedTime + " " + .completionTime + " " + .state + " " + .version' cluster-version.json 
  2022-06-07T15:04:14Z 2022-06-07T16:18:08Z Completed 4.10.15
  $ jq -r '.status.history[-1] | .startedTime + " " + .completionTime + " " + .state + " " + .version' cluster-version.json 
  2022-01-04T15:56:00Z 2022-01-04T16:17:44Z Completed 4.9.11

11 < 50, so 4.9.11 is the install version, and we don't have to worry about bug 2108292 vs. history pruning.

Looking in the graph-data repo [1] (where I'm on a checkout of [2] to get efficient manifest-list caching), I'll look up the commit state on the 21st:

  $ git --no-pager log --first-parent --date=iso --format='%ad %h %s' | grep -1 2022-07-21
  2022-07-22 21:33:57 -0400 6e5dcdf Merge pull request #2245 from sdodson/cleanup-eus-48
  2022-07-21 17:21:23 +0200 1014ca3 Merge pull request #2243 from openshift-ota-bot/promote-4.6.60-to-fast-4.7
  2022-07-21 16:47:41 +0200 1eea0f7 Merge pull request #2242 from openshift-ota-bot/promote-4.6.60-to-fast-4.6
  2022-07-21 15:45:36 +0200 5cdf31f Merge pull request #2241 from openshift-ota-bot/promote-4.6.60-to-fast
  2022-07-21 10:16:38 +0200 efb78b5 Merge pull request #2240 from openshift-ota-bot/promote-4.7.54-to-stable-4.8
  2022-07-21 09:58:33 +0200 d32cd17 Merge pull request #2238 from openshift-ota-bot/promote-4.7.54-to-eus-4.8
  2022-07-21 09:40:45 +0200 b11a997 Merge pull request #2239 from openshift-ota-bot/promote-4.7.54-to-stable-4.7
  2022-07-21 08:35:47 +0200 0a4c588 Merge pull request #2237 from openshift-ota-bot/promote-4.7.54-to-stable
  2022-07-20 19:10:04 +0200 590c297 Merge pull request #2234 from openshift-ota-bot/promote-4.10.24-to-candidate-4.10

So nothing moving in stable-4.10 that day, and we can pick any commit hash from those results.  I'll pick 1014ca3.  Showing the state of 4.10.15 -> * in stable-4.10 at that point:

  $ hack/show-edges.py --revision 1014ca3 stable-4.10 | grep '^4[.]10[.]15 -'
  4.10.15 -(blocked: ReleaseDataWithHyphenPrefix)-> 4.10.16
  4.10.15 -(blocked: ReleaseDataWithHyphenPrefix)-> 4.10.17
  4.10.15 -(blocked: ReleaseDataWithHyphenPrefix)-> 4.10.18
  4.10.15 -> 4.10.20
  4.10.15 -> 4.10.21
  4.10.15 -> 4.10.22

As Oscar pointed out internally, there is [3] discussing those ReleaseDataWithHyphenPrefix.  This born-in-4.9.11 cluster matches that risk's PromQL, which was looking at born_by_4_9:

  $ jq -r '.status.conditionalUpdates[] | .release.version as $to | .conditions[] | select(.type == "Recommended") | .lastTransitionTime + " " + .type + "=" + .status + " " + $to' cluster-version.json 
  2022-07-04T07:30:44Z Recommended=False 4.10.18
  2022-07-04T07:30:44Z Recommended=False 4.10.17
  2022-07-04T07:30:44Z Recommended=False 4.10.16

But this is not an s390x cluster, so it wouldn't have passed through any of the hyphen-prefix releases listed in [3].  You could use:

  $ oc adm upgrade --allow-not-recommended --to 4.10.18

to say "yes, I see ReleaseDataWithHyphenPrefix matches, but I have evaluated that risk and decided it does not apply to my cluster.  I would like to update to 4.10.18 anyway.".

So far so good.  However, we are in Bugzilla because:

  $ jq .status.availableUpdates cluster-version.json 
  null

when, per the above show-edges output, we expect to see 4.10.20, 4.10.21, and 4.10.22.  I'll dig into those in a follow-up comment.

[1]: https://github.com/openshift/cincinnati-graph-data
[2]: https://github.com/openshift/cincinnati-graph-data/pull/2254
[3]: https://access.redhat.com/solutions/6965075

Comment 7 W. Trevor King 2022-07-27 20:51:20 UTC

Ok, what's going on with the missing availabeUpdates?  Let's look at alerts in the must-gather:

  $ tar tvz <02715548_must-gather-20220721_052723Z.tar.gz | grep /monitoring/
  drwxr-xr-x root/root         0 2022-07-20 22:06 must-gather/monitoring/
  drwxr-xr-x root/root         0 2022-07-20 22:06 must-gather/monitoring/prometheus/
  -rw-r--r-- root/root         0 2022-07-20 22:06 must-gather/monitoring/prometheus/alertmanagers.json
  -rw-r--r-- root/root      2079 2022-07-20 22:06 must-gather/monitoring/prometheus/alertmanagers.stderr
  -rw-r--r-- root/root         0 2022-07-20 22:06 must-gather/monitoring/prometheus/rules.json
  -rw-r--r-- root/root      2079 2022-07-20 22:06 must-gather/monitoring/prometheus/rules.stderr
  ...
  $ tar xOz must-gather/monitoring/prometheus/rules.stderr <02715548_must-gather-20220721_052723Z.tar.gz
  Error from server (Forbidden): <!DOCTYPE html>
  <html lang="en" charset="utf-8">
  <head>
    <title>Log In</title>
  ...

I dunno what's going on there.  But luckily we have Telemetry exported from this cluster.  Getting the cluster ID:

  $ jq -r .spec.clusterID cluster-version.json 
  44REDACTED

And going back into Telemetry to look at the day around the 2022-07-21 05:06 UTC must-gather (timestamp from comment 6):

  alerts{_id="44REDACTED"}

shows CannotRetrieveUpdates firing throughout, although there is a window from 00:10 to 05:03 where alerts was not flowing to Telemetry, although most other metrics were still being uploaded.  However, the issues with update freshness were not being reported in ClusterVersion conditions:

  $ jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' cluster-version.json 
  2022-01-04T16:17:44Z Available=True : Done applying 4.10.15
  2022-06-07T15:41:08Z Failing=False : 
  2022-06-07T16:18:08Z Progressing=False : Cluster version is 4.10.15
  2022-06-07T15:52:44Z RetrievedUpdates=True : 
  2022-06-07T15:05:12Z Upgradeable=False IncompatibleOperatorsInstalled: Cluster operator operator-lifecycle-manager should not be upgraded between minor versions: ...
  2022-06-07T15:05:23Z ReleaseAccepted=True PayloadLoaded: Payload loaded version="4.10.15" image="quay.io/openshift-release-dev/ocp-release@sha256:ddcb70ce04a01ce487c0f4ad769e9e36a10c8c832a34307c1b1eb8e03a5b7ddb"

I trimmed the IncompatibleOperatorsInstalled message, because as "should not be upgraded between minor versions", that's just about updates to 4.11, which aren't relevant for this 4.10 cluster yet.  But RetrievedUpdates=True here isn't great, since the fact that CannotRetrieveUpdates is firing and the lack of 4.10.20 and later in availableUpdates shows that we are struggling with this retrieval.

Digging back in Telemetry history, CannotRetrieveUpdates has been firing with some unrelated gaps since around 2022-07-04 8:32 UTC.  Back again to graph-data:

  $ git --no-pager log --first-parent --date=iso --format='%ad %h %s' -1 origin/master -- blocked-edges
  2022-06-29 17:06:47 +0000 4c92f3a Merge pull request #2118 from wking/conditional-risk-for-hyphen-prefixed-release-data

so the alert began firing a few days after we made those update recommendations conditional.  And because we were getting other alerts like Watchdog for hours+ before CannotRetrieveUpdates shows up in Telemetry, it's likely that update retrieval was working before the alert started firing.

Pulling CVO logs out of the must-gather:

  $ tar xOz must-gather/namespaces/openshift-cluster-version/pods/cluster-version-operator-75bb5448f6-x29lz/cluster-version-operator/cluster-version-operator/logs/current.log <02715548_must-gather-20220721_052723Z.tar.gz >cvo.log

The logs cover several hours:

  $ head -n1 cvo.log
  2022-07-20T21:35:16.943832468Z I0720 21:35:16.942716       1 sync_worker.go:840] Done syncing for customresourcedefinition "machineconfigs.machineconfiguration.openshift.io" (642 of 771)
  $ tail -n1 cvo.log
  2022-07-21T05:03:08.910533346Z I0721 05:03:08.910493       1 cvo.go:517] Finished syncing cluster version "openshift-cluster-version/version" (12.170664ms)

but the commit has long since rotated out.  Grabbing it off the image:

  $ tar xOz must-gather/namespaces/openshift-cluster-version/pods/cluster-version-operator-75bb5448f6-x29lz/cluster-version-operator-75bb5448f6-x29lz.yaml <02715548_must-gather-20220721_052723Z.tar.gz | yaml2json | jq -r '.spec.containers[].image'
  quay.io/openshift-release-dev/ocp-release@sha256:ddcb70ce04a01ce487c0f4ad769e9e36a10c8c832a34307c1b1eb8e03a5b7ddb
  $ oc image info -o json quay.io/openshift-release-dev/ocp-release@sha256:ddcb70ce04a01ce487c0f4ad769e9e36a10c8c832a34307c1b1eb8e03a5b7ddb | jq -r '.config.config.Env[]' | grep '^SOURCE_GIT_COMMIT='
  SOURCE_GIT_COMMIT=b3da2d3eba82adcd53198d662607f21641817c4a

Back to the CVO logs, nothing in there out of availableupdates.go:

  $ grep availableupdates cvo.log 
  ...no hits...

and I'd have expected one of [1].  Bubbling up a level [2], still no sign of life:

  $ grep 'syncing available updates' cvo.log 
  ...no hits...

So the goroutine launched by [3] seems wedged.

Possibly it wedged connecting to the sad Thanos service while trying to resolve a PromQL conditional risk, and we need to get a timeout in place to guard against that?  If we can find a cluster currently exhibiting these symptoms, I would be very interested in a CVO stack trace (see [4] for a tail-and-SIGQUIT process for capturing stack traces).

[1]: https://github.com/openshift/cluster-version-operator/blob/b3da2d3eba82adcd53198d662607f21641817c4a/pkg/cvo/availableupdates.go#L41-L54
[2]: https://github.com/openshift/cluster-version-operator/blob/b3da2d3eba82adcd53198d662607f21641817c4a/pkg/cvo/cvo.go#L588-L590
[3]: https://github.com/openshift/cluster-version-operator/blob/b3da2d3eba82adcd53198d662607f21641817c4a/pkg/cvo/cvo.go#L365-L372
[4]: https://bugzilla.redhat.com/show_bug.cgi?id=1873900#c5

Comment 10 W. Trevor King 2022-07-27 23:17:52 UTC

Using "falsely claims RetrievedUpdates=True despite CannotRetrieveUpdates firing" as our marker, the issue seems rare in the Telemetry-reporting fleet, so I'm setting blocker-.

Comment 12 W. Trevor King 2022-07-27 23:46:34 UTC

Matt got me a tail-and-SIGQUIT stack trace (thanks :), and looking inside:

  $ sha256sum bz2109374-cvo.log.gz 
  d729343215dc7062b21ee014a8622a0f8beb75239c20dd92e6c8e7f9cf04df3d  bz2109374-cvo.log.gz
  $ zcat bz2109374-cvo.log.gz | head -n1
  I0727 19:54:53.897855       1 cvo.go:515] Started syncing cluster version "openshift-cluster-version/version", spec changes, status, and payload (2022-07-27 19:54:53.897839616 +0000 UTC m=+4333456.961228352)
  $ zgrep '^[IE][0-9]' bz2109374-cvo.log.gz | tail -n1
  I0727 23:17:23.912294       1 cvo.go:517] Finished syncing cluster version "openshift-cluster-version/version" (14.289749ms)

So we captured 3+ hours of logs, which with a functioning RetrievedUpdates loop should get us many rounds of attempts.  However, none of the expected logs:

  $ grep 'syncing available updates' bz2109374-cvo.log.gz
  ...no hits...

Big stack dump at the end of the file:

  $ zgrep -A10000 'SIGQUIT: quit' bz2109374-cvo.log.gz | wc -l
  1262

Looking for my cvo.go#L365-L372 goroutine launch point:

  $ zgrep -A10000 'SIGQUIT: quit' bz2109374-cvo.log.gz | grep -n 'cvo[.]go:\(36[5-9]\|7[02]\)'
  784:    /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:369 +0xa5
  798:    /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:368 +0xb6
  800:    /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:366 +0x553

Likely all from a single goroutine.

  $ zgrep -A10000 'SIGQUIT: quit' bz2109374-cvo.log.gz | grep -nB100 'cvo[.]go:366' | grep 'goroutine '
  720-goroutine 173 [select, 2 minutes]:
  738-goroutine 174 [select, 34062 minutes]:

Yup, so goroutine 174 in stack lines 738 through 800+.

  $ zgrep -A62 'goroutine 174' bz2109374-cvo.log.gz | grep cluster-version-operator | grep -v vendor | head -n4
  github.com/openshift/cluster-version-operator/pkg/clusterconditions/promql.(*PromQL).Match(0x2c1c240, {0x1db94b0, 0xc0011f21c0}, 0xc000011488)
          /go/src/github.com/openshift/cluster-version-operator/pkg/clusterconditions/promql/promql.go:83 +0x366
  github.com/openshift/cluster-version-operator/pkg/clusterconditions/cache.(*Cache).Match(0x2c1c240, {0x1db94b0, 0xc0011f21c0}, 0xc000011470)
          /go/src/github.com/openshift/cluster-version-operator/pkg/clusterconditions/cache/cache.go:133 +0xa44

Looks pretty solid for "hung in the PromQL request" to me.  And we get to pass a Context through [1], so I'll insert a "but let's not be crazy now" timeout in there...

[1]: https://github.com/openshift/cluster-version-operator/blob/b3da2d3eba82adcd53198d662607f21641817c4a/pkg/clusterconditions/promql/promql.go#L83

Comment 13 W. Trevor King 2022-07-28 00:33:07 UTC

I've opened cvo#806 adding a timeout to the PromQL query.  I'm not sure what the best way to verify the fix is.  If someone has a blackhole servicer that just catches and idles incoming connections, maybe we could slot that in behind the Thanos service?  Or we can eyeball the timeout, do some basic regression testing, ship it, and see how it plays in the wild.

Comment 15 Yang Yang 2022-07-29 08:06:43 UTC

Trying to reproduce it

1. Install a 4.9.11 cluster
2. Upgrade to 4.10.15
# oc adm upgrade 
Cluster version is 4.10.15

Upstream is unset, so the cluster will use an appropriate default.
Channel: candidate-4.10 (available channels: candidate-4.10, candidate-4.11, eus-4.10, fast-4.10, stable-4.10)

Recommended updates:

  VERSION     IMAGE
  4.10.25     quay.io/openshift-release-dev/ocp-release@sha256:ed84fb3fbe026b3bbb4a2637ddd874452ac49c6ead1e15675f257e28664879cc
  4.10.24     quay.io/openshift-release-dev/ocp-release@sha256:aab51636460b5a9757b736a29bc92ada6e6e6282e46b06e6fd483063d590d62a
  4.10.23     quay.io/openshift-release-dev/ocp-release@sha256:e40e49d722cb36a95fa1c03002942b967ccbd7d68de10e003f0baa69abad457b
  4.10.22     quay.io/openshift-release-dev/ocp-release@sha256:62c995079672535662ee94ef2358ee6b0e700475c38f6502ca2d3d13d9d7de5b
  4.10.21     quay.io/openshift-release-dev/ocp-release@sha256:420ee7160d4970304ae97a1b0a77d9bd52af1fd97c597d7cb5d5a2c0d0b72dda
  4.10.20     quay.io/openshift-release-dev/ocp-release@sha256:b89ada9261a1b257012469e90d7d4839d0d2f99654f5ce76394fa3f06522b600

Additional updates which are not recommended based on your cluster configuration are available, to view those re-run the command with --include-not-recommended.

So far so good.

Make thanos degraded.

# cat cm.yaml 
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    thanosQuerier:
      nodeSelector:
        deploy: unknown

# oc create -f cm.yaml 
configmap/cluster-monitoring-config created

# oc -n openshift-monitoring get pod | grep thanos-querier
thanos-querier-5584f4d459-bzsm4               6/6     Running   0          3h24m
thanos-querier-647556f965-cfhck               0/6     Pending   0          7s
thanos-querier-647556f965-wx2tn               0/6     Pending   0          7s

# oc -n openshift-monitoring get rs | grep thanos-querier
thanos-querier-5584f4d459                1         1         1       3h48m
thanos-querier-647556f965                2         2         0       18s

# oc -n openshift-monitoring delete rs thanos-querier-5584f4d459
replicaset.apps "thanos-querier-5584f4d459" deleted

# oc adm upgrade channel --allow-explicit-channel
warning: Clearing channel "stable-4.10"; cluster will no longer request available update recommendations.

Restart CVO

# oc adm upgrade 
Cluster version is 4.10.15

Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.10 (available channels: candidate-4.10, candidate-4.11, eus-4.10, fast-4.10, stable-4.10)

Recommended updates:

  VERSION     IMAGE
  4.10.23     quay.io/openshift-release-dev/ocp-release@sha256:e40e49d722cb36a95fa1c03002942b967ccbd7d68de10e003f0baa69abad457b
  4.10.22     quay.io/openshift-release-dev/ocp-release@sha256:62c995079672535662ee94ef2358ee6b0e700475c38f6502ca2d3d13d9d7de5b
  4.10.21     quay.io/openshift-release-dev/ocp-release@sha256:420ee7160d4970304ae97a1b0a77d9bd52af1fd97c597d7cb5d5a2c0d0b72dda
  4.10.20     quay.io/openshift-release-dev/ocp-release@sha256:b89ada9261a1b257012469e90d7d4839d0d2f99654f5ce76394fa3f06522b600

Additional updates which are not recommended based on your cluster configuration are available, to view those re-run the command with --include-not-recommended.

Sigh, I can't reproduce it. Will do the regression testing when the nightly build is out.

Comment 16 Yang Yang 2022-08-01 07:59:11 UTC

Verifying it by performing regression testing on 4.12.0-0.nightly-2022-07-31-235028.

# oc adm upgrade --include-not-recommended
Cluster version is 4.12.0-0.nightly-2022-07-31-235028

Upstream: https://storage.googleapis.com/ocp-ota-1659338073/graph-1659338074
Channel: buggy

Recommended updates:

  VERSION     IMAGE
  4.99.999999 registry.ci.openshift.org/ocp/release@sha256:9999999999999999999999999999999999999999999999999999999999999999

Supported but not recommended updates:

  Version: 4.88.888888
  Image: registry.ci.openshift.org/ocp/release@sha256:8888888888888888888888888888888888888888888888888888888888888888
  Recommended: False
  Reason: ReleaseIsRejected
  Message: Too many CI failures on this release, so do not update to it https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.10.0-0.nightly/release/4.10.0-0.nightly-2021-11-24-075634

  Version: 4.77.777777
  Image: registry.ci.openshift.org/ocp/release@sha256:7777777777777777777777777777777777777777777777777777777777777777
  Recommended: False
  Reason: SomeInvokerThing
  Message: On clusters on default invoker user, this imaginary bug can happen. https://bug.example.com/a

All the updates are resolved successfully. CVO seems not to get stuck. Moving it to verified state.

Comment 18 Wu Siu Wa 2022-10-24 07:32:47 UTC

Workaround:
Restarting both the prometheus-adapters and the thanos-queriers (in openshift-monitoring) can mitigate the issue.

Comment 20 W. Trevor King 2023-01-09 17:25:19 UTC

The backport bug 2115564 shipped in 4.11.1, so I don't think it's worth calling out in 4.12 release notes, and I've set No Doc Update.  But if you wanted to doc it, the doc-text template would be something like:

Cause: The cluster-version operator's PromQL update risk evaluation engine lacked a timeout.

Consequence: Some flavors of sad Thanos could hold cluster-version operator PromQL calls indefinitely, blocking the cluster-version operator from assessing other risks or retrieving fresh update recommendations from the upstream update service.

Workaround (if any): Comment 18.

Fix: The cluster-version operator now caps Thanos queries at 5 minutes.

Result: When Thanos is slow, the cluster-version operator gives up, considers the exposure to that risk unknown, and moves on to process additional risks and retreive newer update recommendations from the upstream update service.

Comment 22 errata-xmlrpc 2023-01-17 19:53:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399

Note You need to log in before you can comment on or make changes to this bug.