Hide Forgot
Description of problem: We are seeing customer's upgrade cannot kickoff due to the availableUpdates is null in clusterversion CR Version-Release number of the following components: How reproducible: sometime Steps to Reproduce: 1. 2. 3. Actual results: Please include the entire output from the last TASK line through the end of output if an error is generated Expected results: Additional info:
Hi Haoran, Can you please provide the output of oc get clusterversion -oyaml?
Ah, we can see 4.10.16, 4.10.17 and 4.10.18 in the conditional updates[1]. We do support upgrading along a conditional path[2] but prior to the upgrade please ensure you understand and evaluate the risks displayed along with the update:) [1] https://docs.openshift.com/container-platform/4.10/updating/understanding-upgrade-channels-release.html#conditional-updates-overview_understanding-upgrade-channels-releases [2] https://docs.openshift.com/container-platform/4.10/updating/updating-cluster-cli.html#update-conditional-upgrade-pathupdating-cluster-cli
Please ignore comment#3. Had a call with Haoran and figured out the issue he is facing is that the available updates for the 4.10.15 cluster are null as showed in comment#3, while we can see it has 3 available updates (4.10.20, 4.10.21, 4.10.22) from [1]. [1] https://access.redhat.com/labs/ocpupgradegraph/update_channel?channel=stable-4.10&arch=x86_64&is_show_hot_fix=false
Getting a timestamp from the must-gather (sorry external-to-Red-Hat folks): $ sha256sum 02715548_must-gather-20220721_052723Z.tar.gz 6664f8f57144a5b64e83485cd894e24b9dbfcd7a5881661da540882711b69be7 02715548_must-gather-20220721_052723Z.tar.gz $ tar xOz must-gather/timestamp <02715548_must-gather-20220721_052723Z.tar.gz 2022-07-21 05:03:09.399613208 +0000 UTC m=+0.121030418 ... 2022-07-21 05:06:43.974196248 +0000 UTC m=+3.574959029 Pulling ClusterVersion out of the must-gather: $ tar xOz must-gather/cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml <02715548_must-gather-20220721_052723Z.tar.gz | yaml2json >cluster-version.json Confirming the default 'upstream', and figuring out the channel/version: $ jq '{channel: .spec.channel, upstream: .spec.upstream, current: .status.desired.version}' cluster-version.json { "channel": "stable-4.10", "upstream": null, "current": "4.10.15" } Looking at history: $ jq '.status.history | length' cluster-version.json 11 $ jq -r '.status.history[0] | .startedTime + " " + .completionTime + " " + .state + " " + .version' cluster-version.json 2022-06-07T15:04:14Z 2022-06-07T16:18:08Z Completed 4.10.15 $ jq -r '.status.history[-1] | .startedTime + " " + .completionTime + " " + .state + " " + .version' cluster-version.json 2022-01-04T15:56:00Z 2022-01-04T16:17:44Z Completed 4.9.11 11 < 50, so 4.9.11 is the install version, and we don't have to worry about bug 2108292 vs. history pruning. Looking in the graph-data repo [1] (where I'm on a checkout of [2] to get efficient manifest-list caching), I'll look up the commit state on the 21st: $ git --no-pager log --first-parent --date=iso --format='%ad %h %s' | grep -1 2022-07-21 2022-07-22 21:33:57 -0400 6e5dcdf Merge pull request #2245 from sdodson/cleanup-eus-48 2022-07-21 17:21:23 +0200 1014ca3 Merge pull request #2243 from openshift-ota-bot/promote-4.6.60-to-fast-4.7 2022-07-21 16:47:41 +0200 1eea0f7 Merge pull request #2242 from openshift-ota-bot/promote-4.6.60-to-fast-4.6 2022-07-21 15:45:36 +0200 5cdf31f Merge pull request #2241 from openshift-ota-bot/promote-4.6.60-to-fast 2022-07-21 10:16:38 +0200 efb78b5 Merge pull request #2240 from openshift-ota-bot/promote-4.7.54-to-stable-4.8 2022-07-21 09:58:33 +0200 d32cd17 Merge pull request #2238 from openshift-ota-bot/promote-4.7.54-to-eus-4.8 2022-07-21 09:40:45 +0200 b11a997 Merge pull request #2239 from openshift-ota-bot/promote-4.7.54-to-stable-4.7 2022-07-21 08:35:47 +0200 0a4c588 Merge pull request #2237 from openshift-ota-bot/promote-4.7.54-to-stable 2022-07-20 19:10:04 +0200 590c297 Merge pull request #2234 from openshift-ota-bot/promote-4.10.24-to-candidate-4.10 So nothing moving in stable-4.10 that day, and we can pick any commit hash from those results. I'll pick 1014ca3. Showing the state of 4.10.15 -> * in stable-4.10 at that point: $ hack/show-edges.py --revision 1014ca3 stable-4.10 | grep '^4[.]10[.]15 -' 4.10.15 -(blocked: ReleaseDataWithHyphenPrefix)-> 4.10.16 4.10.15 -(blocked: ReleaseDataWithHyphenPrefix)-> 4.10.17 4.10.15 -(blocked: ReleaseDataWithHyphenPrefix)-> 4.10.18 4.10.15 -> 4.10.20 4.10.15 -> 4.10.21 4.10.15 -> 4.10.22 As Oscar pointed out internally, there is [3] discussing those ReleaseDataWithHyphenPrefix. This born-in-4.9.11 cluster matches that risk's PromQL, which was looking at born_by_4_9: $ jq -r '.status.conditionalUpdates[] | .release.version as $to | .conditions[] | select(.type == "Recommended") | .lastTransitionTime + " " + .type + "=" + .status + " " + $to' cluster-version.json 2022-07-04T07:30:44Z Recommended=False 4.10.18 2022-07-04T07:30:44Z Recommended=False 4.10.17 2022-07-04T07:30:44Z Recommended=False 4.10.16 But this is not an s390x cluster, so it wouldn't have passed through any of the hyphen-prefix releases listed in [3]. You could use: $ oc adm upgrade --allow-not-recommended --to 4.10.18 to say "yes, I see ReleaseDataWithHyphenPrefix matches, but I have evaluated that risk and decided it does not apply to my cluster. I would like to update to 4.10.18 anyway.". So far so good. However, we are in Bugzilla because: $ jq .status.availableUpdates cluster-version.json null when, per the above show-edges output, we expect to see 4.10.20, 4.10.21, and 4.10.22. I'll dig into those in a follow-up comment. [1]: https://github.com/openshift/cincinnati-graph-data [2]: https://github.com/openshift/cincinnati-graph-data/pull/2254 [3]: https://access.redhat.com/solutions/6965075
Ok, what's going on with the missing availabeUpdates? Let's look at alerts in the must-gather: $ tar tvz <02715548_must-gather-20220721_052723Z.tar.gz | grep /monitoring/ drwxr-xr-x root/root 0 2022-07-20 22:06 must-gather/monitoring/ drwxr-xr-x root/root 0 2022-07-20 22:06 must-gather/monitoring/prometheus/ -rw-r--r-- root/root 0 2022-07-20 22:06 must-gather/monitoring/prometheus/alertmanagers.json -rw-r--r-- root/root 2079 2022-07-20 22:06 must-gather/monitoring/prometheus/alertmanagers.stderr -rw-r--r-- root/root 0 2022-07-20 22:06 must-gather/monitoring/prometheus/rules.json -rw-r--r-- root/root 2079 2022-07-20 22:06 must-gather/monitoring/prometheus/rules.stderr ... $ tar xOz must-gather/monitoring/prometheus/rules.stderr <02715548_must-gather-20220721_052723Z.tar.gz Error from server (Forbidden): <!DOCTYPE html> <html lang="en" charset="utf-8"> <head> <title>Log In</title> ... I dunno what's going on there. But luckily we have Telemetry exported from this cluster. Getting the cluster ID: $ jq -r .spec.clusterID cluster-version.json 44REDACTED And going back into Telemetry to look at the day around the 2022-07-21 05:06 UTC must-gather (timestamp from comment 6): alerts{_id="44REDACTED"} shows CannotRetrieveUpdates firing throughout, although there is a window from 00:10 to 05:03 where alerts was not flowing to Telemetry, although most other metrics were still being uploaded. However, the issues with update freshness were not being reported in ClusterVersion conditions: $ jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' cluster-version.json 2022-01-04T16:17:44Z Available=True : Done applying 4.10.15 2022-06-07T15:41:08Z Failing=False : 2022-06-07T16:18:08Z Progressing=False : Cluster version is 4.10.15 2022-06-07T15:52:44Z RetrievedUpdates=True : 2022-06-07T15:05:12Z Upgradeable=False IncompatibleOperatorsInstalled: Cluster operator operator-lifecycle-manager should not be upgraded between minor versions: ... 2022-06-07T15:05:23Z ReleaseAccepted=True PayloadLoaded: Payload loaded version="4.10.15" image="quay.io/openshift-release-dev/ocp-release@sha256:ddcb70ce04a01ce487c0f4ad769e9e36a10c8c832a34307c1b1eb8e03a5b7ddb" I trimmed the IncompatibleOperatorsInstalled message, because as "should not be upgraded between minor versions", that's just about updates to 4.11, which aren't relevant for this 4.10 cluster yet. But RetrievedUpdates=True here isn't great, since the fact that CannotRetrieveUpdates is firing and the lack of 4.10.20 and later in availableUpdates shows that we are struggling with this retrieval. Digging back in Telemetry history, CannotRetrieveUpdates has been firing with some unrelated gaps since around 2022-07-04 8:32 UTC. Back again to graph-data: $ git --no-pager log --first-parent --date=iso --format='%ad %h %s' -1 origin/master -- blocked-edges 2022-06-29 17:06:47 +0000 4c92f3a Merge pull request #2118 from wking/conditional-risk-for-hyphen-prefixed-release-data so the alert began firing a few days after we made those update recommendations conditional. And because we were getting other alerts like Watchdog for hours+ before CannotRetrieveUpdates shows up in Telemetry, it's likely that update retrieval was working before the alert started firing. Pulling CVO logs out of the must-gather: $ tar xOz must-gather/namespaces/openshift-cluster-version/pods/cluster-version-operator-75bb5448f6-x29lz/cluster-version-operator/cluster-version-operator/logs/current.log <02715548_must-gather-20220721_052723Z.tar.gz >cvo.log The logs cover several hours: $ head -n1 cvo.log 2022-07-20T21:35:16.943832468Z I0720 21:35:16.942716 1 sync_worker.go:840] Done syncing for customresourcedefinition "machineconfigs.machineconfiguration.openshift.io" (642 of 771) $ tail -n1 cvo.log 2022-07-21T05:03:08.910533346Z I0721 05:03:08.910493 1 cvo.go:517] Finished syncing cluster version "openshift-cluster-version/version" (12.170664ms) but the commit has long since rotated out. Grabbing it off the image: $ tar xOz must-gather/namespaces/openshift-cluster-version/pods/cluster-version-operator-75bb5448f6-x29lz/cluster-version-operator-75bb5448f6-x29lz.yaml <02715548_must-gather-20220721_052723Z.tar.gz | yaml2json | jq -r '.spec.containers[].image' quay.io/openshift-release-dev/ocp-release@sha256:ddcb70ce04a01ce487c0f4ad769e9e36a10c8c832a34307c1b1eb8e03a5b7ddb $ oc image info -o json quay.io/openshift-release-dev/ocp-release@sha256:ddcb70ce04a01ce487c0f4ad769e9e36a10c8c832a34307c1b1eb8e03a5b7ddb | jq -r '.config.config.Env[]' | grep '^SOURCE_GIT_COMMIT=' SOURCE_GIT_COMMIT=b3da2d3eba82adcd53198d662607f21641817c4a Back to the CVO logs, nothing in there out of availableupdates.go: $ grep availableupdates cvo.log ...no hits... and I'd have expected one of [1]. Bubbling up a level [2], still no sign of life: $ grep 'syncing available updates' cvo.log ...no hits... So the goroutine launched by [3] seems wedged. Possibly it wedged connecting to the sad Thanos service while trying to resolve a PromQL conditional risk, and we need to get a timeout in place to guard against that? If we can find a cluster currently exhibiting these symptoms, I would be very interested in a CVO stack trace (see [4] for a tail-and-SIGQUIT process for capturing stack traces). [1]: https://github.com/openshift/cluster-version-operator/blob/b3da2d3eba82adcd53198d662607f21641817c4a/pkg/cvo/availableupdates.go#L41-L54 [2]: https://github.com/openshift/cluster-version-operator/blob/b3da2d3eba82adcd53198d662607f21641817c4a/pkg/cvo/cvo.go#L588-L590 [3]: https://github.com/openshift/cluster-version-operator/blob/b3da2d3eba82adcd53198d662607f21641817c4a/pkg/cvo/cvo.go#L365-L372 [4]: https://bugzilla.redhat.com/show_bug.cgi?id=1873900#c5
Using "falsely claims RetrievedUpdates=True despite CannotRetrieveUpdates firing" as our marker, the issue seems rare in the Telemetry-reporting fleet, so I'm setting blocker-.
Matt got me a tail-and-SIGQUIT stack trace (thanks :), and looking inside: $ sha256sum bz2109374-cvo.log.gz d729343215dc7062b21ee014a8622a0f8beb75239c20dd92e6c8e7f9cf04df3d bz2109374-cvo.log.gz $ zcat bz2109374-cvo.log.gz | head -n1 I0727 19:54:53.897855 1 cvo.go:515] Started syncing cluster version "openshift-cluster-version/version", spec changes, status, and payload (2022-07-27 19:54:53.897839616 +0000 UTC m=+4333456.961228352) $ zgrep '^[IE][0-9]' bz2109374-cvo.log.gz | tail -n1 I0727 23:17:23.912294 1 cvo.go:517] Finished syncing cluster version "openshift-cluster-version/version" (14.289749ms) So we captured 3+ hours of logs, which with a functioning RetrievedUpdates loop should get us many rounds of attempts. However, none of the expected logs: $ grep 'syncing available updates' bz2109374-cvo.log.gz ...no hits... Big stack dump at the end of the file: $ zgrep -A10000 'SIGQUIT: quit' bz2109374-cvo.log.gz | wc -l 1262 Looking for my cvo.go#L365-L372 goroutine launch point: $ zgrep -A10000 'SIGQUIT: quit' bz2109374-cvo.log.gz | grep -n 'cvo[.]go:\(36[5-9]\|7[02]\)' 784: /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:369 +0xa5 798: /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:368 +0xb6 800: /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:366 +0x553 Likely all from a single goroutine. $ zgrep -A10000 'SIGQUIT: quit' bz2109374-cvo.log.gz | grep -nB100 'cvo[.]go:366' | grep 'goroutine ' 720-goroutine 173 [select, 2 minutes]: 738-goroutine 174 [select, 34062 minutes]: Yup, so goroutine 174 in stack lines 738 through 800+. $ zgrep -A62 'goroutine 174' bz2109374-cvo.log.gz | grep cluster-version-operator | grep -v vendor | head -n4 github.com/openshift/cluster-version-operator/pkg/clusterconditions/promql.(*PromQL).Match(0x2c1c240, {0x1db94b0, 0xc0011f21c0}, 0xc000011488) /go/src/github.com/openshift/cluster-version-operator/pkg/clusterconditions/promql/promql.go:83 +0x366 github.com/openshift/cluster-version-operator/pkg/clusterconditions/cache.(*Cache).Match(0x2c1c240, {0x1db94b0, 0xc0011f21c0}, 0xc000011470) /go/src/github.com/openshift/cluster-version-operator/pkg/clusterconditions/cache/cache.go:133 +0xa44 Looks pretty solid for "hung in the PromQL request" to me. And we get to pass a Context through [1], so I'll insert a "but let's not be crazy now" timeout in there... [1]: https://github.com/openshift/cluster-version-operator/blob/b3da2d3eba82adcd53198d662607f21641817c4a/pkg/clusterconditions/promql/promql.go#L83
I've opened cvo#806 adding a timeout to the PromQL query. I'm not sure what the best way to verify the fix is. If someone has a blackhole servicer that just catches and idles incoming connections, maybe we could slot that in behind the Thanos service? Or we can eyeball the timeout, do some basic regression testing, ship it, and see how it plays in the wild.
Trying to reproduce it 1. Install a 4.9.11 cluster 2. Upgrade to 4.10.15 # oc adm upgrade Cluster version is 4.10.15 Upstream is unset, so the cluster will use an appropriate default. Channel: candidate-4.10 (available channels: candidate-4.10, candidate-4.11, eus-4.10, fast-4.10, stable-4.10) Recommended updates: VERSION IMAGE 4.10.25 quay.io/openshift-release-dev/ocp-release@sha256:ed84fb3fbe026b3bbb4a2637ddd874452ac49c6ead1e15675f257e28664879cc 4.10.24 quay.io/openshift-release-dev/ocp-release@sha256:aab51636460b5a9757b736a29bc92ada6e6e6282e46b06e6fd483063d590d62a 4.10.23 quay.io/openshift-release-dev/ocp-release@sha256:e40e49d722cb36a95fa1c03002942b967ccbd7d68de10e003f0baa69abad457b 4.10.22 quay.io/openshift-release-dev/ocp-release@sha256:62c995079672535662ee94ef2358ee6b0e700475c38f6502ca2d3d13d9d7de5b 4.10.21 quay.io/openshift-release-dev/ocp-release@sha256:420ee7160d4970304ae97a1b0a77d9bd52af1fd97c597d7cb5d5a2c0d0b72dda 4.10.20 quay.io/openshift-release-dev/ocp-release@sha256:b89ada9261a1b257012469e90d7d4839d0d2f99654f5ce76394fa3f06522b600 Additional updates which are not recommended based on your cluster configuration are available, to view those re-run the command with --include-not-recommended. So far so good. Make thanos degraded. # cat cm.yaml apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | thanosQuerier: nodeSelector: deploy: unknown # oc create -f cm.yaml configmap/cluster-monitoring-config created # oc -n openshift-monitoring get pod | grep thanos-querier thanos-querier-5584f4d459-bzsm4 6/6 Running 0 3h24m thanos-querier-647556f965-cfhck 0/6 Pending 0 7s thanos-querier-647556f965-wx2tn 0/6 Pending 0 7s # oc -n openshift-monitoring get rs | grep thanos-querier thanos-querier-5584f4d459 1 1 1 3h48m thanos-querier-647556f965 2 2 0 18s # oc -n openshift-monitoring delete rs thanos-querier-5584f4d459 replicaset.apps "thanos-querier-5584f4d459" deleted # oc adm upgrade channel --allow-explicit-channel warning: Clearing channel "stable-4.10"; cluster will no longer request available update recommendations. Restart CVO # oc adm upgrade Cluster version is 4.10.15 Upstream is unset, so the cluster will use an appropriate default. Channel: stable-4.10 (available channels: candidate-4.10, candidate-4.11, eus-4.10, fast-4.10, stable-4.10) Recommended updates: VERSION IMAGE 4.10.23 quay.io/openshift-release-dev/ocp-release@sha256:e40e49d722cb36a95fa1c03002942b967ccbd7d68de10e003f0baa69abad457b 4.10.22 quay.io/openshift-release-dev/ocp-release@sha256:62c995079672535662ee94ef2358ee6b0e700475c38f6502ca2d3d13d9d7de5b 4.10.21 quay.io/openshift-release-dev/ocp-release@sha256:420ee7160d4970304ae97a1b0a77d9bd52af1fd97c597d7cb5d5a2c0d0b72dda 4.10.20 quay.io/openshift-release-dev/ocp-release@sha256:b89ada9261a1b257012469e90d7d4839d0d2f99654f5ce76394fa3f06522b600 Additional updates which are not recommended based on your cluster configuration are available, to view those re-run the command with --include-not-recommended. Sigh, I can't reproduce it. Will do the regression testing when the nightly build is out.
Verifying it by performing regression testing on 4.12.0-0.nightly-2022-07-31-235028. # oc adm upgrade --include-not-recommended Cluster version is 4.12.0-0.nightly-2022-07-31-235028 Upstream: https://storage.googleapis.com/ocp-ota-1659338073/graph-1659338074 Channel: buggy Recommended updates: VERSION IMAGE 4.99.999999 registry.ci.openshift.org/ocp/release@sha256:9999999999999999999999999999999999999999999999999999999999999999 Supported but not recommended updates: Version: 4.88.888888 Image: registry.ci.openshift.org/ocp/release@sha256:8888888888888888888888888888888888888888888888888888888888888888 Recommended: False Reason: ReleaseIsRejected Message: Too many CI failures on this release, so do not update to it https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.10.0-0.nightly/release/4.10.0-0.nightly-2021-11-24-075634 Version: 4.77.777777 Image: registry.ci.openshift.org/ocp/release@sha256:7777777777777777777777777777777777777777777777777777777777777777 Recommended: False Reason: SomeInvokerThing Message: On clusters on default invoker user, this imaginary bug can happen. https://bug.example.com/a All the updates are resolved successfully. CVO seems not to get stuck. Moving it to verified state.
Workaround: Restarting both the prometheus-adapters and the thanos-queriers (in openshift-monitoring) can mitigate the issue.
The backport bug 2115564 shipped in 4.11.1, so I don't think it's worth calling out in 4.12 release notes, and I've set No Doc Update. But if you wanted to doc it, the doc-text template would be something like: Cause: The cluster-version operator's PromQL update risk evaluation engine lacked a timeout. Consequence: Some flavors of sad Thanos could hold cluster-version operator PromQL calls indefinitely, blocking the cluster-version operator from assessing other risks or retrieving fresh update recommendations from the upstream update service. Workaround (if any): Comment 18. Fix: The cluster-version operator now caps Thanos queries at 5 minutes. Result: When Thanos is slow, the cluster-version operator gives up, considers the exposure to that risk unknown, and moves on to process additional risks and retreive newer update recommendations from the upstream update service.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399