Description of problem: OCP 4.1 UPI install completes, update channel is set to fast. Cluster information: #oc adm release info Name: 4.1.0-rc.0 Digest: sha256:345ec9351ecc1d78c16cf0853fe0ef2d9f48dd493da5fdffc18fa18f45707867 Created: 2019-04-23T14:45:52Z OS/Arch: linux/amd64 Manifests: 273 Pull From: quay.io/openshift-release-dev/ocp-release@sha256:345ec9351ecc1d78c16cf0853fe0ef2d9f48dd493da5fdffc18fa18f45707867 Release Metadata: Version: 4.1.0-rc.0 Upgrades: <none> Metadata: description: Beta 4 Component Versions: Kubernetes 1.13.4 OVA Image: https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.1/latest/rhcos-410.8.20190418.1-vmware.ova How reproducible: 100% Steps to Reproduce: 1. Deploy vmware cluster per docs ( https://docs.openshift.com/container-platform/4.1/installing/installing_vsphere/installing-vsphere.html ) 2. oc get clusterversion -o yaml 3. Actual results: #oc get clusterversion -o yaml apiVersion: v1 items: - apiVersion: config.openshift.io/v1 kind: ClusterVersion metadata: creationTimestamp: "2019-05-10T14:04:37Z" generation: 1 name: version resourceVersion: "20263" selfLink: /apis/config.openshift.io/v1/clusterversions/version uid: 8ba9afe1-732c-11e9-bfb5-0050569b5e80 spec: channel: fast clusterID: 3f92c3b3-9d45-4ba9-97eb-78305cdb0dae upstream: https://api.openshift.com/api/upgrades_info/v1/graph status: availableUpdates: null conditions: - lastTransitionTime: "2019-05-10T14:38:34Z" message: Done applying 4.1.0-rc.0 status: "True" type: Available - lastTransitionTime: "2019-05-10T14:38:34Z" status: "False" type: Failing - lastTransitionTime: "2019-05-10T14:38:34Z" message: Cluster version is 4.1.0-rc.0 status: "False" type: Progressing - lastTransitionTime: "2019-05-10T14:04:37Z" message: 'Unable to retrieve available updates: unknown version 4.1.0-rc.0' reason: RemoteFailed status: "False" type: RetrievedUpdates desired: image: quay.io/openshift-release-dev/ocp-release@sha256:345ec9351ecc1d78c16cf0853fe0ef2d9f48dd493da5fdffc18fa18f45707867 version: 4.1.0-rc.0 history: - completionTime: "2019-05-10T14:38:34Z" image: quay.io/openshift-release-dev/ocp-release@sha256:345ec9351ecc1d78c16cf0853fe0ef2d9f48dd493da5fdffc18fa18f45707867 startedTime: "2019-05-10T14:04:37Z" state: Completed version: 4.1.0-rc.0 observedGeneration: 1 versionHash: jHX1796OCic= kind: List metadata: resourceVersion: "" selfLink: "" Expected results: Release channel is stable, per an IPI install to AWS. Additional info: Please attach logs from ansible-playbook with the -vvv flag
Assuming fast channel is being defaulted to by the CVO: https://github.com/openshift/cluster-version-operator/blob/0386842157d4db5d27ab5935db3cb69c52687d9d/pkg/cvo/cvo.go#L463-L479
The clusterversion created by the e2e-vsphere tests of master are setting the channel to "stable-4.1". See https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/3440/rehearse-3440-pull-ci-openshift-installer-master-e2e-vsphere/26/artifacts/e2e-vsphere/clusterversion.json.
Backing code for the installer-generated ClusterVersion is in [1], in case that helps. bootkube.service logs from the bootstrap machine (you'll have to gather them before tearing down the bootstrap resources) and logs from the CVO container might help explain why you're getting the CVO's default instead of the content from the installer's manifest. [1]: https://github.com/openshift/installer/pull/1599/files
Created attachment 1566793 [details] installer-gather logs from bootstrap
(In reply to W. Trevor King from comment #3) > Backing code for the installer-generated ClusterVersion is in [1], in case > that helps. bootkube.service logs from the bootstrap machine (you'll have > to gather them before tearing down the bootstrap resources) and logs from > the CVO container might help explain why you're getting the CVO's default > instead of the content from the installer's manifest. > > [1]: https://github.com/openshift/installer/pull/1599/files Attached installer-gather logs.
$ tar xf log-bundle_upivm.tar.gz $ jq -r '.items[].spec.channel' resources/clusterversion.json pre-release-4.1 This does not match your initial 'fast' from comment 0. And the hyphenated form is wrong too [1]. I'm assuming you used the console to change it, and checking the logs to see if I can reconstruct the history for this value... [1]: https://github.com/openshift/console/pull/1498
$ grep -i clusterversion bootstrap/journals/bootkube.log May 10 14:04:36 boots-int bootkube.sh[1451]: "0000_00_cluster-version-operator_01_clusterversion.crd.yaml": unable to get REST mapping: no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1beta1" May 10 14:04:36 boots-int bootkube.sh[1451]: "cvo-overrides.yaml": unable to get REST mapping: no matches for kind "ClusterVersion" in version "config.openshift.io/v1" May 10 14:04:42 boots-int bootkube.sh[1451]: "cvo-overrides.yaml": unable to get REST mapping: no matches for kind "ClusterVersion" in version "config.openshift.io/v1" May 10 14:04:44 boots-int bootkube.sh[1451]: Skipped config.openshift.io/v1, Resource=clusterversions as it already exists May 10 14:14:39 boots-int bootkube.sh[1451]: Skipped config.openshift.io/v1, Resource=clusterversions as it already exists So possibly a race here where the 14:04:42 push failed, the CVO filled in its default, and the 14:04:44 attempt was too late. I'll check the CVO logs to confirm; but we may need to make the CVO less enthusiastic about pushing its default and/or teach cluster-bootstrap to compare content vs. just existence.
$ grep cluster-version bootstrap/pods/*inspect bootstrap/pods/edb6158916ec.inspect: "Path": "/usr/bin/cluster-version-operator", bootstrap/pods/edb6158916ec.inspect: "/usr/bin/cluster-version-operator", bootstrap/pods/edb6158916ec.inspect: "Entrypoint": "/usr/bin/cluster-version-operator", $ ls -l bootstrap/pods/edb6158916ec.log -rw-r--r--. 1 trking trking 0 May 10 12:01 bootstrap/pods/edb6158916ec.log Well that's unfortunate ;). So maybe try again with a newer release than 4.1.0-rc.0, since there have been some installer-gather improvements in the meantime.
(In reply to W. Trevor King from comment #6) > $ tar xf log-bundle_upivm.tar.gz > $ jq -r '.items[].spec.channel' resources/clusterversion.json > pre-release-4.1 > > This does not match your initial 'fast' from comment 0. And the hyphenated > form is wrong too [1]. I'm assuming you used the console to change it, and > checking the logs to see if I can reconstruct the history for this value... > > [1]: https://github.com/openshift/console/pull/1498 Correct, it was since changed manually through the web console.
Looking for this in CI, in case it is a race between the CVO pushing a default and cluster-bootstrap pushing the installer's manifest, I see: $ find ~/.cache/openshift-deck-build-logs -name clusterversion.json -execdir grep -o '"channel": ".*"' {} \+ | sort | uniq -c 1 "channel": "stable-4.0" 41 "channel": "stable-4.1" $ grep -r '"channel": "stable-4.0"' ~/.cache/openshift-deck-build-logs /home/trking/.cache/openshift-deck-build-logs/pr-logs/pull/openshift_release/3748/rehearse-3748-pull-ci-openshift-machine-config-operator-release-4.0-e2e-aws-scaleup-rhel7/1/clusterversion.json: "channel": "stable-4.0", But no 'fast'. Still, it's been a quiet day, so while the odds are good for these mostly installer-provided-infrastructure AWS runs, I'm not yet comfortable ruling out a race.
So despite the lack of CI evidence (maybe I'm just holding that wrong), we're going to move ahead and treat this as a race. I'll drop the defaulting logic from the cluster-version operator, and have it sit quietly waiting for the ClusterVersion object to get pushed before it does anything. That will resolve the create-time race. That opens us up to admins accidentally deleting ClusterVersion later, but we will block that (eventually) via an admission controller. The current plan is to resolve this bug when we've completed the cluster-version operator part of this, and to leave the admission controller to a separate bug/ticket. Let me know if anyone wants to adjust this plan :).
I'm a little allergic to the idea that I can delete the cluster version object and I can't recover the cluster, nor is there any status telling me why. But I think it's ok to turn off defaulting first.
From CI data we never have more than 1-5 clusters using the "fast" channel (hard to say whether it's CI or local iteration): https://www.dropbox.com/s/bzlnk2lmzkyyr1x/Screenshot%202019-05-14%2017.34.56.png?dl=0
Maybe worth noting, this issue persists on the beta5 drop: #oc adm release info Name: 4.1.0-rc.3 Digest: sha256:713aae8687cf8a3cb5c2c504f65532dfe11e1b3534448ea9eeef5b0931d3e208 Created: 2019-05-10T18:39:16Z OS/Arch: linux/amd64 Manifests: 287 Pull From: quay.io/openshift-release-dev/ocp-release@sha256:713aae8687cf8a3cb5c2c504f65532dfe11e1b3534448ea9eeef5b0931d3e208 Release Metadata: Version: 4.1.0-rc.3 Upgrades: <none> Metadata: description: beta 5 Metadata: url: https://errata.devel.redhat.com/advisory/38252 Component Versions: Kubernetes 1.13.4 # oc get clusterversion -o yaml [79/305] apiVersion: v1 items: - apiVersion: config.openshift.io/v1 kind: ClusterVersion metadata: creationTimestamp: "2019-05-15T12:53:14Z" generation: 1 name: version resourceVersion: "17387" selfLink: /apis/config.openshift.io/v1/clusterversions/version uid: 67098406-7710-11e9-89d0-0050569b5e80 spec: channel: fast clusterID: e30624c2-487e-4646-81e4-02b060dcc070 upstream: https://api.openshift.com/api/upgrades_info/v1/graph status: availableUpdates: null conditions: - lastTransitionTime: "2019-05-15T13:16:52Z" message: Done applying 4.1.0-rc.3 status: "True" type: Available - lastTransitionTime: "2019-05-15T13:05:36Z" status: "False" type: Failing - lastTransitionTime: "2019-05-15T13:16:52Z" message: Cluster version is 4.1.0-rc.3 status: "False" type: Progressing - lastTransitionTime: "2019-05-15T12:53:14Z" message: 'Unable to retrieve available updates: currently installed version 4.1.0-rc.3 not found in the "fast" channel' reason: RemoteFailed status: "False" type: RetrievedUpdates desired: force: false image: quay.io/openshift-release-dev/ocp-release@sha256:713aae8687cf8a3cb5c2c504f65532dfe11e1b3534448ea9eeef5b0931d3e208 version: 4.1.0-rc.3 history: - completionTime: "2019-05-15T13:16:52Z" image: quay.io/openshift-release-dev/ocp-release@sha256:713aae8687cf8a3cb5c2c504f65532dfe11e1b3534448ea9eeef5b0931d3e208 startedTime: "2019-05-15T12:53:14Z" state: Completed verified: false version: 4.1.0-rc.3 observedGeneration: 1 versionHash: CsNEu_DKlWg= kind: List metadata: resourceVersion: "" selfLink: "" 100% reproducible for me.
Created attachment 1568991 [details] log bundle for beta5 Updated log bundle for the beta5 install
This was seen recently by a user using 4.1.2. See https://github.com/openshift/installer/issues/1884#issuecomment-504921970.
This is not reproducible yet. feel free to re-open if you see this again.
Turned up again in 4.1.9. Definitely a CVO vs. cluster-bootstrap race. We need to remove the CVO's ClusterVersion defaulting logic.
I've filed bug 1741786 to address this issue in 4.2, and redirected this bug to target the 4.1.z backport.
Master fix landed via bug 1741786. Backport filed, but we can't land it until the master bug is VERIFIED [1]. [1]: https://github.com/openshift/cluster-version-operator/pull/242#issuecomment-523670684
Version: 4.1.0-0.nightly-2019-09-14-050039 Before create ignition file, update cvo-manifests.yaml to force installer-provided ClusterVersion failed with following way. $ openshift-install create manifests $ sed -i 's/name: version/name: get-lost/' manifests/cvo-overrides.yaml Bootstrap fail as expected. INFO Waiting up to 30m0s for the Kubernetes API at https://api.jliu-27870.qe.devcluster.openshift.com:6443... INFO API v1.13.4+f61b934 up INFO Waiting up to 30m0s for bootstrapping to complete... INFO Use the following commands to gather logs from the cluster INFO openshift-install gather bootstrap --help Checked cvo log as expected. ... I0916 08:02:54.491115 1 cvo.go:350] Started syncing cluster version "openshift-cluster-version/version" (2019-09-16 08:02:54.491111375 +0000 UTC m=+45.604482397) I0916 08:02:54.491155 1 cvo.go:366] No ClusterVersion object and defaulting not enabled, waiting for one ... And the normal installation on vsphere works well on above version. # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0-0.nightly-2019-09-14-050039 True False 6s Cluster version is 4.1.0-0.nightly-2019-09-14-050039
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2820