Bug 1708697 - UPI installs default to fast update channel
Summary: UPI installs default to fast update channel
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.1.z
Assignee: W. Trevor King
QA Contact: liujia
URL:
Whiteboard:
Depends On: 1741786
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-05-10 15:33 UTC by Timothy Rees
Modified: 2019-09-25 07:28 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The cluster-version operator's defaulting logic could outrace the cluster-bootstrap logic and inject unintended ClusterVersion content. Consequence: When the cluster-version operator won the race, the configured update channel was "fast" (which doesn't exist) and the configured clusterID diverged from the one listed in the installer's output metadata.json. Fix: The cluster-version operator has been updated to no longer supply a default ClusterVersion during the bootstrap phase. Result: Existing clusters affected by the race could be manually recovered by updating their in-cluster ClusterVersion to use your desired channel and clusterID. With the fix, there is no longer a race to recover from.
Clone Of:
: 1741786 (view as bug list)
Environment:
Last Closed: 2019-09-25 07:27:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
installer-gather logs from bootstrap (616.85 KB, application/gzip)
2019-05-10 19:25 UTC, Timothy Rees
no flags Details
log bundle for beta5 (935.55 KB, application/gzip)
2019-05-15 13:37 UTC, Timothy Rees
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-version-operator pull 242 0 'None' closed Bug 1708697: pkg/cvo: Drop ClusterVersion defaulting during bootstrap 2021-01-13 20:13:41 UTC
Red Hat Product Errata RHBA-2019:2820 0 None None None 2019-09-25 07:28:02 UTC

Description Timothy Rees 2019-05-10 15:33:08 UTC
Description of problem:

OCP 4.1 UPI install completes, update channel is set to fast.

Cluster information:

#oc adm release info                                                                             
Name:      4.1.0-rc.0                                                                                                  
Digest:    sha256:345ec9351ecc1d78c16cf0853fe0ef2d9f48dd493da5fdffc18fa18f45707867                                     
Created:   2019-04-23T14:45:52Z                                                                                        
OS/Arch:   linux/amd64                                                                                                 
Manifests: 273                                                                                                         
                                                                                                                       
Pull From: quay.io/openshift-release-dev/ocp-release@sha256:345ec9351ecc1d78c16cf0853fe0ef2d9f48dd493da5fdffc18fa18f45707867
                                                                                                                       
Release Metadata:                                                                                                      
  Version:  4.1.0-rc.0                                                                                                 
  Upgrades: <none>                                                                                                     
  Metadata:                                                                                                            
    description: Beta 4                                                                                                
                                                                                                                       
Component Versions:                                                                                                    
  Kubernetes 1.13.4                                


OVA Image:
https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.1/latest/rhcos-410.8.20190418.1-vmware.ova



How reproducible:

100%

Steps to Reproduce:
1. Deploy vmware cluster per docs ( https://docs.openshift.com/container-platform/4.1/installing/installing_vsphere/installing-vsphere.html )
2. oc get clusterversion -o yaml
3. 

Actual results:

#oc get clusterversion -o yaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: ClusterVersion
  metadata:
    creationTimestamp: "2019-05-10T14:04:37Z"
    generation: 1
    name: version
    resourceVersion: "20263"
    selfLink: /apis/config.openshift.io/v1/clusterversions/version
    uid: 8ba9afe1-732c-11e9-bfb5-0050569b5e80
  spec:
    channel: fast
    clusterID: 3f92c3b3-9d45-4ba9-97eb-78305cdb0dae
    upstream: https://api.openshift.com/api/upgrades_info/v1/graph
  status:
    availableUpdates: null
    conditions:
    - lastTransitionTime: "2019-05-10T14:38:34Z"
      message: Done applying 4.1.0-rc.0
      status: "True"
      type: Available
    - lastTransitionTime: "2019-05-10T14:38:34Z"
      status: "False"
      type: Failing
    - lastTransitionTime: "2019-05-10T14:38:34Z"
      message: Cluster version is 4.1.0-rc.0
      status: "False"
      type: Progressing
    - lastTransitionTime: "2019-05-10T14:04:37Z"
      message: 'Unable to retrieve available updates: unknown version 4.1.0-rc.0'
      reason: RemoteFailed
      status: "False"
      type: RetrievedUpdates
    desired:
      image: quay.io/openshift-release-dev/ocp-release@sha256:345ec9351ecc1d78c16cf0853fe0ef2d9f48dd493da5fdffc18fa18f45707867
      version: 4.1.0-rc.0
    history:
    - completionTime: "2019-05-10T14:38:34Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:345ec9351ecc1d78c16cf0853fe0ef2d9f48dd493da5fdffc18fa18f45707867
      startedTime: "2019-05-10T14:04:37Z"
      state: Completed
      version: 4.1.0-rc.0
    observedGeneration: 1
    versionHash: jHX1796OCic=
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""


Expected results:

Release channel is stable, per an IPI install to AWS.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Timothy Rees 2019-05-10 15:36:54 UTC
Assuming fast channel is being defaulted to by the CVO:

https://github.com/openshift/cluster-version-operator/blob/0386842157d4db5d27ab5935db3cb69c52687d9d/pkg/cvo/cvo.go#L463-L479

Comment 2 Matthew Staebler 2019-05-10 17:08:42 UTC
The clusterversion created by the e2e-vsphere tests of master are setting the channel to "stable-4.1". See https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/3440/rehearse-3440-pull-ci-openshift-installer-master-e2e-vsphere/26/artifacts/e2e-vsphere/clusterversion.json.

Comment 3 W. Trevor King 2019-05-10 18:30:57 UTC
Backing code for the installer-generated ClusterVersion is in [1], in case that helps.  bootkube.service logs from the bootstrap machine (you'll have to gather them before tearing down the bootstrap resources) and logs from the CVO container might help explain why you're getting the CVO's default instead of the content from the installer's manifest.

[1]: https://github.com/openshift/installer/pull/1599/files

Comment 4 Timothy Rees 2019-05-10 19:25:00 UTC
Created attachment 1566793 [details]
installer-gather logs from bootstrap

Comment 5 Timothy Rees 2019-05-10 19:25:44 UTC
(In reply to W. Trevor King from comment #3)
> Backing code for the installer-generated ClusterVersion is in [1], in case
> that helps.  bootkube.service logs from the bootstrap machine (you'll have
> to gather them before tearing down the bootstrap resources) and logs from
> the CVO container might help explain why you're getting the CVO's default
> instead of the content from the installer's manifest.
> 
> [1]: https://github.com/openshift/installer/pull/1599/files

Attached installer-gather logs.

Comment 6 W. Trevor King 2019-05-10 19:29:48 UTC
$ tar xf log-bundle_upivm.tar.gz 
$ jq -r '.items[].spec.channel' resources/clusterversion.json
pre-release-4.1

This does not match your initial 'fast' from comment 0.  And the hyphenated form is wrong too [1].  I'm assuming you used the console to change it, and checking the logs to see if I can reconstruct the history for this value...

[1]: https://github.com/openshift/console/pull/1498

Comment 7 W. Trevor King 2019-05-10 19:33:16 UTC
$ grep -i  clusterversion bootstrap/journals/bootkube.log 
May 10 14:04:36 boots-int bootkube.sh[1451]: "0000_00_cluster-version-operator_01_clusterversion.crd.yaml": unable to get REST mapping: no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1beta1"
May 10 14:04:36 boots-int bootkube.sh[1451]: "cvo-overrides.yaml": unable to get REST mapping: no matches for kind "ClusterVersion" in version "config.openshift.io/v1"
May 10 14:04:42 boots-int bootkube.sh[1451]: "cvo-overrides.yaml": unable to get REST mapping: no matches for kind "ClusterVersion" in version "config.openshift.io/v1"
May 10 14:04:44 boots-int bootkube.sh[1451]: Skipped config.openshift.io/v1, Resource=clusterversions as it already exists
May 10 14:14:39 boots-int bootkube.sh[1451]: Skipped config.openshift.io/v1, Resource=clusterversions as it already exists

So possibly a race here where the 14:04:42 push failed, the CVO filled in its default, and the 14:04:44 attempt was too late.  I'll check the CVO logs to confirm; but we may need to make the CVO less enthusiastic about pushing its default and/or teach cluster-bootstrap to compare content vs. just existence.

Comment 8 W. Trevor King 2019-05-10 19:36:04 UTC
$ grep cluster-version bootstrap/pods/*inspect
bootstrap/pods/edb6158916ec.inspect:        "Path": "/usr/bin/cluster-version-operator",
bootstrap/pods/edb6158916ec.inspect:                "/usr/bin/cluster-version-operator",
bootstrap/pods/edb6158916ec.inspect:            "Entrypoint": "/usr/bin/cluster-version-operator",
$ ls -l bootstrap/pods/edb6158916ec.log 
-rw-r--r--. 1 trking trking 0 May 10 12:01 bootstrap/pods/edb6158916ec.log

Well that's unfortunate ;).  So maybe try again with a newer release than 4.1.0-rc.0, since there have been some installer-gather improvements in the meantime.

Comment 9 Timothy Rees 2019-05-10 19:55:53 UTC
(In reply to W. Trevor King from comment #6)
> $ tar xf log-bundle_upivm.tar.gz 
> $ jq -r '.items[].spec.channel' resources/clusterversion.json
> pre-release-4.1
> 
> This does not match your initial 'fast' from comment 0.  And the hyphenated
> form is wrong too [1].  I'm assuming you used the console to change it, and
> checking the logs to see if I can reconstruct the history for this value...
> 
> [1]: https://github.com/openshift/console/pull/1498

Correct, it was since changed manually through the web console.

Comment 10 W. Trevor King 2019-05-10 23:43:54 UTC
Looking for this in CI, in case it is a race between the CVO pushing a default and cluster-bootstrap pushing the installer's manifest, I see:

$ find ~/.cache/openshift-deck-build-logs -name clusterversion.json -execdir grep -o '"channel": ".*"' {} \+ | sort | uniq -c
      1 "channel": "stable-4.0"
     41 "channel": "stable-4.1"
$ grep -r '"channel": "stable-4.0"' ~/.cache/openshift-deck-build-logs
/home/trking/.cache/openshift-deck-build-logs/pr-logs/pull/openshift_release/3748/rehearse-3748-pull-ci-openshift-machine-config-operator-release-4.0-e2e-aws-scaleup-rhel7/1/clusterversion.json:                "channel": "stable-4.0",

But no 'fast'.  Still, it's been a quiet day, so while the odds are good for these mostly installer-provided-infrastructure AWS runs, I'm not yet comfortable ruling out a race.

Comment 11 W. Trevor King 2019-05-14 21:11:57 UTC
So despite the lack of CI evidence (maybe I'm just holding that wrong), we're going to move ahead and treat this as a race.  I'll drop the defaulting logic from the cluster-version operator, and have it sit quietly waiting for the ClusterVersion object to get pushed before it does anything.  That will resolve the create-time race.  That opens us up to admins accidentally deleting ClusterVersion later, but we will block that (eventually) via an admission controller.  The current plan is to resolve this bug when we've completed the cluster-version operator part of this, and to leave the admission controller to a separate bug/ticket.  Let me know if anyone wants to adjust this plan :).

Comment 12 Clayton Coleman 2019-05-14 21:33:46 UTC
I'm a little allergic to the idea that I can delete the cluster version object and I can't recover the cluster, nor is there any status telling me why.  But I think it's ok to turn off defaulting first.

Comment 13 Clayton Coleman 2019-05-14 21:35:14 UTC
From CI data we never have more than 1-5 clusters using the "fast" channel (hard to say whether it's CI or local iteration): https://www.dropbox.com/s/bzlnk2lmzkyyr1x/Screenshot%202019-05-14%2017.34.56.png?dl=0

Comment 14 Timothy Rees 2019-05-15 13:22:31 UTC
Maybe worth noting, this issue persists on the beta5 drop:

#oc adm release info                                                                                                                                                                           
Name:      4.1.0-rc.3                                                                                                                                                                                                
Digest:    sha256:713aae8687cf8a3cb5c2c504f65532dfe11e1b3534448ea9eeef5b0931d3e208                                                                                                                                   
Created:   2019-05-10T18:39:16Z                                                                                                                                                                                      
OS/Arch:   linux/amd64                                                                                                                                                                                               
Manifests: 287                                                                                                                                                                                                       
                                                                                                                                                                                                                     
Pull From: quay.io/openshift-release-dev/ocp-release@sha256:713aae8687cf8a3cb5c2c504f65532dfe11e1b3534448ea9eeef5b0931d3e208                                                                                         
                                                                                                                                                                                                                     
Release Metadata:                                                                                                                                                                                                    
  Version:  4.1.0-rc.3                                                                                                                                                                                               
  Upgrades: <none>                                                                                                                                                                                                   
  Metadata:                                                                                                                                                                                                          
    description: beta 5                                                                                                                                                                                              
  Metadata:                                                                                                                                                                                                          
    url: https://errata.devel.redhat.com/advisory/38252                                                                                                                                                              
                                                                                                                                                                                                                     
Component Versions:                                                                                                                                                                                                  
  Kubernetes 1.13.4                            

# oc get clusterversion -o yaml                                                                                                                                                          [79/305]
apiVersion: v1                                                                                                                                                                                                       
items:                                                                                                                                                                                                               
- apiVersion: config.openshift.io/v1                                                                                                                                                                                 
  kind: ClusterVersion                                                                                                                                                                                               
  metadata:                                                                                                                                                                                                          
    creationTimestamp: "2019-05-15T12:53:14Z"                                                                                                                                                                        
    generation: 1                                                                                                                                                                                                    
    name: version                                                                                                                                                                                                    
    resourceVersion: "17387"                                                                                                                                                                                         
    selfLink: /apis/config.openshift.io/v1/clusterversions/version                                                                                                                                                   
    uid: 67098406-7710-11e9-89d0-0050569b5e80                                                                                                                                                                        
  spec:                                                                                                                                                                                                              
    channel: fast                                                                                                                                                                                                    
    clusterID: e30624c2-487e-4646-81e4-02b060dcc070                                                                                                                                                                  
    upstream: https://api.openshift.com/api/upgrades_info/v1/graph                                                                                                                                                   
  status:                                                                                                                                                                                                            
    availableUpdates: null                                                                                                                                                                                           
    conditions:                                                                                                                                                                                                      
    - lastTransitionTime: "2019-05-15T13:16:52Z"                                                                                                                                                                     
      message: Done applying 4.1.0-rc.3                                                                                                                                                                              
      status: "True"                                                                                                                                                                                                 
      type: Available                                                                                                                                                                                                
    - lastTransitionTime: "2019-05-15T13:05:36Z"                                                                                                                                                                     
      status: "False"                                                                                                                                                                                                
      type: Failing                                                                                                                                                                                                  
    - lastTransitionTime: "2019-05-15T13:16:52Z"                                                                                                                                                                     
      message: Cluster version is 4.1.0-rc.3                                                                                                                                                                         
      status: "False"                                                                                                                                                                                                
      type: Progressing                                                                                                                                                                                              
    - lastTransitionTime: "2019-05-15T12:53:14Z"                                                                                                                                                                     
      message: 'Unable to retrieve available updates: currently installed version                                                                                                                                    
        4.1.0-rc.3 not found in the "fast" channel'                                                                                                                                                                  
      reason: RemoteFailed                                                                                                                                                                                           
      status: "False"                                                                                                                                                                                                
      type: RetrievedUpdates                                                                                                                                                                                         
    desired:                                                                                                                                                                                                         
      force: false                                                                                                                                                                                                   
      image: quay.io/openshift-release-dev/ocp-release@sha256:713aae8687cf8a3cb5c2c504f65532dfe11e1b3534448ea9eeef5b0931d3e208                                                                                       
      version: 4.1.0-rc.3                                                                                                                                                                                            
    history:                                                                                                                                                                                                         
    - completionTime: "2019-05-15T13:16:52Z"                                                                                                                                                                         
      image: quay.io/openshift-release-dev/ocp-release@sha256:713aae8687cf8a3cb5c2c504f65532dfe11e1b3534448ea9eeef5b0931d3e208                                                                                       
      startedTime: "2019-05-15T12:53:14Z"                                                                                                                                                                            
      state: Completed                                                                                                                                                                                               
      verified: false                                                                                                                                                                                                
      version: 4.1.0-rc.3                                                                                                                                                                                            
    observedGeneration: 1                                                                                                                                                                                            
    versionHash: CsNEu_DKlWg=                                                                                                                                                                                        
kind: List                                                                                                                                                                                                           
metadata:                                                                                                                                                                                                            
  resourceVersion: ""                                                                                                                                                                                                
  selfLink: ""                              


100% reproducible for me.

Comment 15 Timothy Rees 2019-05-15 13:37:55 UTC
Created attachment 1568991 [details]
log bundle for beta5

Updated log bundle for the beta5 install

Comment 16 Matthew Staebler 2019-06-24 13:32:00 UTC
This was seen recently by a user using 4.1.2. See https://github.com/openshift/installer/issues/1884#issuecomment-504921970.

Comment 17 Abhinav Dahiya 2019-07-25 23:25:54 UTC
This is not reproducible yet. feel free to re-open if you see this again.

Comment 18 W. Trevor King 2019-08-16 06:07:34 UTC
Turned up again in 4.1.9.  Definitely a CVO vs. cluster-bootstrap race.  We need to remove the CVO's ClusterVersion defaulting logic.

Comment 19 W. Trevor King 2019-08-16 06:30:48 UTC
I've filed bug 1741786 to address this issue in 4.2, and redirected this bug to target the 4.1.z backport.

Comment 21 W. Trevor King 2019-08-21 22:25:54 UTC
Master fix landed via bug 1741786.  Backport filed, but we can't land it until the master bug is VERIFIED [1].

[1]: https://github.com/openshift/cluster-version-operator/pull/242#issuecomment-523670684

Comment 23 liujia 2019-09-16 08:42:50 UTC
Version: 4.1.0-0.nightly-2019-09-14-050039

Before create ignition file, update cvo-manifests.yaml to force installer-provided ClusterVersion failed with following way.
$ openshift-install create manifests
$ sed -i 's/name: version/name: get-lost/' manifests/cvo-overrides.yaml

Bootstrap fail as expected.
INFO Waiting up to 30m0s for the Kubernetes API at https://api.jliu-27870.qe.devcluster.openshift.com:6443... 
INFO API v1.13.4+f61b934 up                       
INFO Waiting up to 30m0s for bootstrapping to complete... 
INFO Use the following commands to gather logs from the cluster 
INFO openshift-install gather bootstrap --help   

Checked cvo log as expected.
...
I0916 08:02:54.491115       1 cvo.go:350] Started syncing cluster version "openshift-cluster-version/version" (2019-09-16 08:02:54.491111375 +0000 UTC m=+45.604482397)
I0916 08:02:54.491155       1 cvo.go:366] No ClusterVersion object and defaulting not enabled, waiting for one
...

And the normal installation on vsphere works well on above version.
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-09-14-050039   True        False         6s      Cluster version is 4.1.0-0.nightly-2019-09-14-050039

Comment 25 errata-xmlrpc 2019-09-25 07:27:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2820


Note You need to log in before you can comment on or make changes to this bug.