Bug 1907329

Summary: CLUSTER_PROFILE env. variable is not used by the CVO
Product: OpenShift Container Platform Reporter: Guillaume Rose <gurose>
Component: Cluster Version OperatorAssignee: Guillaume Rose <gurose>
Status: CLOSED ERRATA QA Contact: Johnny Liu <jialiu>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.7CC: aos-bugs, jokerman, jsafrane, lmohanty, wking
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:43:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1891068    
Bug Blocks: 1925199    

Description Guillaume Rose 2020-12-14 09:10:49 UTC
As per enhancement merged for 4.6 release [1], CVO should use the CLUSTER_PROFILE env variable to select the manifests to apply.
As of today, it doesn't work because the code simply doesn't exist.

[1] https://github.com/openshift/enhancements/pull/200


https://bugzilla.redhat.com/show_bug.cgi?id=1871890 prerequisite is not done yet and 2 PRs are still opened: 


After that, the env. variable can be used by the CVO with, for instance, this PR:

Comment 2 Johnny Liu 2021-01-13 06:53:26 UTC
I am following https://github.com/openshift/enhancements/blob/master/enhancements/update/cluster-profiles.md#with-the-installer to test this.

$ openshift-install version
openshift-install 4.7.0-0.nightly-2021-01-12-203716
built from commit b3dae7f4736bcd1dbf5a1e0ddafa826ee1738d81
release image registry.ci.openshift.org/ocp/release@sha256:c97466158d19a6e6b5563da4365d42ebe5579421b1163f3a2d6778ceb5388aed
$ openshift-install create cluster

Installation get completed successfully, but I did not see CLUSTER_PROFILE env. variable is injected into CVO deployment.

$ oc -n openshift-cluster-version get deployment.apps/cluster-version-operator -o yaml|grep -i PROFILE

Run the following command to find out some source is for `self-managed-high-availability` cluster, but not for `single-node-developer`

$ for i in `ls`; do if grep -q "include.release.openshift.io/self-managed-high-availability" $i; then if ! grep -q "include.release.openshift.io/single-node-developer" $i; then echo $i; fi; fi; done

Take 0000_90_cluster-update-keys_configmap.yaml as testing target.
$ cat 0000_90_cluster-update-keys_configmap.yaml 
kind: ConfigMap
    include.release.openshift.io/self-managed-high-availability: "true"
    release.openshift.io/verification-config-map: ""
  creationTimestamp: null
  name: release-verification
  namespace: openshift-config-managed
$ oc get cm release-verification -n openshift-config-managed
NAME                   DATA   AGE
release-verification   3      147m

Per my understanding, `release-verification` should NOT be created once user set CLUSTER_PROFILE=single-node-developer

Go to bootstrap node, check /usr/local/bin/bootkube.sh,
if [ ! -f cvo-bootstrap.done ]
        echo "Rendering Cluster Version Operator Manifests..."

        rm --recursive --force cvo-bootstrap

        bootkube_podman_run \
                --volume "$PWD:/assets:z" \
                --env CLUSTER_PROFILE="single-node-developer" \
                "${RELEASE_IMAGE_DIGEST}" \
                render \
                        --output-dir=/assets/cvo-bootstrap \

        cp cvo-bootstrap/bootstrap/* bootstrap-manifests/
        cp cvo-bootstrap/manifests/* manifests/
        ## FIXME: CVO should use `/etc/kubernetes/bootstrap-secrets/kubeconfig` instead
        cp auth/kubeconfig-loopback /etc/kubernetes/kubeconfig

        touch cvo-bootstrap.done

The setting already take effect in bootkube.sh.

But seem like this env VAR is never respected by CVO, do I miss anything?

Comment 3 Guillaume Rose 2021-01-13 08:41:05 UTC
The work we did for 4.7 is preparatory work for 4.8. It doesn't work with the installer.

The current implementation is really made for IBM Cloud: if you deploy the CVO as they do and pass the CLUSTER_PROFILE env var, then it will be used.

With the installer, it's different. We didn't add yet the env. variable in manifests. It will be added in 4.8. If we did it here, it would break the upgrade from 4.6 as the old CVO doesn't know what to do with {{ .ClusterProfile }} template variable in manifests.

Also, I donĀ“t think we have 2 complete profiles. I guess we only have the default one. single-node-developer is still in progress: https://bugzilla.redhat.com/show_bug.cgi?id=1915473

Comment 4 Johnny Liu 2021-01-13 10:11:47 UTC
Thanks for details.

I have no touch with IBM Cloud. Can you help verify this bug? or can you tell me a simple way to verify this bug on a common cloud, such as, ipi on aws.

Comment 5 W. Trevor King 2021-01-13 18:31:56 UTC
We need at least bug 1891068 to be addressed to add a missing annotation for our current two profiles.

Comment 6 W. Trevor King 2021-01-13 18:47:03 UTC
For verification, building on Guillaume's suggestion in comment 3, you could install a vanilla AWS cluster, scale the CVO Deployment down to zero, bump the CVO Deployment to set CLUSTER_PROFILE=single-node-production-edge, and scale the CVO Deployment back up to one.  Then check the resulting CVO logs to confirm that it is only pushing single-node-production-edge manifests.  Looking for a useful manifest in a recent 4.7 image:

  $ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.7.0-fc.2-x86_64
  $ grep -rA 2 self-managed-high-availability manifests | grep -B6 single-node-production-edge | head -n7
  manifests/0000_50_console-operator_sample-application-quickstart.yaml:    include.release.openshift.io/self-managed-high-availability: "true"
  manifests/0000_50_console-operator_sample-application-quickstart.yaml-    include.release.openshift.io/single-node-developer: "true"
  manifests/0000_50_console-operator_ocs-install-tour-quickstart.yaml:    include.release.openshift.io/self-managed-high-availability: "true"
  manifests/0000_50_console-operator_ocs-install-tour-quickstart.yaml-    include.release.openshift.io/single-node-developer: "true"
  manifests/0000_50_console-operator_ocs-install-tour-quickstart.yaml-    include.release.openshift.io/single-node-production-edge: "true"
  $ head -n4 manifests/0000_50_console-operator_sample-application-quickstart.yaml
  apiVersion: console.openshift.io/v1
  kind: ConsoleQuickStart
    name: sample-application

So CVO should not be attempting to sync the sample-application ConsoleQuickStart object in the single-node-production-edge profile.  We aren't currently actually supporting changing profiles on the fly like this, but if we get lucky and the CVO doesn't blow up on the profile change, seeing the CVO get past that the spot where that manifest used to live without trying to push that manifest will show that the 404 code is working.

Comment 7 Johnny Liu 2021-01-14 03:47:24 UTC
Thanks for your detailed steps.

Verified this bug with 4.7.0-0.nightly-2021-01-12-203716.

1. run IPI install on aws.
2. oc scale --replicas=0 deployment.apps/cluster-version-operator -n openshift-cluster-version
3. edit deployment.apps/cluster-version-operator to set CLUSTER_PROFILE=single-node-production-edge
      - args:
        - start
        - --release-image=registry.ci.openshift.org/ocp/release@sha256:c97466158d19a6e6b5563da4365d42ebe5579421b1163f3a2d6778ceb5388aed
        - --enable-auto-update=false
        - --enable-default-cluster-version=true
        - --serving-cert-file=/etc/tls/serving-cert/tls.crt
        - --serving-key-file=/etc/tls/serving-cert/tls.key
        - --v=5
        - name: CLUSTER_PROFILE
          value: single-node-production-edge
4. oc scale --replicas=1 deployment.apps/cluster-version-operator -n openshift-cluster-version
5. check 'sample-application' ConsoleQuickStart
$ oc get ConsoleQuickStart
NAME                 AGE
add-healthchecks     21h
explore-pipelines    21h
explore-serverless   21h
monitor-sampleapp    21h
ocs-install-tour     21h
sample-application   21h
6. delete 'ocs-install-tour' and 'sample-application' ConsoleQuickStart together
$ oc delete ConsoleQuickStart ocs-install-tour sample-application
consolequickstart.console.openshift.io "ocs-install-tour" deleted
consolequickstart.console.openshift.io "sample-application" deleted
$ oc get ConsoleQuickStart
NAME                 AGE
add-healthchecks     21h
explore-pipelines    21h
explore-serverless   21h
monitor-sampleapp    21h
7. wait some minutes
8. check again, "ocs-install-tour" will be recreated, but "sample-application" not
$ oc get ConsoleQuickStart
NAME                 AGE
add-healthchecks     21h
explore-pipelines    21h
explore-serverless   21h
monitor-sampleapp    21h
ocs-install-tour     3m25s
9. edit deployment.apps/cluster-version-operator to remove CLUSTER_PROFILE=single-node-production-edge setting
10. wait from some minutes, cvo will be redeployed, and wait sample-application is synced, and check again
$ oc get ConsoleQuickStart
NAME                 AGE
add-healthchecks     21h
explore-pipelines    21h
explore-serverless   21h
monitor-sampleapp    21h
ocs-install-tour     12m
sample-application   17s

Now 'sample-application' is synced and recreated. So that means CLUSTER_PROFILE take effect.

Comment 11 errata-xmlrpc 2021-02-24 15:43:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.