1907329 – CLUSTER_PROFILE env. variable is not used by the CVO

Bug 1907329 - CLUSTER_PROFILE env. variable is not used by the CVO

Summary: CLUSTER_PROFILE env. variable is not used by the CVO

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Guillaume Rose
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Depends On:	1891068
Blocks:	1925199
TreeView+	depends on / blocked

Reported:	2020-12-14 09:10 UTC by Guillaume Rose
Modified:	2021-02-24 15:43 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:43:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-storage-operator pull 117	None	closed	Bug 1907329: Add missing default cluster profile annotation	2021-02-15 14:49:34 UTC
Github	openshift cluster-version-operator pull 404	None	closed	Bug 1907329: Add cluster profile support	2021-02-15 14:49:34 UTC
Github	operator-framework operator-lifecycle-manager pull 1887	None	closed	Bug 1907329: Update /manifests with default cluster profile annotation	2021-02-15 14:49:34 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:43:41 UTC

Description Guillaume Rose 2020-12-14 09:10:49 UTC

As per enhancement merged for 4.6 release [1], CVO should use the CLUSTER_PROFILE env variable to select the manifests to apply.
As of today, it doesn't work because the code simply doesn't exist.

[1] https://github.com/openshift/enhancements/pull/200

---

https://bugzilla.redhat.com/show_bug.cgi?id=1871890 prerequisite is not done yet and 2 PRs are still opened: 

https://github.com/openshift/cluster-storage-operator/pull/117
https://github.com/operator-framework/operator-lifecycle-manager/pull/1887

After that, the env. variable can be used by the CVO with, for instance, this PR:
https://github.com/openshift/cluster-version-operator/pull/404

Comment 2 Johnny Liu 2021-01-13 06:53:26 UTC

I am following https://github.com/openshift/enhancements/blob/master/enhancements/update/cluster-profiles.md#with-the-installer to test this.


$ export OPENSHIFT_INSTALL_EXPERIMENTAL_CLUSTER_PROFILE=single-node-developer
$ openshift-install version
openshift-install 4.7.0-0.nightly-2021-01-12-203716
built from commit b3dae7f4736bcd1dbf5a1e0ddafa826ee1738d81
release image registry.ci.openshift.org/ocp/release@sha256:c97466158d19a6e6b5563da4365d42ebe5579421b1163f3a2d6778ceb5388aed
$ openshift-install create cluster


Installation get completed successfully, but I did not see CLUSTER_PROFILE env. variable is injected into CVO deployment.

$ oc -n openshift-cluster-version get deployment.apps/cluster-version-operator -o yaml|grep -i PROFILE
<empty>

Run the following command to find out some source is for `self-managed-high-availability` cluster, but not for `single-node-developer`

$ for i in `ls`; do if grep -q "include.release.openshift.io/self-managed-high-availability" $i; then if ! grep -q "include.release.openshift.io/single-node-developer" $i; then echo $i; fi; fi; done
<--snip-->
0000_90_cluster-update-keys_configmap.yaml
<--snip-->

Take 0000_90_cluster-update-keys_configmap.yaml as testing target.
$ cat 0000_90_cluster-update-keys_configmap.yaml 
<--snip-->
kind: ConfigMap
metadata:
  annotations:
    include.release.openshift.io/self-managed-high-availability: "true"
    release.openshift.io/verification-config-map: ""
  creationTimestamp: null
  name: release-verification
  namespace: openshift-config-managed
$ oc get cm release-verification -n openshift-config-managed
NAME                   DATA   AGE
release-verification   3      147m

Per my understanding, `release-verification` should NOT be created once user set CLUSTER_PROFILE=single-node-developer

Go to bootstrap node, check /usr/local/bin/bootkube.sh,
if [ ! -f cvo-bootstrap.done ]
then
        echo "Rendering Cluster Version Operator Manifests..."

        rm --recursive --force cvo-bootstrap

        bootkube_podman_run \
                --volume "$PWD:/assets:z" \
                --env CLUSTER_PROFILE="single-node-developer" \
                "${RELEASE_IMAGE_DIGEST}" \
                render \
                        --output-dir=/assets/cvo-bootstrap \
                        --release-image="${RELEASE_IMAGE_DIGEST}"

        cp cvo-bootstrap/bootstrap/* bootstrap-manifests/
        cp cvo-bootstrap/manifests/* manifests/
        ## FIXME: CVO should use `/etc/kubernetes/bootstrap-secrets/kubeconfig` instead
        cp auth/kubeconfig-loopback /etc/kubernetes/kubeconfig

        touch cvo-bootstrap.done
fi


The setting already take effect in bootkube.sh.

But seem like this env VAR is never respected by CVO, do I miss anything?

Comment 3 Guillaume Rose 2021-01-13 08:41:05 UTC

The work we did for 4.7 is preparatory work for 4.8. It doesn't work with the installer.

The current implementation is really made for IBM Cloud: if you deploy the CVO as they do and pass the CLUSTER_PROFILE env var, then it will be used.

With the installer, it's different. We didn't add yet the env. variable in manifests. It will be added in 4.8. If we did it here, it would break the upgrade from 4.6 as the old CVO doesn't know what to do with {{ .ClusterProfile }} template variable in manifests.

Also, I don´t think we have 2 complete profiles. I guess we only have the default one. single-node-developer is still in progress: https://bugzilla.redhat.com/show_bug.cgi?id=1915473

Comment 4 Johnny Liu 2021-01-13 10:11:47 UTC

Thanks for details.

I have no touch with IBM Cloud. Can you help verify this bug? or can you tell me a simple way to verify this bug on a common cloud, such as, ipi on aws.

Comment 5 W. Trevor King 2021-01-13 18:31:56 UTC

We need at least bug 1891068 to be addressed to add a missing annotation for our current two profiles.

Comment 6 W. Trevor King 2021-01-13 18:47:03 UTC

For verification, building on Guillaume's suggestion in comment 3, you could install a vanilla AWS cluster, scale the CVO Deployment down to zero, bump the CVO Deployment to set CLUSTER_PROFILE=single-node-production-edge, and scale the CVO Deployment back up to one.  Then check the resulting CVO logs to confirm that it is only pushing single-node-production-edge manifests.  Looking for a useful manifest in a recent 4.7 image:

  $ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.7.0-fc.2-x86_64
  $ grep -rA 2 self-managed-high-availability manifests | grep -B6 single-node-production-edge | head -n7
  manifests/0000_50_console-operator_sample-application-quickstart.yaml:    include.release.openshift.io/self-managed-high-availability: "true"
  manifests/0000_50_console-operator_sample-application-quickstart.yaml-    include.release.openshift.io/single-node-developer: "true"
  manifests/0000_50_console-operator_sample-application-quickstart.yaml-spec:
  --
  manifests/0000_50_console-operator_ocs-install-tour-quickstart.yaml:    include.release.openshift.io/self-managed-high-availability: "true"
  manifests/0000_50_console-operator_ocs-install-tour-quickstart.yaml-    include.release.openshift.io/single-node-developer: "true"
  manifests/0000_50_console-operator_ocs-install-tour-quickstart.yaml-    include.release.openshift.io/single-node-production-edge: "true"
  $ head -n4 manifests/0000_50_console-operator_sample-application-quickstart.yaml
  apiVersion: console.openshift.io/v1
  kind: ConsoleQuickStart
  metadata:
    name: sample-application

So CVO should not be attempting to sync the sample-application ConsoleQuickStart object in the single-node-production-edge profile.  We aren't currently actually supporting changing profiles on the fly like this, but if we get lucky and the CVO doesn't blow up on the profile change, seeing the CVO get past that the spot where that manifest used to live without trying to push that manifest will show that the 404 code is working.

Comment 7 Johnny Liu 2021-01-14 03:47:24 UTC

Thanks for your detailed steps.

Verified this bug with 4.7.0-0.nightly-2021-01-12-203716.

1. run IPI install on aws.
2. oc scale --replicas=0 deployment.apps/cluster-version-operator -n openshift-cluster-version
3. edit deployment.apps/cluster-version-operator to set CLUSTER_PROFILE=single-node-production-edge
<--snip-->
    spec:
      containers:
      - args:
        - start
        - --release-image=registry.ci.openshift.org/ocp/release@sha256:c97466158d19a6e6b5563da4365d42ebe5579421b1163f3a2d6778ceb5388aed
        - --enable-auto-update=false
        - --enable-default-cluster-version=true
        - --serving-cert-file=/etc/tls/serving-cert/tls.crt
        - --serving-key-file=/etc/tls/serving-cert/tls.key
        - --v=5
        env:
        - name: CLUSTER_PROFILE
          value: single-node-production-edge
<--snip-->
4. oc scale --replicas=1 deployment.apps/cluster-version-operator -n openshift-cluster-version
5. check 'sample-application' ConsoleQuickStart
$ oc get ConsoleQuickStart
NAME                 AGE
add-healthchecks     21h
explore-pipelines    21h
explore-serverless   21h
monitor-sampleapp    21h
ocs-install-tour     21h
sample-application   21h
6. delete 'ocs-install-tour' and 'sample-application' ConsoleQuickStart together
$ oc delete ConsoleQuickStart ocs-install-tour sample-application
consolequickstart.console.openshift.io "ocs-install-tour" deleted
consolequickstart.console.openshift.io "sample-application" deleted
$ oc get ConsoleQuickStart
NAME                 AGE
add-healthchecks     21h
explore-pipelines    21h
explore-serverless   21h
monitor-sampleapp    21h
7. wait some minutes
8. check again, "ocs-install-tour" will be recreated, but "sample-application" not
$ oc get ConsoleQuickStart
NAME                 AGE
add-healthchecks     21h
explore-pipelines    21h
explore-serverless   21h
monitor-sampleapp    21h
ocs-install-tour     3m25s
9. edit deployment.apps/cluster-version-operator to remove CLUSTER_PROFILE=single-node-production-edge setting
10. wait from some minutes, cvo will be redeployed, and wait sample-application is synced, and check again
$ oc get ConsoleQuickStart
NAME                 AGE
add-healthchecks     21h
explore-pipelines    21h
explore-serverless   21h
monitor-sampleapp    21h
ocs-install-tour     12m
sample-application   17s

Now 'sample-application' is synced and recreated. So that means CLUSTER_PROFILE take effect.

Comment 11 errata-xmlrpc 2021-02-24 15:43:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.