Bug 2022509 - getOverrideForManifest does not check manifest.GVK.Group
Summary: getOverrideForManifest does not check manifest.GVK.Group
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.10
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.10.0
Assignee: Matthew Barnes
QA Contact: liujia
URL:
Whiteboard:
Depends On:
Blocks: 2022570
TreeView+ depends on / blocked
 
Reported: 2021-11-11 20:17 UTC by Matthew Barnes
Modified: 2022-03-10 16:27 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The cluster-version operator (CVO) previously ignored spec.overrides[].group when deciding whether to override a manifest. Consequence: An overrides entry might match multiple resources which only differed by group, and override more resources than the admin intended. An overrides entry with an invalid group was also still considered a match, so admins might be using invalid group values without noticing. Fix: The CVO now requires group matching when applying configured overrides. Result: The CVO will no longer match multiple manifests with a single override, and instead only matches the manifest with the correct group. Admins who had been using an invalid group will have to update to the correct group in order to have their override continue to match.
Clone Of:
Environment:
Last Closed: 2022-03-10 16:26:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-version-operator issues 688 0 None open getOverrideForManifest does not check manifest.GVK.Group 2021-11-11 20:17:23 UTC
Github openshift cluster-version-operator pull 689 0 None open Bug 2022509: cvo: Compare manifest group in getOverrideForManifest 2021-11-11 20:59:31 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:27:10 UTC

Description Matthew Barnes 2021-11-11 20:17:23 UTC
Note: Filed this on GitHub (see links) but opening here too for internal tracking, as it's blocking ARO from moving to 4.9.


We have the following override in our `ClusterVersion`:

    - group: imageregistry.operator.openshift.io
      kind: Config
      name: cluster
      namespace: ""
      unmanaged: true

This is causing cluster provisioning to fail, because when the operator encounters this manifest...

$ cat 0000_30_config-operator_01_operator.cr.yaml
apiVersion: operator.openshift.io/v1
kind: Config
metadata:
  name: cluster
  annotations:
    include.release.openshift.io/ibm-cloud-managed: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
    release.openshift.io/create-only: "true"
spec:
  managementState: Managed

... the getOverrideForManifest function [1] is improperly matching it to the above "imageregistry.operator.openshift.io" override because it disregards the Group in its comparison ("imageregistry.operator.openshift.io" != "operator.openshift.io").

As a result, the cluster-config-operator has no custom resource to act on and it blocks the cluster-version-operator from ever completing:

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          3h18m   Working towards 4.9.7: 725 of 735 done (98% complete), waiting on config-operator


[1] https://github.com/openshift/cluster-version-operator/blob/4c3a08036da8a96175b7c0445de83b58d0ea5515/pkg/cvo/sync_worker.go#L1060-L1071

Comment 2 liujia 2021-11-12 08:33:06 UTC
> $ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          3h18m   Working towards 4.9.7: 725 of 735 done (98% complete), waiting on config-operator
The error looks like a status during initializing or updating, right?

> This is causing cluster provisioning to fail
I think "cluster provisioning" here means installing a cluster? I'm not quite clear how to install a cluster with overrides set in the clusterversion. I don't think it would mean updating a cluster, because setting overrides in CV will block the update of cluster. 

Hi Matthew Barnes
Currently i have no idea how QE reproduce the issue. Could you help give more steps on how to provision such a cluster with spec.overrides of cv?

Comment 3 Matthew Barnes 2021-11-12 16:01:35 UTC
> Currently i have no idea how QE reproduce the issue. Could you help give more steps on how to provision such a cluster with spec.overrides of cv?

The example is for an Azure Red Hat OpenShift (ARO) cluster, which embeds a forked openshift-installer in our custom Azure Resource Provider code.  But I think this should be reproducible with the vanilla installer.

The "CVOIgnore" asset in the installer I believe is the entry point.

The cluster-version-operator overrides are specified in a "manifests/cvo-overrides.yaml" file:
https://github.com/openshift/installer/blob/f3f56e279b729663e3184a06e38bf27d42d58279/pkg/asset/ignition/bootstrap/cvoignore.go#L21

First run "bin/openshift-install create manifests --dir assets" and then add the override from comment #0 in "assets/manifests/cvo-overrides.yaml"

So the manifest spec would look something like:

  spec:
    channel: stable-4.9
    clusterID: $CLUSTERID
    overrides:
    - group: imageregistry.operator.openshift.io
      kind: Config
      name: cluster
      namespace: ""
      unmanaged: true

Then create the cluster as per usual.

During install, once the bootstrap phase is complete, obtain a .kubeconfig and verify this resource is missing:
$ oc get config.operator cluster

Also the openshift-config-operator logs will be filled with this message:
ConfigOperatorController reconciliation failed: configs.operator.openshift.io "cluster" not found

This will indefinitely block the cluster-version-operator from reaching 100%.

Comment 5 liujia 2021-11-15 07:57:02 UTC
The pr has landed in the oldest available v4.10 nightly build. So I can not reproduce it on v4.10 now. Instead, with the steps, reproduced the bug on v4.9.7.

1. Add overrides in manifests/cvo-overrides.yaml before triggering an installation.
  spec:
    channel: stable-4.9
    clusterID: 9a263f40-6865-475d-919c-705fc7f49f57
    overrides:
    - kind: Config
      group: imageregistry.operator.openshift.io
      name: cluster
      namespace: ""
      unmanaged: true
2. Trigger installation with above manifest, checked that the instillation fail.
level=info msg=Waiting up to 40m0s for the cluster at https://api.jliu49.qe.devcluster.openshift.com:6443 to initialize...
...
level=debug msg=Still waiting for the cluster to initialize: Working towards 4.9.7: 733 of 735 done (99% complete), waiting on config-operator

# ./oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          53m     Working towards 4.9.7: 733 of 735 done (99% complete), waiting on config-operator

# ./oc get config cluster
Error from server (NotFound): configs.operator.openshift.io "cluster" not found

Comment 6 liujia 2021-11-15 09:40:08 UTC
Verified on 4.10.0-0.nightly-2021-11-14-184249

1. Add overrides in manifests/cvo-overrides.yaml before triggering an installation.
  spec:
    channel: stable-4.10
    clusterID: 52b6a00c-aae7-422f-9673-5b5629fd23d6
    overrides:
    - group: imageregistry.operator.openshift.io
      kind: Config
      name: cluster
      namespace: ''
      unmanaged: true

2. Trigger installation with above manifest, checked that the instillation succeed.

# ./oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-11-14-184249   True        False         73m     Cluster version is 4.10.0-0.nightly-2021-11-14-184249

# ./oc get clusterversion -o json|jq .items[].spec
{
  "channel": "stable-4.10",
  "clusterID": "52b6a00c-aae7-422f-9673-5b5629fd23d6",
  "overrides": [
    {
      "group": "imageregistry.operator.openshift.io",
      "kind": "Config",
      "name": "cluster",
      "namespace": "",
      "unmanaged": true
    }
  ]
}

# ./oc get config cluster
NAME      AGE
cluster   96m

Comment 10 errata-xmlrpc 2022-03-10 16:26:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.