Bug 2022509

Summary: getOverrideForManifest does not check manifest.GVK.Group
Product: OpenShift Container Platform Reporter: Matthew Barnes <mbarnes>
Component: Cluster Version OperatorAssignee: Matthew Barnes <mbarnes>
Status: CLOSED ERRATA QA Contact: liujia <jiajliu>
Severity: high Docs Contact:
Priority: high    
Version: 4.10CC: aos-bugs, dramseur, jiajliu, nmalik, wking
Target Milestone: ---Keywords: ServiceDeliveryBlocker
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The cluster-version operator (CVO) previously ignored spec.overrides[].group when deciding whether to override a manifest. Consequence: An overrides entry might match multiple resources which only differed by group, and override more resources than the admin intended. An overrides entry with an invalid group was also still considered a match, so admins might be using invalid group values without noticing. Fix: The CVO now requires group matching when applying configured overrides. Result: The CVO will no longer match multiple manifests with a single override, and instead only matches the manifest with the correct group. Admins who had been using an invalid group will have to update to the correct group in order to have their override continue to match.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-10 16:26:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2022570    

Description Matthew Barnes 2021-11-11 20:17:23 UTC
Note: Filed this on GitHub (see links) but opening here too for internal tracking, as it's blocking ARO from moving to 4.9.


We have the following override in our `ClusterVersion`:

    - group: imageregistry.operator.openshift.io
      kind: Config
      name: cluster
      namespace: ""
      unmanaged: true

This is causing cluster provisioning to fail, because when the operator encounters this manifest...

$ cat 0000_30_config-operator_01_operator.cr.yaml
apiVersion: operator.openshift.io/v1
kind: Config
metadata:
  name: cluster
  annotations:
    include.release.openshift.io/ibm-cloud-managed: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
    release.openshift.io/create-only: "true"
spec:
  managementState: Managed

... the getOverrideForManifest function [1] is improperly matching it to the above "imageregistry.operator.openshift.io" override because it disregards the Group in its comparison ("imageregistry.operator.openshift.io" != "operator.openshift.io").

As a result, the cluster-config-operator has no custom resource to act on and it blocks the cluster-version-operator from ever completing:

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          3h18m   Working towards 4.9.7: 725 of 735 done (98% complete), waiting on config-operator


[1] https://github.com/openshift/cluster-version-operator/blob/4c3a08036da8a96175b7c0445de83b58d0ea5515/pkg/cvo/sync_worker.go#L1060-L1071

Comment 2 liujia 2021-11-12 08:33:06 UTC
> $ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          3h18m   Working towards 4.9.7: 725 of 735 done (98% complete), waiting on config-operator
The error looks like a status during initializing or updating, right?

> This is causing cluster provisioning to fail
I think "cluster provisioning" here means installing a cluster? I'm not quite clear how to install a cluster with overrides set in the clusterversion. I don't think it would mean updating a cluster, because setting overrides in CV will block the update of cluster. 

Hi Matthew Barnes
Currently i have no idea how QE reproduce the issue. Could you help give more steps on how to provision such a cluster with spec.overrides of cv?

Comment 3 Matthew Barnes 2021-11-12 16:01:35 UTC
> Currently i have no idea how QE reproduce the issue. Could you help give more steps on how to provision such a cluster with spec.overrides of cv?

The example is for an Azure Red Hat OpenShift (ARO) cluster, which embeds a forked openshift-installer in our custom Azure Resource Provider code.  But I think this should be reproducible with the vanilla installer.

The "CVOIgnore" asset in the installer I believe is the entry point.

The cluster-version-operator overrides are specified in a "manifests/cvo-overrides.yaml" file:
https://github.com/openshift/installer/blob/f3f56e279b729663e3184a06e38bf27d42d58279/pkg/asset/ignition/bootstrap/cvoignore.go#L21

First run "bin/openshift-install create manifests --dir assets" and then add the override from comment #0 in "assets/manifests/cvo-overrides.yaml"

So the manifest spec would look something like:

  spec:
    channel: stable-4.9
    clusterID: $CLUSTERID
    overrides:
    - group: imageregistry.operator.openshift.io
      kind: Config
      name: cluster
      namespace: ""
      unmanaged: true

Then create the cluster as per usual.

During install, once the bootstrap phase is complete, obtain a .kubeconfig and verify this resource is missing:
$ oc get config.operator cluster

Also the openshift-config-operator logs will be filled with this message:
ConfigOperatorController reconciliation failed: configs.operator.openshift.io "cluster" not found

This will indefinitely block the cluster-version-operator from reaching 100%.

Comment 5 liujia 2021-11-15 07:57:02 UTC
The pr has landed in the oldest available v4.10 nightly build. So I can not reproduce it on v4.10 now. Instead, with the steps, reproduced the bug on v4.9.7.

1. Add overrides in manifests/cvo-overrides.yaml before triggering an installation.
  spec:
    channel: stable-4.9
    clusterID: 9a263f40-6865-475d-919c-705fc7f49f57
    overrides:
    - kind: Config
      group: imageregistry.operator.openshift.io
      name: cluster
      namespace: ""
      unmanaged: true
2. Trigger installation with above manifest, checked that the instillation fail.
level=info msg=Waiting up to 40m0s for the cluster at https://api.jliu49.qe.devcluster.openshift.com:6443 to initialize...
...
level=debug msg=Still waiting for the cluster to initialize: Working towards 4.9.7: 733 of 735 done (99% complete), waiting on config-operator

# ./oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          53m     Working towards 4.9.7: 733 of 735 done (99% complete), waiting on config-operator

# ./oc get config cluster
Error from server (NotFound): configs.operator.openshift.io "cluster" not found

Comment 6 liujia 2021-11-15 09:40:08 UTC
Verified on 4.10.0-0.nightly-2021-11-14-184249

1. Add overrides in manifests/cvo-overrides.yaml before triggering an installation.
  spec:
    channel: stable-4.10
    clusterID: 52b6a00c-aae7-422f-9673-5b5629fd23d6
    overrides:
    - group: imageregistry.operator.openshift.io
      kind: Config
      name: cluster
      namespace: ''
      unmanaged: true

2. Trigger installation with above manifest, checked that the instillation succeed.

# ./oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-11-14-184249   True        False         73m     Cluster version is 4.10.0-0.nightly-2021-11-14-184249

# ./oc get clusterversion -o json|jq .items[].spec
{
  "channel": "stable-4.10",
  "clusterID": "52b6a00c-aae7-422f-9673-5b5629fd23d6",
  "overrides": [
    {
      "group": "imageregistry.operator.openshift.io",
      "kind": "Config",
      "name": "cluster",
      "namespace": "",
      "unmanaged": true
    }
  ]
}

# ./oc get config cluster
NAME      AGE
cluster   96m

Comment 10 errata-xmlrpc 2022-03-10 16:26:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056