Bug 2033489 - CCM operator failing on baremetal platform
Summary: CCM operator failing on baremetal platform
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.10.0
Assignee: Michael McCune
QA Contact: sunzhaohua
URL:
Whiteboard:
: 2033722 2036571 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-12-17 02:01 UTC by Zane Bitter
Modified: 2022-04-11 08:33 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-10 16:34:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-cloud-controller-manager-operator pull 156 0 None Merged Bug 2033489: allow baremetal platform to skip syncing 2021-12-17 17:53:07 UTC
Github openshift cluster-cloud-controller-manager-operator pull 158 0 None Merged Bug 2033489: Use a list of platforms where config sync is required 2021-12-17 17:53:08 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:34:51 UTC

Description Zane Bitter 2021-12-17 02:01:14 UTC
Description of problem:
Since around the time the patch https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/152 went in, all baremetal jobs are failing to finish cluster creation because CVO is failing to complete with this error:

Cluster operator cloud-controller-manager Degraded is True with SyncingFailed: Failed when progressing towards operator: 4.10.0-0.ci.test-2021-12-16-204517-ci-op-d7q95m29-latest because &{%!e(string=failed to apply resources because CloudConfigControllerDegraded condition is set to True)}

Looking at the cluster-version-operator operator resource, it contains an error message ("Cloud Config Controller failed to sync cloud config") that was added for the first time in the above commit:

    Last Transition Time:  2021-12-16T21:38:34Z
    Message:               Cloud Config Controller failed to sync cloud config
    Reason:                SyncingFailed
    Status:                False
    Type:                  CloudConfigControllerAvailable
    Last Transition Time:  2021-12-16T21:38:34Z
    Message:               Cloud Config Controller failed to sync cloud config
    Reason:                SyncingFailed
    Status:                True
    Type:                  CloudConfigControllerDegraded
    Last Transition Time:  2021-12-16T21:38:34Z
    Message:               Failed when progressing towards operator: 4.10.0-0.ci.test-2021-12-16-210713-ci-ln-k4gx5wb-latest because &{%!e(string=failed to apply resources because CloudConfigControllerDegraded condition is set to True)}
    Reason:                SyncingFailed
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2021-12-16T21:38:34Z
    Reason:                AsExpected
    Status:                False
    Type:                  Upgradeable
    Last Transition Time:  2021-12-16T21:38:34Z
    Message:               Trusted CA Bundle Controller works as expected
    Reason:                AsExpected
    Status:                True
    Type:                  TrustedCABundleControllerControllerAvailable
    Last Transition Time:  2021-12-16T21:38:34Z
    Message:               Trusted CA Bundle Controller works as expected
    Reason:                AsExpected
    Status:                False
    Type:                  TrustedCABundleControllerControllerDegraded

Version-Release number of selected component (if applicable):


How reproducible:
~100%

https://search.ci.openshift.org/chart?maxAge=12h&type=build-log&search=failed%20to%20apply%20resources%20because%20CloudConfigControllerDegraded

Steps to Reproduce:
1. deploy a baremetal cluster
2. wait for CVO to finish

Actual results:
CVO never reaches intended version because the cloud-controller-manager is Degraded

Expected results:
No operators are degraded and CVO reaches the desired version

Additional info:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/208/pull-ci-openshift-cluster-baremetal-operator-master-e2e-metal-ipi-ovn-ipv6/1471581173510574080
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/208/pull-ci-openshift-cluster-baremetal-operator-master-e2e-metal-ipi/1471581173460242432
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/1241/pull-ci-openshift-cluster-network-operator-master-e2e-metal-ipi-ovn-ipv6/1471597466745835520

Comment 2 Zane Bitter 2021-12-17 04:30:09 UTC
I tested Mike's patch on metal and it worked.

We're starting to see more oVirt failures with the same issue, so it may need a similar fix. Its non-presence in the list of supported and unsupported platforms is suspicious, when the default is to enable: https://github.com/openshift/cluster-cloud-controller-manager-operator/blob/master/README.md#supported-platforms

Note that there also seems to be an issue with posting events to the correct namespace (either that or with RBAC), judging by these log messages:

2021-12-16T21:16:04.957877011Z E1216 21:16:04.957646       1 event.go:264] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"cloud-controller-manager.16c158ca3c14e1a3", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"ClusterOperator", Namespace:"", Name:"cloud-controller-manager", UID:"10c9879a-aee8-44ac-8cef-3b58cc92f170", APIVersion:"config.openshift.io/v1", ResourceVersion:"3068", FieldPath:""}, Reason:"Status degraded", Message:"failed to apply resources because CloudConfigControllerDegraded condition is set to True", Source:v1.EventSource{Component:"cloud-controller-manager-operator", Host:""}, FirstTimestamp:time.Date(2021, time.December, 16, 21, 16, 4, 954210723, time.Local), LastTimestamp:time.Date(2021, time.December, 16, 21, 16, 4, 954210723, time.Local), Count:1, Type:"Warning", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events is forbidden: User "system:serviceaccount:openshift-cloud-controller-manager-operator:cluster-cloud-controller-manager" cannot create resource "events" in API group "" in the namespace "default"' (will not retry!)

Comment 4 Dan Williams 2021-12-17 17:52:14 UTC
*** Bug 2033722 has been marked as a duplicate of this bug. ***

Comment 6 sunzhaohua 2021-12-21 05:26:29 UTC
Verified
clusterversion: 4.10.0-0.nightly-2021-12-20-231053
BM cluster could be installed successfully. cloud-config sync is skiped on bm.
$ oc get co                                                                                                                                                        [13:24:41]
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.10.0-0.nightly-2021-12-20-231053   True        False         False      84m
baremetal                                  4.10.0-0.nightly-2021-12-20-231053   True        False         False      101m
cloud-controller-manager                   4.10.0-0.nightly-2021-12-20-231053   True        False         False      103m

$ oc logs -f cluster-cloud-controller-manager-operator-6ffd6d8d9d-cr7sm -n openshift-cloud-controller-manager-operator -c config-sync-controllers
I1221 05:18:48.864743       1 cloud_config_sync_controller.go:59] cloud-config sync is not needed, returning early

Comment 7 Prashanth Sundararaman 2022-01-03 15:32:19 UTC
*** Bug 2036571 has been marked as a duplicate of this bug. ***

Comment 10 errata-xmlrpc 2022-03-10 16:34:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.