Bug 1886176 - CCO sets Upgradeable to False on bare-metal with reason CredentialsRootSecretMissing
Summary: CCO sets Upgradeable to False on bare-metal with reason CredentialsRootSecret...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Credential Operator
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.7.0
Assignee: Devan Goodwin
QA Contact: wang lin
URL:
Whiteboard:
Depends On:
Blocks: 1886475
TreeView+ depends on / blocked
 
Reported: 2020-10-07 19:59 UTC by Seth Jennings
Modified: 2020-12-08 02:26 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1886475 (view as bug list)
Environment:
Last Closed: 2020-12-04 12:56:49 UTC
Target Upstream Version:
Embargoed:
lwan: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1884691 0 medium CLOSED Installer blocks cloud-credential-operator manual mode on GCP and Azure 2021-02-24 15:22:55 UTC

Description Seth Jennings 2020-10-07 19:59:23 UTC
Installed on bare-metal (VMs) with assisted installer

$ oc get clusteroperator -oyaml cloud-credential 
...
status:
  conditions:
...
  - lastTransitionTime: "2020-10-07T18:28:25Z"
    message: Parent credential secret kube-system/aws-creds must be restored prior to upgrade
    reason: CredentialsRootSecretMissing
    status: "False"
    type: Upgradeable

which blocks upgrades and results in an ClusterNotUpgradeable alert


Version-Release number of selected component (if applicable):
4.6

How reproducible:
Every time

Steps to Reproduce:
1. Install a cluster with assisted installer on bare-metal
2.
3.

Actual results:
Cluster is not upgradeable 

Expected results:
Cluster is upgradeable as cloud provider creds are not a thing on bare-metal

Additional info:

$ oc get infrastructures.config.openshift.io -oyaml cluster
...
spec:
  cloudConfig:
    name: ""
  platformSpec:
    type: BareMetal
status:
...
  platform: BareMetal
  platformStatus:
    baremetal:
      apiServerInternalIP: 10.42.11.10
      ingressIP: 10.42.11.11
    type: BareMetal

$ oc get cloudcredentials.operator.openshift.io -oyaml cluster 
...
spec:
  credentialsMode: ""

Seems like this was done in response to this
https://bugzilla.redhat.com/show_bug.cgi?id=1879628
https://github.com/openshift/cloud-credential-operator/pull/248

Comment 1 Devan Goodwin 2020-10-08 11:22:17 UTC
If a workaround is needed, just create an empty dummy secret, we don't check the contents.

This will only block an upgrade to 4.7, and now 4.6.1. 

Will try to determine how AWS code is triggering despite a non-AWS infrastructure.

Comment 2 Scott Dodson 2020-10-08 14:17:07 UTC
Discussed w/ Devan, that was a typo s/now/not/ and conclusion is that this is not a 4.6 GA blocker but will need to be resolved in a 4.6.z before 4.7 upgrades are expected to be supported.

Moving to 4.7 and cloning for resolution in 4.6.z.

Comment 3 wang lin 2020-11-06 05:46:16 UTC
test payload: registry.svc.ci.openshift.org/ocp/release:4.7.0-0.nightly-2020-10-27-051128

I see the similar issue when provision with Manual mode on GCP. Provision on Azure is ok.(when testing bug https://bugzilla.redhat.com/show_bug.cgi?id=1884691)

###
$ oc get co cloud-credential -o json | jq -r ".status.conditions"
[
  {
    "lastTransitionTime": "2020-11-06T02:54:08Z",
    "message": "Credential minting is disabled by cluster admin",
    "reason": "OperatorDisabledByAdmin",
    "status": "True",
    "type": "Available"
  },
  {
    "lastTransitionTime": "2020-11-06T02:54:08Z",
    "status": "False",
    "type": "Degraded"
  },
  {
    "lastTransitionTime": "2020-11-06T02:54:08Z",
    "status": "False",
    "type": "Progressing"
  },
  {
    "lastTransitionTime": "2020-11-06T02:54:08Z",
    "message": "Parent credential secret must be restored prior to upgrade: kube-system/gcp-credentials",
    "reason": "MissingRootCredential",
    "status": "False",
    "type": "Upgradeable"
  }
]


$ oc get cloudcredential cluster -o json | jq -r ".spec"
{
  "credentialsMode": "Manual",
  "logLevel": "Normal"
}

$ oc get infrastructure cluster -o json | jq -r ".status"
{
  "apiServerInternalURI": "https://api-int.lwan-jk-manual-gcp.qe.gcp.devcluster.openshift.com:6443",
  "apiServerURL": "https://api.lwan-jk-manual-gcp.qe.gcp.devcluster.openshift.com:6443",
  "etcdDiscoveryDomain": "lwan-jk-manual-gcp.qe.gcp.devcluster.openshift.com",
  "infrastructureName": "lwan-jk-manual-gcp-wkwnt",
  "platform": "GCP",
  "platformStatus": {
    "gcp": {
      "projectID": "openshift-qe",
      "region": "us-central1"
    },
    "type": "GCP"
  }
}

Comment 5 Devan Goodwin 2020-11-30 18:33:48 UTC
I've looked through the code and can't immediately see how this is possible, we fork the actuator implementation based on the infrastructure status platform (https://github.com/openshift/cloud-credential-operator/blob/master/pkg/operator/controller.go#L84), bare metal platform type should be falling through to the default case and adding a dummy actuator which contains no logic and does nothing, it explicitly codes to always set upgradable true. I can't yet see how a bare metal cluster would be instantiating the AWS actuator, but that appears to be what is happening in this case.

It would help a lot to get onto a metal cluster to debug, I'd like to see the infrastructure CR in full (though we have most of that), but most importantly the cloud-credential-operator pod logs. Unfortunately we do not have access to such a cluster/hardware. I will try to track down someone who can get us onto one.

Comment 6 Devan Goodwin 2020-11-30 18:34:51 UTC
oc adm must-gather would be ideal.

Comment 7 Seth Jennings 2020-12-04 03:22:47 UTC
I'm not seeing this issue in 4.6 any more for platform: BareMetal. 4.6.6 and higher to be specific.  It could have been fixed before 4.6.6 as well.

Wang Lin, is this still an issue for you? If not, feel free to close.

Comment 8 Devan Goodwin 2020-12-04 12:24:27 UTC
FWIW I think Lin's issue is probably a separate bug, in that case it looks like GCP is correctly running the GCP actuator, but logic may be incorrectly assuming it needs an admin cred.


Note You need to log in before you can comment on or make changes to this bug.