Bug 1886176

Summary: CCO sets Upgradeable to False on bare-metal with reason CredentialsRootSecretMissing
Product: OpenShift Container Platform Reporter: Seth Jennings <sjenning>
Component: Cloud Credential OperatorAssignee: Devan Goodwin <dgoodwin>
Status: CLOSED WORKSFORME QA Contact: wang lin <lwan>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.6CC: lwan
Target Milestone: ---Flags: lwan: needinfo-
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1886475 (view as bug list) Environment:
Last Closed: 2020-12-04 12:56:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1886475    

Description Seth Jennings 2020-10-07 19:59:23 UTC
Installed on bare-metal (VMs) with assisted installer

$ oc get clusteroperator -oyaml cloud-credential 
...
status:
  conditions:
...
  - lastTransitionTime: "2020-10-07T18:28:25Z"
    message: Parent credential secret kube-system/aws-creds must be restored prior to upgrade
    reason: CredentialsRootSecretMissing
    status: "False"
    type: Upgradeable

which blocks upgrades and results in an ClusterNotUpgradeable alert


Version-Release number of selected component (if applicable):
4.6

How reproducible:
Every time

Steps to Reproduce:
1. Install a cluster with assisted installer on bare-metal
2.
3.

Actual results:
Cluster is not upgradeable 

Expected results:
Cluster is upgradeable as cloud provider creds are not a thing on bare-metal

Additional info:

$ oc get infrastructures.config.openshift.io -oyaml cluster
...
spec:
  cloudConfig:
    name: ""
  platformSpec:
    type: BareMetal
status:
...
  platform: BareMetal
  platformStatus:
    baremetal:
      apiServerInternalIP: 10.42.11.10
      ingressIP: 10.42.11.11
    type: BareMetal

$ oc get cloudcredentials.operator.openshift.io -oyaml cluster 
...
spec:
  credentialsMode: ""

Seems like this was done in response to this
https://bugzilla.redhat.com/show_bug.cgi?id=1879628
https://github.com/openshift/cloud-credential-operator/pull/248

Comment 1 Devan Goodwin 2020-10-08 11:22:17 UTC
If a workaround is needed, just create an empty dummy secret, we don't check the contents.

This will only block an upgrade to 4.7, and now 4.6.1. 

Will try to determine how AWS code is triggering despite a non-AWS infrastructure.

Comment 2 Scott Dodson 2020-10-08 14:17:07 UTC
Discussed w/ Devan, that was a typo s/now/not/ and conclusion is that this is not a 4.6 GA blocker but will need to be resolved in a 4.6.z before 4.7 upgrades are expected to be supported.

Moving to 4.7 and cloning for resolution in 4.6.z.

Comment 3 wang lin 2020-11-06 05:46:16 UTC
test payload: registry.svc.ci.openshift.org/ocp/release:4.7.0-0.nightly-2020-10-27-051128

I see the similar issue when provision with Manual mode on GCP. Provision on Azure is ok.(when testing bug https://bugzilla.redhat.com/show_bug.cgi?id=1884691)

###
$ oc get co cloud-credential -o json | jq -r ".status.conditions"
[
  {
    "lastTransitionTime": "2020-11-06T02:54:08Z",
    "message": "Credential minting is disabled by cluster admin",
    "reason": "OperatorDisabledByAdmin",
    "status": "True",
    "type": "Available"
  },
  {
    "lastTransitionTime": "2020-11-06T02:54:08Z",
    "status": "False",
    "type": "Degraded"
  },
  {
    "lastTransitionTime": "2020-11-06T02:54:08Z",
    "status": "False",
    "type": "Progressing"
  },
  {
    "lastTransitionTime": "2020-11-06T02:54:08Z",
    "message": "Parent credential secret must be restored prior to upgrade: kube-system/gcp-credentials",
    "reason": "MissingRootCredential",
    "status": "False",
    "type": "Upgradeable"
  }
]


$ oc get cloudcredential cluster -o json | jq -r ".spec"
{
  "credentialsMode": "Manual",
  "logLevel": "Normal"
}

$ oc get infrastructure cluster -o json | jq -r ".status"
{
  "apiServerInternalURI": "https://api-int.lwan-jk-manual-gcp.qe.gcp.devcluster.openshift.com:6443",
  "apiServerURL": "https://api.lwan-jk-manual-gcp.qe.gcp.devcluster.openshift.com:6443",
  "etcdDiscoveryDomain": "lwan-jk-manual-gcp.qe.gcp.devcluster.openshift.com",
  "infrastructureName": "lwan-jk-manual-gcp-wkwnt",
  "platform": "GCP",
  "platformStatus": {
    "gcp": {
      "projectID": "openshift-qe",
      "region": "us-central1"
    },
    "type": "GCP"
  }
}

Comment 5 Devan Goodwin 2020-11-30 18:33:48 UTC
I've looked through the code and can't immediately see how this is possible, we fork the actuator implementation based on the infrastructure status platform (https://github.com/openshift/cloud-credential-operator/blob/master/pkg/operator/controller.go#L84), bare metal platform type should be falling through to the default case and adding a dummy actuator which contains no logic and does nothing, it explicitly codes to always set upgradable true. I can't yet see how a bare metal cluster would be instantiating the AWS actuator, but that appears to be what is happening in this case.

It would help a lot to get onto a metal cluster to debug, I'd like to see the infrastructure CR in full (though we have most of that), but most importantly the cloud-credential-operator pod logs. Unfortunately we do not have access to such a cluster/hardware. I will try to track down someone who can get us onto one.

Comment 6 Devan Goodwin 2020-11-30 18:34:51 UTC
oc adm must-gather would be ideal.

Comment 7 Seth Jennings 2020-12-04 03:22:47 UTC
I'm not seeing this issue in 4.6 any more for platform: BareMetal. 4.6.6 and higher to be specific.  It could have been fixed before 4.6.6 as well.

Wang Lin, is this still an issue for you? If not, feel free to close.

Comment 8 Devan Goodwin 2020-12-04 12:24:27 UTC
FWIW I think Lin's issue is probably a separate bug, in that case it looks like GCP is correctly running the GCP actuator, but logic may be incorrectly assuming it needs an admin cred.