Bug 1823631 - Cluster-network-operator got degraded due to `Secret "installer-cloud-credentials" not found`
Summary: Cluster-network-operator got degraded due to `Secret "installer-cloud-credent...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Credential Operator
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.5.0
Assignee: Seth Jennings
QA Contact: wang lin
URL:
Whiteboard:
Depends On:
Blocks: 1822861
TreeView+ depends on / blocked
 
Reported: 2020-04-14 06:08 UTC by weiwei jiang
Modified: 2020-07-13 17:27 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-13 17:27:23 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cloud-credential-operator pull 183 0 None closed Bug 1823631: fix bootstrap static pod 2021-02-16 12:02:35 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:27:45 UTC

Description weiwei jiang 2020-04-14 06:08:56 UTC
Description of problem:
When trying install IPI on OSP with kuryr networkType, cluster-network-operator got degraded.
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2020-04-14T05:09:28Z"
  generation: 1
  name: network
  resourceVersion: "2129"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/network
  uid: 5b08bea5-7edb-45e1-8c54-1db830e0d39b
spec: {}
status:
  conditions:
  - lastTransitionTime: "2020-04-14T05:09:28Z"
    message: 'Internal error while reconciling platform networking resources: failed
      to authenticate to OpenStack: Failed to get installer-cloud-credentials Secret
      with OpenStack credentials: Secret "installer-cloud-credentials" not found'
    reason: BootstrapError
    status: "True"
    type: Degraded
  - lastTransitionTime: "2020-04-14T05:09:28Z"
    status: "True"
    type: Upgradeable
  extension: null


Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-04-14-015024

How reproducible:
Always

Steps to Reproduce:
1. Trying IPI on OSP with kuryr networkType
2. Check installation progress
3.

Actual results:
got:
level=info msg="API v1.18.0-rc.1 up"
level=info msg="Waiting up to 40m0s for bootstrapping to complete..."
E0414 05:25:42.104346     631 reflector.go:307] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ConfigMap: Get https://api.wj45krr414b.0414-p0m.qe.rhcloud.com:6443/api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dbootstrap&resourceVersion=4361&timeoutSeconds=404&watch=true: dial tcp 10.46.22.47:6443: connect: connection refused
E0414 05:46:00.090627     631 reflector.go:307] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ConfigMap: Get https://api.wj45krr414b.0414-p0m.qe.rhcloud.com:6443/api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dbootstrap&resourceVersion=5880&timeoutSeconds=513&watch=true: dial tcp 10.46.22.47:6443: connect: connection refused
E0414 05:46:01.377358     631 reflector.go:307] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ConfigMap: Get https://api.wj45krr414b.0414-p0m.qe.rhcloud.com:6443/api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dbootstrap&resourceVersion=5880&timeoutSeconds=479&watch=true: dial tcp 10.46.22.47:6443: connect: connection refused
level=error msg="Cluster operator network Degraded is True with BootstrapError: Internal error while reconciling platform networking resources: failed to authenticate to OpenStack: Failed to get installer-cloud-credentials Secret with OpenStack credentials: Secret \"installer-cloud-credentials\" not found"


Expected results:
Should work well

Additional info:

Comment 2 Michał Dulko 2020-04-14 12:44:48 UTC
Some more info from the CCO ReplicaSet:

Events:
  Type     Reason        Age                  From                   Message
  ----     ------        ----                 ----                   -------
  Warning  FailedCreate  71m (x19 over 82m)   replicaset-controller  Error creating: pods "cloud-credential-operator-7cccb96db-" is forbidden: unable to validate against any security context constraint: []
  Warning  FailedCreate  52m (x19 over 62m)   replicaset-controller  Error creating: pods "cloud-credential-operator-7cccb96db-" is forbidden: unable to validate against any security context constraint: []
  Warning  FailedCreate  31m (x19 over 42m)   replicaset-controller  Error creating: pods "cloud-credential-operator-7cccb96db-" is forbidden: unable to validate against any security context constraint: []
  Warning  FailedCreate  11m (x19 over 22m)   replicaset-controller  Error creating: pods "cloud-credential-operator-7cccb96db-" is forbidden: unable to validate against any security context constraint: []
  Warning  FailedCreate  22s (x15 over 111s)  replicaset-controller  Error creating: pods "cloud-credential-operator-7cccb96db-" is forbidden: unable to validate against any security context constraint: []

Comment 3 Michał Dulko 2020-04-14 12:52:17 UTC
Ah, just some additional info, so with kuryr-kubernetes CCO is required to run alongside CNO (so before anything else is started) as CNO will need cloud credentials to create cloud resources needed to run Kuryr. For some reason it stopped to get created at that point. I'd suspect the cause is https://github.com/openshift/cloud-credential-operator/commit/38321955558090602b9d4a06142f7da8b45979d6, but obviously I'm not sure about it.

Comment 4 Jon Uriarte 2020-04-14 13:31:13 UTC
Same issue with 4.5.0-0.nightly-2020-04-14-075212

Comment 5 Seth Jennings 2020-04-14 15:33:40 UTC
The issue is that CCO is blocked from starting by SCC.  Thus is cannot process this CredentialsRequest CR https://github.com/openshift/cluster-network-operator/blob/master/manifests/0000_70_cluster-network-operator_01_credentialsrequest.yaml.

Checking with apiserver team to see if anything changed recently wrt SCCs.

Comment 6 Seth Jennings 2020-04-14 16:22:04 UTC
Weiwei, what was the last 4.5 build that passed this test?

At first thought, I don't think https://github.com/openshift/cloud-credential-operator/commit/38321955558090602b9d4a06142f7da8b45979d6 is the cause.

There was a change pretty recently to modify how the SCCs managed and it is possible that SCC creation (or at least their successful operation) is blocked behind CNO being available which is blocked by CCO running and processing the CR, which is blocked behind SCCs (deadlock).

Comment 7 Seth Jennings 2020-04-14 19:15:04 UTC
Sending back to Devan.

I'm not sure what happened but I think something has changed wrt when SCCs enforcement vs present vs admission plugin is working that has done away with this window pre-CNO that worked before.  I just can't find out what changed.  Asked master team about it but no luck.

I double checked to see if I changed anything in /manifests or bindata assets in some way that would cause this but didn't see anything.  The pod isn't even created so the actual code change are not relevant.

Some other references:

https://bugzilla.redhat.com/show_bug.cgi?id=1820687
https://github.com/openshift/origin/pull/24828

https://bugzilla.redhat.com/show_bug.cgi?id=1817099

https://search.svc.ci.openshift.org/?search=unable+to+validate+against+any+security+context+constraint&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520

Comment 10 Michał Dulko 2020-04-15 12:48:47 UTC
So I run bisect building CCO and trying it out. Turns out:

5b648eb0 - ok
01b26765 - ok
40ac6c16 - broken

This means that PR I mentioned above is the culprit. What I found is that what's broken is the CCO that's running as static pod on the bootstrap node. This is its definition:

apiVersion: v1
kind: Pod
metadata:
  name: cloud-credential-operator
  namespace: openshift-cloud-credential-operator
spec:
  containers:
  - command:
    - /usr/bin/cloud-credential-operator
    args:
        - operator
    - --log-level=debug
    - --kubeconfig=/etc/kubernetes/secrets/kubeconfig
    image: docker.io/dulek/cloud-credential-operator:latest
    imagePullPolicy: IfNotPresent
    name: manager
    volumeMounts:
    - mountPath: /etc/kubernetes/secrets
      name: secrets
      readOnly: true
  hostNetwork: true
  volumes:
  - hostPath:
      path: /etc/kubernetes/bootstrap-secrets
    name: secrets

For some reason the pod never gets created. I wasn't able to figure out why, but I guess it'll be very easy to reproduce now.

Comment 11 Michał Dulko 2020-04-15 12:50:59 UTC
Oh wait, the failure is actually there:

/etc/kubernetes/manifests/cloud-credential-operator-pod.yaml: couldn't parse as pod(yaml: line 11: found character that cannot start any token)

Comment 12 Seth Jennings 2020-04-15 13:58:15 UTC
Gah, look like my editor put a tab instead of 4 spaces :-/  Editing inline yaml in a .go file.

Comment 13 Seth Jennings 2020-04-15 14:11:34 UTC
Thanks for running that down Michal!

That inline static pod definition in a .go file is not the best. I think to check /bindata, not that render.go file.

I'll move that pod definition into bindata at some point.

Comment 14 Michał Dulko 2020-04-15 14:49:00 UTC
(In reply to Seth Jennings from comment #13)
> Thanks for running that down Michal!
> 
> That inline static pod definition in a .go file is not the best. I think to
> check /bindata, not that render.go file.
> 
> I'll move that pod definition into bindata at some point.

This is a great idea, as I spent 20 minutes looking for that file without success. :)

Comment 15 Seth Jennings 2020-04-15 23:56:12 UTC
upgrade presubmit tests are having a hard time passing across the board and is blocking the merge of the PR
https://deck-ci.svc.ci.openshift.org/?type=presubmit&job=*master-e2e-aws-upgrade

Comment 18 weiwei jiang 2020-04-23 10:19:45 UTC
Checked with 4.5.0-0.nightly-2020-04-21-103613 and the issue is fixed, moved to verified.

Comment 19 errata-xmlrpc 2020-07-13 17:27:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.