Description of problem: When trying install IPI on OSP with kuryr networkType, cluster-network-operator got degraded. apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: creationTimestamp: "2020-04-14T05:09:28Z" generation: 1 name: network resourceVersion: "2129" selfLink: /apis/config.openshift.io/v1/clusteroperators/network uid: 5b08bea5-7edb-45e1-8c54-1db830e0d39b spec: {} status: conditions: - lastTransitionTime: "2020-04-14T05:09:28Z" message: 'Internal error while reconciling platform networking resources: failed to authenticate to OpenStack: Failed to get installer-cloud-credentials Secret with OpenStack credentials: Secret "installer-cloud-credentials" not found' reason: BootstrapError status: "True" type: Degraded - lastTransitionTime: "2020-04-14T05:09:28Z" status: "True" type: Upgradeable extension: null Version-Release number of selected component (if applicable): 4.5.0-0.nightly-2020-04-14-015024 How reproducible: Always Steps to Reproduce: 1. Trying IPI on OSP with kuryr networkType 2. Check installation progress 3. Actual results: got: level=info msg="API v1.18.0-rc.1 up" level=info msg="Waiting up to 40m0s for bootstrapping to complete..." E0414 05:25:42.104346 631 reflector.go:307] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ConfigMap: Get https://api.wj45krr414b.0414-p0m.qe.rhcloud.com:6443/api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dbootstrap&resourceVersion=4361&timeoutSeconds=404&watch=true: dial tcp 10.46.22.47:6443: connect: connection refused E0414 05:46:00.090627 631 reflector.go:307] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ConfigMap: Get https://api.wj45krr414b.0414-p0m.qe.rhcloud.com:6443/api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dbootstrap&resourceVersion=5880&timeoutSeconds=513&watch=true: dial tcp 10.46.22.47:6443: connect: connection refused E0414 05:46:01.377358 631 reflector.go:307] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ConfigMap: Get https://api.wj45krr414b.0414-p0m.qe.rhcloud.com:6443/api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dbootstrap&resourceVersion=5880&timeoutSeconds=479&watch=true: dial tcp 10.46.22.47:6443: connect: connection refused level=error msg="Cluster operator network Degraded is True with BootstrapError: Internal error while reconciling platform networking resources: failed to authenticate to OpenStack: Failed to get installer-cloud-credentials Secret with OpenStack credentials: Secret \"installer-cloud-credentials\" not found" Expected results: Should work well Additional info:
Some more info from the CCO ReplicaSet: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreate 71m (x19 over 82m) replicaset-controller Error creating: pods "cloud-credential-operator-7cccb96db-" is forbidden: unable to validate against any security context constraint: [] Warning FailedCreate 52m (x19 over 62m) replicaset-controller Error creating: pods "cloud-credential-operator-7cccb96db-" is forbidden: unable to validate against any security context constraint: [] Warning FailedCreate 31m (x19 over 42m) replicaset-controller Error creating: pods "cloud-credential-operator-7cccb96db-" is forbidden: unable to validate against any security context constraint: [] Warning FailedCreate 11m (x19 over 22m) replicaset-controller Error creating: pods "cloud-credential-operator-7cccb96db-" is forbidden: unable to validate against any security context constraint: [] Warning FailedCreate 22s (x15 over 111s) replicaset-controller Error creating: pods "cloud-credential-operator-7cccb96db-" is forbidden: unable to validate against any security context constraint: []
Ah, just some additional info, so with kuryr-kubernetes CCO is required to run alongside CNO (so before anything else is started) as CNO will need cloud credentials to create cloud resources needed to run Kuryr. For some reason it stopped to get created at that point. I'd suspect the cause is https://github.com/openshift/cloud-credential-operator/commit/38321955558090602b9d4a06142f7da8b45979d6, but obviously I'm not sure about it.
Same issue with 4.5.0-0.nightly-2020-04-14-075212
The issue is that CCO is blocked from starting by SCC. Thus is cannot process this CredentialsRequest CR https://github.com/openshift/cluster-network-operator/blob/master/manifests/0000_70_cluster-network-operator_01_credentialsrequest.yaml. Checking with apiserver team to see if anything changed recently wrt SCCs.
Weiwei, what was the last 4.5 build that passed this test? At first thought, I don't think https://github.com/openshift/cloud-credential-operator/commit/38321955558090602b9d4a06142f7da8b45979d6 is the cause. There was a change pretty recently to modify how the SCCs managed and it is possible that SCC creation (or at least their successful operation) is blocked behind CNO being available which is blocked by CCO running and processing the CR, which is blocked behind SCCs (deadlock).
Sending back to Devan. I'm not sure what happened but I think something has changed wrt when SCCs enforcement vs present vs admission plugin is working that has done away with this window pre-CNO that worked before. I just can't find out what changed. Asked master team about it but no luck. I double checked to see if I changed anything in /manifests or bindata assets in some way that would cause this but didn't see anything. The pod isn't even created so the actual code change are not relevant. Some other references: https://bugzilla.redhat.com/show_bug.cgi?id=1820687 https://github.com/openshift/origin/pull/24828 https://bugzilla.redhat.com/show_bug.cgi?id=1817099 https://search.svc.ci.openshift.org/?search=unable+to+validate+against+any+security+context+constraint&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520
So I run bisect building CCO and trying it out. Turns out: 5b648eb0 - ok 01b26765 - ok 40ac6c16 - broken This means that PR I mentioned above is the culprit. What I found is that what's broken is the CCO that's running as static pod on the bootstrap node. This is its definition: apiVersion: v1 kind: Pod metadata: name: cloud-credential-operator namespace: openshift-cloud-credential-operator spec: containers: - command: - /usr/bin/cloud-credential-operator args: - operator - --log-level=debug - --kubeconfig=/etc/kubernetes/secrets/kubeconfig image: docker.io/dulek/cloud-credential-operator:latest imagePullPolicy: IfNotPresent name: manager volumeMounts: - mountPath: /etc/kubernetes/secrets name: secrets readOnly: true hostNetwork: true volumes: - hostPath: path: /etc/kubernetes/bootstrap-secrets name: secrets For some reason the pod never gets created. I wasn't able to figure out why, but I guess it'll be very easy to reproduce now.
Oh wait, the failure is actually there: /etc/kubernetes/manifests/cloud-credential-operator-pod.yaml: couldn't parse as pod(yaml: line 11: found character that cannot start any token)
Gah, look like my editor put a tab instead of 4 spaces :-/ Editing inline yaml in a .go file.
Thanks for running that down Michal! That inline static pod definition in a .go file is not the best. I think to check /bindata, not that render.go file. I'll move that pod definition into bindata at some point.
(In reply to Seth Jennings from comment #13) > Thanks for running that down Michal! > > That inline static pod definition in a .go file is not the best. I think to > check /bindata, not that render.go file. > > I'll move that pod definition into bindata at some point. This is a great idea, as I spent 20 minutes looking for that file without success. :)
upgrade presubmit tests are having a hard time passing across the board and is blocking the merge of the PR https://deck-ci.svc.ci.openshift.org/?type=presubmit&job=*master-e2e-aws-upgrade
Checked with 4.5.0-0.nightly-2020-04-21-103613 and the issue is fixed, moved to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409