Bug 1846200

Summary: Install fails waiting for bootstrap complete: AWS_POD_IDENTITY_WEBHOOK_IMAGE is not set
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: Cloud Credential OperatorAssignee: Seth Jennings <sjenning>
Status: CLOSED ERRATA QA Contact: wang lin <lwan>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.6CC: gshereme, jdiaz, lwan, scuppett, sjenning
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:06:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2020-06-11 04:51:58 UTC
A FIPS CI job on an e2e-suite PR failed in setup with [1]:

E0611 00:52:21.614878      45 reflector.go:307] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ConfigMap: Get https://api.ci-op-8qi7qqf9-b5b45.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dbootstrap&resourceVersion=3194&timeoutSeconds=577&watch=true: dial tcp 52.52.41.37:6443: connect: connection refused
level=info msg="Pulling debug logs from the bootstrap machine"
level=info msg="Bootstrap gather logs captured here \"/tmp/artifacts/installer/log-bundle-20200611010248.tar.gz\""
level=fatal msg="Bootstrap failed to complete: failed to wait for bootstrapping to complete: timed out waiting for the condition" 

From the log bundle [2]:

$ grep '"attempt": [^0]' bootstrap/containers/*.inspect 
bootstrap/containers/cloud-credential-operator-ad0c8f95af28155d280a26416ede6a94adc46f57b724f6b54f08905018d6fb83.inspect:      "attempt": 11,
bootstrap/containers/cluster-version-operator-1f6115911bbfa402033b04dcbee0fee34760f69d80d236666a3d7c712efb9cd1.inspect:      "attempt": 1,
bootstrap/containers/kube-apiserver-de38bbc76573b2c14da39eb8d966dc8698dd0286d75acc0827901dce64405bf6.inspect:      "attempt": 1,
bootstrap/containers/kube-apiserver-insecure-readyz-7526e4441fd4413f7fc6fbc29785f486342621c74dc7c7639725328ed9834008.inspect:      "attempt": 2,
bootstrap/containers/kube-apiserver-insecure-readyz-bc4f1fa5dda3d31498b9cc06bbe86318697aa3e36567948b1d592180979ebb46.inspect:      "attempt": 1,
bootstrap/containers/kube-controller-manager-c1d4b4e90e251f9e20e1afa9ab9a7adc648230fb16ce667325e847daa6a52126.inspect:      "attempt": 1,
bootstrap/containers/kube-scheduler-ba28db102059ca25af13633430885cdfa3f9c4efafdf3dc6f02bf4fd15294a57.inspect:      "attempt": 1,
bootstrap/containers/setup-4affb608eb4017587629de44dcf89e31b8a8e40fc52884027c8e061e8b573b73.inspect:      "attempt": 1,
$ tail -n2 bootstrap/containers/cloud-credential-operator-ad0c8f95af28155d280a26416ede6a94adc46f57b724f6b54f08905018d6fb83.log 
time="2020-06-11T01:02:38Z" level=info msg="setting up AWS pod identity controller"
time="2020-06-11T01:02:38Z" level=fatal msg="unable to register controllers to the manager" error="AWS_POD_IDENTITY_WEBHOOK_IMAGE is not set"

Looks like AWS_POD_IDENTITY_WEBHOOK_IMAGE is from a weekish ago [3].  From the manifest [4]:

        - name: RELEASE_VERSION
          value: 0.0.1-2020-06-11-001518
        - name: AWS_POD_IDENTITY_WEBHOOK_IMAGE
          value: registry.svc.ci.openshift.org/ci-op-8qi7qqf9/stable@sha256:7181998a260035fdcc06a65fb261289503f8f6b07d36f2db6a14b7161fc15d0c

So that looks like it's set to me.  But the bootstrap pod has:

$ jq .info.runtimeSpec.process.env bootstrap/containers/cloud-credential-operator-ad0c8f95af28155d280a26416ede6a94adc46f57b724f6b54f08905018d6fb83.inspect
[
  "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
  "TERM=xterm",
  "HOSTNAME=ip-10-0-21-86",
  "foo=bar",
  "OPENSHIFT_BUILD_NAME=cloud-credential-operator",
  "OPENSHIFT_BUILD_NAMESPACE=ci-op-gkt0g4hd",
  "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
  "container=oci"
]

And the rendered pod spec was:

$ cat rendered-assets/openshift/cco-bootstrap/bootstrap-manifests/cloud-credential-operator-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: cloud-credential-operator
  namespace: openshift-cloud-credential-operator
spec:
  containers:
  - command:
    - /usr/bin/cloud-credential-operator
    args:
    - operator
    - --log-level=debug
    - --kubeconfig=/etc/kubernetes/secrets/kubeconfig
    image: registry.svc.ci.openshift.org/ci-op-8qi7qqf9/stable@sha256:db9944c2ca1c542822860e01b6a17a22cd36dbb615978dc41141f1fe97ba92d1
    imagePullPolicy: IfNotPresent
    name: cloud-credential-operator
    volumeMounts:
    - mountPath: /etc/kubernetes/secrets
      name: secrets
      readOnly: true
  hostNetwork: true
  volumes:
  - hostPath:
      path: /etc/kubernetes/bootstrap-secrets
    name: secrets

So maybe cred#195 missed some changes that need to happen to the generated bootstrap pod YAML?  Or we need a softening on the requirement like the in-flight [5].

[1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/25094/pull-ci-openshift-origin-master-e2e-aws-fips/3270
[2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/25094/pull-ci-openshift-origin-master-e2e-aws-fips/3270/artifacts/e2e-aws-fips/installer/
[3]: https://github.com/openshift/cloud-credential-operator/pull/195/files#diff-77b40adb1e22a95f65dd1acda430bc80R110
[4]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/25094/pull-ci-openshift-origin-master-e2e-aws-fips/3270/artifacts/release-latest/release-payload-latest/0000_50_cloud-credential-operator_03-deployment.yaml
[5]: https://github.com/openshift/cloud-credential-operator/pull/206#discussion_r438448965

Comment 2 Seth Jennings 2020-06-15 12:41:44 UTC
As far as I was aware, the CCO was only run with the `render` command during install, not in actual operator mode.  Is this not true any more? If so, why did this change?

Comment 3 Joel Diaz 2020-06-15 14:12:31 UTC
The CCO binary is run to render a limited Pod definition https://github.com/openshift/cloud-credential-operator/blob/master/pkg/cmd/render/render.go#L38-L62 that does run as a static Pod on the bootstrap node.

Comment 4 Greg Sheremeta 2020-06-18 18:10:26 UTC
PR in code review.

Comment 7 wang lin 2020-06-30 09:32:23 UTC
The bug has fixed.

INFO[0003] registering components                       
INFO[0003] setting up scheme                            
INFO[0003] setting up controller                        
INFO[0005] Setting up secret annotator. Platform Type is AWS 
INFO[0006] setting up AWS pod identity controller       
WARN[0006] AWS_POD_IDENTITY_WEBHOOK_IMAGE is not set, AWS pod identity webhook will not be deployed  controller=awspodidentity
INFO[0008] setting up AWS OIDC Discovery Endpoint Controller 
INFO[0012] initializing AWS actuator                    
INFO[0012] starting the cmd

Comment 9 errata-xmlrpc 2020-10-27 16:06:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196