Bug 1846200 - Install fails waiting for bootstrap complete: AWS_POD_IDENTITY_WEBHOOK_IMAGE is not set
Summary: Install fails waiting for bootstrap complete: AWS_POD_IDENTITY_WEBHOOK_IMAGE ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Credential Operator
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.6.0
Assignee: Seth Jennings
QA Contact: wang lin
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-11 04:51 UTC by W. Trevor King
Modified: 2020-10-27 16:06 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:06:32 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cloud-credential-operator pull 209 0 None closed Bug 1846200: make AWS_POD_IDENTITY_WEBHOOK_IMAGE not set non-fatal 2020-09-03 23:31:09 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:06:55 UTC

Description W. Trevor King 2020-06-11 04:51:58 UTC
A FIPS CI job on an e2e-suite PR failed in setup with [1]:

E0611 00:52:21.614878      45 reflector.go:307] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ConfigMap: Get https://api.ci-op-8qi7qqf9-b5b45.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dbootstrap&resourceVersion=3194&timeoutSeconds=577&watch=true: dial tcp 52.52.41.37:6443: connect: connection refused
level=info msg="Pulling debug logs from the bootstrap machine"
level=info msg="Bootstrap gather logs captured here \"/tmp/artifacts/installer/log-bundle-20200611010248.tar.gz\""
level=fatal msg="Bootstrap failed to complete: failed to wait for bootstrapping to complete: timed out waiting for the condition" 

From the log bundle [2]:

$ grep '"attempt": [^0]' bootstrap/containers/*.inspect 
bootstrap/containers/cloud-credential-operator-ad0c8f95af28155d280a26416ede6a94adc46f57b724f6b54f08905018d6fb83.inspect:      "attempt": 11,
bootstrap/containers/cluster-version-operator-1f6115911bbfa402033b04dcbee0fee34760f69d80d236666a3d7c712efb9cd1.inspect:      "attempt": 1,
bootstrap/containers/kube-apiserver-de38bbc76573b2c14da39eb8d966dc8698dd0286d75acc0827901dce64405bf6.inspect:      "attempt": 1,
bootstrap/containers/kube-apiserver-insecure-readyz-7526e4441fd4413f7fc6fbc29785f486342621c74dc7c7639725328ed9834008.inspect:      "attempt": 2,
bootstrap/containers/kube-apiserver-insecure-readyz-bc4f1fa5dda3d31498b9cc06bbe86318697aa3e36567948b1d592180979ebb46.inspect:      "attempt": 1,
bootstrap/containers/kube-controller-manager-c1d4b4e90e251f9e20e1afa9ab9a7adc648230fb16ce667325e847daa6a52126.inspect:      "attempt": 1,
bootstrap/containers/kube-scheduler-ba28db102059ca25af13633430885cdfa3f9c4efafdf3dc6f02bf4fd15294a57.inspect:      "attempt": 1,
bootstrap/containers/setup-4affb608eb4017587629de44dcf89e31b8a8e40fc52884027c8e061e8b573b73.inspect:      "attempt": 1,
$ tail -n2 bootstrap/containers/cloud-credential-operator-ad0c8f95af28155d280a26416ede6a94adc46f57b724f6b54f08905018d6fb83.log 
time="2020-06-11T01:02:38Z" level=info msg="setting up AWS pod identity controller"
time="2020-06-11T01:02:38Z" level=fatal msg="unable to register controllers to the manager" error="AWS_POD_IDENTITY_WEBHOOK_IMAGE is not set"

Looks like AWS_POD_IDENTITY_WEBHOOK_IMAGE is from a weekish ago [3].  From the manifest [4]:

        - name: RELEASE_VERSION
          value: 0.0.1-2020-06-11-001518
        - name: AWS_POD_IDENTITY_WEBHOOK_IMAGE
          value: registry.svc.ci.openshift.org/ci-op-8qi7qqf9/stable@sha256:7181998a260035fdcc06a65fb261289503f8f6b07d36f2db6a14b7161fc15d0c

So that looks like it's set to me.  But the bootstrap pod has:

$ jq .info.runtimeSpec.process.env bootstrap/containers/cloud-credential-operator-ad0c8f95af28155d280a26416ede6a94adc46f57b724f6b54f08905018d6fb83.inspect
[
  "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
  "TERM=xterm",
  "HOSTNAME=ip-10-0-21-86",
  "foo=bar",
  "OPENSHIFT_BUILD_NAME=cloud-credential-operator",
  "OPENSHIFT_BUILD_NAMESPACE=ci-op-gkt0g4hd",
  "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
  "container=oci"
]

And the rendered pod spec was:

$ cat rendered-assets/openshift/cco-bootstrap/bootstrap-manifests/cloud-credential-operator-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: cloud-credential-operator
  namespace: openshift-cloud-credential-operator
spec:
  containers:
  - command:
    - /usr/bin/cloud-credential-operator
    args:
    - operator
    - --log-level=debug
    - --kubeconfig=/etc/kubernetes/secrets/kubeconfig
    image: registry.svc.ci.openshift.org/ci-op-8qi7qqf9/stable@sha256:db9944c2ca1c542822860e01b6a17a22cd36dbb615978dc41141f1fe97ba92d1
    imagePullPolicy: IfNotPresent
    name: cloud-credential-operator
    volumeMounts:
    - mountPath: /etc/kubernetes/secrets
      name: secrets
      readOnly: true
  hostNetwork: true
  volumes:
  - hostPath:
      path: /etc/kubernetes/bootstrap-secrets
    name: secrets

So maybe cred#195 missed some changes that need to happen to the generated bootstrap pod YAML?  Or we need a softening on the requirement like the in-flight [5].

[1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/25094/pull-ci-openshift-origin-master-e2e-aws-fips/3270
[2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/25094/pull-ci-openshift-origin-master-e2e-aws-fips/3270/artifacts/e2e-aws-fips/installer/
[3]: https://github.com/openshift/cloud-credential-operator/pull/195/files#diff-77b40adb1e22a95f65dd1acda430bc80R110
[4]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/25094/pull-ci-openshift-origin-master-e2e-aws-fips/3270/artifacts/release-latest/release-payload-latest/0000_50_cloud-credential-operator_03-deployment.yaml
[5]: https://github.com/openshift/cloud-credential-operator/pull/206#discussion_r438448965

Comment 2 Seth Jennings 2020-06-15 12:41:44 UTC
As far as I was aware, the CCO was only run with the `render` command during install, not in actual operator mode.  Is this not true any more? If so, why did this change?

Comment 3 Joel Diaz 2020-06-15 14:12:31 UTC
The CCO binary is run to render a limited Pod definition https://github.com/openshift/cloud-credential-operator/blob/master/pkg/cmd/render/render.go#L38-L62 that does run as a static Pod on the bootstrap node.

Comment 4 Greg Sheremeta 2020-06-18 18:10:26 UTC
PR in code review.

Comment 7 wang lin 2020-06-30 09:32:23 UTC
The bug has fixed.

INFO[0003] registering components                       
INFO[0003] setting up scheme                            
INFO[0003] setting up controller                        
INFO[0005] Setting up secret annotator. Platform Type is AWS 
INFO[0006] setting up AWS pod identity controller       
WARN[0006] AWS_POD_IDENTITY_WEBHOOK_IMAGE is not set, AWS pod identity webhook will not be deployed  controller=awspodidentity
INFO[0008] setting up AWS OIDC Discovery Endpoint Controller 
INFO[0012] initializing AWS actuator                    
INFO[0012] starting the cmd

Comment 9 errata-xmlrpc 2020-10-27 16:06:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.