Bug 1911257 - [aws-c2s] failed to create cluster, kube-cloud-config was not created
Summary: [aws-c2s] failed to create cluster, kube-cloud-config was not created
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.7
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.7.0
Assignee: Matthew Staebler
QA Contact: Yunfei Jiang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-28 11:58 UTC by Yunfei Jiang
Modified: 2021-02-24 15:49 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
This is a bug in new functionality added to 4.7.
Clone Of:
Environment:
Last Closed: 2021-02-24 15:49:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
log-bundle-20201224121756.tar.gz (1.56 MB, application/gzip)
2020-12-28 11:58 UTC, Yunfei Jiang
no flags Details
m11_kubelet.service.log (1.42 MB, text/plain)
2020-12-29 09:46 UTC, Yunfei Jiang
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:49:26 UTC

Description Yunfei Jiang 2020-12-28 11:58:13 UTC
Created attachment 1742642 [details]
log-bundle-20201224121756.tar.gz

Try to create a cluster on C2S, the bootstrap process failed. (Please see attachment log-bundle-20201224121756.tar.gz)

Per story https://issues.redhat.com/browse/CORS-1584
Trust CA chain should be stored in openshift-config-managed/kube-cloud-config, but it does not exist:
```
./oc get configmap -n openshift-config-managed kube-cloud-config
Error from server (NotFound): configmaps "kube-cloud-config" not found
```

Instead, the trust ca was stored in:
```
./oc get configmap -n openshift-config-managed kube-root-ca.crt -oyaml
apiVersion: v1
data:
  ca.crt: |
    -----BEGIN CERTIFICATE-----
<--snip-->
```


All pods are in pending status:
```
NAMESPACE                                          NAME                                                      READY   STATUS    RESTARTS   AGE
openshift-apiserver-operator                       openshift-apiserver-operator-fcd64cdd5-zd89z              0/1     Pending   0          14h
openshift-authentication-operator                  authentication-operator-547b4c57dd-zh727                  0/1     Pending   0          14h
openshift-cloud-credential-operator                cloud-credential-operator-55ff8976c9-db5wj                0/2     Pending   0          14h
openshift-cluster-machine-approver                 machine-approver-845d5b794b-lv87q                         0/2     Pending   0          14h
openshift-cluster-node-tuning-operator             cluster-node-tuning-operator-bdb7f64c9-l84nl              0/1     Pending   0          14h
openshift-cluster-storage-operator                 cluster-storage-operator-74d7f47bf6-wwkvw                 0/1     Pending   0          14h
openshift-cluster-storage-operator                 csi-snapshot-controller-operator-9fdd68d74-jr7w8          0/1     Pending   0          14h
openshift-cluster-version                          cluster-version-operator-59b5c54d75-nbx47                 0/1     Pending   0          14h
openshift-config-operator                          openshift-config-operator-9c555879f-jt2vd                 0/1     Pending   0          14h
openshift-controller-manager-operator              openshift-controller-manager-operator-6cbffdc798-twngw    0/1     Pending   0          14h
openshift-dns-operator                             dns-operator-5fd98db9cd-wh68h                             0/2     Pending   0          14h
openshift-etcd-operator                            etcd-operator-756b494c88-kls64                            0/1     Pending   0          14h
openshift-image-registry                           cluster-image-registry-operator-5dbd8889fd-4d8cc          0/1     Pending   0          14h
openshift-ingress-operator                         ingress-operator-7869bc8864-6vxnj                         0/2     Pending   0          14h
openshift-insights                                 insights-operator-56797c4f9b-kff7n                        0/1     Pending   0          14h
openshift-kube-apiserver-operator                  kube-apiserver-operator-5c77c9c859-q6njj                  0/1     Pending   0          14h
openshift-kube-controller-manager-operator         kube-controller-manager-operator-669859c4fd-5kqjh         0/1     Pending   0          14h
openshift-kube-scheduler-operator                  openshift-kube-scheduler-operator-6fd6b9997-xz9s9         0/1     Pending   0          14h
openshift-kube-storage-version-migrator-operator   kube-storage-version-migrator-operator-7b778676c7-dq9cv   0/1     Pending   0          14h
openshift-machine-api                              cluster-autoscaler-operator-5d48f5ff8f-xglb6              0/2     Pending   0          14h
openshift-machine-api                              cluster-baremetal-operator-584b9999fc-6trt7               0/1     Pending   0          14h
openshift-machine-api                              machine-api-operator-796744bff-6tcmp                      0/2     Pending   0          14h
openshift-machine-config-operator                  machine-config-operator-5c8b66cc4f-wf5f4                  0/1     Pending   0          14h
openshift-marketplace                              marketplace-operator-684f669d55-c7w72                     0/1     Pending   0          14h
openshift-monitoring                               cluster-monitoring-operator-5555868585-2gtdv              0/2     Pending   0          14h
openshift-network-operator                         network-operator-8558975454-mgglx                         0/1     Pending   0          14h
openshift-operator-lifecycle-manager               catalog-operator-7c9b666d58-7b6cz                         0/1     Pending   0          14h
openshift-operator-lifecycle-manager               olm-operator-77748d5dcb-kj9md                             0/1     Pending   0          14h
openshift-service-ca-operator                      service-ca-operator-6fb466b7cf-rjvzm                      0/1     Pending   0          14h
```

>> install-config.yaml:
```
apiVersion: v1
baseDomain: govcloudemu.devcluster.openshift.com
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform: {}
  replicas: 3
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform: {}
  replicas: 3
metadata:
  creationTimestamp: null
  name: yunjiang-m6
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineNetwork:
  - cidr: 10.119.0.0/16
  networkType: OpenShiftSDN
  serviceNetwork:
  - 172.30.0.0/16
platform:
  aws:
    region: us-iso-east-1
    amiID: ami-0996bc3576195a2f0
    subnets:
    - subnet-04018b54f216345c5
publish: Internal
credentialsMode: Manual
additionalTrustBundle: <SHIFT-CERT-CA and IMAGE-REGISTRY-CERT-CA>
imageContentSources:
- mirrors:
  - ip-10-119-0-121.ec2.internal:5000/ocp/release
  source: registry.svc.ci.openshift.org/ocp/release
- mirrors:
  - ip-10-119-0-121.ec2.internal:5000/ocp/release
  source: quay.io/openshift-release-dev/ocp-v4.0-art-dev
pullSecret: <HIDDEN>
sshKey: <HIDDEN>
```


Version-Release number of the following components: 
4.7.0-0.nightly-2020-12-20-031835

How reproducible: 
always 

Steps to Reproduce: 
1. mirror OCP image to local image registry server
2. Create and edit install-config.yaml
3. Config CCO in manual mode [1]
4. (SHIFT console) Turn off `Allow Internet` 
5. Create cluster

Actual results: 
Bootstrap process failed.

Expected results: 
Bootstrap process completed successfully.

Additional info:

[1] https://docs.openshift.com/container-platform/4.6/installing/installing_aws/manually-creating-iam.html#manually-create-iam_manually-creating-iam-aws

Comment 1 Yunfei Jiang 2020-12-29 09:45:32 UTC
Found kubelet service error on one master machine:

Dec 29 08:08:49 ip-10-119-1-234 hyperkube[1461]: W1229 08:08:49.648314    1461 plugins.go:105] WARNING: aws built-in cloud provider is now deprecated. The AWS provider is deprecated and will be removed in a future release
Dec 29 08:08:49 ip-10-119-1-234 hyperkube[1461]: I1229 08:08:49.649512    1461 aws.go:1251] Building AWS cloudprovider
Dec 29 08:08:49 ip-10-119-1-234 hyperkube[1461]: I1229 08:08:49.649588    1461 aws.go:1211] Zone not specified in configuration file; querying AWS metadata service
Dec 29 08:10:19 ip-10-119-1-234 systemd[1]: kubelet.service: start operation timed out. Terminating.
Dec 29 08:10:50 ip-10-119-1-234 hyperkube[1461]: F1229 08:10:50.027080    1461 server.go:269] failed to run Kubelet: could not init cloud provider "aws": error finding instance i-0ac535ea30c00904c: "error listing AWS instances: \"RequestError: send request failed\\ncaused by: Post \\\"https://ec2.us-east-1.amazonaws.com/\\\": dial tcp: i/o timeout\""
Dec 29 08:10:50 ip-10-119-1-234 hyperkube[1461]: goroutine 1 [running]:
Dec 29 08:10:50 ip-10-119-1-234 hyperkube[1461]: k8s.io/kubernetes/vendor/k8s.io/klog/v2.stacks(0xc000012001, 0xc00059c300, 0x130, 0x2fe)
Dec 29 08:10:50 ip-10-119-1-234 hyperkube[1461]:         /builddir/build/BUILD/openshift-git-97012.0616638/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1026 +0xb9
<--SNIP-->
Dec 29 08:10:50 ip-10-119-1-234 hyperkube[1461]: created by internal/singleflight.(*Group).DoChan
Dec 29 08:10:50 ip-10-119-1-234 hyperkube[1461]:         /usr/lib/golang/src/internal/singleflight/singleflight.go:88 +0x2cc
Dec 29 08:10:50 ip-10-119-1-234 systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a
Dec 29 08:10:50 ip-10-119-1-234 systemd[1]: kubelet.service: Failed with result 'timeout'.
Dec 29 08:10:50 ip-10-119-1-234 systemd[1]: Failed to start Kubernetes Kubelet.

The kubelet log is attached, named m11_kubelet.service.log

Comment 2 Yunfei Jiang 2020-12-29 09:46:16 UTC
Created attachment 1742888 [details]
m11_kubelet.service.log

Comment 5 Matthew Staebler 2021-01-12 00:34:08 UTC
I have traced this down to a deficiency in the C2S simulator. Kubelet relies on the AWS metadata to determine the region that the instance is in. However, the C2S simulator does not adjust the instance metadata. Consequently, kubelet thinks that the instance is in us-east-1 instead of us-iso-east-1. In a real C2S environment, the instance metadata would indicate that the instance is in us-iso-east-1, and kubelet would its subsequent AWS calls to the correct us-iso-east-1 endpoints.

I am working on seeing what can be done to unblock testing.

Comment 6 Matthew Staebler 2021-01-12 01:38:48 UTC
The workaround for the incorrect instance metadata is to add the following in the data of the manifests/cloud-provider-config.yaml manifest before creating the cluster.

  config: |
    [ServiceOverride "0"]
      Service = ec2
      Region = us-east-1
      URL = https://ec2.us-iso-east-1.c2s.ic.gov
      SigningRegion = us-iso-east-1

This will trick kubelet into using the us-iso-east-1 endpoint even though the instance metadata told it to use the us-east-1 endpoint. Again, this is only a workaround for the C2S simulated environment. In a real environment this is not necessary.

Comment 7 Matthew Staebler 2021-01-12 01:45:10 UTC
After getting past the kubelet issue, I found another issue with the machine-api-operator. I would like to track that in a separate bug (https://bugzilla.redhat.com/show_bug.cgi?id=1915114).

Comment 8 Yunfei Jiang 2021-01-12 01:50:59 UTC
Thanks Matthew, I'll change setting in my environment.

Comment 12 Matthew Staebler 2021-01-19 01:52:23 UTC
@yunjiang Are you satisfied with the service endpoints being a resolution for this bug? Can we close this bug?

Also, in addition to adding the override for the ec2 endpoint to the service endpoints, it is also necessary to add the elb endpoint to the service endpoints for load balancing.

    [ServiceOverride "0"]
      Service = ec2
      Region = us-east-1
      URL = https://ec2.us-iso-east-1.c2s.ic.gov
      SigningRegion = us-iso-east-1
    [ServiceOverride "1"]
      Service = elasticloadbalancing
      Region = us-east-1
      URL = https://elasticloadbalancing.us-iso-east-1.c2s.ic.gov
      SigningRegion = us-iso-east-1

Comment 13 Yunfei Jiang 2021-01-19 10:16:07 UTC
PASS.
verified on: 4.7.0-0.nightly-2021-01-19-051335

>> NOTE
the trust-ca could not be found in kube-cloud-config, this issue will be tracked in Bug 1915500

>> Steps:
1. create manifests
2. inject cloud-provider-config per comment 6 and comment 12 :

cat << EOF > manifests/cloud-provider-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: cloud-provider-config
  namespace: openshift-config
data:
  config: |
    [ServiceOverride "0"]
      Service = ec2
      Region = us-east-1
      URL = https://ec2.us-iso-east-1.c2s.ic.gov
      SigningRegion = us-iso-east-1
    [ServiceOverride "1"]
      Service = elasticloadbalancing
      Region = us-east-1
      URL = https://elasticloadbalancing.us-iso-east-1.c2s.ic.gov
      SigningRegion = us-iso-east-1
EOF

3. the kube-cloud-config is created


./oc get configmap -n openshift-config-managed kube-cloud-config -o yaml
apiVersion: v1
data:
  cloud.conf: |
    [ServiceOverride "0"]
      Service = ec2
      Region = us-east-1
      URL = https://ec2.us-iso-east-1.c2s.ic.gov
      SigningRegion = us-iso-east-1
    [ServiceOverride "1"]
      Service = elasticloadbalancing
      Region = us-east-1
      URL = https://elasticloadbalancing.us-iso-east-1.c2s.ic.gov
      SigningRegion = us-iso-east-1
kind: ConfigMap
metadata:
  creationTimestamp: "2021-01-19T09:01:43Z"
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:data:
        .: {}
        f:cloud.conf: {}
    manager: cluster-config-operator
    operation: Update
    time: "2021-01-19T09:01:43Z"
  name: kube-cloud-config
  namespace: openshift-config-managed
  resourceVersion: "4666"
  selfLink: /api/v1/namespaces/openshift-config-managed/configmaps/kube-cloud-config
  uid: 1c2eeed9-7c17-46fb-af27-81e6a8275fc1

Comment 16 errata-xmlrpc 2021-02-24 15:49:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.