Bug 1820484 - Failed to create a cluster when applying a custom KMS key on worker volumes.
Summary: Failed to create a cluster when applying a custom KMS key on worker volumes.
Keywords:
Status: CLOSED DUPLICATE of bug 1815219
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.5
Hardware: All
OS: All
high
high
Target Milestone: ---
: 4.5.0
Assignee: Joel Speed
QA Contact: Yunfei Jiang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-03 08:05 UTC by Yunfei Jiang
Modified: 2020-04-20 11:13 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-04-20 11:13:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
full openshift install log (138.13 KB, text/plain)
2020-04-03 08:05 UTC, Yunfei Jiang
no flags Details
must gather log (5.49 MB, application/gzip)
2020-04-07 04:34 UTC, Yunfei Jiang
no flags Details

Description Yunfei Jiang 2020-04-03 08:05:20 UTC
Created attachment 1675932 [details]
full openshift install log

Description of problem:

In the default installation, all master and worker volumes are encrypted by default AWS KMS key. So the worker volumes should be encrypted by the custom KMS key, if I specify a valid one. But this is not work, the cluster can not be created successfully if I specify a custom key on worker volumes.

configuration:
apiVersion: v1
baseDomain: qe.devcluster.openshift.com
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform:
    aws:
      rootVolume:
        kmsKeyARN: arn:aws:kms:us-east-2:301721915996:key/4f5265b4-16f7-4d85-9a09-7209ab0c8456
  replicas: 3
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform: {}
  replicas: 3
metadata:
  creationTimestamp: null
  name: yunjiang-usv
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineNetwork:
  - cidr: 10.0.0.0/16
  networkType: OpenShiftSDN
  serviceNetwork:
  - 172.30.0.0/16
platform:
  aws:
    region: us-east-2
publish: External
pullSecret: xxx
sshKey: XXX

How reproducible:
Always.

Steps to Reproduce:
1. Create KMS key with "Key users" policy.
2. Create install config file and Specify above KMS key in worker node
4. Create cluster

Actual results:

Failed to create cluster.

(full log is attached)
time="2020-04-03T00:47:00-04:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-03-29-195504: 99% complete, waiting on authentication, console, csi-snapshot-controller, image-registry, ingress, kube-storage-version-migrator, monitoring"
time="2020-04-03T00:50:11-04:00" level=debug msg="Still waiting for the cluster to initialize: Cluster operator console is reporting a failure: RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.yunjiang-usv.qe.devcluster.openshift.com/health): Get https://console-openshift-console.apps.yunjiang-usv.qe.devcluster.openshift.com/health: EOF"
time="2020-04-03T01:14:45-04:00" level=error msg="Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::RouteStatus_FailedHost: IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server\nRouteStatusDegraded: route is not available at canonical host oauth-openshift.apps.yunjiang-usv.qe.devcluster.openshift.com: []"
time="2020-04-03T01:14:45-04:00" level=info msg="Cluster operator authentication Progressing is Unknown with NoData: "
time="2020-04-03T01:14:45-04:00" level=info msg="Cluster operator authentication Available is Unknown with NoData: "
time="2020-04-03T01:14:45-04:00" level=error msg="Cluster operator console Degraded is True with RouteHealth_FailedGet: RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.yunjiang-usv.qe.devcluster.openshift.com/health): Get https://console-openshift-console.apps.yunjiang-usv.qe.devcluster.openshift.com/health: EOF"
time="2020-04-03T01:14:45-04:00" level=info msg="Cluster operator console Progressing is True with OAuthClientSync_FailedHost::RouteSync_FailedHost: RouteSyncProgressing: route is not available at canonical host []\nOAuthClientSyncProgressing: waiting on route host"
time="2020-04-03T01:14:45-04:00" level=info msg="Cluster operator console Available is False with Route_FailedAdmittedIngress: RouteAvailable: console route is not admitted"
time="2020-04-03T01:14:45-04:00" level=info msg="Cluster operator image-registry Available is False with NoReplicasAvailable: The deployment does not have available replicas"
time="2020-04-03T01:14:45-04:00" level=info msg="Cluster operator image-registry Progressing is True with DeploymentNotCompleted: The deployment has not completed"
time="2020-04-03T01:14:45-04:00" level=error msg="Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: default"
time="2020-04-03T01:14:45-04:00" level=info msg="Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available.\nMoving to release version \"4.5.0-0.nightly-2020-03-29-195504\".\nMoving to ingress-controller image version \"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:559e38c1171467ee375f2ea873495624920accf3ae0ff4b99cae98964e708897\"."
time="2020-04-03T01:14:45-04:00" level=info msg="Cluster operator ingress Available is False with IngressUnavailable: Not all ingress controllers are available."
time="2020-04-03T01:14:45-04:00" level=info msg="Cluster operator insights Disabled is False with : "
time="2020-04-03T01:14:45-04:00" level=info msg="Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available"
time="2020-04-03T01:14:45-04:00" level=info msg="Cluster operator monitoring Available is False with : "
time="2020-04-03T01:14:45-04:00" level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack."
time="2020-04-03T01:14:45-04:00" level=error msg="Cluster operator monitoring Degraded is True with UpdatingUserWorkloadThanosRulerFailed: Failed to rollout the stack. Error: running task Updating User Workload Thanos Ruler failed: failed to retrieve Grafana datasources config: secrets \"grafana-datasources\" not found"
time="2020-04-03T01:14:45-04:00" level=fatal msg="failed to initialize the cluster: Cluster operator console is reporting a failure: RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.yunjiang-usv.qe.devcluster.openshift.com/health): Get https://console-openshift-console.apps.yunjiang-usv.qe.devcluster.openshift.com/health: EOF"
time="2020-04-03T01:17:45-04:00" level=debug msg="OpenShift Installer 4.5.0-0.nightly-2020-03-29-195504"
time="2020-04-03T01:17:45-04:00" level=debug msg="Built from commit 6aea75e9e5760924991c2a38c021d7f835aef296"
time="2020-04-03T01:17:45-04:00" level=debug msg="Fetching Install Config..."
time="2020-04-03T01:17:45-04:00" level=debug msg="Loading Install Config..."
time="2020-04-03T01:17:45-04:00" level=debug msg="  Loading SSH Key..."
time="2020-04-03T01:17:45-04:00" level=debug msg="  Using SSH Key loaded from state file"
time="2020-04-03T01:17:45-04:00" level=debug msg="  Loading Base Domain..."
time="2020-04-03T01:17:45-04:00" level=debug msg="    Loading Platform..."
time="2020-04-03T01:17:45-04:00" level=debug msg="    Using Platform loaded from state file"
time="2020-04-03T01:17:45-04:00" level=debug msg="  Using Base Domain loaded from state file"
time="2020-04-03T01:17:45-04:00" level=debug msg="  Loading Cluster Name..."
time="2020-04-03T01:17:45-04:00" level=debug msg="    Loading Base Domain..."
time="2020-04-03T01:17:45-04:00" level=debug msg="    Loading Platform..." 
time="2020-04-03T01:17:45-04:00" level=debug msg="  Using Cluster Name loaded from state file"
time="2020-04-03T01:17:45-04:00" level=debug msg="  Loading Pull Secret..."
time="2020-04-03T01:17:45-04:00" level=debug msg="  Using Pull Secret loaded from state file"
time="2020-04-03T01:17:45-04:00" level=debug msg="  Loading Platform..."
time="2020-04-03T01:17:45-04:00" level=debug msg="Using Install Config loaded from state file"
time="2020-04-03T01:17:45-04:00" level=debug msg="Reusing previously-fetched Install Config"
time="2020-04-03T01:17:45-04:00" level=fatal msg="failed to get bootstrap and control plane host addresses from \"ikms_us_verify/terraform.tfstate\": failed to lookup bootstrap: resource not found"


Expected results:
* Cluster is created successfully
* worker volumes are encrypted by above custom KMS key.
* master volumes are encrypted by AWS default KMS key.

Additional info:

Attempts:
1. Create cluster by default configuration
  * Cluster created successfully
  * All master and worker volumes are encrypted by default AWS KMS key.
2. Apply custom KMS key ONLY on master volumes
  * Cluster created successfully
  * Master volumes are encrypted by custome AWS KMS key.
  * Worker volumes are encrypted by default AWS KMS key.

Comment 1 Abhinav Dahiya 2020-04-03 14:42:22 UTC
Can you attach the oc adm must-gathe ?

Comment 2 Abhinav Dahiya 2020-04-03 14:44:42 UTC
 1. Create cluster by default configuration
  * Cluster created successfully
  * All master and worker volumes are encrypted by default AWS KMS key.
2. Apply custom KMS key ONLY on master volumes
  * Cluster created successfully
  * Master volumes are encrypted by custome AWS KMS key.
  * Worker volumes are encrypted by default AWS KMS key.
^^ these succeed..

3. Apply custom KMS key ONLY on worker volumes
  * Cluster created successfully
  * Master volumes are encrypted by default AWS KMS key.
  * Worker volumes should be encrypted with custom AWS KMS key, failing..
^^ (3) failing and (2) working makes it sound like the machine-api is failing..

Comment 3 Yunfei Jiang 2020-04-07 04:34:21 UTC
Created attachment 1676807 [details]
must gather log

Comment 4 Yunfei Jiang 2020-04-07 04:38:18 UTC
(In reply to Abhinav Dahiya from comment #1)
> Can you attach the oc adm must-gathe ?

must-gather log is attached.

Comment 5 Abhinav Dahiya 2020-04-07 04:43:04 UTC
hmm the must-gather is missing the machine-api namespace..

we should open a separate bug for cloud team for that..

Can you grab the logs from all the containers running in the openshift-machine-api namespace..

Comment 6 Yunfei Jiang 2020-04-07 11:06:37 UTC
(In reply to Abhinav Dahiya from comment #5)
> hmm the must-gather is missing the machine-api namespace..
> 
> we should open a separate bug for cloud team for that..
> 
> Can you grab the logs from all the containers running in the
> openshift-machine-api namespace..

Collected all logs using must-gather, including machine-api namespace. 
(shared via cloud due to file size limitation.)
https://drive.google.com/open?id=1I1dR65FFsvcMmgL-pBhtzu8kydd_Zgv5

Comment 7 Joel Speed 2020-04-20 11:13:10 UTC

*** This bug has been marked as a duplicate of bug 1815219 ***


Note You need to log in before you can comment on or make changes to this bug.