1498934 – Need installer support for ClusterID on AWS : master controller service not starting: ClusterID not configured

Bug 1498934 - Need installer support for ClusterID on AWS : master controller service not starting: ClusterID not configured

Summary: Need installer support for ClusterID on AWS : master controller service not s...

Keywords:
Status:	CLOSED DUPLICATE of bug 1491399
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.7.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Scott Dodson
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-10-05 15:04 UTC by Mike Fiedler
Modified:	2017-10-06 13:15 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-10-05 20:05:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
system log at loglevel=5 (470.94 KB, application/x-gzip) 2017-10-05 15:04 UTC, Mike Fiedler	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1491399	0	medium	CLOSED	Require AWS hosts be tagged "kubernetes.io/cluster/xxxx" in 3.7	2021-02-22 00:41:40 UTC

Internal Links: 1491399

Description Mike Fiedler 2017-10-05 15:04:12 UTC

Created attachment 1334894 [details]
system log at loglevel=5

Description of problem:

In 3.7.0-0.142.0 and 0.143.0 the install fails because the atomic-openshift-node service fails to start with network errors:

Oct 05 14:52:12 ip-172-31-15-107.us-west-2.compute.internal atomic-openshift-node[2624]: W1005 14:52:12.452824    2624 sdn_controller.go:48] Could not find an allocated subnet for node: ip-172-31-15-107.us-west-2.compute.internal, Waiting...
Oct 05 14:52:12 ip-172-31-15-107.us-west-2.compute.internal atomic-openshift-node[2624]: I1005 14:52:12.523931    2624 generic.go:183] GenericPLEG: Relisting
Oct 05 14:52:13 ip-172-31-15-107.us-west-2.compute.internal atomic-openshift-node[2624]: I1005 14:52:13.526593    2624 generic.go:183] GenericPLEG: Relisting
Oct 05 14:52:13 ip-172-31-15-107.us-west-2.compute.internal atomic-openshift-node[2624]: W1005 14:52:13.956564    2624 sdn_controller.go:48] Could not find an allocated subnet for node: ip-172-31-15-107.us-west-2.compute.internal, Waiting...
Oct 05 14:52:14 ip-172-31-15-107.us-west-2.compute.internal atomic-openshift-node[2624]: I1005 14:52:14.529242    2624 generic.go:183] GenericPLEG: Relisting
Oct 05 14:52:15 ip-172-31-15-107.us-west-2.compute.internal atomic-openshift-node[2624]: I1005 14:52:15.531737    2624 generic.go:183] GenericPLEG: Relisting
Oct 05 14:52:16 ip-172-31-15-107.us-west-2.compute.internal atomic-openshift-node[2624]: W1005 14:52:16.208670    2624 sdn_controller.go:48] Could not find an allocated subnet for node: ip-172-31-15-107.us-west-2.compute.internal, Waiting...
Oct 05 14:52:16 ip-172-31-15-107.us-west-2.compute.internal atomic-openshift-node[2624]: I1005 14:52:16.513173    2624 kubelet.go:1909] SyncLoop (housekeeping, skipped): sources aren't ready yet.
Oct 05 14:52:16 ip-172-31-15-107.us-west-2.compute.internal atomic-openshift-node[2624]: I1005 14:52:16.513211    2624 kubelet.go:1835] SyncLoop (ADD, "api"): ""
Oct 05 14:52:16 ip-172-31-15-107.us-west-2.compute.internal atomic-openshift-node[2624]: I1005 14:52:16.534306    2624 generic.go:183] GenericPLEG: Relisting
Oct 05 14:52:16 ip-172-31-15-107.us-west-2.compute.internal atomic-openshift-node[2624]: I1005 14:52:16.613406    2624 reconciler.go:159] Desired state of world has been populated with pods, starting reconstruct state function
Oct 05 14:52:16 ip-172-31-15-107.us-west-2.compute.internal atomic-openshift-node[2624]: W1005 14:52:16.711568    2624 cni.go:189] Unable to update cni config: No networks found in /etc/cni/net.d
Oct 05 14:52:16 ip-172-31-15-107.us-west-2.compute.internal atomic-openshift-node[2624]: I1005 14:52:16.711671    2624 kubelet.go:2090] Container runtime status: Runtime Conditions: RuntimeReady=true reason: message:, NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized


I've tried all three network plugins and get the same results.

Version-Release number of selected component (if applicable):  3.7.0-0.143.0 with the same level of openshift-ansible


How reproducible:  Usually - occasionally I am seeing other install failures instead of this one (see https://bugzilla.redhat.com/show_bug.cgi?id=1498592).  If that one does not occur, I hit this one.


Steps to Reproduce:
1.  Install 3.7.0-0.143.0 with the inventory below

Actual results:

During the install there will be a failure with nodes not starting.   Go to the nodes and check the node logs and you'll see the network failures.   Full system log with loglevel set to 5 attached (jump to the bottom for the most recent failure with loglevel 5)


Expected results:

successful install


Additional info:


[OSEv3:children]
masters
nodes
etcd

[OSEv3:vars]

#The following parameters is used by post-actions
iaas_name=AWS
use_rpm_playbook=true
openshift_playbook_rpm_repos=[{'id': 'aos-playbook-rpm', 'name': 'aos-playbook-rpm', 'baseurl': 'http://download.eng.bos.redhat.com/rcm-guest/puddles/RHAOS/AtomicOpenShift/3.7/latest/x86_64/os', 'enabled': 1, 'gpgcheck': 0}]
update_is_images_url=registry.ops.openshift.com
#The following parameters is used by openshift-ansible
ansible_ssh_user=root
openshift_cloudprovider_kind=aws

openshift_cloudprovider_aws_access_key=<redacted>


openshift_cloudprovider_aws_secret_key=<redacted>
openshift_master_default_subdomain_enable=true
openshift_master_default_subdomain=apps.1004-g89.qe.rhcloud.com
openshift_auth_type=allowall
openshift_master_identity_providers=[{'name': 'allow_all', 'login': 'true', 'challenge': 'true', 'kind': 'AllowAllPasswordIdentityProvider'}]
openshift_master_cluster_public_hostname=ec2-54-202-120-33.us-west-2.compute.amazonaws.com
openshift_master_cluster_hostname=ip-172-31-27-246
deployment_type=openshift-enterprise
openshift_cockpit_deployer_prefix=registry.ops.openshift.com/openshift3/
osm_cockpit_plugins=['cockpit-kubernetes']
osm_use_cockpit=false
oreg_url=registry.ops.openshift.com/openshift3/ose-${component}:${version}
openshift_docker_additional_registries=registry.ops.openshift.com
openshift_docker_insecure_registries=registry.ops.openshift.com
use_cluster_metrics=true
openshift_master_cluster_method=native
openshift_master_dynamic_provisioning_enabled=true
osm_default_node_selector=region=primary
openshift_disable_check=disk_availability,memory_availability
openshift_master_portal_net=172.24.0.0/14
openshift_portal_net=172.24.0.0/14
osm_cluster_network_cidr=172.20.0.0/14
osm_host_subnet_length=9
openshift_node_kubelet_args={"pods-per-core": ["0"], "max-pods": ["510"],"minimum-container-ttl-duration": ["10s"], "maximum-dead-containers-per-container": ["1"], "maximum-dead-containers": ["20"], "image-gc-high-threshold": ["80"], "image-gc-low-threshold": ["70"]}
openshift_registry_selector="region=infra,zone=default"
openshift_hosted_router_selector="region=infra,zone=default"
openshift_hosted_router_registryurl=registry.ops.openshift.com/openshift3/ose-${component}:${version}
debug_level=2
openshift_set_hostname=true
openshift_override_hostname_check=true
#os_sdn_network_plugin_name=redhat/ovs-multitenant
openshift_hosted_router_replicas=1
openshift_hosted_registry_storage_kind=object
openshift_hosted_registry_storage_provider=s3
openshift_hosted_registry_storage_s3_accesskey=<redacted>
openshift_hosted_registry_storage_s3_secretkey=<redacted>
openshift_hosted_registry_storage_s3_bucket=aoe-svt-test
openshift_hosted_registry_storage_s3_region=us-west-2
openshift_hosted_registry_replicas=1
openshift_metrics_install_metrics=false
openshift_metrics_image_prefix=registry.ops.openshift.com/openshift3/
openshift_metrics_image_version=v3.7.0
openshift_metrics_cassandra_storage_type=dynamic
openshift_metrics_cassandra_pvc_size=25Gi
openshift_logging_install_logging=false
openshift_logging_image_prefix=registry.ops.openshift.com/openshift3/
openshift_logging_image_version=v3.7.0
openshift_logging_storage_volume_size=25Gi
openshift_logging_storage_kind=dynamic
openshift_logging_es_pvc_size=50Gi
openshift_logging_es_pvc_dynamic=true
openshift_use_system_containers=false
system_images_registry=registry.ops.openshift.com
openshift_image_tag=v3.7.0

[lb]


[etcd]
ip-172-31-27-246 


[masters]
ip-172-31-27-246 


[nodes]
ip-172-31-27-246 openshift_node_labels="{'region': 'infra', 'zone': 'default'}" openshift_scheduleable=false

ip-172-31-32-45 openshift_node_labels="{'region': 'infra', 'zone': 'default'}"

ip-172-31-15-107 openshift_node_labels="{'region': 'primary', 'zone': 'default'}"

Comment 1 Dan Williams 2017-10-05 18:14:10 UTC

The "Could not find an allocated subnet for node" message indicates that the master is either not started, has not completed its initialization, or hasn't been able to create a HostSubnet for the node.

Could we:

1) get master logs
2) get 'oc get hostsubnets' output

Comment 2 Mike Fiedler 2017-10-05 19:31:05 UTC

comment 1 is correct.  The atomic-openshift-master-controllers svc is not starting because of:

F1005 18:51:43.450870   64966 controllermanager.go:179] error building controller context: no ClusterID Found.  A ClusterID is required for the cloud provider to function properly


https://github.com/openshift/origin/pull/16331 now enforces a ClusterID and from what I gather from https://github.com/openshift/openshift-ansible/pull/4726 it is being read from some label.


From @rsquared:  either "KubernetesCluster" or "kubernetes.io/cluster/<unique cluster string>". The former is the old way, the latter the new way.

Is this something openshift-ansible can help with?   I tried openshift_cloudprovider_aws_cluster_id from https://github.com/openshift/openshift-ansible/pull/4726 but that did not work.

Comment 3 Scott Dodson 2017-10-05 20:05:50 UTC

Unless you're using the new provisioning work we don't manage your instances and there's no reasonable manner in which to retroactively label things. You must label your instances ahead of running the installer.

Comment 4 Mike Fiedler 2017-10-05 21:58:00 UTC

Labelling the instances with KubernetesCluster took care of it.   Will open a separate bz for openshift-ansible to enforce it.

Comment 5 Scott Dodson 2017-10-06 13:15:45 UTC


*** This bug has been marked as a duplicate of bug 1491399 ***

Note You need to log in before you can comment on or make changes to this bug.