Created attachment 1334894 [details] system log at loglevel=5 Description of problem: In 3.7.0-0.142.0 and 0.143.0 the install fails because the atomic-openshift-node service fails to start with network errors: Oct 05 14:52:12 ip-172-31-15-107.us-west-2.compute.internal atomic-openshift-node[2624]: W1005 14:52:12.452824 2624 sdn_controller.go:48] Could not find an allocated subnet for node: ip-172-31-15-107.us-west-2.compute.internal, Waiting... Oct 05 14:52:12 ip-172-31-15-107.us-west-2.compute.internal atomic-openshift-node[2624]: I1005 14:52:12.523931 2624 generic.go:183] GenericPLEG: Relisting Oct 05 14:52:13 ip-172-31-15-107.us-west-2.compute.internal atomic-openshift-node[2624]: I1005 14:52:13.526593 2624 generic.go:183] GenericPLEG: Relisting Oct 05 14:52:13 ip-172-31-15-107.us-west-2.compute.internal atomic-openshift-node[2624]: W1005 14:52:13.956564 2624 sdn_controller.go:48] Could not find an allocated subnet for node: ip-172-31-15-107.us-west-2.compute.internal, Waiting... Oct 05 14:52:14 ip-172-31-15-107.us-west-2.compute.internal atomic-openshift-node[2624]: I1005 14:52:14.529242 2624 generic.go:183] GenericPLEG: Relisting Oct 05 14:52:15 ip-172-31-15-107.us-west-2.compute.internal atomic-openshift-node[2624]: I1005 14:52:15.531737 2624 generic.go:183] GenericPLEG: Relisting Oct 05 14:52:16 ip-172-31-15-107.us-west-2.compute.internal atomic-openshift-node[2624]: W1005 14:52:16.208670 2624 sdn_controller.go:48] Could not find an allocated subnet for node: ip-172-31-15-107.us-west-2.compute.internal, Waiting... Oct 05 14:52:16 ip-172-31-15-107.us-west-2.compute.internal atomic-openshift-node[2624]: I1005 14:52:16.513173 2624 kubelet.go:1909] SyncLoop (housekeeping, skipped): sources aren't ready yet. Oct 05 14:52:16 ip-172-31-15-107.us-west-2.compute.internal atomic-openshift-node[2624]: I1005 14:52:16.513211 2624 kubelet.go:1835] SyncLoop (ADD, "api"): "" Oct 05 14:52:16 ip-172-31-15-107.us-west-2.compute.internal atomic-openshift-node[2624]: I1005 14:52:16.534306 2624 generic.go:183] GenericPLEG: Relisting Oct 05 14:52:16 ip-172-31-15-107.us-west-2.compute.internal atomic-openshift-node[2624]: I1005 14:52:16.613406 2624 reconciler.go:159] Desired state of world has been populated with pods, starting reconstruct state function Oct 05 14:52:16 ip-172-31-15-107.us-west-2.compute.internal atomic-openshift-node[2624]: W1005 14:52:16.711568 2624 cni.go:189] Unable to update cni config: No networks found in /etc/cni/net.d Oct 05 14:52:16 ip-172-31-15-107.us-west-2.compute.internal atomic-openshift-node[2624]: I1005 14:52:16.711671 2624 kubelet.go:2090] Container runtime status: Runtime Conditions: RuntimeReady=true reason: message:, NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized I've tried all three network plugins and get the same results. Version-Release number of selected component (if applicable): 3.7.0-0.143.0 with the same level of openshift-ansible How reproducible: Usually - occasionally I am seeing other install failures instead of this one (see https://bugzilla.redhat.com/show_bug.cgi?id=1498592). If that one does not occur, I hit this one. Steps to Reproduce: 1. Install 3.7.0-0.143.0 with the inventory below Actual results: During the install there will be a failure with nodes not starting. Go to the nodes and check the node logs and you'll see the network failures. Full system log with loglevel set to 5 attached (jump to the bottom for the most recent failure with loglevel 5) Expected results: successful install Additional info: [OSEv3:children] masters nodes etcd [OSEv3:vars] #The following parameters is used by post-actions iaas_name=AWS use_rpm_playbook=true openshift_playbook_rpm_repos=[{'id': 'aos-playbook-rpm', 'name': 'aos-playbook-rpm', 'baseurl': 'http://download.eng.bos.redhat.com/rcm-guest/puddles/RHAOS/AtomicOpenShift/3.7/latest/x86_64/os', 'enabled': 1, 'gpgcheck': 0}] update_is_images_url=registry.ops.openshift.com #The following parameters is used by openshift-ansible ansible_ssh_user=root openshift_cloudprovider_kind=aws openshift_cloudprovider_aws_access_key=<redacted> openshift_cloudprovider_aws_secret_key=<redacted> openshift_master_default_subdomain_enable=true openshift_master_default_subdomain=apps.1004-g89.qe.rhcloud.com openshift_auth_type=allowall openshift_master_identity_providers=[{'name': 'allow_all', 'login': 'true', 'challenge': 'true', 'kind': 'AllowAllPasswordIdentityProvider'}] openshift_master_cluster_public_hostname=ec2-54-202-120-33.us-west-2.compute.amazonaws.com openshift_master_cluster_hostname=ip-172-31-27-246 deployment_type=openshift-enterprise openshift_cockpit_deployer_prefix=registry.ops.openshift.com/openshift3/ osm_cockpit_plugins=['cockpit-kubernetes'] osm_use_cockpit=false oreg_url=registry.ops.openshift.com/openshift3/ose-${component}:${version} openshift_docker_additional_registries=registry.ops.openshift.com openshift_docker_insecure_registries=registry.ops.openshift.com use_cluster_metrics=true openshift_master_cluster_method=native openshift_master_dynamic_provisioning_enabled=true osm_default_node_selector=region=primary openshift_disable_check=disk_availability,memory_availability openshift_master_portal_net=172.24.0.0/14 openshift_portal_net=172.24.0.0/14 osm_cluster_network_cidr=172.20.0.0/14 osm_host_subnet_length=9 openshift_node_kubelet_args={"pods-per-core": ["0"], "max-pods": ["510"],"minimum-container-ttl-duration": ["10s"], "maximum-dead-containers-per-container": ["1"], "maximum-dead-containers": ["20"], "image-gc-high-threshold": ["80"], "image-gc-low-threshold": ["70"]} openshift_registry_selector="region=infra,zone=default" openshift_hosted_router_selector="region=infra,zone=default" openshift_hosted_router_registryurl=registry.ops.openshift.com/openshift3/ose-${component}:${version} debug_level=2 openshift_set_hostname=true openshift_override_hostname_check=true #os_sdn_network_plugin_name=redhat/ovs-multitenant openshift_hosted_router_replicas=1 openshift_hosted_registry_storage_kind=object openshift_hosted_registry_storage_provider=s3 openshift_hosted_registry_storage_s3_accesskey=<redacted> openshift_hosted_registry_storage_s3_secretkey=<redacted> openshift_hosted_registry_storage_s3_bucket=aoe-svt-test openshift_hosted_registry_storage_s3_region=us-west-2 openshift_hosted_registry_replicas=1 openshift_metrics_install_metrics=false openshift_metrics_image_prefix=registry.ops.openshift.com/openshift3/ openshift_metrics_image_version=v3.7.0 openshift_metrics_cassandra_storage_type=dynamic openshift_metrics_cassandra_pvc_size=25Gi openshift_logging_install_logging=false openshift_logging_image_prefix=registry.ops.openshift.com/openshift3/ openshift_logging_image_version=v3.7.0 openshift_logging_storage_volume_size=25Gi openshift_logging_storage_kind=dynamic openshift_logging_es_pvc_size=50Gi openshift_logging_es_pvc_dynamic=true openshift_use_system_containers=false system_images_registry=registry.ops.openshift.com openshift_image_tag=v3.7.0 [lb] [etcd] ip-172-31-27-246 [masters] ip-172-31-27-246 [nodes] ip-172-31-27-246 openshift_node_labels="{'region': 'infra', 'zone': 'default'}" openshift_scheduleable=false ip-172-31-32-45 openshift_node_labels="{'region': 'infra', 'zone': 'default'}" ip-172-31-15-107 openshift_node_labels="{'region': 'primary', 'zone': 'default'}"
The "Could not find an allocated subnet for node" message indicates that the master is either not started, has not completed its initialization, or hasn't been able to create a HostSubnet for the node. Could we: 1) get master logs 2) get 'oc get hostsubnets' output
comment 1 is correct. The atomic-openshift-master-controllers svc is not starting because of: F1005 18:51:43.450870 64966 controllermanager.go:179] error building controller context: no ClusterID Found. A ClusterID is required for the cloud provider to function properly https://github.com/openshift/origin/pull/16331 now enforces a ClusterID and from what I gather from https://github.com/openshift/openshift-ansible/pull/4726 it is being read from some label. From @rsquared: either "KubernetesCluster" or "kubernetes.io/cluster/<unique cluster string>". The former is the old way, the latter the new way. Is this something openshift-ansible can help with? I tried openshift_cloudprovider_aws_cluster_id from https://github.com/openshift/openshift-ansible/pull/4726 but that did not work.
Unless you're using the new provisioning work we don't manage your instances and there's no reasonable manner in which to retroactively label things. You must label your instances ahead of running the installer.
Labelling the instances with KubernetesCluster took care of it. Will open a separate bz for openshift-ansible to enforce it.
*** This bug has been marked as a duplicate of bug 1491399 ***