Summary: IDM registration is required for security reasons, and deploying OpenShift on a certified Cloud Provider secured with IDM is seemingly a commonly used Red Hat solution. For OpenShift, the node’s hostname needs to match the Azure VM name, and the "openshift_hostname" configuration was removed in 3.10. For Azure, VM names cannot utilize a ‘.’ (dot) character. The issue is that IDM requires the node hostname to be set to the FQDN, and the Azure VM name can’t be set to the FQDN due to it not supporting the "." (dot) character in the name. To ensure a consistent security posture, OpenShift nodes must utilize IDM and the Azure Cloud Provider together, so a solution is required. [1] https://kubernetes.io/docs/concepts/cluster-administration/cloud-providers/#azure
Sending this to the pod team. Cloudprovider implementation provides node-name to the kubelet when used.
https://github.com/openshift/openshift-ansible/pull/11521
Build: openshift-ansible-3.11.110-1
Reproduced the issue on 3.11.117: - 1 master, 1 compute. Verified we could install 3.11.117 ok using the OOTB Azure hostname config which uses VM names as the hostname and node names - a normal Azure install. This works fine. Now, uninstall and set FQDN as hostname and retry install. FQDN was configured in the Azure UI for the public IP of the instances: - Used Azure to assign public DNS names to the public IPs of the 2 instances - Updated the hostname of the instances to be the public DNS names (hostnamectl set-hostname <public DNS name> ) - created /etc/sysconfig/KUBELET_HOSTNAME_OVERRIDE with the public DNS names on both nodes - verify hostname command returns FQDN, verify FQDN resolves to an IP. Reboot for good luck. Configure openshift-ansible inventory (full inventory in attached tar): [nodes] 137.135.78.189 openshift_node_group_name=node-config-compute openshift_kubelet_name_override=compute-311.osadev.cloud 40.114.9.9 openshift_node_group_name=node-config-master openshift_kubelet_name_override=master-311.osadev.cloud - run installer. Install fails with "Node start failed" on the openshift-ansible side (ansible log in attached tar) We see the same messages in the node log as reported in comment 2 (journal in attached tar): Jun 21 15:30:32 master-311.osadev.cloud atomic-openshift-node[34819]: I0621 15:30:32.213533 34819 kubelet.go:389] cloud provider determined current node name to be master-311.osadev.cloud Jun 21 15:30:32 master-311.osadev.cloud atomic-openshift-node[34819]: I0621 15:30:32.246563 34819 azure_wrap.go:199] Virtual machine "master-311.osadev.cloud" not found with message: "compute.VirtualMachinesClient#Get: Failure responding to request: StatusCode=404 -- Original Error: autorest/azure: Service returned an error. Status=404 Code=\"ResourceNotFound\" Message=\"The Resource 'Microsoft.Compute/virtualMachines/master-311.osadev.cloud' under resource group 'mgahagan-310test' was not found.\"" Jun 21 15:30:32 master-311.osadev.cloud atomic-openshift-node[34819]: I0621 15:30:32.246617 34819 azure_standard.go:569] GetPrimaryInterface(master-311.osadev.cloud, ) abort backoff Jun 21 15:30:32 master-311.osadev.cloud atomic-openshift-node[34819]: E0621 15:30:32.246636 34819 azure_backoff.go:102] GetIPForMachineWithRetry(master-311.osadev.cloud): backoff failure, will retry,err=instance not found Jun 21 15:30:32 master-311.osadev.cloud atomic-openshift-node[34819]: I0621 15:30:32.246654 34819 azure_instances.go:36] NodeAddresses(master-311.osadev.cloud) abort backoff: timed out waiting for the condition Jun 21 15:30:32 master-311.osadev.cloud atomic-openshift-node[34819]: F0621 15:30:32.246674 34819 server.go:262] failed to run Kubelet: failed to create kubelet: failed to get the addresses of the current instance from the cloud provider: timed out waiting for the condition Jun 21 15:30:32 master-311.osadev.cloud systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a Jun 21 15:30:32 master-311.osadev.cloud systemd[1]: Failed to start OpenShift Node.
Created attachment 1583375 [details] inventory, ansible log, node log reproducing issue on 3.11.117
Moving back to ON_QA to test per https://bugzilla.redhat.com/show_bug.cgi?id=1701247#c9
Tested on OCP cluster set up on Azure, cluster nodes were given public DNS names and aliases and public IP's. Openshift-Ansible was run on a bastion host with private IP access to the nodes. Successfully deployed the cluster using openshift_kubelet_name_override set to the private IP hostname of the cluster nodes. openshift_master_cluster_public_hostname and openshift_public_hostname were set to the public DNS names. Versions tested: atomic-openshift-3.11.117-1.git.0.14e54a3.el7 openshift-ansible-3.11.117-1.git.0.add13ff
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1753