Bug 1656983

Summary: OCP Azure Cloud Provider Failure with IDM due to IDM requiring FQDN hostnames
Product: OpenShift Container Platform Reporter: Sam Yangsao <syangsao>
Component: InstallerAssignee: Joseph Callen <jcallen>
Installer sub component: openshift-ansible QA Contact: Mike Gahagan <mgahagan>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: unspecified CC: aos-bugs, brad.ison, decarr, dspangen, gpei, jchaloup, jokerman, jolee, jpreston, mgugino, mifiedle, mmccomas, mrhodes, pvoborni, vlaad
Version: 3.10.0   
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Missing conditionals for cloud provider azure Consequence: Incorrect kubelet node name for special cases Fix: Add proper conditionals to set nodeName in node-config Result: Proper kubelet node name can be set as required
Story Points: ---
Clone Of:
: 1701247 (view as bug list) Environment:
Last Closed: 2019-07-23 19:56:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1701247    
Attachments:
Description Flags
inventory, ansible log, node log reproducing issue on 3.11.117 none

Comment 7 Josh Preston 2018-12-10 16:26:12 UTC
Summary:

IDM registration is required for security reasons, and deploying OpenShift on a certified Cloud Provider secured with IDM is seemingly a commonly used Red Hat solution.

For OpenShift, the node’s hostname needs to match the Azure VM name, and the  "openshift_hostname" configuration was removed in 3.10.

For Azure, VM names cannot utilize a ‘.’ (dot) character.

The issue is that IDM requires the node hostname to be set to the FQDN, and the Azure VM name can’t be set to the FQDN due to it not supporting the "." (dot) character in the name.

To ensure a consistent security posture, OpenShift nodes must utilize IDM and the Azure Cloud Provider together, so a solution is required.

[1] https://kubernetes.io/docs/concepts/cluster-administration/cloud-providers/#azure

Comment 20 Michael Gugino 2019-04-08 22:05:43 UTC
Sending this to the pod team.  Cloudprovider implementation provides node-name to the kubelet when used.

Comment 24 Joseph Callen 2019-04-26 13:19:41 UTC
Build: openshift-ansible-3.11.110-1

Comment 28 Mike Fiedler 2019-06-21 18:29:35 UTC
Reproduced the issue on 3.11.117:

- 1 master, 1 compute.   Verified we could install 3.11.117 ok using the OOTB Azure hostname config which uses VM names as the hostname and node names - a normal Azure install.   This works fine.

Now, uninstall and set FQDN as hostname and retry install.   FQDN was configured in the Azure UI for the public IP of the instances:

- Used Azure to assign public DNS names to the public IPs of the 2 instances
- Updated the hostname of the instances to be the public DNS names (hostnamectl set-hostname <public DNS name> )
- created /etc/sysconfig/KUBELET_HOSTNAME_OVERRIDE with the public DNS names on both nodes
- verify hostname command returns FQDN, verify FQDN resolves to an IP.   Reboot for good luck.

Configure openshift-ansible inventory (full inventory in attached tar):

[nodes]
137.135.78.189 openshift_node_group_name=node-config-compute openshift_kubelet_name_override=compute-311.osadev.cloud
40.114.9.9 openshift_node_group_name=node-config-master openshift_kubelet_name_override=master-311.osadev.cloud

- run installer.   Install fails with "Node start failed" on the openshift-ansible side (ansible log in attached tar)

We see the same messages in the node log as reported in comment 2 (journal in attached tar):

Jun 21 15:30:32 master-311.osadev.cloud atomic-openshift-node[34819]: I0621 15:30:32.213533   34819 kubelet.go:389] cloud provider determined current node name to be master-311.osadev.cloud
Jun 21 15:30:32 master-311.osadev.cloud atomic-openshift-node[34819]: I0621 15:30:32.246563   34819 azure_wrap.go:199] Virtual machine "master-311.osadev.cloud" not found with message: "compute.VirtualMachinesClient#Get: Failure responding to request: StatusCode=404 -- Original Error: autorest/azure: Service returned
 an error. Status=404 Code=\"ResourceNotFound\" Message=\"The Resource 'Microsoft.Compute/virtualMachines/master-311.osadev.cloud' under resource group 'mgahagan-310test' was not found.\""
Jun 21 15:30:32 master-311.osadev.cloud atomic-openshift-node[34819]: I0621 15:30:32.246617   34819 azure_standard.go:569] GetPrimaryInterface(master-311.osadev.cloud, ) abort backoff
Jun 21 15:30:32 master-311.osadev.cloud atomic-openshift-node[34819]: E0621 15:30:32.246636   34819 azure_backoff.go:102] GetIPForMachineWithRetry(master-311.osadev.cloud): backoff failure, will retry,err=instance not found
Jun 21 15:30:32 master-311.osadev.cloud atomic-openshift-node[34819]: I0621 15:30:32.246654   34819 azure_instances.go:36] NodeAddresses(master-311.osadev.cloud) abort backoff: timed out waiting for the condition
Jun 21 15:30:32 master-311.osadev.cloud atomic-openshift-node[34819]: F0621 15:30:32.246674   34819 server.go:262] failed to run Kubelet: failed to create kubelet: failed to get the addresses of the current instance from the cloud provider: timed out waiting for the condition
Jun 21 15:30:32 master-311.osadev.cloud systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a
Jun 21 15:30:32 master-311.osadev.cloud systemd[1]: Failed to start OpenShift Node.

Comment 29 Mike Fiedler 2019-06-21 18:30:20 UTC
Created attachment 1583375 [details]
inventory, ansible log, node log reproducing issue on 3.11.117

Comment 30 Mike Fiedler 2019-06-27 11:59:28 UTC
Moving back to ON_QA to test per https://bugzilla.redhat.com/show_bug.cgi?id=1701247#c9

Comment 31 Mike Gahagan 2019-06-28 18:14:33 UTC
Tested on OCP cluster set up on Azure, cluster nodes were given public DNS names and aliases and public IP's. Openshift-Ansible was run on a bastion host with private IP access to the nodes. Successfully deployed the cluster using openshift_kubelet_name_override set to the private IP hostname of the cluster nodes. openshift_master_cluster_public_hostname and openshift_public_hostname were set to the public DNS names. 

Versions tested:
atomic-openshift-3.11.117-1.git.0.14e54a3.el7
openshift-ansible-3.11.117-1.git.0.add13ff

Comment 33 errata-xmlrpc 2019-07-23 19:56:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1753