Bug 1656983
Summary: | OCP Azure Cloud Provider Failure with IDM due to IDM requiring FQDN hostnames | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Sam Yangsao <syangsao> | ||||
Component: | Installer | Assignee: | Joseph Callen <jcallen> | ||||
Installer sub component: | openshift-ansible | QA Contact: | Mike Gahagan <mgahagan> | ||||
Status: | CLOSED ERRATA | Docs Contact: | |||||
Severity: | urgent | ||||||
Priority: | unspecified | CC: | aos-bugs, brad.ison, decarr, dspangen, gpei, jchaloup, jokerman, jolee, jpreston, mgugino, mifiedle, mmccomas, mrhodes, pvoborni, vlaad | ||||
Version: | 3.10.0 | ||||||
Target Milestone: | --- | ||||||
Target Release: | 3.11.z | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
Cause: Missing conditionals for cloud provider azure
Consequence: Incorrect kubelet node name for special cases
Fix: Add proper conditionals to set nodeName in node-config
Result: Proper kubelet node name can be set as required
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 1701247 (view as bug list) | Environment: | |||||
Last Closed: | 2019-07-23 19:56:23 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1701247 | ||||||
Attachments: |
|
Comment 7
Josh Preston
2018-12-10 16:26:12 UTC
Sending this to the pod team. Cloudprovider implementation provides node-name to the kubelet when used. Build: openshift-ansible-3.11.110-1 Reproduced the issue on 3.11.117: - 1 master, 1 compute. Verified we could install 3.11.117 ok using the OOTB Azure hostname config which uses VM names as the hostname and node names - a normal Azure install. This works fine. Now, uninstall and set FQDN as hostname and retry install. FQDN was configured in the Azure UI for the public IP of the instances: - Used Azure to assign public DNS names to the public IPs of the 2 instances - Updated the hostname of the instances to be the public DNS names (hostnamectl set-hostname <public DNS name> ) - created /etc/sysconfig/KUBELET_HOSTNAME_OVERRIDE with the public DNS names on both nodes - verify hostname command returns FQDN, verify FQDN resolves to an IP. Reboot for good luck. Configure openshift-ansible inventory (full inventory in attached tar): [nodes] 137.135.78.189 openshift_node_group_name=node-config-compute openshift_kubelet_name_override=compute-311.osadev.cloud 40.114.9.9 openshift_node_group_name=node-config-master openshift_kubelet_name_override=master-311.osadev.cloud - run installer. Install fails with "Node start failed" on the openshift-ansible side (ansible log in attached tar) We see the same messages in the node log as reported in comment 2 (journal in attached tar): Jun 21 15:30:32 master-311.osadev.cloud atomic-openshift-node[34819]: I0621 15:30:32.213533 34819 kubelet.go:389] cloud provider determined current node name to be master-311.osadev.cloud Jun 21 15:30:32 master-311.osadev.cloud atomic-openshift-node[34819]: I0621 15:30:32.246563 34819 azure_wrap.go:199] Virtual machine "master-311.osadev.cloud" not found with message: "compute.VirtualMachinesClient#Get: Failure responding to request: StatusCode=404 -- Original Error: autorest/azure: Service returned an error. Status=404 Code=\"ResourceNotFound\" Message=\"The Resource 'Microsoft.Compute/virtualMachines/master-311.osadev.cloud' under resource group 'mgahagan-310test' was not found.\"" Jun 21 15:30:32 master-311.osadev.cloud atomic-openshift-node[34819]: I0621 15:30:32.246617 34819 azure_standard.go:569] GetPrimaryInterface(master-311.osadev.cloud, ) abort backoff Jun 21 15:30:32 master-311.osadev.cloud atomic-openshift-node[34819]: E0621 15:30:32.246636 34819 azure_backoff.go:102] GetIPForMachineWithRetry(master-311.osadev.cloud): backoff failure, will retry,err=instance not found Jun 21 15:30:32 master-311.osadev.cloud atomic-openshift-node[34819]: I0621 15:30:32.246654 34819 azure_instances.go:36] NodeAddresses(master-311.osadev.cloud) abort backoff: timed out waiting for the condition Jun 21 15:30:32 master-311.osadev.cloud atomic-openshift-node[34819]: F0621 15:30:32.246674 34819 server.go:262] failed to run Kubelet: failed to create kubelet: failed to get the addresses of the current instance from the cloud provider: timed out waiting for the condition Jun 21 15:30:32 master-311.osadev.cloud systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a Jun 21 15:30:32 master-311.osadev.cloud systemd[1]: Failed to start OpenShift Node. Created attachment 1583375 [details]
inventory, ansible log, node log reproducing issue on 3.11.117
Moving back to ON_QA to test per https://bugzilla.redhat.com/show_bug.cgi?id=1701247#c9 Tested on OCP cluster set up on Azure, cluster nodes were given public DNS names and aliases and public IP's. Openshift-Ansible was run on a bastion host with private IP access to the nodes. Successfully deployed the cluster using openshift_kubelet_name_override set to the private IP hostname of the cluster nodes. openshift_master_cluster_public_hostname and openshift_public_hostname were set to the public DNS names. Versions tested: atomic-openshift-3.11.117-1.git.0.14e54a3.el7 openshift-ansible-3.11.117-1.git.0.add13ff Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1753 |