Hide Forgot
Description of problem: When installing OpenShift on Azure VMs with using Azure Default name resolution, the installation fails due to the failure of node service Version-Release number of the following components: ~~~ # rpm -q openshift-ansible openshift-ansible-3.6.173.0.83-1.git.0.84c5eff.el7.noarch # rpm -q ansible ansible-2.4.1.0-1.el7.noarch # ansible --version ansible 2.4.1.0 config file = /etc/ansible/ansible.cfg configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python2.7/site-packages/ansible executable location = /bin/ansible python version = 2.7.5 (default, May 3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)] ~~~ How reproducible: Always. Steps to Reproduce: 1. Provision RHEL server on Azure with using Azure default name resolution. If we don't specify DNS Server configuration on Azure, Azure uses the default name resolution. 2. Create ansible inventory file specifying the node name is equal to the VM name. This is the requirement when using Azure Disk as a PV by kubernetes Azure Cloud Provider. https://docs.openshift.com/container-platform/3.6/install_config/configuring_azure.html 3. Run advanced installation. # ansible-playbook /usr/share/ansible/openshift-ansible/ansible/playbooks/byo/config.yml Actual results: Failed at the task [openshift_node : restart node] Unable to restart service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See "systemctl status atomic-openshift-node.service" and "journalctl -xe" for details. Expected results: Installation is completed without errors. Additional info: The customer reported this issue happens at OCP 3.6. However, I'm assuming this still happens on latest OCP. Will update later. Added the detailed info including sensitive data in private. [1] https://docs.microsoft.com/en-us/azure/virtual-machines/linux/azure-dns
I could reproduce the issue with single master/node. And the condition can be summarized as follows. Two reasons [a],[b] cause the node service can't restart. [a] search domain in /etc/resolv.conf is removed by someone during installation. And the search domain will be recovered by restarting network service. [b] The name resolution from hostname to private IP is required when specifying Azure Cloud Provide in ansible inventory file I think [a] should be a bug. However, it causes no harm as far as Azure Cloud Provider is not specified in ansible inventory file [1]. When this configuration [1] is specified, OpenShift node gets IP address from the node name [b]. However, it failed because the search domain is required to resolve by DNS. Then restarting node service failed. Workaround: To workaround, restart network service once the playbook failed. Then re-execute the playbook again. Note: Enabling Azure Cloud Provide during ansible install can't work even though this issue is resolved. Another BZ 1535391 is blocking. [1] ~~~ osm_controller_args={'cloud-provider': ['azure'], 'cloud-config': ['/etc/azure/azure.conf']} osm_api_server_args={'cloud-provider': ['azure'], 'cloud-config': ['/etc/azure/azure.conf']} openshift_node_kubelet_args={'cloud-provider': ['azure'], 'cloud-config': ['/etc/azure/azure.conf'], 'enable-controller-attach-detach': ['true']} openshift_cloudprovider_kind=azure ~~~
Changed to 3.7.1 because I confirmed this issue happens both 3.7.1 & 3.6.1
Glenn, Do you have any input on how to go about fixing this issue?
@Takayoshi , can you try [1] in your hosts? [1]. ping `hostname`
Hi, here is a result. ~~~ # ping `hostname` ping: tatanaka-37c: Name or service not known # hostnamectl status Static hostname: tatanaka-37c Pretty hostname: [localhost.localdomain] Icon name: computer-vm Chassis: vm Machine ID: 02f1ddb1415c4feba9880b2b8c4c5925 Boot ID: 7c131a057da44f0c9c5333f9cb0fefdd Virtualization: microsoft Operating System: Employee SKU CPE OS Name: cpe:/o:redhat:enterprise_linux:7.4:GA:server Kernel: Linux 3.10.0-693.11.6.el7.x86_64 Architecture: x86-64 ~~~
Add needinfo flg again since I removed mistakenly.
(In reply to Takayoshi Tanaka from comment #7) > Hi, here is a result. > > ~~~ > # ping `hostname` > ping: tatanaka-37c: Name or service not known > > # hostnamectl status > Static hostname: tatanaka-37c > Pretty hostname: [localhost.localdomain] > Icon name: computer-vm > Chassis: vm > Machine ID: 02f1ddb1415c4feba9880b2b8c4c5925 > Boot ID: 7c131a057da44f0c9c5333f9cb0fefdd > Virtualization: microsoft > Operating System: Employee SKU > CPE OS Name: cpe:/o:redhat:enterprise_linux:7.4:GA:server > Kernel: Linux 3.10.0-693.11.6.el7.x86_64 > Architecture: x86-64 > ~~~ Hi Takayoshi, This is root cause I believe, you must have an internal hostname which could resloved to internal ip. Or you might met BZ #1505266 During create internal nic, you must add "--internal-dns-name" to assign an internal hostname for your host as example [1]. [1]. # az network public-ip create --resource-group openshift-qe-xxx -n qe-public-ip --dns-name qe-public-hostname # az network nic create --resource-group openshift-qe-xxx --name qe-internal-ip --vnet-name openshift-qe-vnet --subnet default-subnet --internal-dns-name qe-internal-hostname --public-ip-address qe-public-ip
I have tested with specifying "--internal-dns-name" and get the same error. Even though "--internal-dns-name" is required to set up [1], it's strange to fix this issue by restarting the network service. As I commented before, this issue is caused by someone who is modified the /etc/resolv.conf. While the hostname can't resolve to the IP: ~~~ > # ping `hostname` > ping: tatanaka-37c: Name or service not known ~~~ No search domain is shown in /etc/resolv.conf ~~~ # nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh # Generated by NetworkManager nameserver 10.0.0.4 ~~~ Then, /etc/resolv.conf is updated by restarting network service. ~~~ # systemctl restart network # cat /etc/resolv.conf # nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh # Generated by NetworkManager search cele2tf43r2elamcykz40glung.xx.internal.cloudapp.net cluster.local nameserver 10.0.0.4 ~~~ Does it mean someone updated the /etc/resolv.conf incompletely? [1] I don't believe the current OCP doesn't require "--internal-dns-name" because the hostname can be resolved to the IP before and after installation.
I suspect the issue is caused by removing search directive from /etc/resolv.conf. Because "search <private_hostname>.xx.internal.cloudapp.net" is defined before installation, the playbook should keep this search directive. Actually, restarting network service can workaround the issue, but installer itself should restart network service or should keep the "search <private_hostname>.xx.internal.cloudapp.net" in /etc/resolv.conf. I mean, when the install starts the node service, /etc/resolv.conf should be as follows: ~~~ # nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh # Generated by NetworkManager search <private_hostname.xx.internal.cloudapp.net> cluster.local nameserver <private_ip> ~~~ The actual result is: ~~~ # nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh # Generated by NetworkManager nameserver <private_ip> ~~~ Please note, this issue only happened once at the first attempt of the advanced installer. Once restarting the network service, this issue won't happen again.
Proposed a potential fix upstream for this one. This will allow for optionally configuring Network Manager with "dns=none". https://github.com/openshift/openshift-ansible/pull/7765
To test set openshift_node_dnsmasq_disable_network_manager_dns=true
Fail to verified with version openshift-ansible-3.9.40-1.git.0.188c954.el7, related code[1] doesn't merged in to this build. [1]. https://github.com/openshift/openshift-ansible/pull/7765
Scott, Kenny, is this a blocker for 3.9? The PR is included only in the latest 3.11 build [1]. The cherry-picks are merged though not included in the latest 3.9/3.10 build: 3.10: https://github.com/openshift/openshift-ansible/pull/9374 3.9: https://github.com/openshift/openshift-ansible/pull/9375 [1] https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=735456
I've moved target release for this bug to 3.11, should be in openshift-ansible-3.11.0-0.11.0 I think we should clone this bug for 3.10 and 3.9, mark them modified similarly.
Verified with version openshift-ansible-3.11.0-0.11.0.git.0.3c66516None, this issue doesn't appear in QE's cluster. The PR with openshift_node_dnsmasq_disable_network_manager_dns parameter do has effect. # cat /etc/NetworkManager/conf.d/99-origin.conf [main] dns=none
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2652