Bug 1535340
Summary: | Advanced Installer Always failed once when installing OpenShift on Azure with Azure default name resolution | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Takayoshi Tanaka <tatanaka> | |
Component: | Cloud Compute | Assignee: | Jan Chaloupka <jchaloup> | |
Status: | CLOSED ERRATA | QA Contact: | Wenkai Shi <weshi> | |
Severity: | urgent | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 3.7.1 | CC: | aos-bugs, arun.neelicattu, clasohm, gwest, jokerman, mmccomas, sdodson, stwalter, weshi | |
Target Milestone: | --- | |||
Target Release: | 3.11.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
You may now configure NetworkManager for dns=none during installation. This is commonly used when deploying on Azure but may be useful in other scenarios too. In order to configure this set openshift_node_dnsmasq_disable_network_manager_dns=true
|
Story Points: | --- | |
Clone Of: | ||||
: | 1614089 1614092 (view as bug list) | Environment: | ||
Last Closed: | 2018-10-11 07:19:06 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1614089, 1614092 |
Description
Takayoshi Tanaka
2018-01-17 07:48:35 UTC
I could reproduce the issue with single master/node. And the condition can be summarized as follows. Two reasons [a],[b] cause the node service can't restart. [a] search domain in /etc/resolv.conf is removed by someone during installation. And the search domain will be recovered by restarting network service. [b] The name resolution from hostname to private IP is required when specifying Azure Cloud Provide in ansible inventory file I think [a] should be a bug. However, it causes no harm as far as Azure Cloud Provider is not specified in ansible inventory file [1]. When this configuration [1] is specified, OpenShift node gets IP address from the node name [b]. However, it failed because the search domain is required to resolve by DNS. Then restarting node service failed. Workaround: To workaround, restart network service once the playbook failed. Then re-execute the playbook again. Note: Enabling Azure Cloud Provide during ansible install can't work even though this issue is resolved. Another BZ 1535391 is blocking. [1] ~~~ osm_controller_args={'cloud-provider': ['azure'], 'cloud-config': ['/etc/azure/azure.conf']} osm_api_server_args={'cloud-provider': ['azure'], 'cloud-config': ['/etc/azure/azure.conf']} openshift_node_kubelet_args={'cloud-provider': ['azure'], 'cloud-config': ['/etc/azure/azure.conf'], 'enable-controller-attach-detach': ['true']} openshift_cloudprovider_kind=azure ~~~ Changed to 3.7.1 because I confirmed this issue happens both 3.7.1 & 3.6.1 Glenn, Do you have any input on how to go about fixing this issue? @Takayoshi , can you try [1] in your hosts? [1]. ping `hostname` Hi, here is a result. ~~~ # ping `hostname` ping: tatanaka-37c: Name or service not known # hostnamectl status Static hostname: tatanaka-37c Pretty hostname: [localhost.localdomain] Icon name: computer-vm Chassis: vm Machine ID: 02f1ddb1415c4feba9880b2b8c4c5925 Boot ID: 7c131a057da44f0c9c5333f9cb0fefdd Virtualization: microsoft Operating System: Employee SKU CPE OS Name: cpe:/o:redhat:enterprise_linux:7.4:GA:server Kernel: Linux 3.10.0-693.11.6.el7.x86_64 Architecture: x86-64 ~~~ Add needinfo flg again since I removed mistakenly. (In reply to Takayoshi Tanaka from comment #7) > Hi, here is a result. > > ~~~ > # ping `hostname` > ping: tatanaka-37c: Name or service not known > > # hostnamectl status > Static hostname: tatanaka-37c > Pretty hostname: [localhost.localdomain] > Icon name: computer-vm > Chassis: vm > Machine ID: 02f1ddb1415c4feba9880b2b8c4c5925 > Boot ID: 7c131a057da44f0c9c5333f9cb0fefdd > Virtualization: microsoft > Operating System: Employee SKU > CPE OS Name: cpe:/o:redhat:enterprise_linux:7.4:GA:server > Kernel: Linux 3.10.0-693.11.6.el7.x86_64 > Architecture: x86-64 > ~~~ Hi Takayoshi, This is root cause I believe, you must have an internal hostname which could resloved to internal ip. Or you might met BZ #1505266 During create internal nic, you must add "--internal-dns-name" to assign an internal hostname for your host as example [1]. [1]. # az network public-ip create --resource-group openshift-qe-xxx -n qe-public-ip --dns-name qe-public-hostname # az network nic create --resource-group openshift-qe-xxx --name qe-internal-ip --vnet-name openshift-qe-vnet --subnet default-subnet --internal-dns-name qe-internal-hostname --public-ip-address qe-public-ip I have tested with specifying "--internal-dns-name" and get the same error.
Even though "--internal-dns-name" is required to set up [1], it's strange to fix this issue by restarting the network service.
As I commented before, this issue is caused by someone who is modified the /etc/resolv.conf.
While the hostname can't resolve to the IP:
~~~
> # ping `hostname`
> ping: tatanaka-37c: Name or service not known
~~~
No search domain is shown in /etc/resolv.conf
~~~
# nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh
# Generated by NetworkManager
nameserver 10.0.0.4
~~~
Then, /etc/resolv.conf is updated by restarting network service.
~~~
# systemctl restart network
# cat /etc/resolv.conf
# nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh
# Generated by NetworkManager
search cele2tf43r2elamcykz40glung.xx.internal.cloudapp.net cluster.local
nameserver 10.0.0.4
~~~
Does it mean someone updated the /etc/resolv.conf incompletely?
[1] I don't believe the current OCP doesn't require "--internal-dns-name" because the hostname can be resolved to the IP before and after installation.
I suspect the issue is caused by removing search directive from /etc/resolv.conf. Because "search <private_hostname>.xx.internal.cloudapp.net" is defined before installation, the playbook should keep this search directive. Actually, restarting network service can workaround the issue, but installer itself should restart network service or should keep the "search <private_hostname>.xx.internal.cloudapp.net" in /etc/resolv.conf. I mean, when the install starts the node service, /etc/resolv.conf should be as follows: ~~~ # nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh # Generated by NetworkManager search <private_hostname.xx.internal.cloudapp.net> cluster.local nameserver <private_ip> ~~~ The actual result is: ~~~ # nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh # Generated by NetworkManager nameserver <private_ip> ~~~ Please note, this issue only happened once at the first attempt of the advanced installer. Once restarting the network service, this issue won't happen again. Proposed a potential fix upstream for this one. This will allow for optionally configuring Network Manager with "dns=none". https://github.com/openshift/openshift-ansible/pull/7765 To test set openshift_node_dnsmasq_disable_network_manager_dns=true Fail to verified with version openshift-ansible-3.9.40-1.git.0.188c954.el7, related code[1] doesn't merged in to this build. [1]. https://github.com/openshift/openshift-ansible/pull/7765 Scott, Kenny, is this a blocker for 3.9? The PR is included only in the latest 3.11 build [1]. The cherry-picks are merged though not included in the latest 3.9/3.10 build: 3.10: https://github.com/openshift/openshift-ansible/pull/9374 3.9: https://github.com/openshift/openshift-ansible/pull/9375 [1] https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=735456 I've moved target release for this bug to 3.11, should be in openshift-ansible-3.11.0-0.11.0 I think we should clone this bug for 3.10 and 3.9, mark them modified similarly. Verified with version openshift-ansible-3.11.0-0.11.0.git.0.3c66516None, this issue doesn't appear in QE's cluster. The PR with openshift_node_dnsmasq_disable_network_manager_dns parameter do has effect. # cat /etc/NetworkManager/conf.d/99-origin.conf [main] dns=none Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2652 |