Bug 1535340

Summary:	Advanced Installer Always failed once when installing OpenShift on Azure with Azure default name resolution
Product:	OpenShift Container Platform	Reporter:	Takayoshi Tanaka <tatanaka>
Component:	Cloud Compute	Assignee:	Jan Chaloupka <jchaloup>
Status:	CLOSED ERRATA	QA Contact:	Wenkai Shi <weshi>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	3.7.1	CC:	aos-bugs, arun.neelicattu, clasohm, gwest, jokerman, mmccomas, sdodson, stwalter, weshi
Target Milestone:	---
Target Release:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	You may now configure NetworkManager for dns=none during installation. This is commonly used when deploying on Azure but may be useful in other scenarios too. In order to configure this set openshift_node_dnsmasq_disable_network_manager_dns=true	Story Points:	---
Clone Of:
Clones:	1614089 1614092 (view as bug list)		Environment:
Last Closed:	2018-10-11 07:19:06 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1614089, 1614092

Description Takayoshi Tanaka 2018-01-17 07:48:35 UTC

Description of problem:
When installing OpenShift on Azure VMs with using Azure Default name resolution, the installation fails due to the failure of node service

Version-Release number of the following components:
~~~
# rpm -q openshift-ansible
openshift-ansible-3.6.173.0.83-1.git.0.84c5eff.el7.noarch

# rpm -q ansible
ansible-2.4.1.0-1.el7.noarch

# ansible --version
ansible 2.4.1.0
config file = /etc/ansible/ansible.cfg
configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
ansible python module location = /usr/lib/python2.7/site-packages/ansible
executable location = /bin/ansible
python version = 2.7.5 (default, May 3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)]
~~~

How reproducible:
Always.

Steps to Reproduce:
1. Provision RHEL server on Azure with using Azure default name resolution. If we don't specify DNS Server configuration on Azure, Azure uses the default name resolution.

2. Create ansible inventory file specifying the node name is equal to the VM name. This is the requirement when using Azure Disk as a PV by kubernetes Azure Cloud Provider.
https://docs.openshift.com/container-platform/3.6/install_config/configuring_azure.html

3. Run advanced installation.
# ansible-playbook /usr/share/ansible/openshift-ansible/ansible/playbooks/byo/config.yml

Actual results:
Failed at the task [openshift_node : restart node]
Unable to restart service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See "systemctl status atomic-openshift-node.service" and "journalctl -xe" for details.

Expected results:
Installation is completed without errors.

Additional info:
The customer reported this issue happens at OCP 3.6. However, I'm assuming this still happens on latest OCP. Will update later.

Added the detailed info including sensitive data in private.

[1] https://docs.microsoft.com/en-us/azure/virtual-machines/linux/azure-dns

Comment 3 Takayoshi Tanaka 2018-01-18 06:27:26 UTC

I could reproduce the issue with single master/node. And the condition can be summarized as follows.

Two reasons [a],[b] cause the node service can't restart.

[a] search domain in /etc/resolv.conf is removed by someone during installation. And the search domain will be recovered by restarting network service.
[b] The name resolution from hostname to private IP is required when specifying Azure Cloud Provide in ansible inventory file

I think [a] should be a bug. However, it causes no harm as far as Azure Cloud Provider is not specified in ansible inventory file [1].

When this configuration [1] is specified, OpenShift node gets IP address from the node name [b]. However, it failed because the search domain is required to resolve by DNS. Then restarting node service failed.

Workaround:
To workaround, restart network service once the playbook failed. Then re-execute the playbook again.

Note:
Enabling Azure Cloud Provide during ansible install can't work even though this issue is resolved. Another BZ 1535391 is blocking.

[1] 
~~~
osm_controller_args={'cloud-provider': ['azure'], 'cloud-config': ['/etc/azure/azure.conf']}
osm_api_server_args={'cloud-provider': ['azure'], 'cloud-config': ['/etc/azure/azure.conf']}
openshift_node_kubelet_args={'cloud-provider': ['azure'], 'cloud-config': ['/etc/azure/azure.conf'], 'enable-controller-attach-detach': ['true']}
openshift_cloudprovider_kind=azure
~~~

Comment 4 Takayoshi Tanaka 2018-01-18 06:51:43 UTC

Changed to 3.7.1 because I confirmed this issue happens both 3.7.1 & 3.6.1

Comment 5 Scott Dodson 2018-01-19 20:37:04 UTC

Glenn,

Do you have any input on how to go about fixing this issue?

Comment 6 Wenkai Shi 2018-01-23 07:35:13 UTC

@Takayoshi , can you try [1] in your hosts? 

[1]. ping `hostname`

Comment 7 Takayoshi Tanaka 2018-01-24 06:37:17 UTC

Hi, here is a result.

~~~
# ping `hostname`
ping: tatanaka-37c: Name or service not known

# hostnamectl status
   Static hostname: tatanaka-37c
   Pretty hostname: [localhost.localdomain]
         Icon name: computer-vm
           Chassis: vm
        Machine ID: 02f1ddb1415c4feba9880b2b8c4c5925
           Boot ID: 7c131a057da44f0c9c5333f9cb0fefdd
    Virtualization: microsoft
  Operating System: Employee SKU
       CPE OS Name: cpe:/o:redhat:enterprise_linux:7.4:GA:server
            Kernel: Linux 3.10.0-693.11.6.el7.x86_64
      Architecture: x86-64
~~~

Comment 8 Takayoshi Tanaka 2018-01-24 06:38:16 UTC

Add needinfo flg again since I removed mistakenly.

Comment 9 Wenkai Shi 2018-01-24 08:02:24 UTC

(In reply to Takayoshi Tanaka from comment #7)
> Hi, here is a result.
> 
> ~~~
> # ping `hostname`
> ping: tatanaka-37c: Name or service not known
> 
> # hostnamectl status
>    Static hostname: tatanaka-37c
>    Pretty hostname: [localhost.localdomain]
>          Icon name: computer-vm
>            Chassis: vm
>         Machine ID: 02f1ddb1415c4feba9880b2b8c4c5925
>            Boot ID: 7c131a057da44f0c9c5333f9cb0fefdd
>     Virtualization: microsoft
>   Operating System: Employee SKU
>        CPE OS Name: cpe:/o:redhat:enterprise_linux:7.4:GA:server
>             Kernel: Linux 3.10.0-693.11.6.el7.x86_64
>       Architecture: x86-64
> ~~~

Hi Takayoshi,

This is root cause I believe, you must have an internal hostname which could resloved to internal ip. Or you might met BZ #1505266
During create internal nic, you must add "--internal-dns-name" to assign an internal hostname for your host as example [1].

[1]. 
# az network public-ip create --resource-group openshift-qe-xxx -n qe-public-ip --dns-name qe-public-hostname
# az network nic create --resource-group openshift-qe-xxx --name qe-internal-ip --vnet-name openshift-qe-vnet --subnet default-subnet --internal-dns-name qe-internal-hostname --public-ip-address qe-public-ip

Comment 10 Takayoshi Tanaka 2018-01-24 08:11:27 UTC

I have tested with specifying "--internal-dns-name" and get the same error.

Even though "--internal-dns-name" is required to set up [1], it's strange to fix this issue by restarting the network service.

As I commented before, this issue is caused by someone who is modified the /etc/resolv.conf.

While the hostname can't resolve to the IP:

~~~
> # ping `hostname`
> ping: tatanaka-37c: Name or service not known
~~~

No search domain is shown in /etc/resolv.conf
~~~
# nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh
# Generated by NetworkManager
nameserver 10.0.0.4
~~~

Then, /etc/resolv.conf is updated by restarting network service.

~~~
# systemctl restart network

# cat /etc/resolv.conf 
# nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh
# Generated by NetworkManager
search cele2tf43r2elamcykz40glung.xx.internal.cloudapp.net cluster.local
nameserver 10.0.0.4
~~~

Does it mean someone updated the /etc/resolv.conf incompletely?

[1] I don't believe the current OCP doesn't require "--internal-dns-name" because the hostname can be resolved to the IP before and after installation.

Comment 15 Takayoshi Tanaka 2018-01-25 07:42:05 UTC

I suspect the issue is caused by removing search directive from /etc/resolv.conf.

Because "search <private_hostname>.xx.internal.cloudapp.net" is defined before installation, the playbook should keep this search directive.
Actually, restarting network service can workaround the issue, but installer itself should restart network service or should keep the "search <private_hostname>.xx.internal.cloudapp.net" in /etc/resolv.conf.

I mean, when the install starts the node service, /etc/resolv.conf should be as follows:

~~~
# nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh
# Generated by NetworkManager
search <private_hostname.xx.internal.cloudapp.net> cluster.local
nameserver <private_ip>
~~~

The actual result is:
~~~
# nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh
# Generated by NetworkManager
nameserver <private_ip>
~~~

Please note, this issue only happened once at the first attempt of the advanced installer. Once restarting the network service, this issue won't happen again.

Comment 18 Arun Babu Neelicattu 2018-04-04 00:59:35 UTC

Proposed a potential fix upstream for this one. This will allow for optionally configuring Network Manager with "dns=none".

https://github.com/openshift/openshift-ansible/pull/7765

Comment 19 Scott Dodson 2018-07-31 13:56:23 UTC

To test set openshift_node_dnsmasq_disable_network_manager_dns=true

Comment 21 Wenkai Shi 2018-08-06 03:51:53 UTC

Fail to verified with version openshift-ansible-3.9.40-1.git.0.188c954.el7, related code[1] doesn't merged in to this build. 

[1]. https://github.com/openshift/openshift-ansible/pull/7765

Comment 22 Jan Chaloupka 2018-08-06 08:13:20 UTC

Scott, Kenny, is this a blocker for 3.9?

The PR is included only in the latest 3.11 build [1]. The cherry-picks are merged though not included in the latest 3.9/3.10 build:

3.10: https://github.com/openshift/openshift-ansible/pull/9374
3.9:  https://github.com/openshift/openshift-ansible/pull/9375

[1] https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=735456

Comment 23 Scott Dodson 2018-08-06 13:26:51 UTC

I've moved target release for this bug to 3.11, should be in openshift-ansible-3.11.0-0.11.0

I think we should clone this bug for 3.10 and 3.9, mark them modified similarly.

Comment 24 Wenkai Shi 2018-08-07 04:17:24 UTC

Verified with version openshift-ansible-3.11.0-0.11.0.git.0.3c66516None, this issue doesn't appear in QE's cluster. The PR with openshift_node_dnsmasq_disable_network_manager_dns parameter do has effect.

# cat /etc/NetworkManager/conf.d/99-origin.conf 

[main]
dns=none

Comment 27 errata-xmlrpc 2018-10-11 07:19:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652