1535391 – Can't enable Azure Cloud Provider

Bug 1535391 - Can't enable Azure Cloud Provider

Summary: Can't enable Azure Cloud Provider

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.7.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	3.9.z
Assignee:	aos-install
QA Contact:	Gaoyun Pei
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1267746
TreeView+	depends on / blocked

Reported:	2018-01-17 09:55 UTC by Takayoshi Tanaka
Modified:	2022-03-13 14:38 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-07-03 18:58:21 UTC
Target Upstream Version:
Embargoed:
Flags:	gwest: needinfo- gwest: needinfo-

Attachments	(Terms of Use)

Description Takayoshi Tanaka 2018-01-17 09:55:08 UTC

Description of problem:
When installing OpenShift on Azure with enabling Azure Cloud Provider in inventory file, the installer fails to register node.

Version-Release number of the following components:
~~~
# rpm -q openshift-ansible
openshift-ansible-3.6.173.0.83-1.git.0.84c5eff.el7.noarch

# rpm -q ansible
ansible-2.4.1.0-1.el7.noarch

# ansible --version
ansible 2.4.1.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /bin/ansible
  python version = 2.7.5 (default, May  3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)]
~~~

How reproducible:

Steps to Reproduce:
1. Provision RHEL server on Azure with using Azure default name resolution. If we don't specify DNS Server configuration on Azure, Azure uses the default name resolution.

2. Create azure.conf file and locate it at every node.
https://access.redhat.com/solutions/3321281

3.Create ansible inventory file as follows.
~~~
[OSEv3:vars]
osm_controller_args={'cloud-provider': ['azure'], 'cloud-config': ['/etc/azure/azure.conf']}
osm_api_server_args={'cloud-provider': ['azure'], 'cloud-config': ['/etc/azure/azure.conf']}
openshift_node_kubelet_args={'cloud-provider': ['azure'], 'cloud-config': ['/etc/azure/azure.conf'], 'enable-controller-attach-detach': ['true']}
openshift_cloudprovider_kind=azure
~~~

Actual results:
<tatanaka02011214-n> (0, '\r\n{"changed": true, "end": "2018-01-17 04:49:45.956673", "stdout": "", "cmd": ["/sbin/atomic-openshift-excluder", "exclude"], "rc": 0, "start": "2018-01-17 04:49:45.884909", "stderr": "", "delta": "0:00:00.071764", "invocation": {"module_args": {"warn": true, "executable": null, "_uses_shell": false, "_raw_params": "/sbin/atomic-openshift-excluder exclude", "removes": null, "creates": null, "chdir": null, "stdin": null}}}\r\n', 'Shared connection to tatanaka02011214-n closed.\r\n')
changed: [tatanaka02011214-n] => {
    "changed": true, 
    "cmd": [
        "/sbin/atomic-openshift-excluder", 
        "exclude"
    ], 
    "delta": "0:00:00.071764", 
    "end": "2018-01-17 04:49:45.956673", 
    "failed": false, 
    "invocation": {
        "module_args": {
            "_raw_params": "/sbin/atomic-openshift-excluder exclude", 
            "_uses_shell": false, 
            "chdir": null, 
            "creates": null, 
            "executable": null, 
            "removes": null, 
            "stdin": null, 
            "warn": true
        }
    }, 
    "rc": 0, 
    "start": "2018-01-17 04:49:45.884909"
}
META: ran handlers
META: ran handlers
	to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/config.retry

PLAY RECAP ********************************************************************************************************************************************************************************************************
localhost                  : ok=12   changed=0    unreachable=0    failed=0   
tatanaka02011214-m         : ok=513  changed=35   unreachable=0    failed=1   
tatanaka02011214-n         : ok=231  changed=15   unreachable=0    failed=0   



Failure summary:


  1. Hosts:    tatanaka02011214-m
     Play:     Configure nodes
     Task:     restart node
     Message:  Unable to restart service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See "systemctl status atomic-openshift-node.service" and "journalctl -xe" for details.
               
(ansible logs is attached in private)


Expected results:
Installation is completed without errors.

Additional info:
When reproducing this bug, you will see the following bug at first.
https://bugzilla.redhat.com/show_bug.cgi?id=1535340

After restarting network service and execute ansible playbook again, this bug happens.

This bug can workaround by commenting out the cloud provider related values in master-config.yaml and node-config.yaml
https://docs.openshift.com/container-platform/3.6/install_config/configuring_azure.html

Comment 4 Takayoshi Tanaka 2018-01-18 06:49:00 UTC

I could reproduce the issue by installing single master/node. If Azure Cloud Provider is specified in ansible inventory file, this issue happens.

[1] 
~~~
osm_controller_args={'cloud-provider': ['azure'], 'cloud-config': ['/etc/azure/azure.conf']}
osm_api_server_args={'cloud-provider': ['azure'], 'cloud-config': ['/etc/azure/azure.conf']}
openshift_node_kubelet_args={'cloud-provider': ['azure'], 'cloud-config': ['/etc/azure/azure.conf'], 'enable-controller-attach-detach': ['true']}
openshift_cloudprovider_kind=azure
~~~

Note that the other bug 1535340 happens before this BZ happens. The former BZ 1535340 can be workaround by restarting network service and run install playbook gain. However, this BZ 1535391 can't be workaround by restarting network service (and also rebooting the host).

To workaround, quit specifying Azure Cloud Provider in ansible inventory file. It means we have to enable Azure Cloud Provider after openshift installation has been completed.
We also need to remove node to enable Azure Cloud Provider. 
https://docs.openshift.com/container-platform/3.7/install_config/configuring_azure.html

Comment 5 Takayoshi Tanaka 2018-01-18 06:52:11 UTC

Changed to 3.7.1 because I confirmed this issue happens both 3.7.1 & 3.6.1

Comment 7 Wenqi He 2018-01-19 07:32:07 UTC

This may caused by the internal hostname, I opened a bug for 3.9: https://bugzilla.redhat.com/show_bug.cgi?id=1534934
 
When create VM on Azure now, you need to add an internal hostname

Comment 8 Glenn West 2018-01-19 07:36:44 UTC

The hostname must match the vm name on azure.
From /etc/hosts/ansible:


[masters]
gsw1v37x openshift_hostname=gsw1v37x openshift_ip=10.0.0.4 openshift_public_ip=52.163.244.106

Comment 9 Takayoshi Tanaka 2018-01-19 07:45:43 UTC

Here is the output. Which hostname is used?

# hostnamectl 
   Static hostname: tatanaka-37-with-cp
   Pretty hostname: [localhost.localdomain]
         Icon name: computer-vm
           Chassis: vm
        Machine ID: 02f1ddb1415c4feba9880b2b8c4c5925
           Boot ID: 96f0f05dd5f44c3ebf6b2684cc3713e0
    Virtualization: microsoft
  Operating System: Employee SKU
       CPE OS Name: cpe:/o:redhat:enterprise_linux:7.4:GA:server
            Kernel: Linux 3.10.0-693.11.6.el7.x86_64
      Architecture: x86-64

//inventoryfile
[masters]
tatanaka-37-with-cp

[nodes]
tatanaka-37-with-cp openshift_node_labels="{'region': 'infra', 'zone': 'default'}"  openshift_schedulable=true openshift_hostname=tatanaka-37-with-cp openshift_public_hostname=tatanaka-37-with-cp.westus2.cloudapp.azure.com

Comment 10 Takayoshi Tanaka 2018-01-19 07:55:20 UTC

Azure VM name is the same as node name.

# curl -H Metadata:true "http://169.254.169.254/metadata/instance/compute/?api-version=2017-04-02"
{"location":"westus2","name":"tatanaka-37-with-cp","offer":"","osType":"Linux","platformFaultDomain":"0","platformUpdateDomain":"0","publisher":"","sku":"","version":"","vmId":"3542b9d1-d47b-4fe3-a6fb-0c413f029567","vmSize":"Standard_D2S_V3"}

Comment 11 Glenn West 2018-01-19 08:52:16 UTC

Test Run for 3.6: Same error.

Jan 19 08:20:09 gsw1v36 atomic-openshift-master[74735]: I0119 08:20:09.570138   74735 rest.go:324] Starting watch for /apis/rbac.authorization.k8s.io/v1beta1/roles, rv=489 labels= fields= timeout=9m21s
Jan 19 08:20:09 gsw1v36 atomic-openshift-master[74735]: I0119 08:20:09.579531   74735 rest.go:324] Starting watch for /apis/apps.openshift.io/v1/deploymentconfigs, rv=4 labels= fields= timeout=6m12s
Jan 19 08:20:11 gsw1v36 atomic-openshift-node[94335]: W0119 08:20:11.219185   94335 sdn_controller.go:38] Could not find an allocated subnet for node: gsw1v36, Waiting...
Jan 19 08:20:11 gsw1v36 atomic-openshift-master[74735]: I0119 08:20:11.331317   74735 rest.go:324] Starting watch for /api/v1/services, rv=11 labels= fields= timeout=8m38s
Jan 19 08:20:15 gsw1v36 sshd[94326]: Invalid user guest from 172.104.230.143 port 45412
Jan 19 08:20:15 gsw1v36 sshd[94326]: input_userauth_request: invalid user guest [preauth]
Jan 19 08:20:15 gsw1v36 sshd[94326]: Received disconnect from 172.104.230.143 port 45412:11: Normal Shutdown, Thank you for playing [preauth]
Jan 19 08:20:15 gsw1v36 sshd[94326]: Disconnected from 172.104.230.143 port 45412 [preauth]
Jan 19 08:20:15 gsw1v36 atomic-openshift-master[74735]: I0119 08:20:15.550244   74735 rest.go:324] Starting watch for /apis/template.openshift.io/v1/templates, rv=1079 labels= fields= timeout=7m14s
Jan 19 08:20:15 gsw1v36 atomic-openshift-master[74735]: I0119 08:20:15.587366   74735 rest.go:324] Starting watch for /apis/build.openshift.io/v1/builds, rv=4 labels= fields= timeout=9m40s
Jan 19 08:20:17 gsw1v36 atomic-openshift-node[94335]: W0119 08:20:17.622331   94335 sdn_controller.go:38] Could not find an allocated subnet for node: gsw1v36, Waiting...
Jan 19 08:20:17 gsw1v36 atomic-openshift-node[94335]: F0119 08:20:17.622369   94335 node.go:309] error: SDN node startup failed: failed to get subnet for this host: gsw1v36, error: timed out waiting for the condition
Jan 19 08:20:17 gsw1v36 systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a

Comment 15 Takayoshi Tanaka 2018-01-22 07:02:06 UTC

When the node service failed to start due to this BZ, we can fix by removing Azure Cloud Provider from node-config.yaml like following.

~~~
kubeletArguments: 
  #cloud-config:
  #- /etc/azure/azure.conf
  #cloud-provider:
  #- azure
  #enable-controller-attach-detach:
  #- 'true'
  node-labels:
  - region=infra
  - zone=default
~~~

After removing the config, the subnet was created by restarting node service and the node service was able to start. I found the following messages in master log.

~~~
Jan 22 01:37:52 tatanaka-37-with-cp atomic-openshift-master-controllers[122534]: I0122 01:37:52.263891  122534 subnets.go:245] Watch Added event for HostSubnet "tatanaka-37-with-cp"
Jan 22 01:37:52 tatanaka-37-with-cp atomic-openshift-master-controllers[122534]: I0122 01:37:52.264706  122534 subnets.go:105] Created HostSubnet tatanaka-37-with-cp (host: "tatanaka-37-with-cp", ip: "10.0.0.7", subnet: "10.128.0.0/23")
~~~

According to what I investigated in the source code and logs, it looks like handleAddOrUpdateNode method in subnets.go was not triggered when Azure Cloud Provider was specified.


Also, after node service was able to start, deleting the node and un-commenting Azure Cloud Provider causes the issue again.

//on master
~~~
# oc get node
NAME                  STATUS    AGE       VERSION
tatanaka-37-with-cp   Ready     21m       v1.7.6+a08f5eeb62
# oc delete node tatanaka-37-with-cp
node "tatanaka-37-with-cp" deleted
# oc get hostsubnet 
No resources found.
~~~

Adding Azure Cloud Provider causes something wrong in Add Node process.

Comment 16 Takayoshi Tanaka 2018-01-22 08:22:36 UTC

Curious to say, the workaround doesn't work suddenly. Both of 3.7.14 and 3.6.173.0.83, node service failed to start when removing the node (oc delete node <node_name> and restart node service.

~~~
# oc version 
oc v3.7.14
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ose3-single-vm.westus2.cloudapp.azure.com:8443
openshift v3.7.14
kubernetes v1.7.6+a08f5eeb62
~~~

~~~
# oc version 
oc v3.6.173.0.83
kubernetes v1.6.1+5115d708d7
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://tatanaka-36-with-cp.westus2.cloudapp.azure.com:8443
openshift v3.6.173.0.83
kubernetes v1.6.1+5115d708d7

~~~

I'll investigate more detail and update.

Comment 19 Ryan Cook 2018-01-22 18:14:14 UTC

tested against rhel 7.2, rhel 7.3, and rhel 7.4 and issue is relevant in all versions

Comment 20 Takayoshi Tanaka 2018-01-23 05:32:53 UTC

According to the customer update, it looks like this BZ has been introduced the latest minor update for 3.7 and 3.6. It may be a critical issue if this BZ affects all the latest 3.7 and 3.6 on Azure because a customer can't install a brand new OpenShift and add Node.

Also, OCP 3.9 has a similar BZ 1534934.
I tried to add internal-host-name option after creating a VM. However, it has no effect for my OCP 3.7.14.

I'm still investigating the detailed trigger of this BZ and workaround.

Comment 22 Glenn West 2018-01-24 00:20:31 UTC

It affect 3.5, 3.6 and 3.7 ocp, single and multi-node.
Its not really "installer" I believe, reading thru the kublet code.
It appears to be a stuck state when the azure provider is pre-defined.
I have logs for "working" and "not" on the same machine, that makes it easy to compare.

Slowly working thru kubelet.go to find where it's hanging.

If the cloudprovider is injected at install time, the problem occurs across all versions. If the cloudprovider is not provided, the install proceeds normally, and if you add it, on the node-master.conf, on the next restart of the node service, all appears good. 

Note that if you "add" a new node, it will have the problem again. 

Note that if the problem is manifesting, none of the network devices are created.

I've also tried various versions of RHEL 7.4 as well, without updates, and has no affect.

Comment 23 Takayoshi Tanaka 2018-01-24 08:12:50 UTC

Updated the summary because this issue is not only for installation time but also for after clean installation.

I confirmed this issue is not fixed at v3.7.23. 

Azure Cloud Provider specified in /etc/ansible/hosts => no good [1]
Azure Cloud Provider is NOT specified in /etc/ansible/hosts => good [2] (install succeeded without error except the known BZ 1535340), but it means Azure Cloud Provider is disabled.
Enable Azure Cloud Provider after clean installation => no good [3]

Both [1] and [3] shows the same error. It means HostSubnet is not created at master.

~~~
Jan 24 02:53:31 tatanaka-37c atomic-openshift-node[69828]: W0124 02:53:31.122640   69828 sdn_controller.go:48] Could not find an allocated subnet for node: tatanaka-37c, Waiting...
~~~

Comparing the logs between [2] and [3], I found some node events has not published in the case of [3].
Note that the following logs are tested at the different cluster.

Both [2] and [3] reports this event.
~~~
atomic-openshift-node[25272]: I0123 19:26:26.024489   25272 server.go:351] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ose3-single-vm", UID:"ose3-single-vm", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'Starting' Starting kubelet.
~~~

However, the following events are only reported at [2].
~~~
Jan 23 19:26:26 ose3-single-vm atomic-openshift-node[25272]: I0123 19:26:26.039882   25272 server.go:351] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ose3-single-vm", UID:"ose3-single-vm", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeHasNoDiskPressure' Node ose3-single-vm status is now: NodeHasNoDiskPressure
Jan 23 19:26:26 ose3-single-vm atomic-openshift-node[25272]: I0123 19:26:26.039900   25272 server.go:351] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ose3-single-vm", UID:"ose3-single-vm", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeHasSufficientMemory' Node ose3-single-vm status is now: NodeHasSufficientMemory
Jan 23 19:26:26 ose3-single-vm atomic-openshift-node[25272]: I0123 19:26:26.039942   25272 server.go:351] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ose3-single-vm", UID:"ose3-single-vm", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeHasSufficientDisk' Node ose3-single-vm status is now: NodeHasSufficientDisk

Jan 23 19:26:26 ose3-single-vm atomic-openshift-node[25272]: I0123 19:26:26.107138   25272 server.go:351] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ose3-single-vm", UID:"ose3-single-vm", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeAllocatableEnforced' Updated Node Allocatable limit across pods

Jan 23 19:26:26 ose3-single-vm atomic-openshift-node[25272]: I0123 19:26:26.229208   25272 server.go:351] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ose3-single-vm", UID:"ose3-single-vm", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeHasSufficientDisk' Node ose3-single-vm status is now: NodeHasSufficientDisk

Jan 23 19:26:26 ose3-single-vm atomic-openshift-node[25272]: I0123 19:26:26.229227   25272 server.go:351] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ose3-single-vm", UID:"ose3-single-vm", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeHasNoDiskPressure' Node ose3-single-vm status is now: NodeHasNoDiskPressure
Jan 23 19:26:26 ose3-single-vm atomic-openshift-node[25272]: I0123 19:26:26.229236   25272 server.go:351] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ose3-single-vm", UID:"ose3-single-vm", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeHasSufficientMemory' Node ose3-single-vm status is now: NodeHasSufficientMemory

//HostSubnet is created at this point.
Jan 23 19:26:26 ose3-single-vm atomic-openshift-master-controllers[1896]: I0123 19:26:26.270873    1896 subnets.go:245] Watch Added event for HostSubnet "ose3-single-vm"
Jan 23 19:26:26 ose3-single-vm atomic-openshift-master-controllers[1896]: I0123 19:26:26.270925    1896 subnets.go:105] Created HostSubnet ose3-single-vm (host: "ose3-single-vm", ip: "10.0.0.4", subnet: "10.128.0.0/23")
//

Jan 23 19:26:26 ose3-single-vm atomic-openshift-master-controllers[1896]: I0123 19:26:26.694299    1896 event.go:218] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ose3-single-vm", UID:"36b23fa3-009d-11e8-beca-000d3af9651a", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RegisteredNode' Node ose3-single-vm event: Registered Node ose3-single-vm in NodeController
~~~

Looking the source code, this Node event triggers the creation of HostSubnet.

Comment 24 Glenn West 2018-01-26 02:29:18 UTC

I've updated the allinone node scripts for the 3.7 ref arch, and the node is reliably added, and also survives reboot. I would say this is a work around, but its has proven to work. I will back port this to all versions of the ref arch in the next few days.

https://github.com/openshift/openshift-ansible-contrib/blob/master/reference-architecture/azure-ansible/3.7/allinone.sh

The specific code:
cat > /home/${AUSERNAME}/azure-config.yml <<EOF
#!/usr/bin/ansible-playbook
- hosts: masters
  gather_facts: no
  vars_files:
  - vars.yml
  become: yes
  vars:
    azure_conf_dir: /etc/azure
    azure_conf: "{{ azure_conf_dir }}/azure.conf"
    master_conf: /etc/origin/master/master-config.yaml
  handlers:
  - name: restart atomic-openshift-master-controllers
    systemd:
      state: restarted
      name: atomic-openshift-master-controllers
  - name: restart atomic-openshift-master-api
    systemd:
      state: restarted
      name: atomic-openshift-master-api
  - name: restart atomic-openshift-node
    systemd:
      state: restarted
      name: atomic-openshift-node
  post_tasks:
  - name: make sure /etc/azure exists
    file:
      state: directory
      path: "{{ azure_conf_dir }}"
  - name: populate /etc/azure/azure.conf
    copy:
      dest: "{{ azure_conf }}"
      content: |
        {
          "aadClientID" : "{{ g_aadClientId }}",
          "aadClientSecret" : "{{ g_aadClientSecret }}",
          "subscriptionID" : "{{ g_subscriptionId }}",
          "tenantID" : "{{ g_tenantId }}",
          "resourceGroup": "{{ g_resourceGroup }}",
        }
    notify:
    - restart atomic-openshift-master-api
    - restart atomic-openshift-master-controllers
    - restart atomic-openshift-node
  - name: insert the azure disk config into the master
    modify_yaml:
      dest: "{{ master_conf }}"
      yaml_key: "{{ item.key }}"
      yaml_value: "{{ item.value }}"
    with_items:
    - key: kubernetesMasterConfig.apiServerArguments.cloud-config
      value:
      - "{{ azure_conf }}"
    - key: kubernetesMasterConfig.apiServerArguments.cloud-provider
      value:
      - azure
    - key: kubernetesMasterConfig.controllerArguments.cloud-config
      value:
      - "{{ azure_conf }}"
    - key: kubernetesMasterConfig.controllerArguments.cloud-provider
      value:
      - azure
    notify:
    - restart atomic-openshift-master-api
    - restart atomic-openshift-master-controllers
#

This technique will work on all versions.

Comment 25 Takayoshi Tanaka 2018-01-26 03:10:14 UTC

Hi Glenn,

I executed the similar steps with your playbook manually. I'm afraid Azure Cloud Provider is not enabled with this playbook through the node service started without error.

"oc delete node <node>" is a necessary step because the external ID has been changed after applying an azure.conf to node-config.yaml.
https://docs.openshift.com/container-platform/3.7/install_config/configuring_azure.html

Without deleing an existing node, I'm afraid Azure Cloud Provider is not enabled.

Could you confirm you can use Azure Disk or Azure File after node has been configured with your playbook?

Comment 26 Glenn West 2018-01-26 03:14:21 UTC

actually with the timing and everything else, it worked perfectly in the 3.7 ref arch scripts. 

On the single node, the delete is not delete, in the multi-node, I believe it is. (I've had this code in 3.4 code). I will be updating the multi-node 3.7 later today, and willing to wager I will need the timed deletes I did before.

How many nodes are you running?

Comment 27 Takayoshi Tanaka 2018-01-26 03:16:01 UTC

At present, a single all-in-one node. Did you confirm with multiple nodes?

Comment 28 Glenn West 2018-01-26 03:19:34 UTC

multi-node, will be doing that today. +/- on the PR.

Comment 29 Takayoshi Tanaka 2018-01-26 03:26:57 UTC

Thanks. I start testing with multi-node. If multi-node works, it could be a workaround.

Comment 30 Glenn West 2018-01-26 07:30:26 UTC

I've just finished the 3.7 multi-node ref-arch deployment with the work-around, and node delete is not needed. I've made one final cosmetic change, after the test with that passes, I will submit a PR for the openshift-ansible-contrib for the 3.7 ref arch for multi-node.

Comment 31 Takayoshi Tanaka 2018-01-26 08:12:32 UTC

Hi Glenn,

Thanks a lot.
I confirmed multiple-node can workaround the issue as you suggest at OCP 3.7.23. In the latest version, we don't have to delete node manually because OpenShift deletes the out-dated node automatically and register a new one. In this context, out-dated node and new node mean the same node and the difference are with/without Azure Cloud Provider.

Summarising the situation:
- multiple nodes
 - Can work around the issue as following steps
  1. Install OpenSihft cluster without Azure Cloud Provider
  2. Enable Azure Cloud Provider after that. 
 - Enabling Azure Cloud Provider at installer time still no success
- single node (all-in-one)
 - Still no workaround. (however, in the first place, all-in-one is not supported officially. All-in-one is just for a testing/demo purpose. )

Comment 33 Glenn West 2018-01-26 09:46:45 UTC

The single-vm 3.7 is in openshift-ansible-contrib and is publically available.
The mutli-vm 3.7 is on the edge of a pull request for openshift-ansible-contrib, and publically available as well.

your welcome to try the upstream of the mutli-vm, if your time critical.

https://github.com/glennswest/openshift-ansible-contrib

It will be mon/tuesday for 3.6 most likely.

Comment 34 Glenn West 2018-01-26 09:55:07 UTC

The all-in-one ref arch 3.7 works every time with the work-around script in my tests in
the last 48 hours on the upstream.

The "injection" of azure cloud provider, is currently not the documented method for
azure to add the cloud provider. It happened to work, but is has never been the documented method. Its always been "edit the config files after the install", per the documentation.

Which is effectively what the work-around script is doing.

Comment 35 Takayoshi Tanaka 2018-01-29 01:29:09 UTC

To work around for the single node, I confirmed creating hostsubnet manually works fine.

Before enabling Azure Cloud Provider at each node, save hostsubnet.

//usually hostsubnet name is same as node name.
$ oc export hostsubnet <node_name> > hostsubnet-<node>.txt

$ oc delete node <node_name>

$ oc create -f hostsubnet-<node>.txt

$ systemctl restart atomic-openshift-node

Actually, sometimes I don't have to delete a node to enable Cloud Provider. In this case, creating hostsubnet is not necessary. However, I can't find out the difference. So, just in case, exporting and creating hostsubnet could be a wrokaround.

Comment 41 Ryan Cook 2018-03-07 19:33:33 UTC

We honestly need to bump the priority on this. For example if a instance is rebooted and the master removes the nodes automatically and then the node comes online the node cannot add itself back into the cluster because it cannot set up host networking on its own. This is a pretty large risk.

Comment 43 Ryan Cook 2018-03-21 20:30:23 UTC

Ok to fix this each nic on the azure side must contain the value of --internal-dns-name matching the openshift node name. For some reason by default Azure does not configure internal dns names on nics. So when the node comes back online it cannot register itself.

Upon creation the following must be set on the nic
https://docs.microsoft.com/en-us/azure/virtual-machines/linux/static-dns-name-resolution-for-linux-on-azure

If an environment already exists then the following can be used to set the internal-dns-name

 az network nic update -g RESOURCEGROUP -n $NODE_NAMEVMNic --internal-dns-name $NODE_NAME

Comment 44 Takayoshi Tanaka 2018-03-21 23:15:26 UTC

Hi Ryan,

I have already tried "internal-dns-name" and it doesn't fix this issue at OCP 3.7. https://bugzilla.redhat.com/show_bug.cgi?id=1535391#c20

As Wenqi opened, this is the issue introduced at OCP 3.9.
https://bugzilla.redhat.com/show_bug.cgi?id=1534934

Comment 45 Arun Babu Neelicattu 2018-04-03 21:14:40 UTC

I have submitted a fix upstream for the installer [1]. The specific commit pertaining to the azure.conf issue is [2].

[1] https://github.com/openshift/openshift-ansible/pull/7745
[2] https://github.com/openshift/openshift-ansible/pull/7745/commits/212f310c65748ca26b3d768456d909baaf769a00

Comment 48 DeShuai Ma 2018-04-27 16:39:42 UTC

walkaround from Takayoshi Tanaka <tatanaka> ; paste here just in case someone want it.

1. Install OpenShift without specifying Azure Configuration file
2. Locate Azure Configuration file on all masters and nodes.
3. Edit master-config and restart master services.
https://docs.openshift.com/container-platform/3.7/install_config/configuring_azure.html
4. export hostsubnets.$ oc export hostsubnet <node_name> > hostsubnet-<node>.txt
5. Edit node-config, restart node service and wait for several minutes.
// I found usually the node will be removed and re-registered with Azure Cloud Provider automatically. You can see it with "oc get node -w".
// If the node won't be re-registered, please proceed the next steps.
6 Delete node, create hostsubnet and restart node service again.
$ oc delete node <node_name>
$ oc create -f hostsubnet-<node>.txt
$ systemctl restart atomic-openshift-node

Comment 49 Ryan Cook 2018-04-27 16:53:48 UTC

this should be tested to see how the node reacts at reboot. If I recall the azure.conf has to be removed and the process done over again.

Comment 50 Ryan Cook 2018-04-27 16:53:53 UTC

this should be tested to see how the node reacts at reboot. If I recall the azure.conf has to be removed and the process done over again.

Note You need to log in before you can comment on or make changes to this bug.