Description of problem: When installing OpenShift on Azure with enabling Azure Cloud Provider in inventory file, the installer fails to register node. Version-Release number of the following components: ~~~ # rpm -q openshift-ansible openshift-ansible-3.6.173.0.83-1.git.0.84c5eff.el7.noarch # rpm -q ansible ansible-2.4.1.0-1.el7.noarch # ansible --version ansible 2.4.1.0 config file = /etc/ansible/ansible.cfg configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python2.7/site-packages/ansible executable location = /bin/ansible python version = 2.7.5 (default, May 3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)] ~~~ How reproducible: Steps to Reproduce: 1. Provision RHEL server on Azure with using Azure default name resolution. If we don't specify DNS Server configuration on Azure, Azure uses the default name resolution. 2. Create azure.conf file and locate it at every node. https://access.redhat.com/solutions/3321281 3.Create ansible inventory file as follows. ~~~ [OSEv3:vars] osm_controller_args={'cloud-provider': ['azure'], 'cloud-config': ['/etc/azure/azure.conf']} osm_api_server_args={'cloud-provider': ['azure'], 'cloud-config': ['/etc/azure/azure.conf']} openshift_node_kubelet_args={'cloud-provider': ['azure'], 'cloud-config': ['/etc/azure/azure.conf'], 'enable-controller-attach-detach': ['true']} openshift_cloudprovider_kind=azure ~~~ Actual results: <tatanaka02011214-n> (0, '\r\n{"changed": true, "end": "2018-01-17 04:49:45.956673", "stdout": "", "cmd": ["/sbin/atomic-openshift-excluder", "exclude"], "rc": 0, "start": "2018-01-17 04:49:45.884909", "stderr": "", "delta": "0:00:00.071764", "invocation": {"module_args": {"warn": true, "executable": null, "_uses_shell": false, "_raw_params": "/sbin/atomic-openshift-excluder exclude", "removes": null, "creates": null, "chdir": null, "stdin": null}}}\r\n', 'Shared connection to tatanaka02011214-n closed.\r\n') changed: [tatanaka02011214-n] => { "changed": true, "cmd": [ "/sbin/atomic-openshift-excluder", "exclude" ], "delta": "0:00:00.071764", "end": "2018-01-17 04:49:45.956673", "failed": false, "invocation": { "module_args": { "_raw_params": "/sbin/atomic-openshift-excluder exclude", "_uses_shell": false, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "warn": true } }, "rc": 0, "start": "2018-01-17 04:49:45.884909" } META: ran handlers META: ran handlers to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/config.retry PLAY RECAP ******************************************************************************************************************************************************************************************************** localhost : ok=12 changed=0 unreachable=0 failed=0 tatanaka02011214-m : ok=513 changed=35 unreachable=0 failed=1 tatanaka02011214-n : ok=231 changed=15 unreachable=0 failed=0 Failure summary: 1. Hosts: tatanaka02011214-m Play: Configure nodes Task: restart node Message: Unable to restart service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See "systemctl status atomic-openshift-node.service" and "journalctl -xe" for details. (ansible logs is attached in private) Expected results: Installation is completed without errors. Additional info: When reproducing this bug, you will see the following bug at first. https://bugzilla.redhat.com/show_bug.cgi?id=1535340 After restarting network service and execute ansible playbook again, this bug happens. This bug can workaround by commenting out the cloud provider related values in master-config.yaml and node-config.yaml https://docs.openshift.com/container-platform/3.6/install_config/configuring_azure.html
I could reproduce the issue by installing single master/node. If Azure Cloud Provider is specified in ansible inventory file, this issue happens. [1] ~~~ osm_controller_args={'cloud-provider': ['azure'], 'cloud-config': ['/etc/azure/azure.conf']} osm_api_server_args={'cloud-provider': ['azure'], 'cloud-config': ['/etc/azure/azure.conf']} openshift_node_kubelet_args={'cloud-provider': ['azure'], 'cloud-config': ['/etc/azure/azure.conf'], 'enable-controller-attach-detach': ['true']} openshift_cloudprovider_kind=azure ~~~ Note that the other bug 1535340 happens before this BZ happens. The former BZ 1535340 can be workaround by restarting network service and run install playbook gain. However, this BZ 1535391 can't be workaround by restarting network service (and also rebooting the host). To workaround, quit specifying Azure Cloud Provider in ansible inventory file. It means we have to enable Azure Cloud Provider after openshift installation has been completed. We also need to remove node to enable Azure Cloud Provider. https://docs.openshift.com/container-platform/3.7/install_config/configuring_azure.html
Changed to 3.7.1 because I confirmed this issue happens both 3.7.1 & 3.6.1
This may caused by the internal hostname, I opened a bug for 3.9: https://bugzilla.redhat.com/show_bug.cgi?id=1534934 When create VM on Azure now, you need to add an internal hostname
The hostname must match the vm name on azure. From /etc/hosts/ansible: [masters] gsw1v37x openshift_hostname=gsw1v37x openshift_ip=10.0.0.4 openshift_public_ip=52.163.244.106
Here is the output. Which hostname is used? # hostnamectl Static hostname: tatanaka-37-with-cp Pretty hostname: [localhost.localdomain] Icon name: computer-vm Chassis: vm Machine ID: 02f1ddb1415c4feba9880b2b8c4c5925 Boot ID: 96f0f05dd5f44c3ebf6b2684cc3713e0 Virtualization: microsoft Operating System: Employee SKU CPE OS Name: cpe:/o:redhat:enterprise_linux:7.4:GA:server Kernel: Linux 3.10.0-693.11.6.el7.x86_64 Architecture: x86-64 //inventoryfile [masters] tatanaka-37-with-cp [nodes] tatanaka-37-with-cp openshift_node_labels="{'region': 'infra', 'zone': 'default'}" openshift_schedulable=true openshift_hostname=tatanaka-37-with-cp openshift_public_hostname=tatanaka-37-with-cp.westus2.cloudapp.azure.com
Azure VM name is the same as node name. # curl -H Metadata:true "http://169.254.169.254/metadata/instance/compute/?api-version=2017-04-02" {"location":"westus2","name":"tatanaka-37-with-cp","offer":"","osType":"Linux","platformFaultDomain":"0","platformUpdateDomain":"0","publisher":"","sku":"","version":"","vmId":"3542b9d1-d47b-4fe3-a6fb-0c413f029567","vmSize":"Standard_D2S_V3"}
Test Run for 3.6: Same error. Jan 19 08:20:09 gsw1v36 atomic-openshift-master[74735]: I0119 08:20:09.570138 74735 rest.go:324] Starting watch for /apis/rbac.authorization.k8s.io/v1beta1/roles, rv=489 labels= fields= timeout=9m21s Jan 19 08:20:09 gsw1v36 atomic-openshift-master[74735]: I0119 08:20:09.579531 74735 rest.go:324] Starting watch for /apis/apps.openshift.io/v1/deploymentconfigs, rv=4 labels= fields= timeout=6m12s Jan 19 08:20:11 gsw1v36 atomic-openshift-node[94335]: W0119 08:20:11.219185 94335 sdn_controller.go:38] Could not find an allocated subnet for node: gsw1v36, Waiting... Jan 19 08:20:11 gsw1v36 atomic-openshift-master[74735]: I0119 08:20:11.331317 74735 rest.go:324] Starting watch for /api/v1/services, rv=11 labels= fields= timeout=8m38s Jan 19 08:20:15 gsw1v36 sshd[94326]: Invalid user guest from 172.104.230.143 port 45412 Jan 19 08:20:15 gsw1v36 sshd[94326]: input_userauth_request: invalid user guest [preauth] Jan 19 08:20:15 gsw1v36 sshd[94326]: Received disconnect from 172.104.230.143 port 45412:11: Normal Shutdown, Thank you for playing [preauth] Jan 19 08:20:15 gsw1v36 sshd[94326]: Disconnected from 172.104.230.143 port 45412 [preauth] Jan 19 08:20:15 gsw1v36 atomic-openshift-master[74735]: I0119 08:20:15.550244 74735 rest.go:324] Starting watch for /apis/template.openshift.io/v1/templates, rv=1079 labels= fields= timeout=7m14s Jan 19 08:20:15 gsw1v36 atomic-openshift-master[74735]: I0119 08:20:15.587366 74735 rest.go:324] Starting watch for /apis/build.openshift.io/v1/builds, rv=4 labels= fields= timeout=9m40s Jan 19 08:20:17 gsw1v36 atomic-openshift-node[94335]: W0119 08:20:17.622331 94335 sdn_controller.go:38] Could not find an allocated subnet for node: gsw1v36, Waiting... Jan 19 08:20:17 gsw1v36 atomic-openshift-node[94335]: F0119 08:20:17.622369 94335 node.go:309] error: SDN node startup failed: failed to get subnet for this host: gsw1v36, error: timed out waiting for the condition Jan 19 08:20:17 gsw1v36 systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a
When the node service failed to start due to this BZ, we can fix by removing Azure Cloud Provider from node-config.yaml like following. ~~~ kubeletArguments: #cloud-config: #- /etc/azure/azure.conf #cloud-provider: #- azure #enable-controller-attach-detach: #- 'true' node-labels: - region=infra - zone=default ~~~ After removing the config, the subnet was created by restarting node service and the node service was able to start. I found the following messages in master log. ~~~ Jan 22 01:37:52 tatanaka-37-with-cp atomic-openshift-master-controllers[122534]: I0122 01:37:52.263891 122534 subnets.go:245] Watch Added event for HostSubnet "tatanaka-37-with-cp" Jan 22 01:37:52 tatanaka-37-with-cp atomic-openshift-master-controllers[122534]: I0122 01:37:52.264706 122534 subnets.go:105] Created HostSubnet tatanaka-37-with-cp (host: "tatanaka-37-with-cp", ip: "10.0.0.7", subnet: "10.128.0.0/23") ~~~ According to what I investigated in the source code and logs, it looks like handleAddOrUpdateNode method in subnets.go was not triggered when Azure Cloud Provider was specified. Also, after node service was able to start, deleting the node and un-commenting Azure Cloud Provider causes the issue again. //on master ~~~ # oc get node NAME STATUS AGE VERSION tatanaka-37-with-cp Ready 21m v1.7.6+a08f5eeb62 # oc delete node tatanaka-37-with-cp node "tatanaka-37-with-cp" deleted # oc get hostsubnet No resources found. ~~~ Adding Azure Cloud Provider causes something wrong in Add Node process.
Curious to say, the workaround doesn't work suddenly. Both of 3.7.14 and 3.6.173.0.83, node service failed to start when removing the node (oc delete node <node_name> and restart node service. ~~~ # oc version oc v3.7.14 kubernetes v1.7.6+a08f5eeb62 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://ose3-single-vm.westus2.cloudapp.azure.com:8443 openshift v3.7.14 kubernetes v1.7.6+a08f5eeb62 ~~~ ~~~ # oc version oc v3.6.173.0.83 kubernetes v1.6.1+5115d708d7 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://tatanaka-36-with-cp.westus2.cloudapp.azure.com:8443 openshift v3.6.173.0.83 kubernetes v1.6.1+5115d708d7 ~~~ I'll investigate more detail and update.
tested against rhel 7.2, rhel 7.3, and rhel 7.4 and issue is relevant in all versions
According to the customer update, it looks like this BZ has been introduced the latest minor update for 3.7 and 3.6. It may be a critical issue if this BZ affects all the latest 3.7 and 3.6 on Azure because a customer can't install a brand new OpenShift and add Node. Also, OCP 3.9 has a similar BZ 1534934. I tried to add internal-host-name option after creating a VM. However, it has no effect for my OCP 3.7.14. I'm still investigating the detailed trigger of this BZ and workaround.
It affect 3.5, 3.6 and 3.7 ocp, single and multi-node. Its not really "installer" I believe, reading thru the kublet code. It appears to be a stuck state when the azure provider is pre-defined. I have logs for "working" and "not" on the same machine, that makes it easy to compare. Slowly working thru kubelet.go to find where it's hanging. If the cloudprovider is injected at install time, the problem occurs across all versions. If the cloudprovider is not provided, the install proceeds normally, and if you add it, on the node-master.conf, on the next restart of the node service, all appears good. Note that if you "add" a new node, it will have the problem again. Note that if the problem is manifesting, none of the network devices are created. I've also tried various versions of RHEL 7.4 as well, without updates, and has no affect.
Updated the summary because this issue is not only for installation time but also for after clean installation. I confirmed this issue is not fixed at v3.7.23. Azure Cloud Provider specified in /etc/ansible/hosts => no good [1] Azure Cloud Provider is NOT specified in /etc/ansible/hosts => good [2] (install succeeded without error except the known BZ 1535340), but it means Azure Cloud Provider is disabled. Enable Azure Cloud Provider after clean installation => no good [3] Both [1] and [3] shows the same error. It means HostSubnet is not created at master. ~~~ Jan 24 02:53:31 tatanaka-37c atomic-openshift-node[69828]: W0124 02:53:31.122640 69828 sdn_controller.go:48] Could not find an allocated subnet for node: tatanaka-37c, Waiting... ~~~ Comparing the logs between [2] and [3], I found some node events has not published in the case of [3]. Note that the following logs are tested at the different cluster. Both [2] and [3] reports this event. ~~~ atomic-openshift-node[25272]: I0123 19:26:26.024489 25272 server.go:351] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ose3-single-vm", UID:"ose3-single-vm", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'Starting' Starting kubelet. ~~~ However, the following events are only reported at [2]. ~~~ Jan 23 19:26:26 ose3-single-vm atomic-openshift-node[25272]: I0123 19:26:26.039882 25272 server.go:351] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ose3-single-vm", UID:"ose3-single-vm", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeHasNoDiskPressure' Node ose3-single-vm status is now: NodeHasNoDiskPressure Jan 23 19:26:26 ose3-single-vm atomic-openshift-node[25272]: I0123 19:26:26.039900 25272 server.go:351] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ose3-single-vm", UID:"ose3-single-vm", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeHasSufficientMemory' Node ose3-single-vm status is now: NodeHasSufficientMemory Jan 23 19:26:26 ose3-single-vm atomic-openshift-node[25272]: I0123 19:26:26.039942 25272 server.go:351] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ose3-single-vm", UID:"ose3-single-vm", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeHasSufficientDisk' Node ose3-single-vm status is now: NodeHasSufficientDisk Jan 23 19:26:26 ose3-single-vm atomic-openshift-node[25272]: I0123 19:26:26.107138 25272 server.go:351] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ose3-single-vm", UID:"ose3-single-vm", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeAllocatableEnforced' Updated Node Allocatable limit across pods Jan 23 19:26:26 ose3-single-vm atomic-openshift-node[25272]: I0123 19:26:26.229208 25272 server.go:351] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ose3-single-vm", UID:"ose3-single-vm", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeHasSufficientDisk' Node ose3-single-vm status is now: NodeHasSufficientDisk Jan 23 19:26:26 ose3-single-vm atomic-openshift-node[25272]: I0123 19:26:26.229227 25272 server.go:351] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ose3-single-vm", UID:"ose3-single-vm", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeHasNoDiskPressure' Node ose3-single-vm status is now: NodeHasNoDiskPressure Jan 23 19:26:26 ose3-single-vm atomic-openshift-node[25272]: I0123 19:26:26.229236 25272 server.go:351] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ose3-single-vm", UID:"ose3-single-vm", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeHasSufficientMemory' Node ose3-single-vm status is now: NodeHasSufficientMemory //HostSubnet is created at this point. Jan 23 19:26:26 ose3-single-vm atomic-openshift-master-controllers[1896]: I0123 19:26:26.270873 1896 subnets.go:245] Watch Added event for HostSubnet "ose3-single-vm" Jan 23 19:26:26 ose3-single-vm atomic-openshift-master-controllers[1896]: I0123 19:26:26.270925 1896 subnets.go:105] Created HostSubnet ose3-single-vm (host: "ose3-single-vm", ip: "10.0.0.4", subnet: "10.128.0.0/23") // Jan 23 19:26:26 ose3-single-vm atomic-openshift-master-controllers[1896]: I0123 19:26:26.694299 1896 event.go:218] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ose3-single-vm", UID:"36b23fa3-009d-11e8-beca-000d3af9651a", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RegisteredNode' Node ose3-single-vm event: Registered Node ose3-single-vm in NodeController ~~~ Looking the source code, this Node event triggers the creation of HostSubnet.
I've updated the allinone node scripts for the 3.7 ref arch, and the node is reliably added, and also survives reboot. I would say this is a work around, but its has proven to work. I will back port this to all versions of the ref arch in the next few days. https://github.com/openshift/openshift-ansible-contrib/blob/master/reference-architecture/azure-ansible/3.7/allinone.sh The specific code: cat > /home/${AUSERNAME}/azure-config.yml <<EOF #!/usr/bin/ansible-playbook - hosts: masters gather_facts: no vars_files: - vars.yml become: yes vars: azure_conf_dir: /etc/azure azure_conf: "{{ azure_conf_dir }}/azure.conf" master_conf: /etc/origin/master/master-config.yaml handlers: - name: restart atomic-openshift-master-controllers systemd: state: restarted name: atomic-openshift-master-controllers - name: restart atomic-openshift-master-api systemd: state: restarted name: atomic-openshift-master-api - name: restart atomic-openshift-node systemd: state: restarted name: atomic-openshift-node post_tasks: - name: make sure /etc/azure exists file: state: directory path: "{{ azure_conf_dir }}" - name: populate /etc/azure/azure.conf copy: dest: "{{ azure_conf }}" content: | { "aadClientID" : "{{ g_aadClientId }}", "aadClientSecret" : "{{ g_aadClientSecret }}", "subscriptionID" : "{{ g_subscriptionId }}", "tenantID" : "{{ g_tenantId }}", "resourceGroup": "{{ g_resourceGroup }}", } notify: - restart atomic-openshift-master-api - restart atomic-openshift-master-controllers - restart atomic-openshift-node - name: insert the azure disk config into the master modify_yaml: dest: "{{ master_conf }}" yaml_key: "{{ item.key }}" yaml_value: "{{ item.value }}" with_items: - key: kubernetesMasterConfig.apiServerArguments.cloud-config value: - "{{ azure_conf }}" - key: kubernetesMasterConfig.apiServerArguments.cloud-provider value: - azure - key: kubernetesMasterConfig.controllerArguments.cloud-config value: - "{{ azure_conf }}" - key: kubernetesMasterConfig.controllerArguments.cloud-provider value: - azure notify: - restart atomic-openshift-master-api - restart atomic-openshift-master-controllers # This technique will work on all versions.
Hi Glenn, I executed the similar steps with your playbook manually. I'm afraid Azure Cloud Provider is not enabled with this playbook through the node service started without error. "oc delete node <node>" is a necessary step because the external ID has been changed after applying an azure.conf to node-config.yaml. https://docs.openshift.com/container-platform/3.7/install_config/configuring_azure.html Without deleing an existing node, I'm afraid Azure Cloud Provider is not enabled. Could you confirm you can use Azure Disk or Azure File after node has been configured with your playbook?
actually with the timing and everything else, it worked perfectly in the 3.7 ref arch scripts. On the single node, the delete is not delete, in the multi-node, I believe it is. (I've had this code in 3.4 code). I will be updating the multi-node 3.7 later today, and willing to wager I will need the timed deletes I did before. How many nodes are you running?
At present, a single all-in-one node. Did you confirm with multiple nodes?
multi-node, will be doing that today. +/- on the PR.
Thanks. I start testing with multi-node. If multi-node works, it could be a workaround.
I've just finished the 3.7 multi-node ref-arch deployment with the work-around, and node delete is not needed. I've made one final cosmetic change, after the test with that passes, I will submit a PR for the openshift-ansible-contrib for the 3.7 ref arch for multi-node.
Hi Glenn, Thanks a lot. I confirmed multiple-node can workaround the issue as you suggest at OCP 3.7.23. In the latest version, we don't have to delete node manually because OpenShift deletes the out-dated node automatically and register a new one. In this context, out-dated node and new node mean the same node and the difference are with/without Azure Cloud Provider. Summarising the situation: - multiple nodes - Can work around the issue as following steps 1. Install OpenSihft cluster without Azure Cloud Provider 2. Enable Azure Cloud Provider after that. - Enabling Azure Cloud Provider at installer time still no success - single node (all-in-one) - Still no workaround. (however, in the first place, all-in-one is not supported officially. All-in-one is just for a testing/demo purpose. )
The single-vm 3.7 is in openshift-ansible-contrib and is publically available. The mutli-vm 3.7 is on the edge of a pull request for openshift-ansible-contrib, and publically available as well. your welcome to try the upstream of the mutli-vm, if your time critical. https://github.com/glennswest/openshift-ansible-contrib It will be mon/tuesday for 3.6 most likely.
The all-in-one ref arch 3.7 works every time with the work-around script in my tests in the last 48 hours on the upstream. The "injection" of azure cloud provider, is currently not the documented method for azure to add the cloud provider. It happened to work, but is has never been the documented method. Its always been "edit the config files after the install", per the documentation. Which is effectively what the work-around script is doing.
To work around for the single node, I confirmed creating hostsubnet manually works fine. Before enabling Azure Cloud Provider at each node, save hostsubnet. //usually hostsubnet name is same as node name. $ oc export hostsubnet <node_name> > hostsubnet-<node>.txt $ oc delete node <node_name> $ oc create -f hostsubnet-<node>.txt $ systemctl restart atomic-openshift-node Actually, sometimes I don't have to delete a node to enable Cloud Provider. In this case, creating hostsubnet is not necessary. However, I can't find out the difference. So, just in case, exporting and creating hostsubnet could be a wrokaround.
We honestly need to bump the priority on this. For example if a instance is rebooted and the master removes the nodes automatically and then the node comes online the node cannot add itself back into the cluster because it cannot set up host networking on its own. This is a pretty large risk.
Ok to fix this each nic on the azure side must contain the value of --internal-dns-name matching the openshift node name. For some reason by default Azure does not configure internal dns names on nics. So when the node comes back online it cannot register itself. Upon creation the following must be set on the nic https://docs.microsoft.com/en-us/azure/virtual-machines/linux/static-dns-name-resolution-for-linux-on-azure If an environment already exists then the following can be used to set the internal-dns-name az network nic update -g RESOURCEGROUP -n $NODE_NAMEVMNic --internal-dns-name $NODE_NAME
Hi Ryan, I have already tried "internal-dns-name" and it doesn't fix this issue at OCP 3.7. https://bugzilla.redhat.com/show_bug.cgi?id=1535391#c20 As Wenqi opened, this is the issue introduced at OCP 3.9. https://bugzilla.redhat.com/show_bug.cgi?id=1534934
I have submitted a fix upstream for the installer [1]. The specific commit pertaining to the azure.conf issue is [2]. [1] https://github.com/openshift/openshift-ansible/pull/7745 [2] https://github.com/openshift/openshift-ansible/pull/7745/commits/212f310c65748ca26b3d768456d909baaf769a00
walkaround from Takayoshi Tanaka <tatanaka> ; paste here just in case someone want it. 1. Install OpenShift without specifying Azure Configuration file 2. Locate Azure Configuration file on all masters and nodes. 3. Edit master-config and restart master services. https://docs.openshift.com/container-platform/3.7/install_config/configuring_azure.html 4. export hostsubnets.$ oc export hostsubnet <node_name> > hostsubnet-<node>.txt 5. Edit node-config, restart node service and wait for several minutes. // I found usually the node will be removed and re-registered with Azure Cloud Provider automatically. You can see it with "oc get node -w". // If the node won't be re-registered, please proceed the next steps. 6 Delete node, create hostsubnet and restart node service again. $ oc delete node <node_name> $ oc create -f hostsubnet-<node>.txt $ systemctl restart atomic-openshift-node
this should be tested to see how the node reacts at reboot. If I recall the azure.conf has to be removed and the process done over again.