Bug 1737505

Summary: Can not create a machineset with a node with a public ip on azure
Product: OpenShift Container Platform Reporter: Alex Krzos <akrzos>
Component: Cloud ComputeAssignee: Jan Chaloupka <jchaloup>
Status: CLOSED ERRATA QA Contact: sunzhaohua <zhsun>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.2.0CC: agarcial, jmencak, xtian
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-16 06:34:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alex Krzos 2019-08-05 14:22:11 UTC
Description of problem:
While attempting to create an additional machineset on an azure cluster for a node to host workload testing ( or another example could be infra nodes) 

Version-Release number of selected component (if applicable):
4.2.0-0.nightly-2019-07-31-162901


How reproducible:
For this build it is reproducible

Steps to Reproduce:
1. Deploy Azure cluster via IPI installer
2. Create a machineset with https://gist.github.com/akrzos/24880453b050047e11723c28ad778154
3. View logs from machine-controller pod

Actual results:
Error in logs - Machine error: failed to reconcile machine "akrzos-test-w5t9j-workload-centralus1-6kznm"s: failed to create nic akrzos-test-w5t9j-workload-centralus1-6kznm-nic for machine akrzos-test-w5t9j-workload-centralus1-6kznm: unable to create Public IP: cannot create public ip: network.PublicIPAddressesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="InvalidDomainNameLabel" Message="The domain name label akrzos-test-w5t9j-akrzos-test-w5t9j-workload-centralus1-6kznm-publicip is invalid. It must conform to the following regular expression: ^[a-z][a-z0-9-]{1,61}[a-z0-9]$." Details=[]


Expected results:
Machinset to be created and machine to be provisioned in cluster and added to ocp cluster

Additional info:

Afterwards attempting to delete the machineset results in a panic in the pod - https://gist.github.com/akrzos/e3617c5bb8be4bb15c2c9521542613d9

Comment 1 Jan Chaloupka 2019-08-05 15:25:14 UTC
The problem is the public IP resource name: akrzos-test-w5t9j-akrzos-test-w5t9j-workload-centralus1-6kznm-publicip

It must conform to the following regular expression: ^[a-z][a-z0-9-]{1,61}[a-z0-9]$.

So it can be at most 63 characters long. In your case it's 70 characters long. Is there any way to make the name shorter? I.e. making the machineset name shorter? The public IP name is constructed as CLUSTERID+MACHINENAME+"publicip".

Comment 2 Alberto 2019-08-05 15:50:50 UTC
we should also probably drop "publicip" from the name

Comment 3 Alex Krzos 2019-08-05 15:56:59 UTC
(In reply to Jan Chaloupka from comment #1)
> The problem is the public IP resource name:
> akrzos-test-w5t9j-akrzos-test-w5t9j-workload-centralus1-6kznm-publicip
> 
> It must conform to the following regular expression:
> ^[a-z][a-z0-9-]{1,61}[a-z0-9]$.
> 
> So it can be at most 63 characters long. In your case it's 70 characters
> long. Is there any way to make the name shorter? I.e. making the machineset
> name shorter? The public IP name is constructed as
> CLUSTERID+MACHINENAME+"publicip".

I will use a smaller machineset name, however I was just "copying" the same names as the worker node machinesets.

Comment 4 Jan Chaloupka 2019-08-05 21:00:02 UTC
> we should also probably drop "publicip" from the name

Even that will not help. I don't see any other way but to generate the name randomly. We can still use CLUSTERID prefix at least and maybe first xxx characters of the machine name so the fixed length is 50. Randomize the last 13 chars.

Comment 5 Alberto 2019-08-06 07:15:25 UTC
how would reduce characters won't help? also what's the value of having "publicip" in the name?

Comment 6 Jan Chaloupka 2019-08-06 08:39:06 UTC
> how would reduce characters won't help? also what's the value of having "publicip" in the name?

"Even that will not help" = "Even that will not be sufficient"

Comment 8 Jan Chaloupka 2019-08-22 09:58:34 UTC
Another PR related to the issue was merged: https://github.com/openshift/cluster-api-provider-azure/pull/72

Instead of generating random names, we error when the name is too long. In that case either machine name generated by the machineset need to be made shorter or the publicIp field needs to be set to false.

Comment 9 sunzhaohua 2019-08-23 03:39:57 UTC
@Jan, 
I create a machine set "publicIP: true" "name: zhsun4-5b994-worker-centralus1-test1-test2-test3-test4-test5", logs output "unable to create Public IP: machine public IP name is longer than 63 characters"
Then I delete machine zhsun4-5b994-worker-centralus1-test1-test2-test3-test4-test5, the machine couldn't be deleted.

I0823 02:17:42.160824       1 controller.go:141] Reconciling Machine "zhsun4-5b994-worker-centralus1-test1-test2-test3-test4-test5"
I0823 02:17:42.160861       1 controller.go:310] Machine "zhsun4-5b994-worker-centralus1-test1-test2-test3-test4-test5" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0823 02:17:42.160876       1 actuator.go:200] Checking if machine zhsun4-5b994-worker-centralus1-test1-test2-test3-test4-test5 exists
I0823 02:17:42.354384       1 controller.go:259] Reconciling machine object zhsun4-5b994-worker-centralus1-test1-test2-test3-test4-test5 triggers idempotent create.
I0823 02:17:42.354410       1 actuator.go:93] Creating machine zhsun4-5b994-worker-centralus1-test1-test2-test3-test4-test5
E0823 02:17:42.366884       1 actuator.go:87] Machine error: failed to reconcile machine "zhsun4-5b994-worker-centralus1-test1-test2-test3-test4-test5"s: failed to create nic zhsun4-5b994-worker-centralus1-test1-test2-test3-test4-test5-nic for machine zhsun4-5b994-worker-centralus1-test1-test2-test3-test4-test5: unable to create Public IP: machine public IP name is longer than 63 characters
W0823 02:17:42.366903       1 controller.go:261] Failed to create machine "zhsun4-5b994-worker-centralus1-test1-test2-test3-test4-test5": requeue in: 1m0s

$ oc delete machine zhsun4-5b994-worker-centralus1-test1-test2-test3-test4-test5
machine.machine.openshift.io "zhsun4-5b994-worker-centralus1-test1-test2-test3-test4-test5" deleted
^C

I0823 03:34:52.378589       1 controller.go:205] Reconciling machine "zhsun4-5b994-worker-centralus1-test1-test2-test3-test4-test5" triggers delete
I0823 03:34:52.378596       1 actuator.go:128] Deleting machine zhsun4-5b994-worker-centralus1-test1-test2-test3-test4-test5
I0823 03:34:52.379291       1 virtualmachines.go:225] deleting vm zhsun4-5b994-worker-centralus1-test1-test2-test3-test4-test5 
I0823 03:34:52.614725       1 virtualmachines.go:242] successfully deleted vm zhsun4-5b994-worker-centralus1-test1-test2-test3-test4-test5 
I0823 03:34:52.614751       1 disks.go:49] deleting disk zhsun4-5b994-worker-centralus1-test1-test2-test3-test4-test5_OSDisk
I0823 03:34:52.650286       1 disks.go:65] successfully deleted disk zhsun4-5b994-worker-centralus1-test1-test2-test3-test4-test5_OSDisk
I0823 03:34:52.650319       1 networkinterfaces.go:178] deleting nic zhsun4-5b994-worker-centralus1-test1-test2-test3-test4-test5-nic
I0823 03:34:52.712865       1 networkinterfaces.go:197] successfully deleted nic zhsun4-5b994-worker-centralus1-test1-test2-test3-test4-test5-nic
E0823 03:34:52.727330       1 actuator.go:87] Machine error: failed to delete machine "zhsun4-5b994-worker-centralus1-test1-test2-test3-test4-test5": unable to create Public IP: machine public IP name is longer than 63 characters
E0823 03:34:52.727348       1 controller.go:220] Failed to delete machine "zhsun4-5b994-worker-centralus1-test1-test2-test3-test4-test5": requeue in: 1m0s
I0823 03:34:52.727360       1 controller.go:364] Actuator returned requeue-after error: requeue in: 1m0s

Comment 10 Jan Chaloupka 2019-08-28 08:42:08 UTC
Fix for the deletion case: https://github.com/openshift/cluster-api-provider-azure/pull/75

The generated name is longer than allowed by the Azure portal under following OR conditions:
- machine name is changed (can't happen without creating a new CR)
- cluster name is changed (could happen but then we will get different name anyway)
- machine CR was created with too long public ip name (in which case no instance was created)
- machine config was edited and the publicIP field was set to true (no public ip resource is created after an instance was created)

In all cases there is nothing to delete. So the deletion can be skipped.

Comment 11 Jan Chaloupka 2019-08-28 11:15:12 UTC
@sunzhaohua,

https://github.com/openshift/cluster-api-provider-azure/pull/75 just merged.

Comment 12 sunzhaohua 2019-08-30 02:57:07 UTC
Verified.

$ oc delete machine zhsun5-swwlm-worker-centralus1-test1-test2-test3-test4-test5
machine.machine.openshift.io "zhsun5-swwlm-worker-centralus1-test1-test2-test3-test4-test5" deleted

$ oc logs -f machine-api-controllers-7b97cbd9f4-h8mgj -c machine-controller
0830 02:54:55.238249       1 controller.go:310] Machine "zhsun5-swwlm-worker-centralus1-test1-test2-test3-test4-test5" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0830 02:54:55.238267       1 controller.go:205] Reconciling machine "zhsun5-swwlm-worker-centralus1-test1-test2-test3-test4-test5" triggers delete
I0830 02:54:55.238278       1 actuator.go:128] Deleting machine zhsun5-swwlm-worker-centralus1-test1-test2-test3-test4-test5
I0830 02:54:55.239184       1 virtualmachines.go:225] deleting vm zhsun5-swwlm-worker-centralus1-test1-test2-test3-test4-test5 
I0830 02:54:55.525176       1 virtualmachines.go:242] successfully deleted vm zhsun5-swwlm-worker-centralus1-test1-test2-test3-test4-test5 
I0830 02:54:55.525204       1 disks.go:49] deleting disk zhsun5-swwlm-worker-centralus1-test1-test2-test3-test4-test5_OSDisk
I0830 02:54:55.564315       1 disks.go:65] successfully deleted disk zhsun5-swwlm-worker-centralus1-test1-test2-test3-test4-test5_OSDisk
I0830 02:54:55.564392       1 networkinterfaces.go:178] deleting nic zhsun5-swwlm-worker-centralus1-test1-test2-test3-test4-test5-nic
I0830 02:54:55.696393       1 networkinterfaces.go:197] successfully deleted nic zhsun5-swwlm-worker-centralus1-test1-test2-test3-test4-test5-nic
I0830 02:54:55.696423       1 reconciler.go:466] Generated public IP name was too long, skipping deletion of the resource
E0830 02:54:55.736213       1 controller.go:235] Failed to remove finalizer from machine "zhsun5-swwlm-worker-centralus1-test1-test2-test3-test4-test5": Operation cannot be fulfilled on machines.machine.openshift.io "zhsun5-swwlm-worker-centralus1-test1-test2-test3-test4-test5": the object has been modified; please apply your changes to the latest version and try again
I0830 02:54:56.736498       1 controller.go:141] Reconciling Machine "zhsun5-swwlm-worker-centralus1-test1-test2-test3-test4-test5"
I0830 02:54:56.736531       1 controller.go:310] Machine "zhsun5-swwlm-worker-centralus1-test1-test2-test3-test4-test5" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0830 02:54:56.736549       1 controller.go:205] Reconciling machine "zhsun5-swwlm-worker-centralus1-test1-test2-test3-test4-test5" triggers delete
I0830 02:54:56.736556       1 actuator.go:128] Deleting machine zhsun5-swwlm-worker-centralus1-test1-test2-test3-test4-test5
I0830 02:54:56.737503       1 virtualmachines.go:225] deleting vm zhsun5-swwlm-worker-centralus1-test1-test2-test3-test4-test5 
I0830 02:54:56.975689       1 virtualmachines.go:242] successfully deleted vm zhsun5-swwlm-worker-centralus1-test1-test2-test3-test4-test5 
I0830 02:54:56.975715       1 disks.go:49] deleting disk zhsun5-swwlm-worker-centralus1-test1-test2-test3-test4-test5_OSDisk
I0830 02:54:57.022344       1 disks.go:65] successfully deleted disk zhsun5-swwlm-worker-centralus1-test1-test2-test3-test4-test5_OSDisk
I0830 02:54:57.022380       1 networkinterfaces.go:178] deleting nic zhsun5-swwlm-worker-centralus1-test1-test2-test3-test4-test5-nic
I0830 02:54:57.147936       1 networkinterfaces.go:197] successfully deleted nic zhsun5-swwlm-worker-centralus1-test1-test2-test3-test4-test5-nic
I0830 02:54:57.147973       1 reconciler.go:466] Generated public IP name was too long, skipping deletion of the resource
I0830 02:54:57.184181       1 controller.go:239] Machine "zhsun5-swwlm-worker-centralus1-test1-test2-test3-test4-test5" deletion successful

Comment 13 errata-xmlrpc 2019-10-16 06:34:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922