Description of problem: Installations of private clusters (publish: Internal) fails on Azure Stack Hub due to reference the incorrect load balancer when attempting to crate worker nodes. Version-Release number of selected component (if applicable): 4.11.0-0.nightly-2022-07-05-083948 (and previous nightlies) How reproducible: always Steps to Reproduce: 1. Attempt to install a private cluster on ASH 2. wait until installer times out after removal of bootstrap 3. log in to cluster and observe that only master nodes exist Actual results: Incomplete cluster with no worker nodes Expected results: cluster installs successfully Additional info: core@mghaganproxy:~$ oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api mgahagan220706-ff6dd-master-0 Running Standard_DS4_v2 mtcazs 6h16m openshift-machine-api mgahagan220706-ff6dd-master-1 Running Standard_DS4_v2 mtcazs 6h16m openshift-machine-api mgahagan220706-ff6dd-master-2 Running Standard_DS4_v2 mtcazs 6h16m openshift-machine-api mgahagan220706-ff6dd-worker-mtcazs-29hpn Failed 6h8m openshift-machine-api mgahagan220706-ff6dd-worker-mtcazs-c6mr4 Failed 6h8m openshift-machine-api mgahagan220706-ff6dd-worker-mtcazs-nl6h5 Failed 6h8m openshift-machine-api mgahagan220706-ff6dd-worker-mtcazs-rtgk8 Failed 5h14m inspecting one of the failed workers with oc describe we see: Error Message: failed to reconcile machine "mgahagan220706-ff6dd-worker-mtcazs-c6mr4": network.LoadBalancersClient#Get: Failure responding to request: StatusCode=404 -- Original Error: autorest/azure: Service returned an error. Status=404 Code="ResourceNotFound" Message="The Resource 'Microsoft.Network/loadBalancers/mgahagan220706-ff6dd' under resource group 'mgahagan220706-ff6dd-rg' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix" Given that this is a private cluster created with publish: Internal the proper load balancer to bind the worker node's nic to should be mgahagan220706-ff6dd-internal
Full Status message from oc describe: Status: Conditions: Last Transition Time: 2022-07-06T13:26:47Z Status: True Type: Drainable Last Transition Time: 2022-07-06T13:26:47Z Message: Instance has not been created Reason: InstanceNotCreated Severity: Warning Status: False Type: InstanceExists Last Transition Time: 2022-07-06T13:26:47Z Status: True Type: Terminable Error Message: failed to reconcile machine "mgahagan220706-ff6dd-worker-mtcazs-c6mr4": network.LoadBalancersClient#Get: Failure responding to request: StatusCode=404 -- Original Error: autorest/azure: Service returned an error. Status=404 Code="ResourceNotFound" Message="The Resource 'Microsoft.Network/loadBalancers/mgahagan220706-ff6dd' under resource group 'mgahagan220706-ff6dd-rg' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix" Error Reason: InvalidConfiguration Last Updated: 2022-07-06T13:27:42Z Phase: Failed Provider Status: Conditions: Last Transition Time: 2022-07-06T13:27:21Z Message: failed to create nic mgahagan220706-ff6dd-worker-mtcazs-c6mr4-nic for machine mgahagan220706-ff6dd-worker-mtcazs-c6mr4: unable to create VM network interface: load balancer mgahagan220706-ff6dd not found: network.LoadBalancersClient#Get: Failure responding to request: StatusCode=404 -- Original Error: autorest/azure: Service returned an error. Status=404 Code="ResourceNotFound" Message="The Resource 'Microsoft.Network/loadBalancers/mgahagan220706-ff6dd' under resource group 'mgahagan220706-ff6dd-rg' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix" Reason: MachineCreationFailed Status: True Type: MachineCreated Metadata:
talking with the cloud team, we need to do more investigation around the "Internal" publishing option, it's possible we missed a case when adding the implementation for the create machine logic. we are trying to figure out how many users this might affect, @mgahagan do you have any information about the prevalence of private clusters, or perhaps a little more information about this deployment method? we aren't sure that we are testing the private cluster option thoroughly and want to understand more about this use case.
I know that private clusters are quite a common request on Azure public cloud as well as other cloud providers, I have not heard of any requests for private clusters on ASH specifically. In the case of ASH it appears that the entire environment is private so I'm not sure what the use case for an internal-only API is since the whole cloud is essentially "private". When creating private clusters on Azure public I still see an additional load balancer is created but there are no rules assigned to it, I'm not sure if that's helpful in debugging the issue we are seeing here.
thanks Mike, it's helpful for us in building a little more context around the issue.
we discussed this issue again during our team standup and we would like to reach out to our ASH contacts to understand a little more about the differences between public Azure and ASH in these scenarios, we had seen an issue related to availability sets in the past when we encountered this. we are going to try and replicate this as well.
Is there a must gather associated with this issue? Or if someone can reproduce, could they provide a must gather (perhaps via google drive). Having reviewed the attached bug, I suspect this is an issue with the input the installer is providing to Machine API. Based on the knowledge we have so far, if this has never worked, this wouldn't break any existing user and, therefore, should not be considered a blocker for this release. We will try to investigate during the next sprint, though a must gather would help us to get a conclusion quicker as our access to ASH environments is limited
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira. https://issues.redhat.com/browse/OCPBUGS-9367