Bug 2104657 - Openshift private cluster fails to install due to missing worker nodes on ASH
Summary: Openshift private cluster fails to install due to missing worker nodes on ASH
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.11
Hardware: x86_64
OS: Unspecified
medium
high
Target Milestone: ---
: 4.12.z
Assignee: OCP Installer
QA Contact: Gaoyun Pei
URL:
Whiteboard:
Depends On:
Blocks: 2060508
TreeView+ depends on / blocked
 
Reported: 2022-07-06 19:39 UTC by Mike Gahagan
Modified: 2023-03-09 01:23 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-03-09 01:23:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Mike Gahagan 2022-07-06 19:39:05 UTC
Description of problem:

Installations of private clusters (publish: Internal) fails on Azure Stack Hub due to reference the incorrect load balancer when attempting to crate worker nodes.

Version-Release number of selected component (if applicable):
4.11.0-0.nightly-2022-07-05-083948 (and previous nightlies)


How reproducible:

always

Steps to Reproduce:
1. Attempt to install a private cluster on ASH 
2. wait until installer times out after removal of bootstrap
3. log in to cluster and observe that only master nodes exist

Actual results:

Incomplete cluster with no worker nodes

Expected results:

cluster installs successfully

Additional info:

core@mghaganproxy:~$ oc get machines -A
NAMESPACE               NAME                                       PHASE     TYPE              REGION   ZONE   AGE
openshift-machine-api   mgahagan220706-ff6dd-master-0              Running   Standard_DS4_v2   mtcazs          6h16m
openshift-machine-api   mgahagan220706-ff6dd-master-1              Running   Standard_DS4_v2   mtcazs          6h16m
openshift-machine-api   mgahagan220706-ff6dd-master-2              Running   Standard_DS4_v2   mtcazs          6h16m
openshift-machine-api   mgahagan220706-ff6dd-worker-mtcazs-29hpn   Failed                                      6h8m
openshift-machine-api   mgahagan220706-ff6dd-worker-mtcazs-c6mr4   Failed                                      6h8m
openshift-machine-api   mgahagan220706-ff6dd-worker-mtcazs-nl6h5   Failed                                      6h8m
openshift-machine-api   mgahagan220706-ff6dd-worker-mtcazs-rtgk8   Failed                                      5h14m

inspecting one of the failed workers with oc describe we see:

  Error Message:           failed to reconcile machine "mgahagan220706-ff6dd-worker-mtcazs-c6mr4": network.LoadBalancersClient#Get: Failure responding to request: StatusCode=404 -- Original Error: autorest/azure: Service returned an error. Status=404 Code="ResourceNotFound" Message="The Resource 'Microsoft.Network/loadBalancers/mgahagan220706-ff6dd' under resource group 'mgahagan220706-ff6dd-rg' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix"

Given that this is a private cluster created with publish: Internal the proper load balancer to bind the worker node's nic to should be mgahagan220706-ff6dd-internal

Comment 2 Mike Gahagan 2022-07-06 19:47:22 UTC
Full Status message from oc describe:

Status:
  Conditions:
    Last Transition Time:  2022-07-06T13:26:47Z
    Status:                True
    Type:                  Drainable
    Last Transition Time:  2022-07-06T13:26:47Z
    Message:               Instance has not been created
    Reason:                InstanceNotCreated
    Severity:              Warning
    Status:                False
    Type:                  InstanceExists
    Last Transition Time:  2022-07-06T13:26:47Z
    Status:                True
    Type:                  Terminable
  Error Message:           failed to reconcile machine "mgahagan220706-ff6dd-worker-mtcazs-c6mr4": network.LoadBalancersClient#Get: Failure responding to request: StatusCode=404 -- Original Error: autorest/azure: Service returned an error. Status=404 Code="ResourceNotFound" Message="The Resource 'Microsoft.Network/loadBalancers/mgahagan220706-ff6dd' under resource group 'mgahagan220706-ff6dd-rg' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix"
  Error Reason:            InvalidConfiguration
  Last Updated:            2022-07-06T13:27:42Z
  Phase:                   Failed
  Provider Status:
    Conditions:
      Last Transition Time:  2022-07-06T13:27:21Z
      Message:               failed to create nic mgahagan220706-ff6dd-worker-mtcazs-c6mr4-nic for machine mgahagan220706-ff6dd-worker-mtcazs-c6mr4: unable to create VM network interface: load balancer mgahagan220706-ff6dd not found: network.LoadBalancersClient#Get: Failure responding to request: StatusCode=404 -- Original Error: autorest/azure: Service returned an error. Status=404 Code="ResourceNotFound" Message="The Resource 'Microsoft.Network/loadBalancers/mgahagan220706-ff6dd' under resource group 'mgahagan220706-ff6dd-rg' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix"
      Reason:                MachineCreationFailed
      Status:                True
      Type:                  MachineCreated
    Metadata:

Comment 3 Michael McCune 2022-07-12 15:33:45 UTC
talking with the cloud team, we need to do more investigation around the "Internal" publishing option, it's possible we missed a case when adding the implementation for the create machine logic.

we are trying to figure out how many users this might affect, @mgahagan do you have any information about the prevalence of private clusters, or perhaps a little more information about this deployment method?

we aren't sure that we are testing the private cluster option thoroughly and want to understand more about this use case.

Comment 5 Mike Gahagan 2022-07-12 20:15:58 UTC
I know that private clusters are quite a common request on Azure public cloud as well as other cloud providers,  I have not heard of any requests for private clusters on ASH specifically. In the case of ASH it appears that the entire environment is private so I'm not sure what the use case for an internal-only API is since the whole cloud is essentially "private".

When creating private clusters on Azure public I still see an additional load balancer is created but there are no rules assigned to it, I'm not sure if that's helpful in debugging the issue we are seeing here.

Comment 6 Michael McCune 2022-07-12 20:25:46 UTC
thanks Mike, it's helpful for us in building a little more context around the issue.

Comment 7 Michael McCune 2022-07-13 15:18:52 UTC
we discussed this issue again during our team standup and we would like to reach out to our ASH contacts to understand a little more about the differences between public Azure and ASH in these scenarios, we had seen an issue related to availability sets in the past when we encountered this.

we are going to try and replicate this as well.

Comment 8 Joel Speed 2022-07-14 13:38:04 UTC
Is there a must gather associated with this issue? Or if someone can reproduce, could they provide a must gather (perhaps via google drive).

Having reviewed the attached bug, I suspect this is an issue with the input the installer is providing to Machine API.

Based on the knowledge we have so far, if this has never worked, this wouldn't break any existing user and, therefore, should not be considered a blocker for this release.

We will try to investigate during the next sprint, though a must gather would help us to get a conclusion quicker as our access to ASH environments is limited

Comment 15 Shiftzilla 2023-03-09 01:23:30 UTC
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-9367


Note You need to log in before you can comment on or make changes to this bug.