Bug 2108647 - [azure] Standard_D2s_v3 as worker failed by “accelerated networking not supported on instance type”
Summary: [azure] Standard_D2s_v3 as worker failed by “accelerated networking not suppo...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.11
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.12.0
Assignee: Radek Maňák
QA Contact: sunzhaohua
Jeana Routh
: 2115851 2115852 (view as bug list)
Depends On:
Blocks: 2115852
TreeView+ depends on / blocked
Reported: 2022-07-19 15:39 UTC by MayXu
Modified: 2023-01-17 19:53 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
* Previously, when Azure added new instance types and enabled accelerated networking support on instance types that previously did not have it, the list of Azure instances in the machine controller became outdated. As a result, the machine controller could not create machines with instance types that did not previously support accelerated networking, even if they support this feature on Azure. With this release, the required instance type information is retrieved from Azure API before the machine is created to keep it up to date and the machine controller is able to create machines with new and updated instance types. This fix also applies to any instance types that are added in the future. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2108647[*BZ#2108647*])
Clone Of:
Last Closed: 2023-01-17 19:53:06 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github openshift machine-api-provider-azure pull 32 0 None Merged Bug 2108647: Implement fetching SKUs information from Azure 2022-10-19 00:17:16 UTC
Red Hat Product Errata RHSA-2022:7399 0 None None None 2023-01-17 19:53:30 UTC

Internal Links: 2120170

Description MayXu 2022-07-19 15:39:07 UTC
Description of problem:
Standard_D2s_v3 as worker failed by “accelerated networking not supported on instance type”

https://github.com/openshift/machine-api-provider-azure/blob/main/pkg/cloud/azure/actuators/machineset/azure_instance_types.go, some vm type is set to AcceleratedNetworking: false , while azure support AcceleratedNetworking: true. E.g: standard_D2s_v3, Standard_D2_v3, and Standard_D2a_v4.
$ az vm list-skus \
 --location southcentralus \
 --all true \
 --resource-type virtualMachines \
 --query "[?capabilities[?name=='AcceleratedNetworkingEnabled'].value!=['False']].{size:size, name:name, vCPUsAvailable:capabilities[?name=='vCPUsAvailable'].value|[0], acceleratedNetworkingEnabled: capabilities[?name=='AcceleratedNetworkingEnabled'].value | [0]}" \
 --output table
Can get the above vm types. 

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.specify the vm type of worker as Standard_D2s_v3 in install-config.yaml
      type: Standard_D2s_v3

2.Cluster install failed. 

Actual results:
$ oc get machines -A
NAMESPACE               NAME                                         PHASE     TYPE              REGION           ZONE   AGE
openshift-machine-api   maxu-mi-8prjq-master-0                       Running   Standard_D4s_v3   southcentralus   1      78m
openshift-machine-api   maxu-mi-8prjq-master-1                       Running   Standard_D4s_v3   southcentralus   2      78m
openshift-machine-api   maxu-mi-8prjq-master-2                       Running   Standard_D4s_v3   southcentralus   3      78m
openshift-machine-api   maxu-mi-8prjq-worker-southcentralus1-4fzkx   Failed                                              69m
openshift-machine-api   maxu-mi-8prjq-worker-southcentralus2-dqmlf   Failed                                              69m
openshift-machine-api   maxu-mi-8prjq-worker-southcentralus3-jwr5s   Failed                                              69m

In .openshift_install.log: 
level=error msg="Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::OAuthClientsController_SyncError::OAuthServerDeployment_PreconditionNotFulfilled::OAuthServerRouteEndpointAccessibleController_SyncError::OAuthServerServiceEndpointAccessibleController_SyncError::OAuthServerServiceEndpointsEndpointAccessibleController_SyncError::WellKnownReadyController_SyncError: IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server\nOAuthClientsControllerDegraded: no ingress for host oauth-openshift.apps.maxu-mi.qe.azure.devcluster.openshift.com in route oauth-openshift in namespace openshift-authentication\nOAuthServerDeploymentDegraded: waiting for the oauth-openshift route to contain an admitted ingress: no admitted ingress for route oauth-openshift in namespace openshift-authentication\nOAuthServerDeploymentDegraded: \nOAuthServerRouteEndpointAccessibleControllerDegraded: route \"openshift-authentication/oauth-openshift\": status does not have a valid host address\nOAuthServerServiceEndpointAccessibleControllerDegraded: Get \"\": dial tcp connect: connection refused\nOAuthServerServiceEndpointsEndpointAccessibleControllerDegraded: oauth service endpoints are not ready\nWellKnownReadyControllerDegraded: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap \"oauth-openshift\" not found (check authentication operator, it is supposed to create this)"

$ oc describe machine maxu-mi-8prjq-worker-southcentralus1-4fzkx -n openshift-machine-api
  Error Message:           failed to reconcile machine "maxu-mi-8prjq-worker-southcentralus1-4fzkx": failed to create nic maxu-mi-8prjq-worker-southcentralus1-4fzkx-nic for machine maxu-mi-8prjq-worker-southcentralus1-4fzkx: accelerated networking not supported on instance type: Standard_D2s_v3
  Error Reason:            InvalidConfiguration

Expected results:
Install succeed, or change the prompt error if fail to support use Standard_D2s_v3

Additional info:
1. https://docs.microsoft.com/en-us/azure/virtual-network/accelerated-networking-overview#supported-vm-instances
On instances that support hyperthreading, Accelerated Networking is supported on VM instances with 4 or more vCPUs

2. In install-config.yaml, add compute.hyperthreading: Disabled, similar error as above.
$ oc describe machine maxu-mi6-n6s7d-worker-southcentralus1-hs5bk -n openshift-machine-api
 Error Message:           failed to reconcile machine "maxu-mi6-n6s7d-worker-southcentralus1-hs5bk": failed to create nic maxu-mi6-n6s7d-worker-southcentralus1-hs5bk-nic for machine maxu-mi6-n6s7d-worker-southcentralus1-hs5bk: accelerated networking not supported on instance type: Standard_D2s_v3

3. Can create vm with Standard_D2s_v3 based on the existed accelerated networking

$ az network nic list -g $RG -o json --query "[].[name, enableAcceleratedNetworking]" -o tsv
maxu-mi7-m595d-worker-southcentralus3-5g6gh-nic	True

$az vm create --resource-group $RG --name maxutest1 --ssh-key-values '~/openshift-qe.pub' --admin-username cloud-user --image '/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/os4-common/providers/Microsoft.Compute/galleries/openshift_qe_image/images/qe-rhel77-proxy-registry/versions/2022.4.24' --os-disk-size-gb 99 --nsg '' --size 'Standard_D2s_v3' --nics maxu-mi7-m595d-worker-southcentralus3-5g6gh-nic --debug

4.With Standard_D2s_v4 can create the cluster successfully, the nic is accelerated networking.

Comment 1 Joel Speed 2022-07-25 15:35:57 UTC
Going to ask a member of the team to look further into this, we understand the issue and know what it is we need to do. A dynamic check rather than the static list. Will post updates once we've started working on it.

Comment 2 Radek Maňák 2022-08-04 15:20:22 UTC
I am working on a fix for this.

Comment 4 daliu 2022-08-09 01:01:34 UTC
*** Bug 2115852 has been marked as a duplicate of this bug. ***

Comment 5 daliu 2022-08-09 01:02:44 UTC
*** Bug 2115851 has been marked as a duplicate of this bug. ***

Comment 6 daliu 2022-08-22 06:15:41 UTC
Any update for this issue?

Comment 7 Radek Maňák 2022-08-22 08:17:16 UTC
Still working on this. I have it in working state, just need to clean it up a bit and make sure there is no regression.

Comment 8 dhuynh 2022-08-22 14:45:01 UTC
Is the workaround for this issue to simply use a different machine type for the worker nodes?

Comment 9 daliu 2022-08-23 01:00:08 UTC
The workaround can be found here: https://coreos.slack.com/archives/C68TNFWA2/p1659964442215109?thread_ts=1659925635.953309&cid=C68TNFWA2
Eg: workaround by forcing Accelerated=False in the install-config for the compute nodes. Or change the node type for compute nodes.

Comment 10 Radek Maňák 2022-08-23 15:12:17 UTC
/bugzilla refresh

Comment 11 Radek Maňák 2022-08-23 15:14:14 UTC
Oops. Wrong tab, sorry(In reply to Radek Maňák from comment #10)
> /bugzilla refresh

Sorry, wrong browser tab

Comment 13 MayXu 2022-09-14 03:39:07 UTC
verified on registry.ci.openshift.org/ocp/release:4.12.0-0.ci-2022-09-13-225342 and registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-09-13-202959
$  az network nic list -g $RG -o json --query "[].[name, enableAcceleratedNetworking]" -o tsv
maxu-ac1-bwz6k-master-0-nic	True
maxu-ac1-bwz6k-master-1-nic	True
maxu-ac1-bwz6k-master-2-nic	True
maxu-ac1-bwz6k-worker-eastus1-468jw-nic	True
maxu-ac1-bwz6k-worker-eastus2-p2pzw-nic	True
maxu-ac1-bwz6k-worker-eastus3-r4rhf-nic	True

$ oc get machine -A
NAMESPACE               NAME                                  PHASE     TYPE              REGION   ZONE   AGE
openshift-machine-api   maxu-ac1-bwz6k-master-0               Running   Standard_D4s_v3   eastus   2      39m
openshift-machine-api   maxu-ac1-bwz6k-master-1               Running   Standard_D4s_v3   eastus   3      38m
openshift-machine-api   maxu-ac1-bwz6k-master-2               Running   Standard_D4s_v3   eastus   1      38m
openshift-machine-api   maxu-ac1-bwz6k-worker-eastus1-468jw   Running   Standard_D2s_v3   eastus   1      30m
openshift-machine-api   maxu-ac1-bwz6k-worker-eastus2-p2pzw   Running   Standard_D2s_v3   eastus   2      30m
openshift-machine-api   maxu-ac1-bwz6k-worker-eastus3-r4rhf   Running   Standard_D2s_v3   eastus   3      30m

Comment 16 errata-xmlrpc 2023-01-17 19:53:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.