1931115 – Azure cluster install fails with worker type workers Standard_D4_v2

Bug 1931115 - Azure cluster install fails with worker type workers Standard_D4_v2

Summary: Azure cluster install fails with worker type workers Standard_D4_v2

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Aditya Narayanaswamy
QA Contact:	To Hung Sze
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-02-20 18:14 UTC by To Hung Sze
Modified:	2021-12-01 13:16 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Azure clusters created with disk type Premium_LRS and with an instance type that does not support PremiumIO capabilities causes the cluster to fail due to no the mentioned missing Premium functionality Added a check to see if the instance type picked has the PremiumIO capabilities only if the disk type is Premium_LRS which is the default disk type. The code queries the Azure subscription and region to get the information required and returns error if the condition is met.
Clone Of:
Environment:
Last Closed:	2021-07-27 22:45:37 UTC
Target Upstream Version:	4.8.0
Embargoed:

Attachments	(Terms of Use)
install log (163.53 KB, text/plain) 2021-02-20 18:14 UTC, To Hung Sze	no flags	Details
must-gather (13.25 MB, application/zip) 2021-02-20 19:01 UTC, To Hung Sze	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift installer pull 4726	0	None	open	Bug 1931115: Azure: Check Azure disk Instance Type for PremiumIO Capabilities	2021-03-15 16:52:18 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:47:51 UTC

Description To Hung Sze 2021-02-20 18:14:33 UTC

Created attachment 1758447 [details]
install log

Description of problem:
Cluster install fails when worker has type Standard_D4_v2 in Azure

Version-Release number of selected component (if applicable):
4.7 rc.3

How reproducible:
Install a cluster with these in install-config:
  name: worker
  platform:   
    azure:
      type: Standard_D4_v2


    region: centralus


Actual results:
time="2021-02-20T12:16:52-05:00" level=error msg="Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: ingresscontroller \"default\" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod \"router-default-6c9ffb5cd4-m8jjx\" cannot be scheduled: 0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Pod \"router-default-6c9ffb5cd4-lnzmp\" cannot be scheduled: 0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Make sure you have sufficient worker nodes.), DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1)"
time="2021-02-20T12:16:52-05:00" level=info msg="Cluster operator insights Disabled is False with AsExpected: "
time="2021-02-20T12:16:52-05:00" level=info msg="Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available"
time="2021-02-20T12:16:52-05:00" level=info msg="Cluster operator monitoring Available is False with : "
time="2021-02-20T12:16:52-05:00" level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack."
time="2021-02-20T12:16:52-05:00" level=error msg="Cluster operator monitoring Degraded is True with UpdatingAlertmanagerFailed: Failed to rollout the stack. Error: running task Updating Alertmanager failed: waiting for Alertmanager Route to become ready failed: waiting for route openshift-monitoring/alertmanager-main: no status available"
time="2021-02-20T12:16:52-05:00" level=info msg="Cluster operator network ManagementStateDegraded is False with : "
time="2021-02-20T12:16:52-05:00" level=info msg="Cluster operator network Progressing is True with Deploying: Deployment \"openshift-network-diagnostics/network-check-source\" is not available (awaiting 1 nodes)"
time="2021-02-20T12:16:52-05:00" level=error msg="Cluster initialization failed because one or more operators are not functioning properly.\nThe cluster should be accessible for troubleshooting as detailed in the documentation linked below,\nhttps://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html\nThe 'wait-for install-complete' subcommand can then be used to continue the installation"
time="2021-02-20T12:16:52-05:00" level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console, image-registry, ingress, kube-storage-version-migrator, monitoring"



Additional info:
NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                                          False       False         True       104m
    104m
console                                    4.7.0-rc.3   Unknown     True          False      96m
     103m
image-registry                                          False       True          True       97m
ingress                                                 False       True          True       101m
insights                                   4.7.0-rc.3   True        False         False      97m

monitoring                                              False       True          True       103m
network                                    4.7.0-rc.3   True        True          False      104m
node-tuning                                4.7.0-rc.3   True        False         False      104m

Comment 1 To Hung Sze 2021-02-20 19:01:08 UTC

Created attachment 1758449 [details]
must-gather

Comment 2 To Hung Sze 2021-02-20 19:18:20 UTC

Standard_D4_v2 not compatible with master

Comment 3 Yu Qi Zhang 2021-02-23 17:25:37 UTC

This does not seem to have anything to do with MCO. The MCO does not manage disk types you provide to the cluster.

Is Standard_D4_v2 supported? Please make sure its a supported disk type. Passing to installer team to check

Comment 4 Matthew Staebler 2021-02-23 18:27:33 UTC

From the yaml of a worker Machine,

  errorMessage: 'failed to reconcile machine "tszeaz022021-cj884-worker-centralus1-k2xtt": failed to create vm tszeaz022021-cj884-worker-centralus1-k2xtt: failure sending request for machine tszeaz022021-cj884-worker-centralus1-k2xtt: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="InvalidParameter" Message="Requested operation cannot be performed because the VM size Standard_D4_v2 does not support the storage account type Premium_LRS of disk ''tszeaz022021-cj884-worker-centralus1-k2xtt_OSDisk''. Consider updating the VM to a size that supports Premium storage." Target="osDisk.managedDisk.storageAccountType"'

Comment 5 Matthew Staebler 2021-02-23 18:51:50 UTC

I wonder if the PremiumIO capability determines whether the vm supports Premium_LRS disks. If so, we could add that to the validation done in the installer.

$ az vm list-skus --size D4 -l centralus | jq '.[] | {"name":.name,"PremiumIO":.capabilities[]|select(.name=="PremiumIO")|.value}'
{
  "name": "Standard_D4_v2",
  "PremiumIO": "False"
}
{
  "name": "Standard_D4_v3",
  "PremiumIO": "False"
}
{
  "name": "Standard_D4",
  "PremiumIO": "False"
}
{
  "name": "Standard_D48_v3",
  "PremiumIO": "False"
}
{
  "name": "Standard_D4s_v3",
  "PremiumIO": "True"
}
{
  "name": "Standard_D48s_v3",
  "PremiumIO": "True"
}
{
  "name": "Standard_D4d_v4",
  "PremiumIO": "False"
}
{
  "name": "Standard_D48d_v4",
  "PremiumIO": "False"
}
{
  "name": "Standard_D4_v4",
  "PremiumIO": "False"
}
{
  "name": "Standard_D48_v4",
  "PremiumIO": "False"
}
{
  "name": "Standard_D4ds_v4",
  "PremiumIO": "True"
}
{
  "name": "Standard_D48ds_v4",
  "PremiumIO": "True"
}
{
  "name": "Standard_D4s_v4",
  "PremiumIO": "True"
}
{
  "name": "Standard_D48s_v4",
  "PremiumIO": "True"
}
{
  "name": "Standard_D4a_v4",
  "PremiumIO": "False"
}
{
  "name": "Standard_D48a_v4",
  "PremiumIO": "False"
}
{
  "name": "Standard_D4as_v4",
  "PremiumIO": "True"
}
{
  "name": "Standard_D48as_v4",
  "PremiumIO": "True"
}

Comment 6 To Hung Sze 2021-02-23 19:23:35 UTC

@mstaeble Thanks for finding the error.

Comment 8 To Hung Sze 2021-03-25 13:19:56 UTC

This is the error:
FATAL failed to fetch Metadata: failed to load asset "Install Config": [controlPlane.platform.azure.osDisk.diskType: Invalid value: "Premium_LRS": PremiumIO not supported for instance type Standard_D4_v2, compute[0].platform.azure.osDisk.diskType: Invalid value: "Premium_LRS": PremiumIO not supported for instance type Standard_D4_v2] 

Install-config used:
  platform:    
    azure:
      type: Standard_D4_v2
  replicas: 3

Thank you for fixing this.

Comment 9 To Hung Sze 2021-03-25 13:24:19 UTC

Note: error is detected when generating manifests

Comment 12 errata-xmlrpc 2021-07-27 22:45:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Comment 13 Srivatsavks 2021-12-01 05:34:08 UTC

Hi Team,

While installing OCP 4.9, I encounter same error it seems issue still exists in latest OCP version.
Please find details below

[test]$ openshift-install version 
openshift-install 4.9.8
built from commit 1c538b8949f3a0e5b993e1ae33b9cd799806fa93
release image quay.io/openshift-release-dev/ocp-release@sha256:c91c0faf7ae3c480724a935b3dab7e5f49aae19d195b12f3a4ae38f8440ea96b
release architecture amd64

  name: master
  platform:
    azure:
      osDisk:
        diskSizeGB: 100
      type: Standard_D4_v3
  replicas: 1
  
[test]$ openshift-install create manifests --dir .
INFO Credentials loaded from file "/users/s025054/.azure/osServicePrincipal.json" 
FATAL failed to fetch Master Machines: failed to load asset "Install Config": [controlPlane.platform.azure.osDisk.diskType: Invalid value: "Premium_LRS": PremiumIO not supported for instance type Standard_D4_v3, compute[0].platform.azure.osDisk.diskType: Invalid value: "Premium_LRS": PremiumIO not supported for instance type Standard_D2_v3]

Comment 14 Aditya Narayanaswamy 2021-12-01 13:16:35 UTC

Master nodes require PremiumIO capabilities that Standard_D4_v3 does not support. You should use another disktype that actually supports PremiumIO. This bug was created due to the fact that it takes the installer an hour to report this issue and a fix was put in place to check if the disk has PremiumIO capabilities before the cluster creation was started, drastically reducing the time to report the wrong disk configuration.

$ az vm list-skus --size D4 -l centralus | jq '.[] | {"name":.name,"PremiumIO":.capabilities[]|select(.name=="PremiumIO")|.value}'
{
  "name": "Standard_D4_v2",
  "PremiumIO": "False"
}
{
  "name": "Standard_D4_v3",
  "PremiumIO": "False"
}

Note You need to log in before you can comment on or make changes to this bug.