Bug 2085336 - [IPI-Azure] Fail to create the worker node which HyperVGenerations is V2 or V1 and vmNetworkingType is Accelerated
Summary: [IPI-Azure] Fail to create the worker node which HyperVGenerations is V2 or V...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.11
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.11.0
Assignee: Joel Speed
QA Contact: MayXu
URL:
Whiteboard:
: 2085443 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-13 04:13 UTC by MayXu
Modified: 2022-09-12 14:15 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 11:11:36 UTC
Target Upstream Version:
Embargoed:
maxu: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 5913 0 None open Bug 2085336: based on 4.11 CORS-1916 add the vm family 2022-06-08 03:59:26 UTC
Github openshift machine-api-provider-azure pull 20 0 None open Bug 2085336: Ignore unknown Instance types in accelerated networking check 2022-05-17 16:03:47 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:11:48 UTC

Description MayXu 2022-05-13 04:13:01 UTC
When special the worker vm which HyperVGenerations is V2, the worker node fails to be created. 

Version: registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-05-11-054135 

How reproducible:
always

Steps to Reproduce:
Specify the compute vm as ‘Standard_DC4s_v3’ (HyperVGenerations is ‘V2’) in install-config.yaml
Create the cluster

Actual results:
   Fail to create the worker nodes
maxu-hy4-ndjmn-worker-eastus21-k955j  Provisioning                   3h43m
maxu-hy4-ndjmn-worker-eastus23-w8m9v  Provisioning                   3h43m

 check the logs as the following:
oc logs -n openshift-machine-api machine-api-controllers-6f85d75-ld8sc -c machine-controller
I0512 09:54:10.521080       1 actuator.go:85] Creating machine maxu-hy5-9hmpj-worker-eastus21-4t2mf
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x18b7d4e]

goroutine 378 [running]:
github.com/openshift/machine-api-provider-azure/pkg/cloud/azure/actuators/machine.(*Reconciler).createNetworkInterface(0xc000647580, {0x1fbb2e8, 0xc000042390}, {0xc0008411a0, 0x28})
	/go/src/github.com/openshift/machine-api-provider-azure/pkg/cloud/azure/actuators/machine/reconciler.go:509 +0x1ee
github.com/openshift/machine-api-provider-azure/pkg/cloud/azure/actuators/machine.(*Reconciler).CreateMachine(0xc000647580, {0x1fbb2e8, 0xc000042390})
	/go/src/github.com/openshift/machine-api-provider-azure/pkg/cloud/azure/actuators/machine/reconciler.go:120 +0x105
github.com/openshift/machine-api-provider-azure/pkg/cloud/azure/actuators/machine.(*Reconciler).Create(0xc000647580, {0x1fbb2e8, 0xc000042390})
	/go/src/github.com/openshift/machine-api-provider-azure/pkg/cloud/azure/actuators/machine/reconciler.go:98 +0x45
github.com/openshift/machine-api-provider-azure/pkg/cloud/azure/actuators/machine.(*Actuator).Create(0xc0006c03c0, {0x1, 0x1}, 0xc000b95d40)
	/go/src/github.com/openshift/machine-api-provider-azure/pkg/cloud/azure/actuators/machine/actuator.go:96 +0x2c5
github.com/openshift/machine-api-operator/pkg/controller/machine.(*ReconcileMachine).Reconcile(0xc000522ff0, {0x1fbb358, 0xc00057e930}, {{{0xc000682eb8, 0x1c31b00}, {0xc000840630, 0x30}}})
	/go/src/github.com/openshift/machine-api-provider-azure/vendor/github.com/openshift/machine-api-operator/pkg/controller/machine/controller.go:387 +0xab4
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc0001a2160, {0x1fbb358, 0xc00057e810}, {{{0xc000682eb8, 0x1c31b00}, {0xc000840630, 0x413894}}})
	/go/src/github.com/openshift/machine-api-provider-azure/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114 +0x26f
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0001a2160, {0x1fbb2b0, 0xc00013a740}, {0x1b29c80, 0xc000316cc0})
	/go/src/github.com/openshift/machine-api-provider-azure/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311 +0x33e
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0001a2160, {0x1fbb2b0, 0xc00013a740})
	/go/src/github.com/openshift/machine-api-provider-azure/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	/go/src/github.com/openshift/machine-api-provider-azure/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	/go/src/github.com/openshift/machine-api-provider-azure/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:223 +0x357

Expected results:
Install success. the worker nodes are created. 

Additional info:
Now the default vmNetworkingType of the worker is "Accelerated", changed to “Basic”, worker nodes can be created.
Standard_DC8s_v3 as master vm type, is ok; as worker vm type failed. 
Ref : https://issues.redhat.com/browse/CORS-1916
https://issues.redhat.com/browse/SPLAT-205

Comment 2 MayXu 2022-05-17 04:11:37 UTC
when set the compute and controlPlane azure.type as ‘Standard_NP10s’ (HyperVGenerations is ‘V1’) in install-config.yaml (region: southcentralus)

got the same error. 

test version:  release:4.11.0-0.nightly-2022-05-11-054135

Comment 3 Joel Speed 2022-05-17 07:54:44 UTC
@maxu Do you happen to have a must-gather available from one of the times you've produced this issue? It would be helpful to see the full system logs from the cluster and in particular the Machines that were generated by the installer and installed within the cluster

Comment 5 Joel Speed 2022-05-17 15:54:02 UTC
The issue here is that we assume we have a complete enumeration of the instance types in our cached list (which is not true) and are taking a value from something that is potentially nil.

We can make a quick fix to pass the Machine creation to Azure and see how their error handling handles it, but the better thing would be to have a dynamic check for whether accelerated networking is supported or not.

The offending line is https://github.com/openshift/machine-api-provider-azure/blob/08dab41984186873b843f2edd43931b2f378e38b/pkg/cloud/azure/actuators/machine/reconciler.go#L509

Comment 6 Patrick Dillon 2022-05-18 16:10:28 UTC
*** Bug 2085443 has been marked as a duplicate of this bug. ***

Comment 8 MayXu 2022-05-25 17:34:29 UTC
checked with registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-05-25-080235
worker vm type as Standard_E4ads_v5('V1,V2'), Standard_NP10s ('V1'), Standard_DC4s_v3 ('V2') all PASS

Comment 10 errata-xmlrpc 2022-08-10 11:11:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.