Bug 2085336

Summary: [IPI-Azure] Fail to create the worker node which HyperVGenerations is V2 or V1 and vmNetworkingType is Accelerated
Product: OpenShift Container Platform Reporter: MayXu <maxu>
Component: Cloud ComputeAssignee: Joel Speed <jspeed>
Cloud Compute sub component: Other Providers QA Contact: MayXu <maxu>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: fgrosjea, jialiu, m.andre, mfedosin, pprinett
Version: 4.11Keywords: TestBlocker
Target Milestone: ---Flags: maxu: needinfo-
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 11:11:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description MayXu 2022-05-13 04:13:01 UTC
When special the worker vm which HyperVGenerations is V2, the worker node fails to be created. 

Version: registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-05-11-054135 

How reproducible:
always

Steps to Reproduce:
Specify the compute vm as ‘Standard_DC4s_v3’ (HyperVGenerations is ‘V2’) in install-config.yaml
Create the cluster

Actual results:
   Fail to create the worker nodes
maxu-hy4-ndjmn-worker-eastus21-k955j  Provisioning                   3h43m
maxu-hy4-ndjmn-worker-eastus23-w8m9v  Provisioning                   3h43m

 check the logs as the following:
oc logs -n openshift-machine-api machine-api-controllers-6f85d75-ld8sc -c machine-controller
I0512 09:54:10.521080       1 actuator.go:85] Creating machine maxu-hy5-9hmpj-worker-eastus21-4t2mf
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x18b7d4e]

goroutine 378 [running]:
github.com/openshift/machine-api-provider-azure/pkg/cloud/azure/actuators/machine.(*Reconciler).createNetworkInterface(0xc000647580, {0x1fbb2e8, 0xc000042390}, {0xc0008411a0, 0x28})
	/go/src/github.com/openshift/machine-api-provider-azure/pkg/cloud/azure/actuators/machine/reconciler.go:509 +0x1ee
github.com/openshift/machine-api-provider-azure/pkg/cloud/azure/actuators/machine.(*Reconciler).CreateMachine(0xc000647580, {0x1fbb2e8, 0xc000042390})
	/go/src/github.com/openshift/machine-api-provider-azure/pkg/cloud/azure/actuators/machine/reconciler.go:120 +0x105
github.com/openshift/machine-api-provider-azure/pkg/cloud/azure/actuators/machine.(*Reconciler).Create(0xc000647580, {0x1fbb2e8, 0xc000042390})
	/go/src/github.com/openshift/machine-api-provider-azure/pkg/cloud/azure/actuators/machine/reconciler.go:98 +0x45
github.com/openshift/machine-api-provider-azure/pkg/cloud/azure/actuators/machine.(*Actuator).Create(0xc0006c03c0, {0x1, 0x1}, 0xc000b95d40)
	/go/src/github.com/openshift/machine-api-provider-azure/pkg/cloud/azure/actuators/machine/actuator.go:96 +0x2c5
github.com/openshift/machine-api-operator/pkg/controller/machine.(*ReconcileMachine).Reconcile(0xc000522ff0, {0x1fbb358, 0xc00057e930}, {{{0xc000682eb8, 0x1c31b00}, {0xc000840630, 0x30}}})
	/go/src/github.com/openshift/machine-api-provider-azure/vendor/github.com/openshift/machine-api-operator/pkg/controller/machine/controller.go:387 +0xab4
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc0001a2160, {0x1fbb358, 0xc00057e810}, {{{0xc000682eb8, 0x1c31b00}, {0xc000840630, 0x413894}}})
	/go/src/github.com/openshift/machine-api-provider-azure/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114 +0x26f
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0001a2160, {0x1fbb2b0, 0xc00013a740}, {0x1b29c80, 0xc000316cc0})
	/go/src/github.com/openshift/machine-api-provider-azure/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311 +0x33e
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0001a2160, {0x1fbb2b0, 0xc00013a740})
	/go/src/github.com/openshift/machine-api-provider-azure/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	/go/src/github.com/openshift/machine-api-provider-azure/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	/go/src/github.com/openshift/machine-api-provider-azure/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:223 +0x357

Expected results:
Install success. the worker nodes are created. 

Additional info:
Now the default vmNetworkingType of the worker is "Accelerated", changed to “Basic”, worker nodes can be created.
Standard_DC8s_v3 as master vm type, is ok; as worker vm type failed. 
Ref : https://issues.redhat.com/browse/CORS-1916
https://issues.redhat.com/browse/SPLAT-205

Comment 2 MayXu 2022-05-17 04:11:37 UTC
when set the compute and controlPlane azure.type as ‘Standard_NP10s’ (HyperVGenerations is ‘V1’) in install-config.yaml (region: southcentralus)

got the same error. 

test version:  release:4.11.0-0.nightly-2022-05-11-054135

Comment 3 Joel Speed 2022-05-17 07:54:44 UTC
@maxu Do you happen to have a must-gather available from one of the times you've produced this issue? It would be helpful to see the full system logs from the cluster and in particular the Machines that were generated by the installer and installed within the cluster

Comment 5 Joel Speed 2022-05-17 15:54:02 UTC
The issue here is that we assume we have a complete enumeration of the instance types in our cached list (which is not true) and are taking a value from something that is potentially nil.

We can make a quick fix to pass the Machine creation to Azure and see how their error handling handles it, but the better thing would be to have a dynamic check for whether accelerated networking is supported or not.

The offending line is https://github.com/openshift/machine-api-provider-azure/blob/08dab41984186873b843f2edd43931b2f378e38b/pkg/cloud/azure/actuators/machine/reconciler.go#L509

Comment 6 Patrick Dillon 2022-05-18 16:10:28 UTC
*** Bug 2085443 has been marked as a duplicate of this bug. ***

Comment 8 MayXu 2022-05-25 17:34:29 UTC
checked with registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-05-25-080235
worker vm type as Standard_E4ads_v5('V1,V2'), Standard_NP10s ('V1'), Standard_DC4s_v3 ('V2') all PASS

Comment 10 errata-xmlrpc 2022-08-10 11:11:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069