Bug 1837341
Summary: | Nodes in Azure are getting into longUnregistered state | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Marcel Härri <mharri> | ||||||||
Component: | Cloud Compute | Assignee: | Joel Speed <jspeed> | ||||||||
Cloud Compute sub component: | Other Providers | QA Contact: | sunzhaohua <zhsun> | ||||||||
Status: | CLOSED ERRATA | Docs Contact: | |||||||||
Severity: | high | ||||||||||
Priority: | unspecified | ||||||||||
Version: | 4.4 | ||||||||||
Target Milestone: | --- | ||||||||||
Target Release: | 4.5.0 | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: |
Cause: Autoscaler expects provider IDs across Node and Machine objects to be an exact match. This may not be the case if the Machine configuration specifies a resource group name with mixed casing.
Consequence: The autoscaler believes the Machine never gets a node and terminates the machine after a 15 minute period
Fix: Sanitise the user input and convert the resource group to lowercase
Result: Provider IDs will always match regardless of casing on user input
|
Story Points: | --- | ||||||||
Clone Of: | |||||||||||
: | 1841478 (view as bug list) | Environment: | |||||||||
Last Closed: | 2020-07-13 17:40:02 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1841478 | ||||||||||
Attachments: |
|
Description
Marcel Härri
2020-05-19 10:13:58 UTC
sounds like there might be a discrepancy between the providerID for nodes and machines. Does this cluster have this patch https://github.com/openshift/kubernetes-autoscaler/pull/142? Looks to me like the ProviderID that the autoscaler thinks is "LongUnregistered" is matching the ProviderID on the Node, so it should realise the Node is registered. Looks to me like a genuine bug rather than a configuration issue from what we have so far. We will need to try and reproduce this to explore it further. @Marcel Could you please confirm which specific version of 4.4 you were running when you saw this problem? (eg was it a GA release 4.4.z or a nightly build, if so, which nightly build) @Alberto I don't think that patch is relevant here, it was my understanding that we introduced the normalized IDs in 4.5, so this patch shouldn't need backporting to 4.4 Both clusters where I am seeing this are on 4.4.3 (one upgraded from 4.3, the other a fresh 4.4 install): $ oc get clusterversion/version -o yaml apiVersion: config.openshift.io/v1 kind: ClusterVersion metadata: creationTimestamp: "2020-05-18T15:50:14Z" generation: 2 name: version resourceVersion: "699186" selfLink: /apis/config.openshift.io/v1/clusterversions/version uid: 7fcebc9e-ff98-4eaa-a4ef-d3cccef5b8e6 spec: channel: fast-4.4 clusterID: b2c145ed-36e9-4317-abec-afbecdfef682 upstream: https://api.openshift.com/api/upgrades_info/v1/graph status: availableUpdates: null conditions: - lastTransitionTime: "2020-05-18T16:09:51Z" message: Done applying 4.4.3 status: "True" type: Available - lastTransitionTime: "2020-05-19T08:54:57Z" status: "False" type: Failing - lastTransitionTime: "2020-05-18T16:09:51Z" message: Cluster version is 4.4.3 status: "False" type: Progressing - lastTransitionTime: "2020-05-18T15:50:25Z" message: 'Unable to retrieve available updates: unexpected HTTP status: 503 Service Unavailable' reason: ResponseFailed status: "False" type: RetrievedUpdates desired: force: false image: quay.io/openshift-release-dev/ocp-release@sha256:039a4ef7c128a049ccf916a1d68ce93e8f5494b44d5a75df60c85e9e7191dacc version: 4.4.3 history: - completionTime: "2020-05-18T16:09:51Z" image: quay.io/openshift-release-dev/ocp-release@sha256:039a4ef7c128a049ccf916a1d68ce93e8f5494b44d5a75df60c85e9e7191dacc startedTime: "2020-05-18T15:50:25Z" state: Completed verified: false version: 4.4.3 observedGeneration: 2 versionHash: h110xMINmng= Hi Marcel, I've been unable to reproduce this in my testing. Everything is working as expected on the 4.4.3 cluster that I built. Are there any more details you can provide? For example the MachineAutoscaler and ClusterAutoscaler definitions? If you are able to, could you try reproducing again, but this time setting the cluster-autoscaler log level to 4 instead of the default 1. If we have the more detailed logs we may be able to see what is happening a little better. I am trying to increase the cluster-autoscaler loglevel, meanwhile here are the definitions: $ oc get clusterautoscaler/default -o yaml apiVersion: autoscaling.openshift.io/v1 kind: ClusterAutoscaler metadata: creationTimestamp: "2020-05-18T21:18:11Z" generation: 1 name: default resourceVersion: "138473" selfLink: /apis/autoscaling.openshift.io/v1/clusterautoscalers/default uid: 2e3aaee9-1a9b-4bc1-8b9f-704d2f943683 spec: podPriorityThreshold: -10 resourceLimits: maxNodesTotal: 20 scaleDown: delayAfterAdd: 10m delayAfterDelete: 5m delayAfterFailure: 30s enabled: true unneededTime: 60s $ oc get machineautoscaler foo-worker-switzerlandnorth -o yaml apiVersion: autoscaling.openshift.io/v1beta1 kind: MachineAutoscaler metadata: creationTimestamp: "2020-05-18T21:18:11Z" finalizers: - machinetarget.autoscaling.openshift.io generation: 2 name: foo-worker-switzerlandnorth namespace: openshift-machine-api resourceVersion: "565487" selfLink: /apis/autoscaling.openshift.io/v1beta1/namespaces/openshift-machine-api/machineautoscalers/foo-worker-switzerlandnorth uid: 7b7d9945-94bf-4975-ad4c-5a64b87a7f1a spec: maxReplicas: 14 minReplicas: 3 scaleTargetRef: apiVersion: machine.openshift.io/v1beta1 kind: MachineSet name: foo-worker-switzerlandnorth status: lastTargetRef: apiVersion: machine.openshift.io/v1beta1 kind: MachineSet name: foog-worker-switzerlandnorth $ oc get machineset foo-worker-switzerlandnorth -o yaml apiVersion: machine.openshift.io/v1beta1 kind: MachineSet metadata: annotations: autoscaling.openshift.io/machineautoscaler: openshift-machine-api/foo-worker-switzerlandnorth machine.openshift.io/cluster-api-autoscaler-node-group-max-size: "14" machine.openshift.io/cluster-api-autoscaler-node-group-min-size: "3" creationTimestamp: "2020-05-18T15:51:06Z" generation: 2 labels: machine.openshift.io/cluster-api-cluster: foo machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker name: foo-worker-switzerlandnorth namespace: openshift-machine-api resourceVersion: "565488" selfLink: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinesets/foo-worker-switzerlandnorth uid: dfef2d98-72a8-4e5a-bfdc-5baa19380cef spec: replicas: 3 selector: matchLabels: machine.openshift.io/cluster-api-cluster: foo machine.openshift.io/cluster-api-machineset: foo-worker-switzerlandnorth template: metadata: creationTimestamp: null labels: machine.openshift.io/cluster-api-cluster: foo machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/cluster-api-machineset: foo-worker-switzerlandnorth spec: metadata: creationTimestamp: null labels: role.bar.corp: worker providerSpec: value: apiVersion: azureproviderconfig.openshift.io/v1beta1 credentialsSecret: name: azure-cloud-credentials namespace: openshift-machine-api image: offer: "" publisher: "" resourceID: /resourceGroups/xxxx/providers/Microsoft.Compute/images/foo sku: "" version: "" internalLoadBalancer: "" kind: AzureMachineProviderSpec location: switzerlandnorth managedIdentity: foo-identity metadata: creationTimestamp: null natRule: null networkResourceGroup: bbbb osDisk: diskSizeGB: 128 managedDisk: storageAccountType: Premium_LRS osType: Linux publicIP: false publicLoadBalancer: "" resourceGroup: xxxx sshPrivateKey: "" sshPublicKey: "" subnet: cccc userDataSecret: name: worker-user-data vmSize: Standard_DS14-8_v2 vnet: dddd zone: "" status: availableReplicas: 3 fullyLabeledReplicas: 3 observedGeneration: 2 readyReplicas: 3 replicas: 3 Created attachment 1692255 [details]
cluster-autoscaler log with log level 4
So I overrode the cluster-autoscaler to be unamanged, scaled it down to 0 and then edited the deployment of the autoscaler and changed logging from 1 to 4.
I then let it running until I got again longUnregistered warning. Though the nodes seem to be unregistered since the beginning.
I didn't manage to get a cluster up today (they kept failing to launch) so didn't get to retry reproducing, but I have done some code analysis and some analysis on the logs which I'll post here, and then I'll come back to this tomorrow ``` I0526 12:15:08.373545 1 static_autoscaler.go:194] Starting main loop // GetNodeInfosForGroups // Call NodeGroupForNode on each readyNode I0526 12:15:08.373949 1 machineapi_controller.go:468] node "foo-worker-switzerlandnorth-bql4m" is in nodegroup "foo-worker-switzerlandnorth" I0526 12:15:08.374539 1 machineapi_controller.go:468] node "foo-worker-switzerlandnorth-vmkqs" is in nodegroup "foo-worker-switzerlandnorth" I0526 12:15:08.374787 1 machineapi_controller.go:468] node "foo-worker-switzerlandnorth-kfrgp" is in nodegroup "foo-worker-switzerlandnorth" // Call NodeGroups I0526 12:15:08.374861 1 machineapi_provider.go:62] discovered node group: openshift-machine-api/foo-worker-switzerlandnorth (min: 3, max: 14, replicas: 3) // UpdateClusterState -> clusterStateRegistry.UpdateNodes // getTargetSizes -> NodeGroups I0526 12:15:08.374887 1 machineapi_provider.go:62] discovered node group: openshift-machine-api/foo-worker-switzerlandnorth (min: 3, max: 14, replicas: 3) // getCloudProviderNodeInstances -> NodeGroups I0526 12:15:08.374920 1 machineapi_provider.go:62] discovered node group: openshift-machine-api/foo-worker-switzerlandnorth (min: 3, max: 14, replicas: 3) // getCloudProviderNodeInstances -> cloudProviderNodeInstancesCache.GetCloudProviderNodeInstances // GetCloudProviderNodeInstances -> NodeGroups I0526 12:15:08.374956 1 machineapi_provider.go:62] discovered node group: openshift-machine-api/foo-worker-switzerlandnorth (min: 3, max: 14, replicas: 3) // GetCloudProviderNodeInstances -> GetCloudProviderNodeInstancesForNodeGroup (optional?) -> fetchCloudProviderNodeInstancesForNodeGroup // fetchCloudProviderNodeInstancesForNodeGroup -> Nodes -> machineSetProviderIDs (returns a list of provider IDs from the Machines in the MachineSet) (Missing log for nodegroup has nodes) // getNotRegisteredNodes (takes all nodes and matches them to the instance IDs returned by getCloudProviderNodeInstances, adds fake nodes if instance ID not found on a node) // updateUnregisteredNodes // ??? W0526 12:15:08.374979 1 clusterstate.go:437] Failed to find acceptable ranges for openshift-machine-api/foo-worker-switzerlandnorth I0526 12:15:08.374994 1 machineapi_provider.go:62] discovered node group: openshift-machine-api/foo-worker-switzerlandnorth (min: 3, max: 14, replicas: 3) // updateReadinessStats -> nodeGroupForNode on nodes I0526 12:15:08.375083 1 machineapi_controller.go:468] node "foo-worker-switzerlandnorth-kfrgp" is in nodegroup "foo-worker-switzerlandnorth" I0526 12:15:08.375175 1 machineapi_controller.go:468] node "foo-worker-switzerlandnorth-bql4m" is in nodegroup "foo-worker-switzerlandnorth" I0526 12:15:08.375286 1 machineapi_controller.go:468] node "foo-worker-switzerlandnorth-vmkqs" is in nodegroup "foo-worker-switzerlandnorth" // updateReadinessStats -> nodeGroupForNode on unregistered nodes I0526 12:15:08.375347 1 machineapi_controller.go:468] node "azure:///subscriptions/xxxx/resourceGroups/aaaa/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-vmkqs" is in nodegroup "foo-worker-switzerlandnorth" I0526 12:15:08.375420 1 machineapi_controller.go:468] node "azure:///subscriptions/xxxx/resourceGroups/aaaa/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-kfrgp" is in nodegroup "foo-worker-switzerlandnorth" I0526 12:15:08.375476 1 machineapi_controller.go:468] node "azure:///subscriptions/xxxx/resourceGroups/aaaa/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-bql4m" is in nodegroup "foo-worker-switzerlandnorth" // updateAcceptableRanges -> NodeGroups I0526 12:15:08.375541 1 machineapi_provider.go:62] discovered node group: openshift-machine-api/foo-worker-switzerlandnorth (min: 3, max: 14, replicas: 3) // updateScaleRequests -> areThereUpcomingNodesInNodeGroup -> getProvisionedAndTargetSizesForNodeGroup (Missing log for Failed to find acceptable ranges for) // handleInstanceCreationErrors -> NodeGroups I0526 12:15:08.375578 1 machineapi_provider.go:62] discovered node group: openshift-machine-api/foo-worker-switzerlandnorth (min: 3, max: 14, replicas: 3) // updateAcceptableRanges -> NodeGroups I0526 12:15:08.375608 1 machineapi_provider.go:62] discovered node group: openshift-machine-api/foo-worker-switzerlandnorth (min: 3, max: 14, replicas: 3) // updateIncorrectNodeGroupSizes -> NodeGroups I0526 12:15:08.375636 1 machineapi_provider.go:62] discovered node group: openshift-machine-api/foo-worker-switzerlandnorth (min: 3, max: 14, replicas: 3) // GetUnregisteredNodes // If unregisterednodes > 0: I0526 12:15:08.375718 1 static_autoscaler.go:269] 3 unregistered nodes present ``` I've noted the function calls that are creating the log lines we are seeing so that I can follow this. What it looks to me like is happening, is that for some reason the logic of getNotRegisteredNodes is, for some reason, marking the instance IDs that come from the cloud provider as not there, even though we are seeing them on the Node specs. This means we end up with the three real nodes and three "fake" nodes because the instance IDs did not match. If I fail to reproduce myself, I may do a build that includes some extra logging around the parts of the code that I think are buggy and send that across to you. I spent some more time looking into this today, unfortunately with no luck. I've created my own environment and matched all the settings (as far as I can tell) to what you have provided. The only difference I have is that I'm running mine in centralus instead of switzerlandnorth I've create a special build from the 4.4 release branch, with some extra logging. Would you be comfortable running this build on the cluster and gather some logs from it at log level 1? The image I've built is quay.io/jspeed/cluster-autoscaler:bz1837341-debug and the source is https://github.com/JoelSpeed/autoscaler/tree/bz1837341-debug (The last two commits are the additions I've made) I can't tell from what we have so far why this is not working, but it seems the issue is persistent for you in that region Created attachment 1692966 [details]
cluster-autoscaler-debug-image
The pod yaml after I edited the deployment. Mainly as a reference.
Created attachment 1692969 [details]
Pod with debug image log
Thank you for the image. And it helped, as I think I spotted the issue. It probably wasn't really possible to spot it before, as I anonymized customer related data and might have hidden it with that. Though I think the issue was only really revealed with your additional debug statements.
The log I am adding is still anonymized, but in a way it keeps the relevant differences.
Essentially, the autoscaler reports the following ready nodes:
Ready Nodes (9): map[foo-infra-switzerlandnorth-6hr4v:azure:///subscriptions/xxxx/resourceGroups/rgr-beepapp001-chn-dev/providers/Microsoft.Compute/virtualMachines/foo-infra-switzerlandnorth-6hr4v foo-infra-switzerlandnorth-7czk8:azure:///subscriptions/xxxx/resourceGroups/rgr-beepapp001-chn-dev/providers/Microsoft.Compute/virtualMachines/foo-infra-switzerlandnorth-7czk8 foo-infra-switzerlandnorth-7r8m2:azure:///subscriptions/xxxx/resourceGroups/rgr-beepapp001-chn-dev/providers/Microsoft.Compute/virtualMachines/foo-infra-switzerlandnorth-7r8m2 foo-master-0:azure:///subscriptions/xxxx/resourceGroups/rgr-beepapp001-chn-dev/providers/Microsoft.Compute/virtualMachines/foo-master-0 foo-master-1:azure:///subscriptions/xxxx/resourceGroups/rgr-beepapp001-chn-dev/providers/Microsoft.Compute/virtualMachines/foo-master-1 foo-master-2:azure:///subscriptions/xxxx/resourceGroups/rgr-beepapp001-chn-dev/providers/Microsoft.Compute/virtualMachines/foo-master-2 foo-worker-switzerlandnorth-bql4m:azure:///subscriptions/xxxx/resourceGroups/rgr-beepapp001-chn-dev/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-bql4m foo-worker-switzerlandnorth-kfrgp:azure:///subscriptions/xxxx/resourceGroups/rgr-beepapp001-chn-dev/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-kfrgp foo-worker-switzerlandnorth-vmkqs:azure:///subscriptions/xxxx/resourceGroups/rgr-beepapp001-chn-dev/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-vmkqs]
While the cloud provider repors the following NodeInstances:
I0528 08:11:21.240035 1 clusterstate.go:311] CloudProvider NodeInstances: map[openshift-machine-api/foo-worker-switzerlandnorth:[{Id:azure:///subscriptions/xxxx/resourceGroups/rgr-beepApp001-chn-dev/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-bql4m Status:<nil>} {Id:azure:///subscriptions/xxxx/resourceGroups/rgr-beepApp001-chn-dev/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-vmkqs Status:<nil>} {Id:azure:///subscriptions/xxxx/resourceGroups/rgr-beepApp001-chn-dev/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-kfrgp Status:<nil>}]]
Ready nodes:
azure:///subscriptions/xxxx/resourceGroups/rgr-beepapp001-chn-dev/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-bql4m
Node Instances:
azure:///subscriptions/xxxx/resourceGroups/rgr-beepApp001-chn-dev/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-bql4m
The difference is in the resource group:
rgr-beepapp001-chn-dev
vs.
rgr-beepApp001-chn-dev
No in the machineset we have configured the following:
$ oc get machineset -o yaml vx01-worker-switzerlandnorth | grep beep
resourceID: /resourceGroups/rgr-beepApp001-chn-dev/providers/Microsoft.Compute/images/foo
networkResourceGroup: rgr-beepInfra001-chn-dev
resourceGroup: rgr-beepApp001-chn-dev
subnet: snet-vnet-beepInfra001-chn-dev-private-001-default-2
vnet: vnet-beepInfra001-chn-dev-private-001
And on Azure side it is really also with a capital A:
$ az group show --subscription=xxxx --name=rgr-beepApp001-chn-dev -o json
{
"id": "/subscriptions/xxxx/resourceGroups/rgr-beepApp001-chn-dev",
"location": "switzerlandnorth",
"managedBy": null,
"name": "rgr-beepApp001-chn-dev",
"properties": {
"provisioningState": "Succeeded"
},
"tags": {
"Description": "Beep",
},
"type": "Microsoft.Resources/resourceGroups"
}
While you can query things with a downcased A (e.g. az group show) it will still be returned with a capital A.
I verified that in all cloud credentials and machinesets we have the capital A. It looks like some component is downcasing the resourcegroup when pulling it in, but another not and then the machine IDs are not matching.
Based on this latest comment, I have been able to reproduce the scenario by creating an IPI Azure cluster (name has to be lower case, resource group created from that), then modifying the MachineSet to set the resourceGroup to a capitalised version (even though Azure thinks it's lowercase, it doesn't care), then scaling up some Machines in that MachineSet. Since the Kubernetes cloud provider takes the resource group and lowercases it before setting it on the node (https://github.com/kubernetes/legacy-cloud-providers/blob/2c79b47a93724ada71ae2d311d3cd9dcdab3c415/azure/azure_instances.go#L375-L377), I would suggest we should be doing the same in the Azure Cluster API provider where we set the Provider ID on the Machine. That said, the alternative is to add validation that requires the resource group to be lower cased, but I think this has the possibility to break a lot of current UPI installations. Indeed, I would expect also some of the IPI installations with pre-defined vnet and so on to have issues with that if we would suddenly enforce things to be all lower cased. I've raised a fix which just lowercases the strings on our end, it's purely informational at that point anyway and is only really used to match Nodes to Machines, so is under our control. Gonna try and get this into 4.5 and backported to 4.4 as the results could be pretty bad if your nodegroup wasn't at minimum capacity. Ie It would scale down any nodes over the minimum every 15 minutes. As a workaround, if you haven't already, you can change the resourceGroup on the providerSpec and new machines will have their providerIDs set correctly Verified clusterversion: 4.5.0-0.nightly-2020-06-08-204500 creating an IPI Azure cluster, then modifying the MachineSet to set the resourceGroup to a capitalised version, then scaling up some Machines in that MachineSet, machines could join the cluster. $ oc edit machineset zhsun69azure-nknvm-worker-westus-1 publicIP: false resourceGroup: zhsun69azure-nknvm-RG $ oc get node NAME STATUS ROLES AGE VERSION zhsun69azure-nknvm-master-0 Ready master 17h v1.18.3+a637491 zhsun69azure-nknvm-master-1 Ready master 17h v1.18.3+a637491 zhsun69azure-nknvm-master-2 Ready master 17h v1.18.3+a637491 zhsun69azure-nknvm-worker-westus-1-hckk9 Ready worker 13m v1.18.3+a637491 zhsun69azure-nknvm-worker-westus-1-j2cb4 Ready worker 3m3s v1.18.3+a637491 zhsun69azure-nknvm-worker-westus-2ch92 Ready worker 17h v1.18.3+a637491 zhsun69azure-nknvm-worker-westus-94lch Ready worker 5m41s v1.18.3+a637491 zhsun69azure-nknvm-worker-westus-bb5xc Ready worker 17h v1.18.3+a637491 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 |