Bug 1837341 - Nodes in Azure are getting into longUnregistered state
Summary: Nodes in Azure are getting into longUnregistered state
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.5.0
Assignee: Joel Speed
QA Contact: sunzhaohua
URL:
Whiteboard:
Depends On:
Blocks: 1841478
TreeView+ depends on / blocked
 
Reported: 2020-05-19 10:13 UTC by Marcel Härri
Modified: 2020-07-13 17:40 UTC (History)
0 users

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Autoscaler expects provider IDs across Node and Machine objects to be an exact match. This may not be the case if the Machine configuration specifies a resource group name with mixed casing. Consequence: The autoscaler believes the Machine never gets a node and terminates the machine after a 15 minute period Fix: Sanitise the user input and convert the resource group to lowercase Result: Provider IDs will always match regardless of casing on user input
Clone Of:
: 1841478 (view as bug list)
Environment:
Last Closed: 2020-07-13 17:40:02 UTC
Target Upstream Version:


Attachments (Terms of Use)
cluster-autoscaler log with log level 4 (943.12 KB, text/plain)
2020-05-26 12:41 UTC, Marcel Härri
no flags Details
cluster-autoscaler-debug-image (3.97 KB, text/plain)
2020-05-28 08:18 UTC, Marcel Härri
no flags Details
Pod with debug image log (1.71 MB, text/plain)
2020-05-28 08:34 UTC, Marcel Härri
no flags Details


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-api-provider-azure pull 134 None closed Bug 1837341: Use ToLower for Subscription ID and Resource Group Name 2020-09-13 13:37:39 UTC
Red Hat Product Errata RHBA-2020:2409 None None None 2020-07-13 17:40:18 UTC

Description Marcel Härri 2020-05-19 10:13:58 UTC
Worker nodes in Azure are getting into longUnregistered state and ClusterAutoscalerNodesNotReady warning is fired.

Deployed 2 4.4 clusters into Azure, added cluster autoscaler as a post-install task. Soon afterwards ClusterAutoscalerNodesNotReady is fired as the nodes are getting into longUnregistered state:

$ oc describe cm cluster-autoscaler-status -n openshift-machine-api
[...]
NodeGroups:
  Name:        openshift-machine-api/foo-worker-switzerlandnorth
  Health:      Healthy (ready=2 unready=0 notStarted=0 longNotStarted=0 registered=2 longUnregistered=2 cloudProviderTarget=2 (minSize=2, maxSize=6))
               LastProbeTime:      2020-05-13 12:54:34.202041814 +0000 UTC m=+2443.180316364
               LastTransitionTime: 2020-05-13 12:14:19.733945805 +0000 UTC m=+28.716322737

Logs of the autoscaler also shows these nodes to be not considered, as they are not registered:


I0513 12:59:04.639391       1 static_autoscaler.go:269] 2 unregistered nodes present
I0513 12:59:04.639458       1 static_autoscaler.go:528] Removing unregistered node azure:///subscriptions/xxxx/resourceGroups/aaaa/providers
/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-dhpr6
W0513 12:59:04.639485       1 static_autoscaler.go:544] Failed to remove node azure:///subscriptions/xxxx/resourceGroups/aaaa/providers/Micr
osoft.Compute/virtualMachines/foo-worker-switzerlandnorth-dhpr6: node group min size reached, skipping unregistered node removal
I0513 12:59:04.639494       1 static_autoscaler.go:528] Removing unregistered node azure:///subscriptions/xxxx/resourceGroups/aaaa/providers
/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-djh62
W0513 12:59:04.639514       1 static_autoscaler.go:544] Failed to remove node azure:///subscriptions/xxxx/resourceGroups/aaaa/providers/Micr
osoft.Compute/virtualMachines/foo-worker-switzerlandnorth-djh62: node group min size reached, skipping unregistered node removal
I0513 12:59:04.640325       1 static_autoscaler.go:343] No unschedulable pods
I0513 12:59:04.640486       1 pre_filtering_processor.go:66] Skipping foo-worker-switzerlandnorth-djh62 - node group min size reached
I0513 12:59:04.640722       1 pre_filtering_processor.go:66] Skipping foo-worker-switzerlandnorth-dhpr6 - node group min size reached
I0513 12:59:04.641502       1 scale_down.go:776] No candidates for scale down

Looking at these nodes I don't see anything particular standing out. Also I already had them deleted and have the machine controller creating new ones. Then for a while the nodes are fine, but after a while I have the warning back.

]$ oc describe node foo-worker-switzerlandnorth-djh62                                                                    [73/1815]
Name:               foo-worker-switzerlandnorth-djh62
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=Standard_DS14-8_v2
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=SwitzerlandNorth
                    failure-domain.beta.kubernetes.io/zone=0
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=foo-worker-switzerlandnorth-djh62
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
                    node.kubernetes.io/instance-type=Standard_DS14-8_v2
                    node.openshift.io/os_id=rhcos
                    topology.kubernetes.io/region=SwitzerlandNorth
                    topology.kubernetes.io/zone=0
Annotations:        machine.openshift.io/machine: openshift-machine-api/foo-worker-switzerlandnorth-djh62
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-0dc3240777a9cf2acdf56bb5aaec3a53
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-0dc3240777a9cf2acdf56bb5aaec3a53
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 12 May 2020 13:54:27 +0000
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  foo-worker-switzerlandnorth-djh62
  AcquireTime:     <unset>
  RenewTime:       Tue, 19 May 2020 10:05:05 +0000
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Tue, 19 May 2020 10:01:45 +0000   Tue, 12 May 2020 13:54:27 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory ava
ilable
  DiskPressure     False   Tue, 19 May 2020 10:01:45 +0000   Tue, 12 May 2020 13:54:27 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Tue, 19 May 2020 10:01:45 +0000   Tue, 12 May 2020 13:54:27 +0000   KubeletHasSufficientPID      kubelet has sufficient PID availa
ble
  Ready            True    Tue, 19 May 2020 10:01:45 +0000   Tue, 12 May 2020 13:55:07 +0000   KubeletReady                 kubelet is posting ready status
Addresses:
  Hostname:    foo-worker-switzerlandnorth-djh62
  InternalIP:  172.21.177.12
Capacity:
  attachable-volumes-azure-disk:  64
  cpu:                            8
  ephemeral-storage:              133665772Ki
  hugepages-1Gi:                  0
  hugepages-2Mi:                  0
  memory:                         115488388Ki
  pods:                           250
Allocatable:
  attachable-volumes-azure-disk:  64
  cpu:                            7500m
  ephemeral-storage:              122112633448
  hugepages-1Gi:                  0
  hugepages-2Mi:                  0
  memory:                         114337412Ki
  pods:                           250
System Info:
  Machine ID:                              85a580f52d7d474cbff137025399d33c
  System UUID:                             41067a87-10e4-3b46-bb9d-c0d753cfa777
  Boot ID:                                 7b5d7f42-5eae-4a66-a480-0221cdf908a1
  Kernel Version:                          4.18.0-147.8.1.el8_1.x86_64
  OS Image:                                Red Hat Enterprise Linux CoreOS 44.81.202004260825-0 (Ootpa)
  Operating System:                        linux
  Architecture:                            amd64
  Container Runtime Version:               cri-o://1.17.4-8.dev.rhaos4.4.git5f5c5e4.el8
  Kubelet Version:                         v1.17.1
  Kube-Proxy Version:                      v1.17.1
ProviderID:                                azure:///subscriptions/xxxx/resourceGroups/aaaa/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-djh62
Non-terminated Pods:                       (17 in total)
[...]
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                       Requests     Limits
  --------                       --------     ------
  cpu                            847m (11%)   0 (0%)
  memory                         2794Mi (2%)  1248Mi (1%)
  ephemeral-storage              0 (0%)       0 (0%)
  attachable-volumes-azure-disk  0            0
Events:                          <none>

The machine also looks fine:

$ oc describe machine foo-worker-switzerlandnorth-djh62 -n openshift-machine-api                                        [56/1915]
Name:         foo-worker-switzerlandnorth-djh62
Namespace:    openshift-machine-api
Labels:       machine.openshift.io/cluster-api-cluster=foo
              machine.openshift.io/cluster-api-machine-role=worker
              machine.openshift.io/cluster-api-machine-type=worker
              machine.openshift.io/cluster-api-machineset=foo-worker-switzerlandnorth
              machine.openshift.io/instance-type=Standard_DS14-8_v2
              machine.openshift.io/region=switzerlandnorth
Annotations:  machine.openshift.io/instance-state: Running
API Version:  machine.openshift.io/v1beta1
Kind:         Machine
Metadata:
  Creation Timestamp:  2020-05-12T13:50:45Z
  Finalizers:
    machine.machine.openshift.io
  Generate Name:  foo-worker-switzerlandnorth-
  Generation:     2
  Owner References:
    API Version:           machine.openshift.io/v1beta1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  MachineSet
    Name:                  foo-worker-switzerlandnorth
    UID:                   e44f9f67-1163-4c5c-82b0-45a6756e8852
  Resource Version:        11588571
  Self Link:               /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/foo-worker-switzerlandnorth-djh62
  UID:                     f7328d3e-576e-48c0-8be2-48f11321bd09
Spec:
  Metadata:
    Creation Timestamp:  <nil>
    Labels:
      role.sdx.corp:  worker
  Provider ID:        azure:///subscriptions/xxxx/resourceGroups/aaaa/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-djh62
  Provider Spec:
    Value:
      API Version:  azureproviderconfig.openshift.io/v1beta1
      Credentials Secret:
        Name:       azure-cloud-credentials
        Namespace:  openshift-machine-api
      Image:
        Offer:                 
        Publisher:             
        Resource ID:           /resourceGroups/aaaa/providers/Microsoft.Compute/images/foo
        Sku:                                                                                                                                        
        Version:               
      Internal Load Balancer:  
      Kind:                    AzureMachineProviderSpec
      Location:                switzerlandnorth
      Managed Identity:        foo-identity
      Metadata:
        Creation Timestamp:    <nil>
      Nat Rule:                <nil>
      Network Resource Group:  bbbb
      Os Disk:
        Disk Size GB:  128
        Managed Disk:
          Storage Account Type:  Premium_LRS
        Os Type:                 Linux
      Public IP:                 false
      Public Load Balancer:      
      Resource Group:            aaaa
      Ssh Private Key:           
      Ssh Public Key:            
      Subnet:                    cccc
      User Data Secret:
        Name:   worker-user-data
      Vm Size:  Standard_DS14-8_v2
      Vnet:     dddd
      Zone:     
Status:
  Addresses:
    Address:     foo-worker-switzerlandnorth-djh62
    Type:        Hostname
    Address:     foo-worker-switzerlandnorth-djh62
    Type:        InternalDNS
    Address:     foo-worker-switzerlandnorth-djh62.zzzz.zrhx.internal.cloudapp.net
    Type:        InternalDNS
    Address:     172.21.177.12
    Type:        InternalIP
  Last Updated:  2020-05-14T14:47:28Z
  Node Ref:
    Kind:  Node
    Name:  foo-worker-switzerlandnorth-djh62
    UID:   14d870ef-9a5c-4f4b-8b1b-28099224e25a
  Phase:   Running
  Provider Status:
    Conditions:
      Last Probe Time:       2020-05-12T13:52:47Z
      Last Transition Time:  2020-05-12T13:50:56Z
      Message:               machine successfully created
      Reason:                MachineCreationSucceeded
      Status:                True
      Type:                  MachineCreated
    Metadata:
    Vm Id:     /subscriptions/xxxx/resourceGroups/aaaa/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-djh62
    Vm State:  Running
Events:        <none>

Comment 1 Alberto 2020-05-19 10:20:34 UTC
sounds like there might be a discrepancy between the providerID for nodes and machines.
Does this cluster have this patch https://github.com/openshift/kubernetes-autoscaler/pull/142?

Comment 2 Joel Speed 2020-05-19 10:36:25 UTC
Looks to me like the ProviderID that the autoscaler thinks is "LongUnregistered" is matching the ProviderID on the Node, so it should realise the Node is registered. Looks to me like a genuine bug rather than a configuration issue from what we have so far.

We will need to try and reproduce this to explore it further.

@Marcel Could you please confirm which specific version of 4.4 you were running when you saw this problem? (eg was it a GA release 4.4.z or a nightly build, if so, which nightly build)

@Alberto I don't think that patch is relevant here, it was my understanding that we introduced the normalized IDs in 4.5, so this patch shouldn't need backporting to 4.4

Comment 3 Marcel Härri 2020-05-19 11:05:19 UTC
Both clusters where I am seeing this are on 4.4.3 (one upgraded from 4.3, the other a fresh 4.4 install):

$ oc get clusterversion/version -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterVersion
metadata:
  creationTimestamp: "2020-05-18T15:50:14Z"
  generation: 2
  name: version
  resourceVersion: "699186"
  selfLink: /apis/config.openshift.io/v1/clusterversions/version
  uid: 7fcebc9e-ff98-4eaa-a4ef-d3cccef5b8e6
spec:
  channel: fast-4.4
  clusterID: b2c145ed-36e9-4317-abec-afbecdfef682
  upstream: https://api.openshift.com/api/upgrades_info/v1/graph
status:
  availableUpdates: null
  conditions:
  - lastTransitionTime: "2020-05-18T16:09:51Z"
    message: Done applying 4.4.3
    status: "True"
    type: Available
  - lastTransitionTime: "2020-05-19T08:54:57Z"
    status: "False"
    type: Failing
  - lastTransitionTime: "2020-05-18T16:09:51Z"
    message: Cluster version is 4.4.3
    status: "False"
    type: Progressing
  - lastTransitionTime: "2020-05-18T15:50:25Z"
    message: 'Unable to retrieve available updates: unexpected HTTP status: 503 Service
      Unavailable'
    reason: ResponseFailed
    status: "False"
    type: RetrievedUpdates
  desired:
    force: false
    image: quay.io/openshift-release-dev/ocp-release@sha256:039a4ef7c128a049ccf916a1d68ce93e8f5494b44d5a75df60c85e9e7191dacc
    version: 4.4.3
  history:
  - completionTime: "2020-05-18T16:09:51Z"
    image: quay.io/openshift-release-dev/ocp-release@sha256:039a4ef7c128a049ccf916a1d68ce93e8f5494b44d5a75df60c85e9e7191dacc
    startedTime: "2020-05-18T15:50:25Z"
    state: Completed
    verified: false
    version: 4.4.3
  observedGeneration: 2
  versionHash: h110xMINmng=

Comment 4 Joel Speed 2020-05-21 14:12:29 UTC
Hi Marcel,

I've been unable to reproduce this in my testing. Everything is working as expected on the 4.4.3 cluster that I built.

Are there any more details you can provide? For example the MachineAutoscaler and ClusterAutoscaler definitions?

If you are able to, could you try reproducing again, but this time setting the cluster-autoscaler log level to 4 instead of the default 1.
If we have the more detailed logs we may be able to see what is happening a little better.

Comment 5 Marcel Härri 2020-05-26 08:59:36 UTC
I am trying to increase the cluster-autoscaler loglevel, meanwhile here are the definitions:

$ oc get clusterautoscaler/default -o yaml
apiVersion: autoscaling.openshift.io/v1
kind: ClusterAutoscaler
metadata:
  creationTimestamp: "2020-05-18T21:18:11Z"
  generation: 1
  name: default
  resourceVersion: "138473"
  selfLink: /apis/autoscaling.openshift.io/v1/clusterautoscalers/default
  uid: 2e3aaee9-1a9b-4bc1-8b9f-704d2f943683
spec:
  podPriorityThreshold: -10
  resourceLimits:
    maxNodesTotal: 20
  scaleDown:
    delayAfterAdd: 10m
    delayAfterDelete: 5m
    delayAfterFailure: 30s
    enabled: true
    unneededTime: 60s


$ oc get machineautoscaler foo-worker-switzerlandnorth -o yaml
apiVersion: autoscaling.openshift.io/v1beta1
kind: MachineAutoscaler
metadata:
  creationTimestamp: "2020-05-18T21:18:11Z"
  finalizers:
  - machinetarget.autoscaling.openshift.io
  generation: 2
  name: foo-worker-switzerlandnorth
  namespace: openshift-machine-api
  resourceVersion: "565487"
  selfLink: /apis/autoscaling.openshift.io/v1beta1/namespaces/openshift-machine-api/machineautoscalers/foo-worker-switzerlandnorth
  uid: 7b7d9945-94bf-4975-ad4c-5a64b87a7f1a
spec:
  maxReplicas: 14
  minReplicas: 3
  scaleTargetRef:
    apiVersion: machine.openshift.io/v1beta1
    kind: MachineSet
    name: foo-worker-switzerlandnorth
status:
  lastTargetRef:
    apiVersion: machine.openshift.io/v1beta1
    kind: MachineSet
    name: foog-worker-switzerlandnorth


$ oc get machineset foo-worker-switzerlandnorth -o yaml
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  annotations:
    autoscaling.openshift.io/machineautoscaler: openshift-machine-api/foo-worker-switzerlandnorth
    machine.openshift.io/cluster-api-autoscaler-node-group-max-size: "14"
    machine.openshift.io/cluster-api-autoscaler-node-group-min-size: "3"
  creationTimestamp: "2020-05-18T15:51:06Z"
  generation: 2
  labels:
    machine.openshift.io/cluster-api-cluster: foo
    machine.openshift.io/cluster-api-machine-role: worker
    machine.openshift.io/cluster-api-machine-type: worker
  name: foo-worker-switzerlandnorth
  namespace: openshift-machine-api
  resourceVersion: "565488"
  selfLink: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinesets/foo-worker-switzerlandnorth
  uid: dfef2d98-72a8-4e5a-bfdc-5baa19380cef
spec:
  replicas: 3
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: foo
      machine.openshift.io/cluster-api-machineset: foo-worker-switzerlandnorth
  template:
    metadata:
      creationTimestamp: null
      labels:
        machine.openshift.io/cluster-api-cluster: foo
        machine.openshift.io/cluster-api-machine-role: worker
        machine.openshift.io/cluster-api-machine-type: worker
        machine.openshift.io/cluster-api-machineset: foo-worker-switzerlandnorth
    spec:
      metadata:
        creationTimestamp: null
        labels:
          role.bar.corp: worker
      providerSpec:
        value:
          apiVersion: azureproviderconfig.openshift.io/v1beta1
          credentialsSecret:
            name: azure-cloud-credentials
            namespace: openshift-machine-api
          image:
            offer: ""
            publisher: ""
            resourceID: /resourceGroups/xxxx/providers/Microsoft.Compute/images/foo
            sku: ""
            version: ""
          internalLoadBalancer: ""
          kind: AzureMachineProviderSpec
          location: switzerlandnorth
          managedIdentity: foo-identity
          metadata:
            creationTimestamp: null
          natRule: null
          networkResourceGroup: bbbb
          osDisk:
            diskSizeGB: 128
            managedDisk:
              storageAccountType: Premium_LRS
            osType: Linux
          publicIP: false
          publicLoadBalancer: ""
          resourceGroup: xxxx
          sshPrivateKey: ""
          sshPublicKey: ""
          subnet: cccc
          userDataSecret:
            name: worker-user-data
          vmSize: Standard_DS14-8_v2
          vnet: dddd
          zone: ""
status:
  availableReplicas: 3
  fullyLabeledReplicas: 3
  observedGeneration: 2
  readyReplicas: 3
  replicas: 3

Comment 6 Marcel Härri 2020-05-26 12:41:57 UTC
Created attachment 1692255 [details]
cluster-autoscaler log with log level 4

So I overrode the cluster-autoscaler to be unamanged, scaled it down to 0 and then edited the deployment of the autoscaler and changed logging from 1 to 4.

I then let it running until I got again longUnregistered warning. Though the nodes seem to be unregistered since the beginning.

Comment 7 Joel Speed 2020-05-26 16:57:42 UTC
I didn't manage to get a cluster up today (they kept failing to launch) so didn't get to retry reproducing, but I have done some code analysis and some analysis on the logs which I'll post here, and then I'll come back to this tomorrow

```
I0526 12:15:08.373545       1 static_autoscaler.go:194] Starting main loop

// GetNodeInfosForGroups
  // Call NodeGroupForNode on each readyNode
I0526 12:15:08.373949       1 machineapi_controller.go:468] node "foo-worker-switzerlandnorth-bql4m" is in nodegroup "foo-worker-switzerlandnorth"
I0526 12:15:08.374539       1 machineapi_controller.go:468] node "foo-worker-switzerlandnorth-vmkqs" is in nodegroup "foo-worker-switzerlandnorth"
I0526 12:15:08.374787       1 machineapi_controller.go:468] node "foo-worker-switzerlandnorth-kfrgp" is in nodegroup "foo-worker-switzerlandnorth"
  // Call NodeGroups
I0526 12:15:08.374861       1 machineapi_provider.go:62] discovered node group: openshift-machine-api/foo-worker-switzerlandnorth (min: 3, max: 14, replicas: 3)

// UpdateClusterState -> clusterStateRegistry.UpdateNodes
  // getTargetSizes -> NodeGroups
I0526 12:15:08.374887       1 machineapi_provider.go:62] discovered node group: openshift-machine-api/foo-worker-switzerlandnorth (min: 3, max: 14, replicas: 3)
  // getCloudProviderNodeInstances -> NodeGroups
I0526 12:15:08.374920       1 machineapi_provider.go:62] discovered node group: openshift-machine-api/foo-worker-switzerlandnorth (min: 3, max: 14, replicas: 3)
  // getCloudProviderNodeInstances -> cloudProviderNodeInstancesCache.GetCloudProviderNodeInstances
    // GetCloudProviderNodeInstances -> NodeGroups
I0526 12:15:08.374956       1 machineapi_provider.go:62] discovered node group: openshift-machine-api/foo-worker-switzerlandnorth (min: 3, max: 14, replicas: 3)
    // GetCloudProviderNodeInstances -> GetCloudProviderNodeInstancesForNodeGroup (optional?) -> fetchCloudProviderNodeInstancesForNodeGroup
      // fetchCloudProviderNodeInstancesForNodeGroup -> Nodes -> machineSetProviderIDs (returns a list of provider IDs from the Machines in the MachineSet) (Missing log for nodegroup has nodes)
  // getNotRegisteredNodes (takes all nodes and matches them to the instance IDs returned by getCloudProviderNodeInstances, adds fake nodes if instance ID not found on a node)
  // updateUnregisteredNodes
  // ???
W0526 12:15:08.374979       1 clusterstate.go:437] Failed to find acceptable ranges for openshift-machine-api/foo-worker-switzerlandnorth
I0526 12:15:08.374994       1 machineapi_provider.go:62] discovered node group: openshift-machine-api/foo-worker-switzerlandnorth (min: 3, max: 14, replicas: 3)
  // updateReadinessStats -> nodeGroupForNode on nodes
I0526 12:15:08.375083       1 machineapi_controller.go:468] node "foo-worker-switzerlandnorth-kfrgp" is in nodegroup "foo-worker-switzerlandnorth"
I0526 12:15:08.375175       1 machineapi_controller.go:468] node "foo-worker-switzerlandnorth-bql4m" is in nodegroup "foo-worker-switzerlandnorth"
I0526 12:15:08.375286       1 machineapi_controller.go:468] node "foo-worker-switzerlandnorth-vmkqs" is in nodegroup "foo-worker-switzerlandnorth"
  // updateReadinessStats -> nodeGroupForNode on unregistered nodes
I0526 12:15:08.375347       1 machineapi_controller.go:468] node "azure:///subscriptions/xxxx/resourceGroups/aaaa/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-vmkqs" is in nodegroup "foo-worker-switzerlandnorth"
I0526 12:15:08.375420       1 machineapi_controller.go:468] node "azure:///subscriptions/xxxx/resourceGroups/aaaa/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-kfrgp" is in nodegroup "foo-worker-switzerlandnorth"
I0526 12:15:08.375476       1 machineapi_controller.go:468] node "azure:///subscriptions/xxxx/resourceGroups/aaaa/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-bql4m" is in nodegroup "foo-worker-switzerlandnorth"
  // updateAcceptableRanges -> NodeGroups
I0526 12:15:08.375541       1 machineapi_provider.go:62] discovered node group: openshift-machine-api/foo-worker-switzerlandnorth (min: 3, max: 14, replicas: 3)
  // updateScaleRequests -> areThereUpcomingNodesInNodeGroup -> getProvisionedAndTargetSizesForNodeGroup (Missing log for Failed to find acceptable ranges for)
  // handleInstanceCreationErrors -> NodeGroups
I0526 12:15:08.375578       1 machineapi_provider.go:62] discovered node group: openshift-machine-api/foo-worker-switzerlandnorth (min: 3, max: 14, replicas: 3)
  // updateAcceptableRanges -> NodeGroups
I0526 12:15:08.375608       1 machineapi_provider.go:62] discovered node group: openshift-machine-api/foo-worker-switzerlandnorth (min: 3, max: 14, replicas: 3)
  // updateIncorrectNodeGroupSizes -> NodeGroups
I0526 12:15:08.375636       1 machineapi_provider.go:62] discovered node group: openshift-machine-api/foo-worker-switzerlandnorth (min: 3, max: 14, replicas: 3)
// GetUnregisteredNodes
// If unregisterednodes > 0:
I0526 12:15:08.375718       1 static_autoscaler.go:269] 3 unregistered nodes present
```

I've noted the function calls that are creating the log lines we are seeing so that I can follow this. What it looks to me like is happening, is that for some reason the logic of getNotRegisteredNodes is, for some reason, marking the instance IDs that come from the cloud provider as not there, even though we are seeing them on the Node specs. This means we end up with the three real nodes and three "fake" nodes because the instance IDs did not match. If I fail to reproduce myself, I may do a build that includes some extra logging around the parts of the code that I think are buggy and send that across to you.

Comment 8 Joel Speed 2020-05-27 16:17:52 UTC
I spent some more time looking into this today, unfortunately with no luck. I've created my own environment and matched all the settings (as far as I can tell) to what you have provided. The only difference I have is that I'm running mine in centralus instead of switzerlandnorth

I've create a special build from the 4.4 release branch, with some extra logging. Would you be comfortable running this build on the cluster and gather some logs from it at log level 1?

The image I've built is quay.io/jspeed/cluster-autoscaler:bz1837341-debug and the source is https://github.com/JoelSpeed/autoscaler/tree/bz1837341-debug (The last two commits are the additions I've made)

I can't tell from what we have so far why this is not working, but it seems the issue is persistent for you in that region

Comment 9 Marcel Härri 2020-05-28 08:18:07 UTC
Created attachment 1692966 [details]
cluster-autoscaler-debug-image

The pod yaml after I edited the deployment. Mainly as a reference.

Comment 10 Marcel Härri 2020-05-28 08:34:54 UTC
Created attachment 1692969 [details]
Pod with debug image log

Thank you for the image. And it helped, as I think I spotted the issue. It probably wasn't really possible to spot it before, as I anonymized customer related data and might have hidden it with that. Though I think the issue was only really revealed with your additional debug statements.

The log I am adding is still anonymized, but in a way it keeps the relevant differences.

Essentially, the autoscaler reports the following ready nodes:

Ready Nodes (9): map[foo-infra-switzerlandnorth-6hr4v:azure:///subscriptions/xxxx/resourceGroups/rgr-beepapp001-chn-dev/providers/Microsoft.Compute/virtualMachines/foo-infra-switzerlandnorth-6hr4v foo-infra-switzerlandnorth-7czk8:azure:///subscriptions/xxxx/resourceGroups/rgr-beepapp001-chn-dev/providers/Microsoft.Compute/virtualMachines/foo-infra-switzerlandnorth-7czk8 foo-infra-switzerlandnorth-7r8m2:azure:///subscriptions/xxxx/resourceGroups/rgr-beepapp001-chn-dev/providers/Microsoft.Compute/virtualMachines/foo-infra-switzerlandnorth-7r8m2 foo-master-0:azure:///subscriptions/xxxx/resourceGroups/rgr-beepapp001-chn-dev/providers/Microsoft.Compute/virtualMachines/foo-master-0 foo-master-1:azure:///subscriptions/xxxx/resourceGroups/rgr-beepapp001-chn-dev/providers/Microsoft.Compute/virtualMachines/foo-master-1 foo-master-2:azure:///subscriptions/xxxx/resourceGroups/rgr-beepapp001-chn-dev/providers/Microsoft.Compute/virtualMachines/foo-master-2 foo-worker-switzerlandnorth-bql4m:azure:///subscriptions/xxxx/resourceGroups/rgr-beepapp001-chn-dev/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-bql4m foo-worker-switzerlandnorth-kfrgp:azure:///subscriptions/xxxx/resourceGroups/rgr-beepapp001-chn-dev/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-kfrgp foo-worker-switzerlandnorth-vmkqs:azure:///subscriptions/xxxx/resourceGroups/rgr-beepapp001-chn-dev/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-vmkqs]

While the cloud provider repors the following NodeInstances:

I0528 08:11:21.240035       1 clusterstate.go:311] CloudProvider NodeInstances: map[openshift-machine-api/foo-worker-switzerlandnorth:[{Id:azure:///subscriptions/xxxx/resourceGroups/rgr-beepApp001-chn-dev/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-bql4m Status:<nil>} {Id:azure:///subscriptions/xxxx/resourceGroups/rgr-beepApp001-chn-dev/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-vmkqs Status:<nil>} {Id:azure:///subscriptions/xxxx/resourceGroups/rgr-beepApp001-chn-dev/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-kfrgp Status:<nil>}]]


Ready nodes:

azure:///subscriptions/xxxx/resourceGroups/rgr-beepapp001-chn-dev/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-bql4m 

Node Instances:

azure:///subscriptions/xxxx/resourceGroups/rgr-beepApp001-chn-dev/providers/Microsoft.Compute/virtualMachines/foo-worker-switzerlandnorth-bql4m

The difference is in the resource group:

rgr-beepapp001-chn-dev

vs.


rgr-beepApp001-chn-dev

No in the machineset we have configured the following:

$ oc get machineset -o yaml vx01-worker-switzerlandnorth | grep beep
            resourceID: /resourceGroups/rgr-beepApp001-chn-dev/providers/Microsoft.Compute/images/foo
          networkResourceGroup: rgr-beepInfra001-chn-dev
          resourceGroup: rgr-beepApp001-chn-dev
          subnet: snet-vnet-beepInfra001-chn-dev-private-001-default-2
          vnet: vnet-beepInfra001-chn-dev-private-001


And on Azure side it is really also with a capital A:

$ az group show --subscription=xxxx --name=rgr-beepApp001-chn-dev -o json
{
  "id": "/subscriptions/xxxx/resourceGroups/rgr-beepApp001-chn-dev",
  "location": "switzerlandnorth",
  "managedBy": null,
  "name": "rgr-beepApp001-chn-dev",
  "properties": {
    "provisioningState": "Succeeded"
  },
  "tags": {
    "Description": "Beep",
  },
  "type": "Microsoft.Resources/resourceGroups"
}

While you can query things with a downcased A (e.g. az group show) it will still be returned with a capital A.

I verified that in all cloud credentials and machinesets we have the capital A. It looks like some component is downcasing the resourcegroup when pulling it in, but another not and then the machine IDs are not matching.

Comment 11 Joel Speed 2020-05-28 13:21:22 UTC
Based on this latest comment, I have been able to reproduce the scenario by creating an IPI Azure cluster (name has to be lower case, resource group created from that), then modifying the MachineSet to set the resourceGroup to a capitalised version (even though Azure thinks it's lowercase, it doesn't care), then scaling up some Machines in that MachineSet.

Since the Kubernetes cloud provider takes the resource group and lowercases it before setting it on the node (https://github.com/kubernetes/legacy-cloud-providers/blob/2c79b47a93724ada71ae2d311d3cd9dcdab3c415/azure/azure_instances.go#L375-L377),
I would suggest we should be doing the same in the Azure Cluster API provider where we set the Provider ID on the Machine.

That said, the alternative is to add validation that requires the resource group to be lower cased, but I think this has the possibility to break a lot of current UPI installations.

Comment 12 Marcel Härri 2020-05-28 13:39:20 UTC
Indeed, I would expect also some of the IPI installations with pre-defined vnet and so on to have issues with that if we would suddenly enforce things to be all lower cased.

Comment 13 Joel Speed 2020-05-28 13:43:02 UTC
I've raised a fix which just lowercases the strings on our end, it's purely informational at that point anyway and is only really used to match Nodes to Machines, so is under our control. Gonna try and get this into 4.5 and backported to 4.4 as the results could be pretty bad if your nodegroup wasn't at minimum capacity. Ie It would scale down any nodes over the minimum every 15 minutes.

As a workaround, if you haven't already, you can change the resourceGroup on the providerSpec and new machines will have their providerIDs set correctly

Comment 16 sunzhaohua 2020-06-10 01:29:55 UTC
Verified
clusterversion: 4.5.0-0.nightly-2020-06-08-204500

creating an IPI Azure cluster, then modifying the MachineSet to set the resourceGroup to a capitalised version, then scaling up some Machines in that MachineSet, machines could join the cluster.

$ oc edit machineset zhsun69azure-nknvm-worker-westus-1
          publicIP: false
          resourceGroup: zhsun69azure-nknvm-RG

$ oc get node
NAME                                       STATUS   ROLES    AGE     VERSION
zhsun69azure-nknvm-master-0                Ready    master   17h     v1.18.3+a637491
zhsun69azure-nknvm-master-1                Ready    master   17h     v1.18.3+a637491
zhsun69azure-nknvm-master-2                Ready    master   17h     v1.18.3+a637491
zhsun69azure-nknvm-worker-westus-1-hckk9   Ready    worker   13m     v1.18.3+a637491
zhsun69azure-nknvm-worker-westus-1-j2cb4   Ready    worker   3m3s    v1.18.3+a637491
zhsun69azure-nknvm-worker-westus-2ch92     Ready    worker   17h     v1.18.3+a637491
zhsun69azure-nknvm-worker-westus-94lch     Ready    worker   5m41s   v1.18.3+a637491
zhsun69azure-nknvm-worker-westus-bb5xc     Ready    worker   17h     v1.18.3+a637491

Comment 17 errata-xmlrpc 2020-07-13 17:40:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.