1817853 – [RHV] New machine stuck at 'Provisioned' phase

Bug 1817853 - [RHV] New machine stuck at 'Provisioned' phase

Summary: [RHV] New machine stuck at 'Provisioned' phase

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Evgeny Slutsky
QA Contact:	Lucie Leistnerova
Docs Contact:
URL:	https://issues.redhat.com/browse/OCPR...
Whiteboard:	ovirt
Duplicates (1):	1840553 (view as bug list)
Depends On:	1815435
Blocks:	1854787
TreeView+	depends on / blocked

Reported:	2020-03-27 06:23 UTC by Jianwei Hou
Modified:	2024-10-01 16:32 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously, an attempt to perform manual scale-up (from GUI or with CLI) failed with an error. With this update, the scale-up works.
Clone Of:	1815435
Environment:
Last Closed:	2020-11-02 13:01:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
machine-controller.log (15.92 MB, text/plain) 2020-04-10 09:47 UTC, sunzhaohua	no flags	Details
must-gather from cluster (149 bytes, text/plain) 2020-06-29 20:00 UTC, milti leonard	no flags	Details
namespaces included are machine-api and machine-config-operator (125 bytes, text/plain) 2020-06-29 20:17 UTC, milti leonard	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-ovirt pull 53	None	closed	Bug 1817853: Reconcile network addresses according to VM status	2021-02-21 13:46:34 UTC
Github	openshift cluster-api-provider-ovirt pull 62	None	closed	Bug 1817853: abort network reconciling while vm is in transient state	2021-02-21 13:46:34 UTC
Red Hat Product Errata	RHBA-2020:4196	None	None	None	2020-10-27 15:56:56 UTC

Description Jianwei Hou 2020-03-27 06:23:04 UTC

+++ This bug was initially created as a clone of Bug #1815435 +++

Description of problem:
Scale up a machineset on RHV, new machine is stuck at 'Provisioned' phase and can not become running.

Version-Release number of selected component (if applicable):
4.4.0-0.nightly-2020-03-18-102708

How reproducible:
Always

Steps to Reproduce:
1. On an IPI on RHV deployment, clone(create) a machineset, with replicas=1
2. Machine phase can not become 'Running', it's stuck at 'Provisioned' state.
In the yaml output of the machine, it has a status.addresses with type=internalDNS, that is the node being provisioned.
3. List cluster nodes, found that node is already added to cluster with 'Ready' status before machine becomes 'Running'.
4. oc get csr, found that the csr is still pending, however its node already becomes 'Ready'. 

oc get csr
NAME        AGE     REQUESTOR                                                                   CONDITION
csr-2dblb   23m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-l5495   8m21s   system:node:machineset-clone-25436-dz8zk                                    Pending
csr-xcv9q   23m     system:node:machineset-clone-25436-dz8zk                                    Pending

machine-approver log:
I0320 08:34:06.591051       1 main.go:146] CSR csr-xcv9q added
I0320 08:34:06.643897       1 csr_check.go:418] retrieving serving cert from machineset-clone-25436-dz8zk (10.37.142.239:10250)
W0320 08:34:06.648505       1 csr_check.go:178] Failed to retrieve current serving cert: remote error: tls: internal error
I0320 08:34:06.649046       1 csr_check.go:183] Falling back to machine-api authorization for machineset-clone-25436-dz8zk
I0320 08:34:06.649116       1 main.go:181] CSR csr-xcv9q not authorized: No target machine for node "machineset-clone-25436-dz8zk"
I0320 08:34:06.649282       1 main.go:217] Error syncing csr csr-xcv9q: No target machine for node "machineset-clone-25436-dz8zk"

Actual results:
After step 2
# oc get machine machineset-clone-25436-dz8zk -o yaml -n openshift-machine-api

```
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
  annotations:
    instance-status: |
      {"kind":"Machine","apiVersion":"machine.openshift.io/v1beta1","metadata":{"name":"machineset-clone-25436-dz8zk","generateName":"machineset-clone-25436-","namespace":"openshift-machine-api","selfLink":"/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/machineset-clone-25436-dz8zk","uid":"268636af-2605-4940-808b-4c93af9f4439","resourceVersion":"1696820","generation":1,"creationTimestamp":"2020-03-20T08:24:00Z","labels":{"machine.openshift.io/cluster-api-cluster":"six-g7gng","machine.openshift.io/cluster-api-machine-role":"worker","machine.openshift.io/cluster-api-machine-type":"worker","machine.openshift.io/cluster-api-machineset":"machineset-clone-25436"},"annotations":{"VmId":"7c32b1e1-2d64-4d8f-a2bc-b8ebd3effa93","instance-status":""},"ownerReferences":[{"apiVersion":"machine.openshift.io/v1beta1","kind":"MachineSet","name":"machineset-clone-25436","uid":"30021136-c1ad-4a3b-afd5-b80aec5454fb","controller":true,"blockOwnerDeletion":true}],"finalizers":["machine.machine.openshift.io"]},"spec":{"metadata":{"creationTimestamp":null},"providerSpec":{"value":{"apiVersion":"ovirtproviderconfig.openshift.io/v1beta1","cluster_id":"7321c6b7-9abb-4de6-affa-b4ce9b4124a0","credentialsSecret":{"name":"ovirt-credentials"},"id":"","kind":"OvirtMachineProviderSpec","metadata":{"creationTimestamp":null},"name":"","template_name":"six-g7gng-rhcos","userDataSecret":{"name":"worker-user-data"}}},"providerID":"7c32b1e1-2d64-4d8f-a2bc-b8ebd3effa93"},"status":{"lastUpdated":"2020-03-20T08:40:49Z","providerStatus":{"metadata":{"creationTimestamp":null},"instanceId":"machineset-clone-25436-dz8zk","instanceState":"up","conditions":[{"type":"MachineCreated","status":"True","lastProbeTime":"2020-03-20T08:40:49Z","lastTransitionTime":"2020-03-20T08:40:49Z","reason":"machineCreationSucceedReason","message":"machineCreationSucceedMessage"}]},"addresses":[{"type":"InternalDNS","address":"machineset-clone-25436-dz8zk"},{"type":"InternalIP","address":"10.37.142.239"}],"phase":"Provisioned"}}
  creationTimestamp: "2020-03-20T08:24:00Z"
  finalizers:
  - machine.machine.openshift.io
  generateName: machineset-clone-25436-
  generation: 1
  labels:
    machine.openshift.io/cluster-api-cluster: six-g7gng
    machine.openshift.io/cluster-api-machine-role: worker
    machine.openshift.io/cluster-api-machine-type: worker
    machine.openshift.io/cluster-api-machineset: machineset-clone-25436
  name: machineset-clone-25436-dz8zk
  namespace: openshift-machine-api
  ownerReferences:
  - apiVersion: machine.openshift.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: MachineSet
    name: machineset-clone-25436
    uid: 30021136-c1ad-4a3b-afd5-b80aec5454fb
  resourceVersion: "1696850"
  selfLink: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/machineset-clone-25436-dz8zk
  uid: 268636af-2605-4940-808b-4c93af9f4439
spec:
  metadata:
    creationTimestamp: null
  providerSpec:
    value:
      apiVersion: ovirtproviderconfig.openshift.io/v1beta1
      cluster_id: 7321c6b7-9abb-4de6-affa-b4ce9b4124a0
      credentialsSecret:
        name: ovirt-credentials
      id: ""
      kind: OvirtMachineProviderSpec
      metadata:
        creationTimestamp: null
      name: ""
      template_name: six-g7gng-rhcos
      userDataSecret:
        name: worker-user-data
status:
  addresses:
  - address: machineset-clone-25436-dz8zk
    type: InternalDNS
  - address: 10.37.142.239
    type: InternalIP
  lastUpdated: "2020-03-20T08:40:49Z"
  phase: Provisioned
  providerStatus:
    conditions:
    - lastProbeTime: "2020-03-20T08:40:49Z"
      lastTransitionTime: "2020-03-20T08:40:49Z"
      message: machineCreationSucceedMessage
      reason: machineCreationSucceedReason
      status: "True"
      type: MachineCreated
    instanceId: machineset-clone-25436-dz8zk
    instanceState: up
    metadata:
      creationTimestamp: null
```

After step 3:
# oc get node machineset-clone-25436-dz8zk -o yaml
apiVersion: v1
kind: Node
metadata:
  annotations:
    machineconfiguration.openshift.io/currentConfig: rendered-worker-3a7892df1405153af5843ced2eb2124f
    machineconfiguration.openshift.io/desiredConfig: rendered-worker-3a7892df1405153af5843ced2eb2124f
    machineconfiguration.openshift.io/reason: ""
    machineconfiguration.openshift.io/state: Done
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2020-03-20T08:28:36Z"
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/os: linux
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: machineset-clone-25436-dz8zk
    kubernetes.io/os: linux
    node-role.kubernetes.io/worker: ""
    node.openshift.io/os_id: rhcos
  name: machineset-clone-25436-dz8zk
  resourceVersion: "1700847"
  selfLink: /api/v1/nodes/machineset-clone-25436-dz8zk
  uid: 8191284b-25fd-44a7-8ecd-1d7839c94e0c
spec: {}
status:
  addresses:
  - address: 10.37.142.239
    type: InternalIP
  - address: machineset-clone-25436-dz8zk
    type: Hostname
  allocatable:
    cpu: 3500m
    ephemeral-storage: "13879457768"
    hugepages-2Mi: "0"
    memory: 15258944Ki
    pods: "250"
  capacity:
    cpu: "4"
    ephemeral-storage: 16225260Ki
    hugepages-2Mi: "0"
    memory: 16409920Ki
    pods: "250"
  conditions:
  - lastHeartbeatTime: "2020-03-20T08:46:18Z"
    lastTransitionTime: "2020-03-20T08:28:36Z"
    message: kubelet has sufficient memory available
    reason: KubeletHasSufficientMemory
    status: "False"
    type: MemoryPressure
  - lastHeartbeatTime: "2020-03-20T08:46:18Z"
    lastTransitionTime: "2020-03-20T08:28:36Z"
    message: kubelet has no disk pressure
    reason: KubeletHasNoDiskPressure
    status: "False"
    type: DiskPressure
  - lastHeartbeatTime: "2020-03-20T08:46:18Z"
    lastTransitionTime: "2020-03-20T08:28:36Z"
    message: kubelet has sufficient PID available
    reason: KubeletHasSufficientPID
    status: "False"
    type: PIDPressure
  - lastHeartbeatTime: "2020-03-20T08:46:18Z"
    lastTransitionTime: "2020-03-20T08:30:27Z"
    message: kubelet is posting ready status
    reason: KubeletReady
    status: "True"
    type: Ready
  daemonEndpoints:
    kubeletEndpoint:
      Port: 10250
  images:
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:683de47331b97c6b9fda23e5e4d30ca6ed204cb1af255d92fee0609beb35e4e8
    - <none>:<none>
    sizeBytes: 819364305
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7a2d829994da699c14da9ec713c697ad81ae4667c6f33b3842246fb644ffb5a2
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:<none>
    sizeBytes: 467168186
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1be47db5f7ade2960581367a23c2574b774b0f60a5672bdafd8589b1c87779f6
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:<none>
    sizeBytes: 443387144
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:108cef723b7ea6e1322e7addc6d8ede14200b4801d2f6acb1d4c2c446376421c
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:<none>
    sizeBytes: 429699040
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:feab9b86f8829fc5e98a5e04ee5dfd2169e165fdefed183d544f56829409dd32
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:<none>
    sizeBytes: 364316978
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d71d621566014b622fcc830e390d77fd831d28cf51770309f50aa1a541aec3ee
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:<none>
    sizeBytes: 342462123
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f2a1a49204ec502c0c567688faa921f1165d8b130e1360c8fe2916932a6ad7c8
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:<none>
    sizeBytes: 337457196
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3b9c1bd3aea24770c1a6daf3448d84d46f3e76433041d1ba21132c240eb49ef9
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:<none>
    sizeBytes: 325417773
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4d1150eeb6e587161b278595f90580e79f21479a3fed95a56b7f9b3165e2bb58
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:<none>
    sizeBytes: 318520021
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:480153ace2b8733a3cfc1c3f521f972b605efd1ea89ccd5786bfd9d7667af2d2
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:<none>
    sizeBytes: 317938245
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1f0de2e2f25da5df864b7ccf993d84a4032c912ad6d8bf4fccd62d3ee6877a8b
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:<none>
    sizeBytes: 312586096
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8b1ef25de733758ec14635f4e4a4951cd0d11cfbe031ae3fb030bada9db4abc2
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:<none>
    sizeBytes: 305426192
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6662962a1e3cf0867e520206bf679609d59a9ec3b6c257cd9eeb4bf331323bda
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:<none>
    sizeBytes: 283277682
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eea59d098d0a88657c3e268adc6c117a4050a377ec373d4806e21e2b6cb39dbb
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:<none>
    sizeBytes: 267495412
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5782342a322303a684ac6de9bd05f931b7422e866285a8ba869c2ccaed209914
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:<none>
    sizeBytes: 264047880
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:dc45c8a26642cee797c6daa32ff10f9059192bc7c0bbab3ac955a6c53935e21d
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:<none>
    sizeBytes: 255672168
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b31063f8cb53a137f515394838dd4ede503bcf94125e64c2e87401b5243bd067
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:<none>
    sizeBytes: 250871136
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0a79b8c47f97205626b95fba4d54f356c4f8f30491becd33155582744530aba2
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:<none>
    sizeBytes: 237179055
  nodeInfo:
    architecture: amd64
    bootID: 3eed5f48-772d-4f41-8e17-e519ae2eef80
    containerRuntimeVersion: cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el8
    kernelVersion: 4.18.0-147.5.1.el8_1.x86_64
    kubeProxyVersion: v1.17.1
    kubeletVersion: v1.17.1
    machineID: 7c32b1e12d644d8fa2bcb8ebd3effa93
    operatingSystem: linux
    osImage: Red Hat Enterprise Linux CoreOS 44.81.202003180730-0 (Ootpa)
    systemUUID: 7c32b1e1-2d64-4d8f-a2bc-b8ebd3effa93
Expected results:


Additional info:

--- Additional comment from Jianwei Hou on 2020-03-20 17:29:54 CST ---



--- Additional comment from Jianwei Hou on 2020-03-20 17:42:56 CST ---

On deleting the machineset and machine, node can not be removed automatically.

# oc get nodes
NAME                           STATUS     ROLES    AGE   VERSION
machineset-clone-25436-dz8zk   NotReady   worker   70m   v1.17.1
six-g7gng-master-0             Ready      master   41h   v1.17.1
six-g7gng-master-1             Ready      master   41h   v1.17.1
six-g7gng-master-2             NotReady   master   41h   v1.17.1
six-g7gng-worker-0-4lfbj       Ready      worker   41h   v1.17.1
six-g7gng-worker-0-569xv       Ready      worker   41h   v1.17.1

Comment 2 sunzhaohua 2020-04-08 10:14:26 UTC

Test on 4.4.0-0.nightly-2020-03-26-041820, this problem doesn't exist.

1. On an IPI on RHV deployment, create a machineset, with replicas=1
2. Machine phase becomes 'Running',node becomes 'Ready'.  
3. oc get csr, csrs were automatically approved

$ oc get machineset
NAME                      DESIRED   CURRENT   READY   AVAILABLE   AGE
leopard-hw72r-worker-0    3         3         3       3           4d13h
leopard-hw72r-worker-00   1         1         1       1           64m
$ oc get machine
NAME                            PHASE     TYPE   REGION   ZONE   AGE
leopard-hw72r-master-0          Running                          4d13h
leopard-hw72r-master-1          Running                          4d13h
leopard-hw72r-master-2          Running                          4d13h
leopard-hw72r-worker-0-9snp8    Running                          4d13h
leopard-hw72r-worker-0-dhflh    Running                          4d13h
leopard-hw72r-worker-0-wr85v    Running                          4d13h
leopard-hw72r-worker-00-nqqzv   Running                          64m
$ oc get node
NAME                            STATUS   ROLES    AGE     VERSION
leopard-hw72r-master-0          Ready    master   4d13h   v1.17.1
leopard-hw72r-master-1          Ready    master   4d13h   v1.17.1
leopard-hw72r-master-2          Ready    master   4d13h   v1.17.1
leopard-hw72r-worker-0-9snp8    Ready    worker   4d13h   v1.17.1
leopard-hw72r-worker-0-dhflh    Ready    worker   4d13h   v1.17.1
leopard-hw72r-worker-0-wr85v    Ready    worker   4d13h   v1.17.1
leopard-hw72r-worker-00-nqqzv   Ready    worker   60m     v1.17.1
$ oc get csr
NAME        AGE   REQUESTOR                                                                   CONDITION
csr-2b2b7   60m   system:node:leopard-hw72r-worker-00-nqqzv                                   Approved,Issued
csr-ghrgb   60m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued

Comment 3 sunzhaohua 2020-04-10 09:43:09 UTC

Reproduced this bug
clusterversion: 4.4.0-0.nightly-2020-04-04-025830

Create a new machineset with replic=3, 1 machine joined the cluster and node status becomes Ready. 2 machines stuck at Provisioned status. "oc describe machine" show machine "instanceState: down"

# oc get machineset
NAME                        DESIRED   CURRENT   READY   AVAILABLE   AGE
secondary-xgbq9-worker-0    3         3         3       3           18h
secondary-xgbq9-worker-00   3         3         1       1           26m

# oc get machine
NAME                              PHASE         TYPE   REGION   ZONE   AGE
secondary-xgbq9-master-0          Running                              18h
secondary-xgbq9-master-1          Running                              18h
secondary-xgbq9-master-2          Running                              18h
secondary-xgbq9-worker-0-hw22t    Running                              18h
secondary-xgbq9-worker-0-lcrzj    Running                              18h
secondary-xgbq9-worker-0-spwkj    Running                              18h
secondary-xgbq9-worker-00-8x7hq   Running                              26m
secondary-xgbq9-worker-00-g89d6   Provisioned                          26m
secondary-xgbq9-worker-00-q8nqs   Provisioned                          26m

# oc get node
NAME                              STATUS   ROLES    AGE   VERSION
secondary-xgbq9-master-0          Ready    master   18h   v1.17.1
secondary-xgbq9-master-1          Ready    master   18h   v1.17.1
secondary-xgbq9-master-2          Ready    master   18h   v1.17.1
secondary-xgbq9-worker-0-hw22t    Ready    worker   18h   v1.17.1
secondary-xgbq9-worker-0-lcrzj    Ready    worker   18h   v1.17.1
secondary-xgbq9-worker-0-spwkj    Ready    worker   18h   v1.17.1
secondary-xgbq9-worker-00-8x7hq   Ready    worker   24m   v1.17.1

# oc get machine secondary-xgbq9-worker-00-g89d6 -o yaml
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
  annotations:
    instance-status: |
      {"kind":"Machine","apiVersion":"machine.openshift.io/v1beta1","metadata":{"name":"secondary-xgbq9-worker-00-g89d6","generateName":"secondary-xgbq9-worker-00-","namespace":"openshift-machine-api","selfLink":"/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/secondary-xgbq9-worker-00-g89d6","uid":"fc95124b-6cc1-4716-a234-f025b7daafe7","resourceVersion":"1560225","generation":1,"creationTimestamp":"2020-04-10T09:10:54Z","labels":{"machine.openshift.io/cluster-api-cluster":"secondary-xgbq9","machine.openshift.io/cluster-api-machine-role":"worker","machine.openshift.io/cluster-api-machine-type":"worker","machine.openshift.io/cluster-api-machineset":"secondary-xgbq9-worker-00"},"annotations":{"VmId":"dd35df17-8766-47c3-b4c3-94579ea8d4c4","instance-status":""},"ownerReferences":[{"apiVersion":"machine.openshift.io/v1beta1","kind":"MachineSet","name":"secondary-xgbq9-worker-00","uid":"ad0d6173-9b12-4c4f-8df7-c0501d365396","controller":true,"blockOwnerDeletion":true}],"finalizers":["machine.machine.openshift.io"]},"spec":{"metadata":{"creationTimestamp":null},"providerSpec":{"value":{"apiVersion":"ovirtproviderconfig.openshift.io/v1beta1","cluster_id":"81602ae2-79a1-11ea-a904-00163e563201","credentialsSecret":{"name":"ovirt-credentials"},"id":"","kind":"OvirtMachineProviderSpec","metadata":{"creationTimestamp":null},"name":"","template_name":"rhcos_custom","userDataSecret":{"name":"worker-user-data"}}},"providerID":"dd35df17-8766-47c3-b4c3-94579ea8d4c4"},"status":{"lastUpdated":"2020-04-10T09:36:33Z","providerStatus":{"metadata":{"creationTimestamp":null},"instanceId":"secondary-xgbq9-worker-00-g89d6","instanceState":"down","conditions":[{"type":"MachineCreated","status":"True","lastProbeTime":"2020-04-10T09:36:33Z","lastTransitionTime":"2020-04-10T09:36:33Z","reason":"machineCreationSucceedReason","message":"machineCreationSucceedMessage"}]},"addresses":[{"type":"InternalDNS","address":"secondary-xgbq9-worker-00-g89d6"}],"phase":"Provisioned"}}
  creationTimestamp: "2020-04-10T09:10:54Z"
  finalizers:
  - machine.machine.openshift.io
  generateName: secondary-xgbq9-worker-00-
  generation: 1
  labels:
    machine.openshift.io/cluster-api-cluster: secondary-xgbq9
    machine.openshift.io/cluster-api-machine-role: worker
    machine.openshift.io/cluster-api-machine-type: worker
    machine.openshift.io/cluster-api-machineset: secondary-xgbq9-worker-00
  name: secondary-xgbq9-worker-00-g89d6
  namespace: openshift-machine-api
  ownerReferences:
  - apiVersion: machine.openshift.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: MachineSet
    name: secondary-xgbq9-worker-00
    uid: ad0d6173-9b12-4c4f-8df7-c0501d365396
  resourceVersion: "1560252"
  selfLink: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/secondary-xgbq9-worker-00-g89d6
  uid: fc95124b-6cc1-4716-a234-f025b7daafe7
spec:
  metadata:
    creationTimestamp: null
  providerSpec:
    value:
      apiVersion: ovirtproviderconfig.openshift.io/v1beta1
      cluster_id: 81602ae2-79a1-11ea-a904-00163e563201
      credentialsSecret:
        name: ovirt-credentials
      id: ""
      kind: OvirtMachineProviderSpec
      metadata:
        creationTimestamp: null
      name: ""
      template_name: rhcos_custom
      userDataSecret:
        name: worker-user-data
status:
  addresses:
  - address: secondary-xgbq9-worker-00-g89d6
    type: InternalDNS
  lastUpdated: "2020-04-10T09:36:33Z"
  phase: Provisioned
  providerStatus:
    conditions:
    - lastProbeTime: "2020-04-10T09:36:33Z"
      lastTransitionTime: "2020-04-10T09:36:33Z"
      message: machineCreationSucceedMessage
      reason: machineCreationSucceedReason
      status: "True"
      type: MachineCreated
    instanceId: secondary-xgbq9-worker-00-g89d6
    instanceState: down
    metadata:
      creationTimestamp: null

Comment 4 sunzhaohua 2020-04-10 09:47:29 UTC

Created attachment 1677766 [details]
machine-controller.log

Comment 5 Peter Lauterbach 2020-04-14 13:43:03 UTC

This is not a blocker for OCP 4.4.0 GA, target the earliest z-stream

Comment 6 Roy Golan 2020-04-23 17:33:19 UTC

(In reply to sunzhaohua from comment #3)
> Reproduced this bug
> clusterversion: 4.4.0-0.nightly-2020-04-04-025830
> 
> Create a new machineset with replic=3, 1 machine joined the cluster and node
> status becomes Ready. 2 machines stuck at Provisioned status. "oc describe
> machine" show machine "instanceState: down"
> 

This means the VM is not running - can you make sure?

Comment 7 Jianwei Hou 2020-04-28 03:24:13 UTC

Hi Roy,

We didn't have access to the RHV management console, so all the data we had were from the openshift side. I have downloaded the log Zhaohua attached and I see some message like this:

```
I0410 09:12:38.044200 1 machineservice.go:144] creating VM: secondary-xgbq9-worker-00-g89d6
E0410 09:12:58.857808 1 actuator.go:257] Machine error secondary-xgbq9-worker-00-g89d6: Error running oVirt VM: Fault reason is "Operation Failed". Fault detail is "[Cannot run VM. There is no host that satisfies current scheduling constraints. See below for details:, The host dell-r640-008.dsal.lab.eng.tlv2.redhat.com did not satisfy internal filter Memory because its available memory is too low (9038 MB) to run the VM.]". HTTP response code is "409". HTTP response message is "409 Conflict".
W0410 09:12:58.857839 1 controller.go:321] Failed to create machine "secondary-xgbq9-worker-00-g89d6": Error running oVirt VM: Fault reason is "Operation Failed". Fault detail is "[Cannot run VM. There is no host that satisfies current scheduling constraints. See below for details:, The host dell-r640-008.dsal.lab.eng.tlv2.redhat.com did not satisfy internal filter Memory because its available memory is too low (9038 MB) to run the VM.]". HTTP response code is "409". HTTP response message is "409 Conflict".
{"level":"error","ts":1586509978.8578815,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"machine_controller","request":"openshift-machine-api/secondary-xgbq9-worker-00-g89d6","error":"Error running oVirt VM: Fault reason is \"Operation Failed\". Fault detail is \"[Cannot run VM. There is no host that satisfies current scheduling constraints. See below for details:, The host dell-r640-008.dsal.lab.eng.tlv2.redhat.com did not satisfy internal filter Memory because its available memory is too low (9038 MB) to run the VM.]\". HTTP response code is \"409\". HTTP response message is \"409 Conflict\".","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/cluster-api-provider-ovirt/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
```

It seems the machine had error in creation and then stuck at 'Provisioning'.

Comment 8 Roy Golan 2020-05-27 09:00:06 UTC

Talk to your RHV admin, is seems the VM can't start because there is simply not enough memory. Look at the log:

"did not satisfy internal filter Memory because its available memory is too low (9038 MB) to run the VM."

Comment 9 milti leonard 2020-06-29 18:58:00 UTC

ive attached  case that is showing this issue is still an issue:

[root@bastion ~]# oc get nodes
NAME                                   STATUS   ROLES    AGE   VERSION
openshift-stage-wz4zh-master-0         Ready    master   22h   v1.17.1+912792b
openshift-stage-wz4zh-master-1         Ready    master   22h   v1.17.1+912792b
openshift-stage-wz4zh-master-2         Ready    master   22h   v1.17.1+912792b
openshift-stage-wz4zh-worker-0-8w2xb   Ready    worker   18h   v1.17.1+912792b
openshift-stage-wz4zh-worker-0-ckzxm   Ready    worker   21h   v1.17.1+912792b
openshift-stage-wz4zh-worker-0-n8qtb   Ready    worker   21h   v1.17.1+912792b

[root@bastion ~]# oc -n openshift-machine-api get machine
NAME                                   PHASE         TYPE   REGION   ZONE   AGE
openshift-stage-wz4zh-master-0         Running                              22h
openshift-stage-wz4zh-master-1         Running                              22h
openshift-stage-wz4zh-master-2         Running                              22h
openshift-stage-wz4zh-worker-0-8w2xb   Provisioned                          17h
openshift-stage-wz4zh-worker-0-ckzxm   Provisioned                          21h
openshift-stage-wz4zh-worker-0-n8qtb   Provisioned                          21h

a must-gather may not be possible, but the cu was able to run *oc adm inspect* on namespaces, so please if the must-gather issue cannot be resolved, what other information would you like. I've also requested log-collector data on the RHVM and a worker node.

thank-you,
m

Comment 11 milti leonard 2020-06-29 20:00:12 UTC

Created attachment 1699204 [details]
must-gather from cluster

Comment 12 milti leonard 2020-06-29 20:17:35 UTC

Created attachment 1699206 [details]
namespaces included are machine-api and machine-config-operator

Comment 18 Luc Pioro 2020-07-08 14:18:24 UTC

Hi
I have similar issue on my OCP Env - OCP 4.4.11 on RHV 4.3.9.4-11.el7
[root@bastion ~]# oc -n openshift-machine-api get machine
NAME                          PHASE         TYPE   REGION   ZONE   AGE
ocprhv-g86sn-master-0         Running                              6d23h
ocprhv-g86sn-master-1         Running                              6d23h
ocprhv-g86sn-master-2         Running                              6d23h
ocprhv-g86sn-worker-0-8z6c4   Running                              6d23h
ocprhv-g86sn-worker-0-fqdf5   Provisioned                          6d23h
ocprhv-g86sn-worker-0-s68hx   Provisioned                          5d21h
ocprhv-g86sn-worker-0-xdjlz   Provisioned                          6d23h

Provider State is set to "Image locked" for Machine in provisioneed state

I scaled-up Machine Set with one more worker, was OK, Node is ready but statut on machine side is provisioned

Tell me is you need data from OCP/RHV 
Luc Pioro
Account Solutions Architect

Comment 19 Roy Golan 2020-07-09 11:33:09 UTC

*** Bug 1840553 has been marked as a duplicate of this bug. ***

Comment 23 Jan Zmeskal 2020-07-16 07:47:09 UTC

Verification steps:
1. Scale up the existing worker MachineSet -> new Machine ends up in Running state, Node is in Ready status, no pending CSRs
2. Scale down the existing worker MachineSet -> previously created Machine no longer listed among Machines and no longer listed among Nodes

So all of this is fine. However, then I hit the issue again in quite a strange situation. I'm going to describe what exactly I did.
1. oc get machineset primary-dlzl6-worker-0 -o yaml --export > ms.yaml (exported the original MachineSet)
2. vim ms.yaml (edited the file so that replica count is 1, I changed nothing else)
3. oc apply -f ms.yaml
As nothing except replica count was changed, the change was applied to the original MachineSet. It scaled down to 1 Machine without any issue. 
4. In webconsole, I scaled the MachineSet back to 3 replicas in one step
And here it comes. One of the newly created Machines got stuck in Provisioned state, was not listed among Nodes and there are two penging CSRs. According to oc get machine -o yaml, the instanceState is reboot_in_progress. However, when I log into the RHV manager I can see the machine is not rebooting.

CLI output: http://pastebin.test.redhat.com/884795

Verification conducted with:
OCP 4.6.0-0.nightly-2020-07-15-004428
RHV 4.3.11.2-0.1.el7

Comment 25 Jan Zmeskal 2020-07-16 10:48:35 UTC

Failing QA based on comment 23

Comment 26 Jan Zmeskal 2020-07-20 15:08:20 UTC

After a discussion with Roy we found out that the worker machines got into Running state eventually, it just took longer than expected. Therefore moving this BZ to VERIFIED and opening follow-up BZ with lower priority: https://bugzilla.redhat.com/show_bug.cgi?id=1858854

Comment 29 Guilherme Santos 2020-07-30 14:38:03 UTC

Failing in: openshift-4.6.0-0.nightly-2020-07-22-074636

I was playing with the scaling and not only I got in this issue scaling the cluster up, as one machine got stuck in a new state (Provisioning). 
The steps:
1. On project '' I scaled up the machineset '' from 3 to 4
# oc edit machineset secondary-xpvlm-worker-0
(Changed replicas= from 3 to 4)
2. Once finished, I scaled down to 1
3. Then I scaled up to 3

The results:
# oc get machines
NAME                             PHASE          TYPE   REGION   ZONE   AGE
secondary-xpvlm-master-0         Running                               3d
secondary-xpvlm-master-1         Running                               3d
secondary-xpvlm-master-2         Running                               3d
secondary-xpvlm-worker-0-2bkts   Provisioning                          18h
secondary-xpvlm-worker-0-55clj   Provisioned                           18h
secondary-xpvlm-worker-0-zwbb6   Running                               3d

# oc get machinesets
NAME                       DESIRED   CURRENT   READY   AVAILABLE   AGE
secondary-xpvlm-worker-0   3         3         1       1           3d

# oc logs machine-approver-6c6964ffb-kdr4w machine-approver-controller
E0729 19:44:18.096102       1 reflector.go:380] github.com/openshift/cluster-machine-approver/main.go:239: Faile
d to watch *v1beta1.CertificateSigningRequest: Get "https://127.0.0.1:6443/apis/certificates.k8s.io/v1beta1/cert
ificatesigningrequests?allowWatchBookmarks=true&resourceVersion=2575390&timeoutSeconds=579&watch=true": dial tcp
 127.0.0.1:6443: connect: connection refused
E0729 19:44:18.096868       1 reflector.go:380] github.com/openshift/cluster-machine-approver/status.go:99: Failed to watch *v1.ClusterOperator: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/clusteroperators?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dmachine-approver&resourceVersion=2818758&timeoutSeconds=426&watch=true": dial tcp 127.0.0.1:6443: connect: connection refused
E0729 19:44:23.432116       1 reflector.go:380] github.com/openshift/cluster-machine-approver/status.go:99: Failed to watch *v1.ClusterOperator: the server is currently unable to handle the request (get clusteroperators.config.openshift.io)
I0729 19:47:29.895233       1 main.go:147] CSR csr-hjndh added
I0729 19:47:29.962273       1 main.go:182] CSR csr-hjndh not authorized: failed to find machine for node seconda
ry-xpvlm-worker-0-55clj
I0729 19:47:29.962316       1 main.go:218] Error syncing csr csr-hjndh: failed to find machine for node secondar
y-xpvlm-worker-0-55clj
I0729 19:47:29.967613       1 main.go:147] CSR csr-hjndh added
I0729 19:47:29.993158       1 main.go:182] CSR csr-hjndh not authorized: failed to find machine for node seconda
ry-xpvlm-worker-0-55clj
(...)
It basically continues repeating itself until the csr get out of the expected time range and it starts dumping the following:
(...)
I0730 03:00:37.931211       1 main.go:147] CSR csr-6rcxd added
I0730 03:00:37.950011       1 main.go:182] CSR csr-6rcxd not authorized: CSR csr-6rcxd creation time 2020-07-30 03:00:37 +0000 UTC not in range (2020-07-29 19:42:14 +0000 UTC, 2020-07-29 21:42:24 +0000 UTC)
I0730 03:00:37.950152       1 main.go:218] Error syncing csr csr-6rcxd: CSR csr-6rcxd creation time 2020-07-30 03:00:37 +0000 UTC not in range (2020-07-29 19:42:14 +0000 UTC, 2020-07-29 21:42:24 +0000 UTC)
(...)

# oc get csr
NAME        AGE     SIGNERNAME                                    REQUESTOR                                                                   CONDITION
csr-28n6h   16h     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-29lw7   16h     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-2nq66   8h      kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-422tb   6h24m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-4298h   60m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-459kq   11h     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-49jvz   17h     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
(...) 
There are 73 of these

# oc logs machine-api-controllers-675ccc84b7-4d5lw machine-controller
I0729 19:40:45.376076       1 reflector.go:158] Listing and watching *v1beta1.Machine from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126
I0729 19:40:45.387496       1 controller.go:164] Reconciling Machine "secondary-xpvlm-master-2"
I0729 19:40:45.387553       1 controller.go:376] Machine "secondary-xpvlm-master-2" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0729 19:40:45.417705       1 machineservice.go:262] Fetching VM by ID: cbbd193d-72e2-4d02-bc70-d2961b8cc953
I0729 19:40:45.536577       1 machineservice.go:270] Got VM by ID: secondary-xpvlm-master-2
I0729 19:40:45.536630       1 controller.go:284] Reconciling machine "secondary-xpvlm-master-2" triggers idempotent update
I0729 19:40:45.554546       1 machineservice.go:262] Fetching VM by ID: cbbd193d-72e2-4d02-bc70-d2961b8cc953
I0729 19:40:45.596294       1 machineservice.go:270] Got VM by ID: secondary-xpvlm-master-2
I0729 19:40:45.602572       1 actuator.go:262] Updating machine resource
I0729 19:40:45.673792       1 actuator.go:271] Updating machine status sub-resource
I0729 19:40:45.691730       1 controller.go:164] Reconciling Machine "secondary-xpvlm-worker-0-zwbb6"
I0729 19:40:45.691761       1 controller.go:376] Machine "secondary-xpvlm-worker-0-zwbb6" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0729 19:40:45.702461       1 machineservice.go:262] Fetching VM by ID: eb9c15d7-7f60-4576-820f-3becbcd1b2ba
I0729 19:40:45.736967       1 machineservice.go:270] Got VM by ID: secondary-xpvlm-worker-0-zwbb6
I0729 19:40:45.737045       1 controller.go:284] Reconciling machine "secondary-xpvlm-worker-0-zwbb6" triggers i
dempotent update
I0729 19:40:45.755392       1 machineservice.go:262] Fetching VM by ID: eb9c15d7-7f60-4576-820f-3becbcd1b2ba
I0729 19:40:45.789264       1 machineservice.go:270] Got VM by ID: secondary-xpvlm-worker-0-zwbb6
I0729 19:40:45.808221       1 actuator.go:262] Updating machine resource
I0729 19:40:45.833615       1 actuator.go:271] Updating machine status sub-resource
I0729 19:40:45.846750       1 controller.go:164] Reconciling Machine "secondary-xpvlm-master-0"
I0729 19:40:45.846786       1 controller.go:376] Machine "secondary-xpvlm-master-0" in namespace "openshift-mach
ine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0729 19:40:45.888556       1 machineservice.go:262] Fetching VM by ID: 88215f56-fefc-408d-ae80-9c6e3e780292
I0729 19:40:45.921894       1 machineservice.go:270] Got VM by ID: secondary-xpvlm-master-0
I0729 19:40:45.921940       1 controller.go:284] Reconciling machine "secondary-xpvlm-master-0" triggers idempot
ent update
I0729 19:40:45.956470       1 machineservice.go:262] Fetching VM by ID: 88215f56-fefc-408d-ae80-9c6e3e780292
I0729 19:40:46.003732       1 machineservice.go:270] Got VM by ID: secondary-xpvlm-master-0
I0729 19:40:46.007976       1 actuator.go:262] Updating machine resource
I0729 19:40:46.019861       1 actuator.go:271] Updating machine status sub-resource
I0729 19:40:46.029360       1 controller.go:164] Reconciling Machine "secondary-xpvlm-master-1"
I0729 19:40:46.029392       1 controller.go:376] Machine "secondary-xpvlm-master-1" in namespace "openshift-mach
ine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0729 19:40:46.039276       1 machineservice.go:262] Fetching VM by ID: a01683dc-4071-48cd-a40f-56987a663ca3
I0729 19:40:46.138830       1 machineservice.go:270] Got VM by ID: secondary-xpvlm-master-1
I0729 19:40:46.138885       1 controller.go:284] Reconciling machine "secondary-xpvlm-master-1" triggers idempot
ent update
I0729 19:40:46.147479       1 machineservice.go:262] Fetching VM by ID: a01683dc-4071-48cd-a40f-56987a663ca3
I0729 19:40:46.182617       1 machineservice.go:270] Got VM by ID: secondary-xpvlm-master-1
I0729 19:40:46.188361       1 actuator.go:262] Updating machine resource
I0729 19:40:46.248731       1 actuator.go:271] Updating machine status sub-resource
I0729 19:42:25.007316       1 controller.go:164] Reconciling Machine "secondary-xpvlm-worker-0-55clj"
I0729 19:42:25.007371       1 controller.go:376] Machine "secondary-xpvlm-worker-0-55clj" in namespace "openshif
t-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0729 19:42:25.026622       1 controller.go:164] Reconciling Machine "secondary-xpvlm-worker-0-2bkts"
I0729 19:42:25.026653       1 controller.go:376] Machine "secondary-xpvlm-worker-0-2bkts" in namespace "openshif
t-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0729 19:42:46.082403       1 machineservice.go:196] Extending the OS disk from 17179869184 to 128849018880
I0729 19:42:46.420546       1 machineservice.go:209] Waiting while extending the OS disk
I0729 19:44:26.993616       1 actuator.go:262] Updating machine resource
I0729 19:44:27.014082       1 actuator.go:271] Updating machine status sub-resource
I0729 19:44:27.025247       1 controller.go:164] Reconciling Machine "secondary-xpvlm-worker-0-2bkts"
I0729 19:44:27.025311       1 controller.go:376] Machine "secondary-xpvlm-worker-0-2bkts" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0729 19:44:27.087045       1 controller.go:428] Machine "secondary-xpvlm-worker-0-2bkts" going into phase "Provisioning"
I0729 19:44:27.098363       1 controller.go:319] Reconciling machine object secondary-xpvlm-worker-0-2bkts triggers idempotent create.
I0729 19:44:27.142084       1 machineservice.go:126] creating VM: secondary-xpvlm-worker-0-2bkts
I0729 19:44:47.993861       1 machineservice.go:196] Extending the OS disk from 17179869184 to 128849018880
I0729 19:44:48.259039       1 machineservice.go:209] Waiting while extending the OS disk
I0729 19:46:09.001274       1 actuator.go:262] Updating machine resource
I0729 19:46:09.142437       1 actuator.go:271] Updating machine status sub-resource
I0729 19:46:09.175710       1 controller.go:164] Reconciling Machine "secondary-xpvlm-worker-0-55clj"
I0729 19:46:09.175784       1 controller.go:376] Machine "secondary-xpvlm-worker-0-55clj" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0729 19:46:09.191239       1 machineservice.go:262] Fetching VM by ID: a7a7e81e-098c-458f-a8fa-d1060b782b9f
I0729 19:46:09.299097       1 machineservice.go:270] Got VM by ID: secondary-xpvlm-worker-0-55clj
I0729 19:46:09.299135       1 controller.go:284] Reconciling machine "secondary-xpvlm-worker-0-55clj" triggers idempotent update
I0729 19:46:09.310142       1 machineservice.go:262] Fetching VM by ID: a7a7e81e-098c-458f-a8fa-d1060b782b9f
I0729 19:46:09.342089       1 machineservice.go:270] Got VM by ID: secondary-xpvlm-worker-0-55clj
E0729 19:46:09.522784       1 actuator.go:295] failed to lookup the VM IP lookup secondary-xpvlm-worker-0-55clj on 172.30.0.10:53: no such host - skip setting addresses for this machine
E0729 19:46:09.522841       1 controller.go:286] Error updating machine "openshift-machine-api/secondary-xpvlm-worker-0-55clj": lookup secondary-xpvlm-worker-0-55clj on 172.30.0.10:53: no such host
{"level":"error","ts":1596051969.5229006,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"machine_controller","request":"openshift-machine-api/secondary-xpvlm-worker-0-55clj","error":"lookup secondary-xpvlm-worker-0-55clj on 172.30.0.10:53: no such host","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/cluster-api-provider-ovirt/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
I0729 19:46:10.523314       1 controller.go:164] Reconciling Machine "secondary-xpvlm-worker-0-2bkts"
I0729 19:46:10.523382       1 controller.go:376] Machine "secondary-xpvlm-worker-0-2bkts" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0729 19:46:10.534959       1 machineservice.go:262] Fetching VM by ID: 8edcee4b-aa4e-4499-b8bb-044b07f8260e
I0729 19:46:10.565728       1 machineservice.go:270] Got VM by ID: secondary-xpvlm-worker-0-2bkts
I0729 19:46:10.565778       1 controller.go:284] Reconciling machine "secondary-xpvlm-worker-0-2bkts" triggers idempotent update
I0729 19:46:10.576903       1 machineservice.go:262] Fetching VM by ID: 8edcee4b-aa4e-4499-b8bb-044b07f8260e
I0729 19:46:10.609187       1 machineservice.go:270] Got VM by ID: secondary-xpvlm-worker-0-2bkts
E0729 19:46:10.773276       1 actuator.go:295] failed to lookup the VM IP lookup secondary-xpvlm-worker-0-2bkts on 172.30.0.10:53: no such host - skip setting addresses for this machine
E0729 19:46:10.773321       1 controller.go:286] Error updating machine "openshift-machine-api/secondary-xpvlm-worker-0-2bkts": lookup secondary-xpvlm-worker-0-2bkts on 172.30.0.10:53: no such host
{"level":"error","ts":1596051970.7733457,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"machine_controller","request":"openshift-machine-api/secondary-xpvlm-worker-0-2bkts","error":"lookup secondary-xpvlm-worker-0-2bkts on 172.30.0.10:53: no such host","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/cluster-api-provider-ovirt/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
(...)
It basically continues repeating itself

Comment 30 Michal Skrivanek 2020-08-19 08:25:22 UTC

(In reply to Guilherme Santos from comment #29)
> Failing in: openshift-4.6.0-0.nightly-2020-07-22-074636
> 
> I was playing with the scaling and not only I got in this issue scaling the
> cluster up, as one machine got stuck in a new state (Provisioning). 

Hi,
please follow up with Evgeni or opening a new bug on that, and keep this one VERIFIED. This is not related to the originally reported issue on IPI installation

Comment 35 Michal Skrivanek 2020-08-27 07:15:33 UTC

engine log is useless as it doesn't cover the period. Can you attach the right one? 
all the new logging added in that PR are not displayed by default, can you increase the log level please?
The VM may be running but the guest not necessarily. RHV host logs and verifying that the guest actually runs would be helpful

Comment 36 Jan Zmeskal 2020-08-27 07:22:09 UTC

Evgeny found out that the worker VM hasn't booted - most probably because we've exhausted our IP address pool. Let me run one more iteration of the test and if that goes smooth, I'll move to VERIFIED.

Comment 37 Jan Zmeskal 2020-08-27 11:21:55 UTC

After some investigation, we found out that the issue described in comment #33 was caused by our DHCP infra setup, therefore unrelated to the OCP on RHV functionality. 

I performed OCP deployment 4 times, each time doing series of scale-downs and scale-ups and did not encounter the original issue. I believe this is big enough sample to verify the fix. Before this fix was delivered, I used to encounter the original issue in about 75 % of the cases of scale up, so I believe if it was still present, it would pop up during my verification. 
As the scale-up issue is not encountered any more, it seems to indicate that also the root cause of machines being stuck in Provisioned state during the cluster deploy has been removed. This is far more difficult to confirm 100 % though, as the issue reproduced only rarely during cluster deployment. If the issue during the cluster deployment should re-appear, I think it's safe to assume that it's root cause is different from this one.

Comment 39 errata-xmlrpc 2020-10-27 15:56:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 44 Sandro Bonazzola 2020-11-02 13:01:05 UTC

As mentioned in comment #39, this bug was CLOSED ERRATA. If the solution does not work for you, open a new bug report.
Please don't reopen this bug as the problem you're seeing may have similar result but different root cause and deserve to be analyzed accordingly.

Note You need to log in before you can comment on or make changes to this bug.