Bug 1854787 - [RHV] New machine stuck at 'Provisioned' phase
Summary: [RHV] New machine stuck at 'Provisioned' phase
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.4
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.5.z
Assignee: Roy Golan
QA Contact: Jan Zmeskal
URL:
Whiteboard:
: 1840018 1849387 (view as bug list)
Depends On: 1817853
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-08 09:07 UTC by OpenShift BugZilla Robot
Modified: 2023-12-15 18:24 UTC (History)
23 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-08 10:54:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-api-provider-ovirt pull 54 0 None closed [release-4.5] Bug 1854787: Reconcile network addresses according to VM status 2021-01-26 16:58:41 UTC
Github openshift cluster-api-provider-ovirt pull 65 0 None closed BUG 1854787: [release-4.5] abort network reconciling while vm is in transient state 2021-01-26 16:58:42 UTC
Red Hat Knowledge Base (Solution) 5297401 0 None None None 2020-08-05 10:31:07 UTC
Red Hat Product Errata RHBA-2020:3510 0 None None None 2020-09-08 10:54:25 UTC

Comment 2 Roy Golan 2020-07-09 12:41:45 UTC
*** Bug 1840018 has been marked as a duplicate of this bug. ***

Comment 6 Roy Golan 2020-07-14 11:16:31 UTC
*** Bug 1849387 has been marked as a duplicate of this bug. ***

Comment 13 Jan Zmeskal 2020-07-29 13:37:31 UTC
Verification attempted with: 
openshift-install-linux-4.5.0-0.nightly-2020-07-29-051236 (The fix landed in 4.5.0-0.nightly-2020-07-28-182449)
RHV 4.3.11.2-0.1.el7

I scaled up the existing worker MachineSet and waited for the new worker machine to get into Running state. I waited for almost hour and half but it got stuck in Provisioned state. See here:

# oc get machine -n openshift-machine-api
NAME                           PHASE         TYPE   REGION   ZONE   AGE
primary-spfb8-master-0         Running                              178m
primary-spfb8-master-1         Running                              178m
primary-spfb8-master-2         Running                              178m
primary-spfb8-worker-0-9qgp8   Running                              169m
primary-spfb8-worker-0-b8s89   Provisioned                          88m
primary-spfb8-worker-0-ktqv2   Running                              169m
primary-spfb8-worker-0-nxr82   Running                              169m

It was not present among nodes either:
# oc get nodes
NAME                           STATUS   ROLES    AGE    VERSION
primary-spfb8-master-0         Ready    master   177m   v1.18.3+012b3ec
primary-spfb8-master-1         Ready    master   177m   v1.18.3+012b3ec
primary-spfb8-master-2         Ready    master   177m   v1.18.3+012b3ec
primary-spfb8-worker-0-9qgp8   Ready    worker   162m   v1.18.3+012b3ec
primary-spfb8-worker-0-ktqv2   Ready    worker   151m   v1.18.3+012b3ec
primary-spfb8-worker-0-nxr82   Ready    worker   161m   v1.18.3+012b3ec

This error can be seen in machine-controller container:
# oc logs machine-api-controllers-5d75cbdb7d-4pm8j -c machine-controller
...
E0729 12:06:46.244009       1 actuator.go:295] failed to lookup the VM IP lookup primary-spfb8-worker-0-b8s89 on 172.30.0.10:53: no such host - skip setting addresses for this machine
E0729 12:06:46.244052       1 controller.go:286] Error updating machine "openshift-machine-api/primary-spfb8-worker-0-b8s89": lookup primary-spfb8-worker-0-b8s89 on 172.30.0.10:53: no such host
{"level":"error","ts":1596024406.2441392,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"machine_controller","request":"openshift-machine-api/primary-spfb8-worker-0-b8s89","error":"lookup primary-spfb8-worker-0-b8s89 on 172.30.0.10:53: no such host","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/cluster-api-provider-ovirt/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
...

Couple of CSRs are pending (I checked there were none Pending after cluster deployment):
# oc get csr
NAME        AGE   SIGNERNAME                                    REQUESTOR                                                                   CONDITION
csr-8x46q   25m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-9h7ft   56m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-bb98m   41m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-d6plc   87m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-v2q29   10m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-vjjgw   72m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending

I'll attach other log files as well.

Comment 15 Jan Zmeskal 2020-07-29 13:46:45 UTC
One additional information. The Provider State of the new worker Machine in webconsole is reboot_in_progress, but in reality the VM is not rebooting.

Comment 17 daniel 2020-08-05 10:16:50 UTC
reading previous comments it seems the workaround is in c#7 and c#8 giving this a try ....

setting up new OCP 4.4.10 cluster:

~~~
# oc get clusterversion 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.10    True        False         36m     Cluster version is 4.4.10
#
# oc get nodes
NAME                           STATUS   ROLES    AGE     VERSION
cluster-46vks-master-0         Ready    master   25m     v1.17.1+9d33dd3
cluster-46vks-master-1         Ready    master   24m     v1.17.1+9d33dd3
cluster-46vks-master-2         Ready    master   24m     v1.17.1+9d33dd3
cluster-46vks-worker-0-2bzzm   Ready    worker   7m41s   v1.17.1+9d33dd3
cluster-46vks-worker-0-gbkdn   Ready    worker   9m31s   v1.17.1+9d33dd3
cluster-46vks-worker-0-ks6dk   Ready    worker   12m     v1.17.1+9d33dd3
cluster-46vks-worker-0-kt6jp   Ready    worker   11m     v1.17.1+9d33dd3
# 
# oc get machineset -n openshift-machine-api 
NAME                     DESIRED   CURRENT   READY   AVAILABLE   AGE
cluster-46vks-worker-0   4         4                             26m
# 
# oc get machine -n openshift-machine-api 
NAME                           PHASE         TYPE   REGION   ZONE   AGE
cluster-46vks-master-0         Running                              26m
cluster-46vks-master-1         Running                              26m
cluster-46vks-master-2         Running                              26m
cluster-46vks-worker-0-2bzzm   Provisioned                          17m
cluster-46vks-worker-0-gbkdn   Provisioned                          17m
cluster-46vks-worker-0-ks6dk   Provisioned                          17m
cluster-46vks-worker-0-kt6jp   Provisioned                          17m
# 
~~~

-> so, getting VM ID from RHV UI: ( Compute -> Virtual Machines -> click VM in question -> on the right see: VM ID:) 
--> this brings me to the following list:

node/machine name            |              VM ID
---------------------------------------------------------------------
cluster-46vks-worker-0-2bzzm | ee7488fb-ac4f-4bed-85c8-1d75b2cd3798
cluster-46vks-worker-0-gbkdn | ad826aae-2d39-401a-b95a-02dc14b902ea
cluster-46vks-worker-0-ks6dk | 9dafee85-5cca-406b-8f73-fcd438fc67b1
cluster-46vks-worker-0-kt6jp | aa1b34ce-1c4c-44d6-8566-ae4521067fe1

--> those need to be in node and machine config with either possibility 
a) direct edit: 
~~~
# oc edit node cluster-46vks-worker-0-2bzzm

...
spec: 
  providerID: ee7488fb-ac4f-4bed-85c8-1d75b2cd3798
status:
  addresses:
...
node/cluster-46vks-worker-0-2bzzm edited
# oc edit machine cluster-46vks-worker-0-2bzzm  -n openshift-machine-api 
...
spec:
  metadata:
    creationTimestamp: null
  providerSpec:
    value:
      apiVersion: ovirtproviderconfig.machine.openshift.io/v1beta1
      cluster_id: 587fa27d-0229-00d8-0323-000000000290
      cpu:
        cores: 4
        sockets: 1
        threads: 1
      credentialsSecret:
        name: ovirt-credentials
      id: ee7488fb-ac4f-4bed-85c8-1d75b2cd3798
      kind: OvirtMachineProviderSpec
      memory_mb: 16348

...
machine.machine.openshift.io/cluster-46vks-worker-0-2bzzm edited
#
~~~

b) using oc patch:
~~~
# oc patch node cluster-46vks-worker-0-gbkdn --type merge --patch '{"spec":{"providerID":"ad826aae-2d39-401a-b95a-02dc14b902ea"}}'
# oc -n openshift-machine-api patch machine cluster-46vks-worker-0-gbkdn --type merge --patch '{"spec":{"providerSpec":{"value":{"id":"ad826aae-2d39-401a-b95a-02dc14b902ea"}}}}'
~~~

and when checking, the machines and CSRs are getting sorted:

~~~
# oc get machine -n openshift-machine-api
NAME                           PHASE         TYPE   REGION   ZONE   AGE
cluster-46vks-master-0         Running                              41m
cluster-46vks-master-1         Running                              41m
cluster-46vks-master-2         Running                              41m
cluster-46vks-worker-0-2bzzm   Running                              32m
cluster-46vks-worker-0-gbkdn   Running                              32m
cluster-46vks-worker-0-ks6dk   Provisioned                          32m
cluster-46vks-worker-0-kt6jp   Provisioned                          32m
# oc get csr
NAME        AGE     REQUESTOR                                                                   CONDITION
csr-5hmph   29m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-bcxhr   12m     system:node:cluster-46vks-worker-0-ks6dk                                    Pending
csr-cvj6x   26m     system:node:cluster-46vks-worker-0-kt6jp                                    Pending
csr-drsgg   22m     system:node:cluster-46vks-worker-0-2bzzm                                    Approved,Issued
csr-dtfl8   27m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-ffzwq   40m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-gzg75   11m     system:node:cluster-46vks-worker-0-kt6jp                                    Pending
csr-ljkbq   40m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-mmnwq   40m     system:node:cluster-46vks-master-0                                          Approved,Issued
csr-npkp6   22m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-nxh8k   39m     system:node:cluster-46vks-master-1                                          Approved,Issued
csr-qrhmh   24m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-qswsj   9m25s   system:node:cluster-46vks-worker-0-gbkdn                                    Pending
csr-qxmks   27m     system:node:cluster-46vks-worker-0-ks6dk                                    Pending
csr-r6wtx   24m     system:node:cluster-46vks-worker-0-gbkdn                                    Pending
csr-wtt9z   39m     system:node:cluster-46vks-master-2                                          Approved,Issued
csr-xrb69   40m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
# 
~~~

doing the rest:
~~~# oc patch node cluster-46vks-worker-0-ks6dk --type merge --patch '{"spec":{"providerID":"9dafee85-5cca-406b-8f73-fcd438fc67b1"}}'
# oc patch node cluster-46vks-worker-0-kt6jp --type merge --patch '{"spec":{"providerID":"aa1b34ce-1c4c-44d6-8566-ae4521067fe1"}}'
# oc -n openshift-machine-api patch machine cluster-46vks-worker-0-ks6dk --type merge --patch '{"spec":{"providerSpec":{"value":{"id":"9dafee85-5cca-406b-8f73-fcd438fc67b1"}}}}'
# oc -n openshift-machine-api patch machine cluster-46vks-worker-0-kt6jp --type merge --patch '{"spec":{"providerSpec":{"value":{"id":"aa1b34ce-1c4c-44d6-8566-ae4521067fe1"}}}}'
# oc get machine -n openshift-machine-api
NAME                           PHASE     TYPE   REGION   ZONE   AGE
cluster-46vks-master-0         Running                          46m
cluster-46vks-master-1         Running                          46m
cluster-46vks-master-2         Running                          46m
cluster-46vks-worker-0-2bzzm   Running                          36m
cluster-46vks-worker-0-gbkdn   Running                          36m
cluster-46vks-worker-0-ks6dk   Running                          36m
cluster-46vks-worker-0-kt6jp   Running                          36m
# oc get machineset -n openshift-machine-api
NAME                     DESIRED   CURRENT   READY   AVAILABLE   AGE
cluster-46vks-worker-0   4         4         4       4           46m
# oc get csr 
NAME        AGE    REQUESTOR                                                                   CONDITION
csr-5hmph   33m    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-5p28b   2m3s   system:node:cluster-46vks-worker-0-ks6dk                                    Approved,Issued
csr-bcxhr   17m    system:node:cluster-46vks-worker-0-ks6dk                                    Pending
csr-cvj6x   31m    system:node:cluster-46vks-worker-0-kt6jp                                    Pending
csr-drsgg   27m    system:node:cluster-46vks-worker-0-2bzzm                                    Approved,Issued
csr-dtfl8   32m    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-ffzwq   45m    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-gzg75   16m    system:node:cluster-46vks-worker-0-kt6jp                                    Pending
csr-ljkbq   45m    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-mmnwq   44m    system:node:cluster-46vks-master-0                                          Approved,Issued
csr-npkp6   27m    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-nxh8k   44m    system:node:cluster-46vks-master-1                                          Approved,Issued
csr-qrhmh   29m    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-qswsj   14m    system:node:cluster-46vks-worker-0-gbkdn                                    Approved,Issued
csr-qxmks   32m    system:node:cluster-46vks-worker-0-ks6dk                                    Pending
csr-r6wtx   29m    system:node:cluster-46vks-worker-0-gbkdn                                    Pending
csr-wtt9z   44m    system:node:cluster-46vks-master-2                                          Approved,Issued
csr-wzncq   87s    system:node:cluster-46vks-worker-0-kt6jp                                    Approved,Issued
csr-xrb69   45m    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
# 

~~~

waiting another ~10 minutes:
~~~
# oc get csr
NAME        AGE   REQUESTOR                                                                   CONDITION
csr-5hmph   54m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-5p28b   22m   system:node:cluster-46vks-worker-0-ks6dk                                    Approved,Issued
csr-bcxhr   38m   system:node:cluster-46vks-worker-0-ks6dk                                    Approved,Issued
csr-cvj6x   52m   system:node:cluster-46vks-worker-0-kt6jp                                    Approved,Issued
csr-drsgg   48m   system:node:cluster-46vks-worker-0-2bzzm                                    Approved,Issued
csr-dtfl8   53m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-ffzwq   65m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-gzg75   37m   system:node:cluster-46vks-worker-0-kt6jp                                    Approved,Issued
csr-ljkbq   65m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-mmnwq   65m   system:node:cluster-46vks-master-0                                          Approved,Issued
csr-npkp6   48m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-nxh8k   65m   system:node:cluster-46vks-master-1                                          Approved,Issued
csr-qrhmh   50m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-qswsj   35m   system:node:cluster-46vks-worker-0-gbkdn                                    Approved,Issued
csr-qxmks   53m   system:node:cluster-46vks-worker-0-ks6dk                                    Approved,Issued
csr-r6wtx   50m   system:node:cluster-46vks-worker-0-gbkdn                                    Approved,Issued
csr-wtt9z   65m   system:node:cluster-46vks-master-2                                          Approved,Issued
csr-wzncq   22m   system:node:cluster-46vks-worker-0-kt6jp                                    Approved,Issued
csr-xrb69   65m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
#
~~~

all seems to be sorted .. so this could be a WA until it is fixed

Comment 24 Evgeny Slutsky 2020-08-18 09:59:30 UTC
i couldnt reproduce this issue with 4.6.0-0.nightly-2020-08-18-055142

tried:
1. running IPI installation with cluster with 3 masters and 2 workers .
2. manually scaled the machineset to 3 using oc command

oc scale --replicas=3 machineset ovirt10-26k9v-worker-0  -n openshift-machine-api


[root@eslutsky-proxy-vm ~]# ./oc get machineset -n openshift-machine-api
NAME                     DESIRED   CURRENT   READY   AVAILABLE   AGE
ovirt10-26k9v-worker-0   3         3         3       3           67m



[root@eslutsky-proxy-vm ~]# ./oc get machine -n openshift-machine-api
NAME                           PHASE     TYPE   REGION   ZONE   AGE
ovirt10-26k9v-master-0         Running                          67m
ovirt10-26k9v-master-1         Running                          67m
ovirt10-26k9v-master-2         Running                          67m
ovirt10-26k9v-worker-0-bmndg   Running                          57m
ovirt10-26k9v-worker-0-dbptj   Running                          13m
ovirt10-26k9v-worker-0-ghppj   Running                          57m


[root@eslutsky-proxy-vm ~]# ./oc get nodes
NAME                           STATUS   ROLES    AGE     VERSION
ovirt10-26k9v-master-0         Ready    master   65m     v1.19.0-rc.2+99cb93a-dirty
ovirt10-26k9v-master-1         Ready    master   65m     v1.19.0-rc.2+99cb93a-dirty
ovirt10-26k9v-master-2         Ready    master   65m     v1.19.0-rc.2+99cb93a-dirty
ovirt10-26k9v-worker-0-bmndg   Ready    worker   50m     v1.19.0-rc.2+99cb93a-dirty
ovirt10-26k9v-worker-0-dbptj   Ready    worker   5m46s   v1.19.0-rc.2+99cb93a-dirty
ovirt10-26k9v-worker-0-ghppj   Ready    worker   39m     v1.19.0-rc.2+99cb93a-dirty

Comment 25 Jan Zmeskal 2020-08-18 10:03:25 UTC
Hi Evgeny, one more thing comes to mind: Try scaling the existing worker MachineSet to 0 and then back to 3

Comment 26 Evgeny Slutsky 2020-08-18 10:10:15 UTC
the issue reproduced when tried scaling up again to 4 workers, but this time RHV was unable to spawn this extra worker (out of resources

in rhv events:
Failed to run VM ovirt10-26k9v-worker-0-6mzq8 due to a failed validation: [Cannot run VM. There is no host that satisfies current scheduling constraints. See below for details:, The host es-host-01 did not satisfy internal filter Memory because its available memory is too low (10415 MB) to run the VM., The host es-host-01 did not satisfy internal filter Memory because its available memory is too low (10415 MB) to run the VM.] (User: admin@internal-authz).

[root@eslutsky-proxy-vm ~]# ./oc get machines -A
NAMESPACE               NAME                           PHASE         TYPE   REGION   ZONE   AGE
openshift-machine-api   ovirt10-26k9v-master-0         Running                              83m
openshift-machine-api   ovirt10-26k9v-master-1         Running                              83m
openshift-machine-api   ovirt10-26k9v-master-2         Running                              83m
openshift-machine-api   ovirt10-26k9v-worker-0-6mzq8   Provisioned                          6m57s
openshift-machine-api   ovirt10-26k9v-worker-0-bmndg   Running                              73m
openshift-machine-api   ovirt10-26k9v-worker-0-dbptj   Running                              29m
openshift-machine-api   ovirt10-26k9v-worker-0-ghppj   Running                              73m


./oc describe machine/ovirt10-26k9v-worker-0-6mzq8 -n openshift-machine-api

Spec:
  Metadata:
  Provider ID:  ac81b8df-ce0b-4e74-a393-4bb4d83607b6
  Provider Spec:
    Value:
      API Version:  ovirtproviderconfig.machine.openshift.io/v1beta1
      cluster_id:   2b76bbe8-38c3-49a5-b5f3-23b4dd1f4326
      Cpu:
        Cores:    4
        Sockets:  1
        Threads:  1
      Credentials Secret:
        Name:     ovirt-credentials
      Id:         
      Kind:       OvirtMachineProviderSpec
      memory_mb:  16348
      Metadata:
        Creation Timestamp:  <nil>
      Name:                  
      os_disk:
        size_gb:      120
      template_name:  ovirt10-26k9v-rhcos
      Type:           server
      User Data Secret:
        Name:  worker-user-data-managed
Status:
  Last Updated:  2020-08-18T09:59:12Z
  Phase:         Provisioned
  Provider Status:
    Conditions:
      Last Probe Time:       2020-08-18T09:59:12Z
      Last Transition Time:  2020-08-18T09:59:12Z
      Message:               Machine successfully created
      Reason:                MachineCreateSucceeded
      Status:                True
      Type:                  MachineCreated
    Instance Id:             ac81b8df-ce0b-4e74-a393-4bb4d83607b6
    Instance State:          down
    Metadata:
      Creation Timestamp:  <nil>


can you please confirm if the workers are Starting up in the RHV engine ?

Comment 27 Jan Zmeskal 2020-08-18 10:19:21 UTC
I think that doesn't count as a reproduction, since it's expected to run into problems when there aren't enough compute resources. I can confirm that when I hit this issue, the VMs were successfully started.

Comment 28 Evgeny Slutsky 2020-08-18 10:51:49 UTC
(In reply to Jan Zmeskal from comment #25)
> Hi Evgeny, one more thing comes to mind: Try scaling the existing worker
> MachineSet to 0 and then back to 3

when scaling to 0, the last worker didn't deleted and stuck at `Deleting' Phase.
./oc scale --replicas=0 machineset ovirt10-26k9v-worker-0  -n openshift-machine-api


[root@eslutsky-proxy-vm ~]# ./oc get machines -A
NAMESPACE               NAME                           PHASE      TYPE   REGION   ZONE   AGE
openshift-machine-api   ovirt10-26k9v-master-0         Running                           128m
openshift-machine-api   ovirt10-26k9v-master-1         Running                           128m
openshift-machine-api   ovirt10-26k9v-master-2         Running                           128m
openshift-machine-api   ovirt10-26k9v-worker-0-dbptj   Deleting                          74m

Comment 29 Evgeny Slutsky 2020-08-18 14:40:37 UTC
after reproducing this issue in the QE rhv-4.3 env,  with latest ocp (4.6.0-0.nightly-2020-08-18-055142),
it appears to be cause because of failed dns lookup attempt :

# oc logs machine-api-controllers-5bc7996949-wszvn -c machine-controller
I0818 13:49:18.142340       1 machineservice.go:270] Got VM by ID: primary-5tk29-worker-0-ml2v5
E0818 13:49:18.151995       1 actuator.go:295] failed to lookup the VM IP lookup primary-5tk29-worker-0-ml2v5 on 172.30.0.10:53: no such host - skip setting addresses for this machine
E0818 13:49:18.152023       1 controller.go:286] Error updating machine "openshift-machine-api/primary-5tk29-worker-0-ml2v5": lookup primary-5tk29-worker-0-ml2v5 on 172.30.0.10:53: no such host
{"level":"error","ts":1597758558.15205,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"machine_controller","request":"openshift-machine-api/primary-5tk29-worker-0-ml2v5","error":"lookup primary-5tk29-worker-0-ml2v5 on 172.30.0.10:53: no such host","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/cluster-api-provider-ovirt/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}


so the IP never get reported back to the machine object :
oc get machine -o=custom-columns="name:.metadata.name,Status:status.phase,Address:.status.addresses[1].address"
name                           Status        Address
primary-5tk29-master-0         Running       10.35.71.95
primary-5tk29-master-1         Running       10.35.71.94
primary-5tk29-master-2         Running       10.35.71.232
primary-5tk29-worker-0-86mmj   Running       10.35.71.98
primary-5tk29-worker-0-ml2v5   Provisioned   <none>
primary-5tk29-worker-0-nm9lt   Running       10.35.71.99
primary-5tk29-worker-0-pv48v   Running       10.35.71.97

the issue resolved after removing the failed pod:
oc delete pod/machine-api-controllers-5bc7996949-wszvn

Comment 34 Michal Skrivanek 2020-08-19 08:41:13 UTC
(In reply to Jan Zmeskal from comment #13)
> Verification attempted with: 
> openshift-install-linux-4.5.0-0.nightly-2020-07-29-051236 (The fix landed in
> 4.5.0-0.nightly-2020-07-28-182449)
> RHV 4.3.11.2-0.1.el7
> 
> I scaled up the existing worker MachineSet and waited for the new worker
> machine to get into Running state. I waited for almost hour and half but it
> got stuck in Provisioned state. See here:

Can you please update and confirm if you have verified the IPI part of the problem, and only see an issue for scaling up? If so, please verify this bug and work with evgeni to open a new one tracking scale up issues in 4.5+

Comment 43 Evgeny Slutsky 2020-08-25 12:39:08 UTC
this merged for 4.6 only , we need to cherry-pick to 4.5

Comment 44 Michal Skrivanek 2020-08-25 12:41:56 UTC
this bug has TR 4.5.z,  it's either a wrong bug or it should not be MODIFIED then

Comment 45 Evgeny Slutsky 2020-08-25 13:11:33 UTC
yes sorry didnt notice

Comment 50 Jan Zmeskal 2020-08-31 15:30:00 UTC
Verified using the same method as described here: https://bugzilla.redhat.com/show_bug.cgi?id=1817853#c32
Using: openshift-install 4.5.0-0.nightly-2020-08-29-080432

Comment 52 errata-xmlrpc 2020-09-08 10:54:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.8 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3510


Note You need to log in before you can comment on or make changes to this bug.