Bug 1854787
| Summary: | [RHV] New machine stuck at 'Provisioned' phase | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | OpenShift BugZilla Robot <openshift-bugzilla-robot> |
| Component: | Cloud Compute | Assignee: | Roy Golan <rgolan> |
| Cloud Compute sub component: | oVirt Provider | QA Contact: | Jan Zmeskal <jzmeskal> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | urgent | ||
| Priority: | urgent | CC: | aprajapa, asonmez, bjarolim, dmoessne, dvercill, eslutsky, fhirtz, gpulido, hpopal, jmalde, jortialc, jzmeskal, lleistne, lmartinh, michal.skrivanek, mleonard, mnunes, openshift-bugs-escalate, pelauter, ramon.gordillo, rgolan, rkshirsa, zhsun |
| Version: | 4.4 | Keywords: | Reopened |
| Target Milestone: | --- | ||
| Target Release: | 4.5.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-09-08 10:54:03 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1817853 | ||
| Bug Blocks: | |||
|
Comment 2
Roy Golan
2020-07-09 12:41:45 UTC
*** Bug 1849387 has been marked as a duplicate of this bug. *** Verification attempted with:
openshift-install-linux-4.5.0-0.nightly-2020-07-29-051236 (The fix landed in 4.5.0-0.nightly-2020-07-28-182449)
RHV 4.3.11.2-0.1.el7
I scaled up the existing worker MachineSet and waited for the new worker machine to get into Running state. I waited for almost hour and half but it got stuck in Provisioned state. See here:
# oc get machine -n openshift-machine-api
NAME PHASE TYPE REGION ZONE AGE
primary-spfb8-master-0 Running 178m
primary-spfb8-master-1 Running 178m
primary-spfb8-master-2 Running 178m
primary-spfb8-worker-0-9qgp8 Running 169m
primary-spfb8-worker-0-b8s89 Provisioned 88m
primary-spfb8-worker-0-ktqv2 Running 169m
primary-spfb8-worker-0-nxr82 Running 169m
It was not present among nodes either:
# oc get nodes
NAME STATUS ROLES AGE VERSION
primary-spfb8-master-0 Ready master 177m v1.18.3+012b3ec
primary-spfb8-master-1 Ready master 177m v1.18.3+012b3ec
primary-spfb8-master-2 Ready master 177m v1.18.3+012b3ec
primary-spfb8-worker-0-9qgp8 Ready worker 162m v1.18.3+012b3ec
primary-spfb8-worker-0-ktqv2 Ready worker 151m v1.18.3+012b3ec
primary-spfb8-worker-0-nxr82 Ready worker 161m v1.18.3+012b3ec
This error can be seen in machine-controller container:
# oc logs machine-api-controllers-5d75cbdb7d-4pm8j -c machine-controller
...
E0729 12:06:46.244009 1 actuator.go:295] failed to lookup the VM IP lookup primary-spfb8-worker-0-b8s89 on 172.30.0.10:53: no such host - skip setting addresses for this machine
E0729 12:06:46.244052 1 controller.go:286] Error updating machine "openshift-machine-api/primary-spfb8-worker-0-b8s89": lookup primary-spfb8-worker-0-b8s89 on 172.30.0.10:53: no such host
{"level":"error","ts":1596024406.2441392,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"machine_controller","request":"openshift-machine-api/primary-spfb8-worker-0-b8s89","error":"lookup primary-spfb8-worker-0-b8s89 on 172.30.0.10:53: no such host","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/cluster-api-provider-ovirt/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
...
Couple of CSRs are pending (I checked there were none Pending after cluster deployment):
# oc get csr
NAME AGE SIGNERNAME REQUESTOR CONDITION
csr-8x46q 25m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
csr-9h7ft 56m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
csr-bb98m 41m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
csr-d6plc 87m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
csr-v2q29 10m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
csr-vjjgw 72m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
I'll attach other log files as well.
One additional information. The Provider State of the new worker Machine in webconsole is reboot_in_progress, but in reality the VM is not rebooting. reading previous comments it seems the workaround is in c#7 and c#8 giving this a try ....
setting up new OCP 4.4.10 cluster:
~~~
# oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.4.10 True False 36m Cluster version is 4.4.10
#
# oc get nodes
NAME STATUS ROLES AGE VERSION
cluster-46vks-master-0 Ready master 25m v1.17.1+9d33dd3
cluster-46vks-master-1 Ready master 24m v1.17.1+9d33dd3
cluster-46vks-master-2 Ready master 24m v1.17.1+9d33dd3
cluster-46vks-worker-0-2bzzm Ready worker 7m41s v1.17.1+9d33dd3
cluster-46vks-worker-0-gbkdn Ready worker 9m31s v1.17.1+9d33dd3
cluster-46vks-worker-0-ks6dk Ready worker 12m v1.17.1+9d33dd3
cluster-46vks-worker-0-kt6jp Ready worker 11m v1.17.1+9d33dd3
#
# oc get machineset -n openshift-machine-api
NAME DESIRED CURRENT READY AVAILABLE AGE
cluster-46vks-worker-0 4 4 26m
#
# oc get machine -n openshift-machine-api
NAME PHASE TYPE REGION ZONE AGE
cluster-46vks-master-0 Running 26m
cluster-46vks-master-1 Running 26m
cluster-46vks-master-2 Running 26m
cluster-46vks-worker-0-2bzzm Provisioned 17m
cluster-46vks-worker-0-gbkdn Provisioned 17m
cluster-46vks-worker-0-ks6dk Provisioned 17m
cluster-46vks-worker-0-kt6jp Provisioned 17m
#
~~~
-> so, getting VM ID from RHV UI: ( Compute -> Virtual Machines -> click VM in question -> on the right see: VM ID:)
--> this brings me to the following list:
node/machine name | VM ID
---------------------------------------------------------------------
cluster-46vks-worker-0-2bzzm | ee7488fb-ac4f-4bed-85c8-1d75b2cd3798
cluster-46vks-worker-0-gbkdn | ad826aae-2d39-401a-b95a-02dc14b902ea
cluster-46vks-worker-0-ks6dk | 9dafee85-5cca-406b-8f73-fcd438fc67b1
cluster-46vks-worker-0-kt6jp | aa1b34ce-1c4c-44d6-8566-ae4521067fe1
--> those need to be in node and machine config with either possibility
a) direct edit:
~~~
# oc edit node cluster-46vks-worker-0-2bzzm
...
spec:
providerID: ee7488fb-ac4f-4bed-85c8-1d75b2cd3798
status:
addresses:
...
node/cluster-46vks-worker-0-2bzzm edited
# oc edit machine cluster-46vks-worker-0-2bzzm -n openshift-machine-api
...
spec:
metadata:
creationTimestamp: null
providerSpec:
value:
apiVersion: ovirtproviderconfig.machine.openshift.io/v1beta1
cluster_id: 587fa27d-0229-00d8-0323-000000000290
cpu:
cores: 4
sockets: 1
threads: 1
credentialsSecret:
name: ovirt-credentials
id: ee7488fb-ac4f-4bed-85c8-1d75b2cd3798
kind: OvirtMachineProviderSpec
memory_mb: 16348
...
machine.machine.openshift.io/cluster-46vks-worker-0-2bzzm edited
#
~~~
b) using oc patch:
~~~
# oc patch node cluster-46vks-worker-0-gbkdn --type merge --patch '{"spec":{"providerID":"ad826aae-2d39-401a-b95a-02dc14b902ea"}}'
# oc -n openshift-machine-api patch machine cluster-46vks-worker-0-gbkdn --type merge --patch '{"spec":{"providerSpec":{"value":{"id":"ad826aae-2d39-401a-b95a-02dc14b902ea"}}}}'
~~~
and when checking, the machines and CSRs are getting sorted:
~~~
# oc get machine -n openshift-machine-api
NAME PHASE TYPE REGION ZONE AGE
cluster-46vks-master-0 Running 41m
cluster-46vks-master-1 Running 41m
cluster-46vks-master-2 Running 41m
cluster-46vks-worker-0-2bzzm Running 32m
cluster-46vks-worker-0-gbkdn Running 32m
cluster-46vks-worker-0-ks6dk Provisioned 32m
cluster-46vks-worker-0-kt6jp Provisioned 32m
# oc get csr
NAME AGE REQUESTOR CONDITION
csr-5hmph 29m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-bcxhr 12m system:node:cluster-46vks-worker-0-ks6dk Pending
csr-cvj6x 26m system:node:cluster-46vks-worker-0-kt6jp Pending
csr-drsgg 22m system:node:cluster-46vks-worker-0-2bzzm Approved,Issued
csr-dtfl8 27m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-ffzwq 40m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-gzg75 11m system:node:cluster-46vks-worker-0-kt6jp Pending
csr-ljkbq 40m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-mmnwq 40m system:node:cluster-46vks-master-0 Approved,Issued
csr-npkp6 22m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-nxh8k 39m system:node:cluster-46vks-master-1 Approved,Issued
csr-qrhmh 24m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-qswsj 9m25s system:node:cluster-46vks-worker-0-gbkdn Pending
csr-qxmks 27m system:node:cluster-46vks-worker-0-ks6dk Pending
csr-r6wtx 24m system:node:cluster-46vks-worker-0-gbkdn Pending
csr-wtt9z 39m system:node:cluster-46vks-master-2 Approved,Issued
csr-xrb69 40m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
#
~~~
doing the rest:
~~~# oc patch node cluster-46vks-worker-0-ks6dk --type merge --patch '{"spec":{"providerID":"9dafee85-5cca-406b-8f73-fcd438fc67b1"}}'
# oc patch node cluster-46vks-worker-0-kt6jp --type merge --patch '{"spec":{"providerID":"aa1b34ce-1c4c-44d6-8566-ae4521067fe1"}}'
# oc -n openshift-machine-api patch machine cluster-46vks-worker-0-ks6dk --type merge --patch '{"spec":{"providerSpec":{"value":{"id":"9dafee85-5cca-406b-8f73-fcd438fc67b1"}}}}'
# oc -n openshift-machine-api patch machine cluster-46vks-worker-0-kt6jp --type merge --patch '{"spec":{"providerSpec":{"value":{"id":"aa1b34ce-1c4c-44d6-8566-ae4521067fe1"}}}}'
# oc get machine -n openshift-machine-api
NAME PHASE TYPE REGION ZONE AGE
cluster-46vks-master-0 Running 46m
cluster-46vks-master-1 Running 46m
cluster-46vks-master-2 Running 46m
cluster-46vks-worker-0-2bzzm Running 36m
cluster-46vks-worker-0-gbkdn Running 36m
cluster-46vks-worker-0-ks6dk Running 36m
cluster-46vks-worker-0-kt6jp Running 36m
# oc get machineset -n openshift-machine-api
NAME DESIRED CURRENT READY AVAILABLE AGE
cluster-46vks-worker-0 4 4 4 4 46m
# oc get csr
NAME AGE REQUESTOR CONDITION
csr-5hmph 33m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-5p28b 2m3s system:node:cluster-46vks-worker-0-ks6dk Approved,Issued
csr-bcxhr 17m system:node:cluster-46vks-worker-0-ks6dk Pending
csr-cvj6x 31m system:node:cluster-46vks-worker-0-kt6jp Pending
csr-drsgg 27m system:node:cluster-46vks-worker-0-2bzzm Approved,Issued
csr-dtfl8 32m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-ffzwq 45m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-gzg75 16m system:node:cluster-46vks-worker-0-kt6jp Pending
csr-ljkbq 45m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-mmnwq 44m system:node:cluster-46vks-master-0 Approved,Issued
csr-npkp6 27m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-nxh8k 44m system:node:cluster-46vks-master-1 Approved,Issued
csr-qrhmh 29m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-qswsj 14m system:node:cluster-46vks-worker-0-gbkdn Approved,Issued
csr-qxmks 32m system:node:cluster-46vks-worker-0-ks6dk Pending
csr-r6wtx 29m system:node:cluster-46vks-worker-0-gbkdn Pending
csr-wtt9z 44m system:node:cluster-46vks-master-2 Approved,Issued
csr-wzncq 87s system:node:cluster-46vks-worker-0-kt6jp Approved,Issued
csr-xrb69 45m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
#
~~~
waiting another ~10 minutes:
~~~
# oc get csr
NAME AGE REQUESTOR CONDITION
csr-5hmph 54m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-5p28b 22m system:node:cluster-46vks-worker-0-ks6dk Approved,Issued
csr-bcxhr 38m system:node:cluster-46vks-worker-0-ks6dk Approved,Issued
csr-cvj6x 52m system:node:cluster-46vks-worker-0-kt6jp Approved,Issued
csr-drsgg 48m system:node:cluster-46vks-worker-0-2bzzm Approved,Issued
csr-dtfl8 53m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-ffzwq 65m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-gzg75 37m system:node:cluster-46vks-worker-0-kt6jp Approved,Issued
csr-ljkbq 65m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-mmnwq 65m system:node:cluster-46vks-master-0 Approved,Issued
csr-npkp6 48m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-nxh8k 65m system:node:cluster-46vks-master-1 Approved,Issued
csr-qrhmh 50m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-qswsj 35m system:node:cluster-46vks-worker-0-gbkdn Approved,Issued
csr-qxmks 53m system:node:cluster-46vks-worker-0-ks6dk Approved,Issued
csr-r6wtx 50m system:node:cluster-46vks-worker-0-gbkdn Approved,Issued
csr-wtt9z 65m system:node:cluster-46vks-master-2 Approved,Issued
csr-wzncq 22m system:node:cluster-46vks-worker-0-kt6jp Approved,Issued
csr-xrb69 65m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
#
~~~
all seems to be sorted .. so this could be a WA until it is fixed
i couldnt reproduce this issue with 4.6.0-0.nightly-2020-08-18-055142 tried: 1. running IPI installation with cluster with 3 masters and 2 workers . 2. manually scaled the machineset to 3 using oc command oc scale --replicas=3 machineset ovirt10-26k9v-worker-0 -n openshift-machine-api [root@eslutsky-proxy-vm ~]# ./oc get machineset -n openshift-machine-api NAME DESIRED CURRENT READY AVAILABLE AGE ovirt10-26k9v-worker-0 3 3 3 3 67m [root@eslutsky-proxy-vm ~]# ./oc get machine -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE ovirt10-26k9v-master-0 Running 67m ovirt10-26k9v-master-1 Running 67m ovirt10-26k9v-master-2 Running 67m ovirt10-26k9v-worker-0-bmndg Running 57m ovirt10-26k9v-worker-0-dbptj Running 13m ovirt10-26k9v-worker-0-ghppj Running 57m [root@eslutsky-proxy-vm ~]# ./oc get nodes NAME STATUS ROLES AGE VERSION ovirt10-26k9v-master-0 Ready master 65m v1.19.0-rc.2+99cb93a-dirty ovirt10-26k9v-master-1 Ready master 65m v1.19.0-rc.2+99cb93a-dirty ovirt10-26k9v-master-2 Ready master 65m v1.19.0-rc.2+99cb93a-dirty ovirt10-26k9v-worker-0-bmndg Ready worker 50m v1.19.0-rc.2+99cb93a-dirty ovirt10-26k9v-worker-0-dbptj Ready worker 5m46s v1.19.0-rc.2+99cb93a-dirty ovirt10-26k9v-worker-0-ghppj Ready worker 39m v1.19.0-rc.2+99cb93a-dirty Hi Evgeny, one more thing comes to mind: Try scaling the existing worker MachineSet to 0 and then back to 3 the issue reproduced when tried scaling up again to 4 workers, but this time RHV was unable to spawn this extra worker (out of resources
in rhv events:
Failed to run VM ovirt10-26k9v-worker-0-6mzq8 due to a failed validation: [Cannot run VM. There is no host that satisfies current scheduling constraints. See below for details:, The host es-host-01 did not satisfy internal filter Memory because its available memory is too low (10415 MB) to run the VM., The host es-host-01 did not satisfy internal filter Memory because its available memory is too low (10415 MB) to run the VM.] (User: admin@internal-authz).
[root@eslutsky-proxy-vm ~]# ./oc get machines -A
NAMESPACE NAME PHASE TYPE REGION ZONE AGE
openshift-machine-api ovirt10-26k9v-master-0 Running 83m
openshift-machine-api ovirt10-26k9v-master-1 Running 83m
openshift-machine-api ovirt10-26k9v-master-2 Running 83m
openshift-machine-api ovirt10-26k9v-worker-0-6mzq8 Provisioned 6m57s
openshift-machine-api ovirt10-26k9v-worker-0-bmndg Running 73m
openshift-machine-api ovirt10-26k9v-worker-0-dbptj Running 29m
openshift-machine-api ovirt10-26k9v-worker-0-ghppj Running 73m
./oc describe machine/ovirt10-26k9v-worker-0-6mzq8 -n openshift-machine-api
Spec:
Metadata:
Provider ID: ac81b8df-ce0b-4e74-a393-4bb4d83607b6
Provider Spec:
Value:
API Version: ovirtproviderconfig.machine.openshift.io/v1beta1
cluster_id: 2b76bbe8-38c3-49a5-b5f3-23b4dd1f4326
Cpu:
Cores: 4
Sockets: 1
Threads: 1
Credentials Secret:
Name: ovirt-credentials
Id:
Kind: OvirtMachineProviderSpec
memory_mb: 16348
Metadata:
Creation Timestamp: <nil>
Name:
os_disk:
size_gb: 120
template_name: ovirt10-26k9v-rhcos
Type: server
User Data Secret:
Name: worker-user-data-managed
Status:
Last Updated: 2020-08-18T09:59:12Z
Phase: Provisioned
Provider Status:
Conditions:
Last Probe Time: 2020-08-18T09:59:12Z
Last Transition Time: 2020-08-18T09:59:12Z
Message: Machine successfully created
Reason: MachineCreateSucceeded
Status: True
Type: MachineCreated
Instance Id: ac81b8df-ce0b-4e74-a393-4bb4d83607b6
Instance State: down
Metadata:
Creation Timestamp: <nil>
can you please confirm if the workers are Starting up in the RHV engine ?
I think that doesn't count as a reproduction, since it's expected to run into problems when there aren't enough compute resources. I can confirm that when I hit this issue, the VMs were successfully started. (In reply to Jan Zmeskal from comment #25) > Hi Evgeny, one more thing comes to mind: Try scaling the existing worker > MachineSet to 0 and then back to 3 when scaling to 0, the last worker didn't deleted and stuck at `Deleting' Phase. ./oc scale --replicas=0 machineset ovirt10-26k9v-worker-0 -n openshift-machine-api [root@eslutsky-proxy-vm ~]# ./oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api ovirt10-26k9v-master-0 Running 128m openshift-machine-api ovirt10-26k9v-master-1 Running 128m openshift-machine-api ovirt10-26k9v-master-2 Running 128m openshift-machine-api ovirt10-26k9v-worker-0-dbptj Deleting 74m after reproducing this issue in the QE rhv-4.3 env, with latest ocp (4.6.0-0.nightly-2020-08-18-055142),
it appears to be cause because of failed dns lookup attempt :
# oc logs machine-api-controllers-5bc7996949-wszvn -c machine-controller
I0818 13:49:18.142340 1 machineservice.go:270] Got VM by ID: primary-5tk29-worker-0-ml2v5
E0818 13:49:18.151995 1 actuator.go:295] failed to lookup the VM IP lookup primary-5tk29-worker-0-ml2v5 on 172.30.0.10:53: no such host - skip setting addresses for this machine
E0818 13:49:18.152023 1 controller.go:286] Error updating machine "openshift-machine-api/primary-5tk29-worker-0-ml2v5": lookup primary-5tk29-worker-0-ml2v5 on 172.30.0.10:53: no such host
{"level":"error","ts":1597758558.15205,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"machine_controller","request":"openshift-machine-api/primary-5tk29-worker-0-ml2v5","error":"lookup primary-5tk29-worker-0-ml2v5 on 172.30.0.10:53: no such host","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/cluster-api-provider-ovirt/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
so the IP never get reported back to the machine object :
oc get machine -o=custom-columns="name:.metadata.name,Status:status.phase,Address:.status.addresses[1].address"
name Status Address
primary-5tk29-master-0 Running 10.35.71.95
primary-5tk29-master-1 Running 10.35.71.94
primary-5tk29-master-2 Running 10.35.71.232
primary-5tk29-worker-0-86mmj Running 10.35.71.98
primary-5tk29-worker-0-ml2v5 Provisioned <none>
primary-5tk29-worker-0-nm9lt Running 10.35.71.99
primary-5tk29-worker-0-pv48v Running 10.35.71.97
the issue resolved after removing the failed pod:
oc delete pod/machine-api-controllers-5bc7996949-wszvn
(In reply to Jan Zmeskal from comment #13) > Verification attempted with: > openshift-install-linux-4.5.0-0.nightly-2020-07-29-051236 (The fix landed in > 4.5.0-0.nightly-2020-07-28-182449) > RHV 4.3.11.2-0.1.el7 > > I scaled up the existing worker MachineSet and waited for the new worker > machine to get into Running state. I waited for almost hour and half but it > got stuck in Provisioned state. See here: Can you please update and confirm if you have verified the IPI part of the problem, and only see an issue for scaling up? If so, please verify this bug and work with evgeni to open a new one tracking scale up issues in 4.5+ this merged for 4.6 only , we need to cherry-pick to 4.5 this bug has TR 4.5.z, it's either a wrong bug or it should not be MODIFIED then yes sorry didnt notice Verified using the same method as described here: https://bugzilla.redhat.com/show_bug.cgi?id=1817853#c32 Using: openshift-install 4.5.0-0.nightly-2020-08-29-080432 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.5.8 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3510 |