Bug 1817853
| Summary: | [RHV] New machine stuck at 'Provisioned' phase | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jianwei Hou <jhou> | ||||||||
| Component: | Cloud Compute | Assignee: | Evgeny Slutsky <eslutsky> | ||||||||
| Cloud Compute sub component: | oVirt Provider | QA Contact: | Lucie Leistnerova <lleistne> | ||||||||
| Status: | CLOSED ERRATA | Docs Contact: | |||||||||
| Severity: | urgent | ||||||||||
| Priority: | high | CC: | dvercill, eslutsky, fhirtz, gdeolive, gzaidman, hpopal, jzmeskal, lleistne, lpioro, michal.skrivanek, mleonard, mnunes, openshift-bugs-escalate, pelauter, rcunha, rgolan, wkulhane, zhsun | ||||||||
| Version: | 4.4 | Keywords: | Reopened | ||||||||
| Target Milestone: | --- | ||||||||||
| Target Release: | 4.6.0 | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| URL: | https://issues.redhat.com/browse/OCPRHV-240 | ||||||||||
| Whiteboard: | ovirt | ||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: |
Previously, an attempt to perform manual scale-up (from GUI or with CLI) failed with an error. With this update, the scale-up works.
|
Story Points: | --- | ||||||||
| Clone Of: | 1815435 | Environment: | |||||||||
| Last Closed: | 2020-11-02 13:01:05 UTC | Type: | --- | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | 1815435 | ||||||||||
| Bug Blocks: | 1854787 | ||||||||||
| Attachments: |
|
||||||||||
|
Description
Jianwei Hou
2020-03-27 06:23:04 UTC
Test on 4.4.0-0.nightly-2020-03-26-041820, this problem doesn't exist. 1. On an IPI on RHV deployment, create a machineset, with replicas=1 2. Machine phase becomes 'Running',node becomes 'Ready'. 3. oc get csr, csrs were automatically approved $ oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE leopard-hw72r-worker-0 3 3 3 3 4d13h leopard-hw72r-worker-00 1 1 1 1 64m $ oc get machine NAME PHASE TYPE REGION ZONE AGE leopard-hw72r-master-0 Running 4d13h leopard-hw72r-master-1 Running 4d13h leopard-hw72r-master-2 Running 4d13h leopard-hw72r-worker-0-9snp8 Running 4d13h leopard-hw72r-worker-0-dhflh Running 4d13h leopard-hw72r-worker-0-wr85v Running 4d13h leopard-hw72r-worker-00-nqqzv Running 64m $ oc get node NAME STATUS ROLES AGE VERSION leopard-hw72r-master-0 Ready master 4d13h v1.17.1 leopard-hw72r-master-1 Ready master 4d13h v1.17.1 leopard-hw72r-master-2 Ready master 4d13h v1.17.1 leopard-hw72r-worker-0-9snp8 Ready worker 4d13h v1.17.1 leopard-hw72r-worker-0-dhflh Ready worker 4d13h v1.17.1 leopard-hw72r-worker-0-wr85v Ready worker 4d13h v1.17.1 leopard-hw72r-worker-00-nqqzv Ready worker 60m v1.17.1 $ oc get csr NAME AGE REQUESTOR CONDITION csr-2b2b7 60m system:node:leopard-hw72r-worker-00-nqqzv Approved,Issued csr-ghrgb 60m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued Reproduced this bug
clusterversion: 4.4.0-0.nightly-2020-04-04-025830
Create a new machineset with replic=3, 1 machine joined the cluster and node status becomes Ready. 2 machines stuck at Provisioned status. "oc describe machine" show machine "instanceState: down"
# oc get machineset
NAME DESIRED CURRENT READY AVAILABLE AGE
secondary-xgbq9-worker-0 3 3 3 3 18h
secondary-xgbq9-worker-00 3 3 1 1 26m
# oc get machine
NAME PHASE TYPE REGION ZONE AGE
secondary-xgbq9-master-0 Running 18h
secondary-xgbq9-master-1 Running 18h
secondary-xgbq9-master-2 Running 18h
secondary-xgbq9-worker-0-hw22t Running 18h
secondary-xgbq9-worker-0-lcrzj Running 18h
secondary-xgbq9-worker-0-spwkj Running 18h
secondary-xgbq9-worker-00-8x7hq Running 26m
secondary-xgbq9-worker-00-g89d6 Provisioned 26m
secondary-xgbq9-worker-00-q8nqs Provisioned 26m
# oc get node
NAME STATUS ROLES AGE VERSION
secondary-xgbq9-master-0 Ready master 18h v1.17.1
secondary-xgbq9-master-1 Ready master 18h v1.17.1
secondary-xgbq9-master-2 Ready master 18h v1.17.1
secondary-xgbq9-worker-0-hw22t Ready worker 18h v1.17.1
secondary-xgbq9-worker-0-lcrzj Ready worker 18h v1.17.1
secondary-xgbq9-worker-0-spwkj Ready worker 18h v1.17.1
secondary-xgbq9-worker-00-8x7hq Ready worker 24m v1.17.1
# oc get machine secondary-xgbq9-worker-00-g89d6 -o yaml
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
annotations:
instance-status: |
{"kind":"Machine","apiVersion":"machine.openshift.io/v1beta1","metadata":{"name":"secondary-xgbq9-worker-00-g89d6","generateName":"secondary-xgbq9-worker-00-","namespace":"openshift-machine-api","selfLink":"/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/secondary-xgbq9-worker-00-g89d6","uid":"fc95124b-6cc1-4716-a234-f025b7daafe7","resourceVersion":"1560225","generation":1,"creationTimestamp":"2020-04-10T09:10:54Z","labels":{"machine.openshift.io/cluster-api-cluster":"secondary-xgbq9","machine.openshift.io/cluster-api-machine-role":"worker","machine.openshift.io/cluster-api-machine-type":"worker","machine.openshift.io/cluster-api-machineset":"secondary-xgbq9-worker-00"},"annotations":{"VmId":"dd35df17-8766-47c3-b4c3-94579ea8d4c4","instance-status":""},"ownerReferences":[{"apiVersion":"machine.openshift.io/v1beta1","kind":"MachineSet","name":"secondary-xgbq9-worker-00","uid":"ad0d6173-9b12-4c4f-8df7-c0501d365396","controller":true,"blockOwnerDeletion":true}],"finalizers":["machine.machine.openshift.io"]},"spec":{"metadata":{"creationTimestamp":null},"providerSpec":{"value":{"apiVersion":"ovirtproviderconfig.openshift.io/v1beta1","cluster_id":"81602ae2-79a1-11ea-a904-00163e563201","credentialsSecret":{"name":"ovirt-credentials"},"id":"","kind":"OvirtMachineProviderSpec","metadata":{"creationTimestamp":null},"name":"","template_name":"rhcos_custom","userDataSecret":{"name":"worker-user-data"}}},"providerID":"dd35df17-8766-47c3-b4c3-94579ea8d4c4"},"status":{"lastUpdated":"2020-04-10T09:36:33Z","providerStatus":{"metadata":{"creationTimestamp":null},"instanceId":"secondary-xgbq9-worker-00-g89d6","instanceState":"down","conditions":[{"type":"MachineCreated","status":"True","lastProbeTime":"2020-04-10T09:36:33Z","lastTransitionTime":"2020-04-10T09:36:33Z","reason":"machineCreationSucceedReason","message":"machineCreationSucceedMessage"}]},"addresses":[{"type":"InternalDNS","address":"secondary-xgbq9-worker-00-g89d6"}],"phase":"Provisioned"}}
creationTimestamp: "2020-04-10T09:10:54Z"
finalizers:
- machine.machine.openshift.io
generateName: secondary-xgbq9-worker-00-
generation: 1
labels:
machine.openshift.io/cluster-api-cluster: secondary-xgbq9
machine.openshift.io/cluster-api-machine-role: worker
machine.openshift.io/cluster-api-machine-type: worker
machine.openshift.io/cluster-api-machineset: secondary-xgbq9-worker-00
name: secondary-xgbq9-worker-00-g89d6
namespace: openshift-machine-api
ownerReferences:
- apiVersion: machine.openshift.io/v1beta1
blockOwnerDeletion: true
controller: true
kind: MachineSet
name: secondary-xgbq9-worker-00
uid: ad0d6173-9b12-4c4f-8df7-c0501d365396
resourceVersion: "1560252"
selfLink: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/secondary-xgbq9-worker-00-g89d6
uid: fc95124b-6cc1-4716-a234-f025b7daafe7
spec:
metadata:
creationTimestamp: null
providerSpec:
value:
apiVersion: ovirtproviderconfig.openshift.io/v1beta1
cluster_id: 81602ae2-79a1-11ea-a904-00163e563201
credentialsSecret:
name: ovirt-credentials
id: ""
kind: OvirtMachineProviderSpec
metadata:
creationTimestamp: null
name: ""
template_name: rhcos_custom
userDataSecret:
name: worker-user-data
status:
addresses:
- address: secondary-xgbq9-worker-00-g89d6
type: InternalDNS
lastUpdated: "2020-04-10T09:36:33Z"
phase: Provisioned
providerStatus:
conditions:
- lastProbeTime: "2020-04-10T09:36:33Z"
lastTransitionTime: "2020-04-10T09:36:33Z"
message: machineCreationSucceedMessage
reason: machineCreationSucceedReason
status: "True"
type: MachineCreated
instanceId: secondary-xgbq9-worker-00-g89d6
instanceState: down
metadata:
creationTimestamp: null
Created attachment 1677766 [details]
machine-controller.log
This is not a blocker for OCP 4.4.0 GA, target the earliest z-stream (In reply to sunzhaohua from comment #3) > Reproduced this bug > clusterversion: 4.4.0-0.nightly-2020-04-04-025830 > > Create a new machineset with replic=3, 1 machine joined the cluster and node > status becomes Ready. 2 machines stuck at Provisioned status. "oc describe > machine" show machine "instanceState: down" > This means the VM is not running - can you make sure? Hi Roy,
We didn't have access to the RHV management console, so all the data we had were from the openshift side. I have downloaded the log Zhaohua attached and I see some message like this:
```
I0410 09:12:38.044200 1 machineservice.go:144] creating VM: secondary-xgbq9-worker-00-g89d6
E0410 09:12:58.857808 1 actuator.go:257] Machine error secondary-xgbq9-worker-00-g89d6: Error running oVirt VM: Fault reason is "Operation Failed". Fault detail is "[Cannot run VM. There is no host that satisfies current scheduling constraints. See below for details:, The host dell-r640-008.dsal.lab.eng.tlv2.redhat.com did not satisfy internal filter Memory because its available memory is too low (9038 MB) to run the VM.]". HTTP response code is "409". HTTP response message is "409 Conflict".
W0410 09:12:58.857839 1 controller.go:321] Failed to create machine "secondary-xgbq9-worker-00-g89d6": Error running oVirt VM: Fault reason is "Operation Failed". Fault detail is "[Cannot run VM. There is no host that satisfies current scheduling constraints. See below for details:, The host dell-r640-008.dsal.lab.eng.tlv2.redhat.com did not satisfy internal filter Memory because its available memory is too low (9038 MB) to run the VM.]". HTTP response code is "409". HTTP response message is "409 Conflict".
{"level":"error","ts":1586509978.8578815,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"machine_controller","request":"openshift-machine-api/secondary-xgbq9-worker-00-g89d6","error":"Error running oVirt VM: Fault reason is \"Operation Failed\". Fault detail is \"[Cannot run VM. There is no host that satisfies current scheduling constraints. See below for details:, The host dell-r640-008.dsal.lab.eng.tlv2.redhat.com did not satisfy internal filter Memory because its available memory is too low (9038 MB) to run the VM.]\". HTTP response code is \"409\". HTTP response message is \"409 Conflict\".","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/cluster-api-provider-ovirt/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
```
It seems the machine had error in creation and then stuck at 'Provisioning'.
Talk to your RHV admin, is seems the VM can't start because there is simply not enough memory. Look at the log: "did not satisfy internal filter Memory because its available memory is too low (9038 MB) to run the VM." ive attached case that is showing this issue is still an issue: [root@bastion ~]# oc get nodes NAME STATUS ROLES AGE VERSION openshift-stage-wz4zh-master-0 Ready master 22h v1.17.1+912792b openshift-stage-wz4zh-master-1 Ready master 22h v1.17.1+912792b openshift-stage-wz4zh-master-2 Ready master 22h v1.17.1+912792b openshift-stage-wz4zh-worker-0-8w2xb Ready worker 18h v1.17.1+912792b openshift-stage-wz4zh-worker-0-ckzxm Ready worker 21h v1.17.1+912792b openshift-stage-wz4zh-worker-0-n8qtb Ready worker 21h v1.17.1+912792b [root@bastion ~]# oc -n openshift-machine-api get machine NAME PHASE TYPE REGION ZONE AGE openshift-stage-wz4zh-master-0 Running 22h openshift-stage-wz4zh-master-1 Running 22h openshift-stage-wz4zh-master-2 Running 22h openshift-stage-wz4zh-worker-0-8w2xb Provisioned 17h openshift-stage-wz4zh-worker-0-ckzxm Provisioned 21h openshift-stage-wz4zh-worker-0-n8qtb Provisioned 21h a must-gather may not be possible, but the cu was able to run *oc adm inspect* on namespaces, so please if the must-gather issue cannot be resolved, what other information would you like. I've also requested log-collector data on the RHVM and a worker node. thank-you, m Created attachment 1699204 [details]
must-gather from cluster
Created attachment 1699206 [details]
namespaces included are machine-api and machine-config-operator
Hi I have similar issue on my OCP Env - OCP 4.4.11 on RHV 4.3.9.4-11.el7 [root@bastion ~]# oc -n openshift-machine-api get machine NAME PHASE TYPE REGION ZONE AGE ocprhv-g86sn-master-0 Running 6d23h ocprhv-g86sn-master-1 Running 6d23h ocprhv-g86sn-master-2 Running 6d23h ocprhv-g86sn-worker-0-8z6c4 Running 6d23h ocprhv-g86sn-worker-0-fqdf5 Provisioned 6d23h ocprhv-g86sn-worker-0-s68hx Provisioned 5d21h ocprhv-g86sn-worker-0-xdjlz Provisioned 6d23h Provider State is set to "Image locked" for Machine in provisioneed state I scaled-up Machine Set with one more worker, was OK, Node is ready but statut on machine side is provisioned Tell me is you need data from OCP/RHV Luc Pioro Account Solutions Architect *** Bug 1840553 has been marked as a duplicate of this bug. *** Verification steps: 1. Scale up the existing worker MachineSet -> new Machine ends up in Running state, Node is in Ready status, no pending CSRs 2. Scale down the existing worker MachineSet -> previously created Machine no longer listed among Machines and no longer listed among Nodes So all of this is fine. However, then I hit the issue again in quite a strange situation. I'm going to describe what exactly I did. 1. oc get machineset primary-dlzl6-worker-0 -o yaml --export > ms.yaml (exported the original MachineSet) 2. vim ms.yaml (edited the file so that replica count is 1, I changed nothing else) 3. oc apply -f ms.yaml As nothing except replica count was changed, the change was applied to the original MachineSet. It scaled down to 1 Machine without any issue. 4. In webconsole, I scaled the MachineSet back to 3 replicas in one step And here it comes. One of the newly created Machines got stuck in Provisioned state, was not listed among Nodes and there are two penging CSRs. According to oc get machine -o yaml, the instanceState is reboot_in_progress. However, when I log into the RHV manager I can see the machine is not rebooting. CLI output: http://pastebin.test.redhat.com/884795 Verification conducted with: OCP 4.6.0-0.nightly-2020-07-15-004428 RHV 4.3.11.2-0.1.el7 Failing QA based on comment 23 After a discussion with Roy we found out that the worker machines got into Running state eventually, it just took longer than expected. Therefore moving this BZ to VERIFIED and opening follow-up BZ with lower priority: https://bugzilla.redhat.com/show_bug.cgi?id=1858854 Failing in: openshift-4.6.0-0.nightly-2020-07-22-074636 I was playing with the scaling and not only I got in this issue scaling the cluster up, as one machine got stuck in a new state (Provisioning). The steps: 1. On project '' I scaled up the machineset '' from 3 to 4 # oc edit machineset secondary-xpvlm-worker-0 (Changed replicas= from 3 to 4) 2. Once finished, I scaled down to 1 3. Then I scaled up to 3 The results: # oc get machines NAME PHASE TYPE REGION ZONE AGE secondary-xpvlm-master-0 Running 3d secondary-xpvlm-master-1 Running 3d secondary-xpvlm-master-2 Running 3d secondary-xpvlm-worker-0-2bkts Provisioning 18h secondary-xpvlm-worker-0-55clj Provisioned 18h secondary-xpvlm-worker-0-zwbb6 Running 3d # oc get machinesets NAME DESIRED CURRENT READY AVAILABLE AGE secondary-xpvlm-worker-0 3 3 1 1 3d # oc logs machine-approver-6c6964ffb-kdr4w machine-approver-controller E0729 19:44:18.096102 1 reflector.go:380] github.com/openshift/cluster-machine-approver/main.go:239: Faile d to watch *v1beta1.CertificateSigningRequest: Get "https://127.0.0.1:6443/apis/certificates.k8s.io/v1beta1/cert ificatesigningrequests?allowWatchBookmarks=true&resourceVersion=2575390&timeoutSeconds=579&watch=true": dial tcp 127.0.0.1:6443: connect: connection refused E0729 19:44:18.096868 1 reflector.go:380] github.com/openshift/cluster-machine-approver/status.go:99: Failed to watch *v1.ClusterOperator: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/clusteroperators?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dmachine-approver&resourceVersion=2818758&timeoutSeconds=426&watch=true": dial tcp 127.0.0.1:6443: connect: connection refused E0729 19:44:23.432116 1 reflector.go:380] github.com/openshift/cluster-machine-approver/status.go:99: Failed to watch *v1.ClusterOperator: the server is currently unable to handle the request (get clusteroperators.config.openshift.io) I0729 19:47:29.895233 1 main.go:147] CSR csr-hjndh added I0729 19:47:29.962273 1 main.go:182] CSR csr-hjndh not authorized: failed to find machine for node seconda ry-xpvlm-worker-0-55clj I0729 19:47:29.962316 1 main.go:218] Error syncing csr csr-hjndh: failed to find machine for node secondar y-xpvlm-worker-0-55clj I0729 19:47:29.967613 1 main.go:147] CSR csr-hjndh added I0729 19:47:29.993158 1 main.go:182] CSR csr-hjndh not authorized: failed to find machine for node seconda ry-xpvlm-worker-0-55clj (...) It basically continues repeating itself until the csr get out of the expected time range and it starts dumping the following: (...) I0730 03:00:37.931211 1 main.go:147] CSR csr-6rcxd added I0730 03:00:37.950011 1 main.go:182] CSR csr-6rcxd not authorized: CSR csr-6rcxd creation time 2020-07-30 03:00:37 +0000 UTC not in range (2020-07-29 19:42:14 +0000 UTC, 2020-07-29 21:42:24 +0000 UTC) I0730 03:00:37.950152 1 main.go:218] Error syncing csr csr-6rcxd: CSR csr-6rcxd creation time 2020-07-30 03:00:37 +0000 UTC not in range (2020-07-29 19:42:14 +0000 UTC, 2020-07-29 21:42:24 +0000 UTC) (...) # oc get csr NAME AGE SIGNERNAME REQUESTOR CONDITION csr-28n6h 16h kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-29lw7 16h kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-2nq66 8h kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-422tb 6h24m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-4298h 60m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-459kq 11h kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-49jvz 17h kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending (...) There are 73 of these # oc logs machine-api-controllers-675ccc84b7-4d5lw machine-controller I0729 19:40:45.376076 1 reflector.go:158] Listing and watching *v1beta1.Machine from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126 I0729 19:40:45.387496 1 controller.go:164] Reconciling Machine "secondary-xpvlm-master-2" I0729 19:40:45.387553 1 controller.go:376] Machine "secondary-xpvlm-master-2" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0729 19:40:45.417705 1 machineservice.go:262] Fetching VM by ID: cbbd193d-72e2-4d02-bc70-d2961b8cc953 I0729 19:40:45.536577 1 machineservice.go:270] Got VM by ID: secondary-xpvlm-master-2 I0729 19:40:45.536630 1 controller.go:284] Reconciling machine "secondary-xpvlm-master-2" triggers idempotent update I0729 19:40:45.554546 1 machineservice.go:262] Fetching VM by ID: cbbd193d-72e2-4d02-bc70-d2961b8cc953 I0729 19:40:45.596294 1 machineservice.go:270] Got VM by ID: secondary-xpvlm-master-2 I0729 19:40:45.602572 1 actuator.go:262] Updating machine resource I0729 19:40:45.673792 1 actuator.go:271] Updating machine status sub-resource I0729 19:40:45.691730 1 controller.go:164] Reconciling Machine "secondary-xpvlm-worker-0-zwbb6" I0729 19:40:45.691761 1 controller.go:376] Machine "secondary-xpvlm-worker-0-zwbb6" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0729 19:40:45.702461 1 machineservice.go:262] Fetching VM by ID: eb9c15d7-7f60-4576-820f-3becbcd1b2ba I0729 19:40:45.736967 1 machineservice.go:270] Got VM by ID: secondary-xpvlm-worker-0-zwbb6 I0729 19:40:45.737045 1 controller.go:284] Reconciling machine "secondary-xpvlm-worker-0-zwbb6" triggers i dempotent update I0729 19:40:45.755392 1 machineservice.go:262] Fetching VM by ID: eb9c15d7-7f60-4576-820f-3becbcd1b2ba I0729 19:40:45.789264 1 machineservice.go:270] Got VM by ID: secondary-xpvlm-worker-0-zwbb6 I0729 19:40:45.808221 1 actuator.go:262] Updating machine resource I0729 19:40:45.833615 1 actuator.go:271] Updating machine status sub-resource I0729 19:40:45.846750 1 controller.go:164] Reconciling Machine "secondary-xpvlm-master-0" I0729 19:40:45.846786 1 controller.go:376] Machine "secondary-xpvlm-master-0" in namespace "openshift-mach ine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0729 19:40:45.888556 1 machineservice.go:262] Fetching VM by ID: 88215f56-fefc-408d-ae80-9c6e3e780292 I0729 19:40:45.921894 1 machineservice.go:270] Got VM by ID: secondary-xpvlm-master-0 I0729 19:40:45.921940 1 controller.go:284] Reconciling machine "secondary-xpvlm-master-0" triggers idempot ent update I0729 19:40:45.956470 1 machineservice.go:262] Fetching VM by ID: 88215f56-fefc-408d-ae80-9c6e3e780292 I0729 19:40:46.003732 1 machineservice.go:270] Got VM by ID: secondary-xpvlm-master-0 I0729 19:40:46.007976 1 actuator.go:262] Updating machine resource I0729 19:40:46.019861 1 actuator.go:271] Updating machine status sub-resource I0729 19:40:46.029360 1 controller.go:164] Reconciling Machine "secondary-xpvlm-master-1" I0729 19:40:46.029392 1 controller.go:376] Machine "secondary-xpvlm-master-1" in namespace "openshift-mach ine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0729 19:40:46.039276 1 machineservice.go:262] Fetching VM by ID: a01683dc-4071-48cd-a40f-56987a663ca3 I0729 19:40:46.138830 1 machineservice.go:270] Got VM by ID: secondary-xpvlm-master-1 I0729 19:40:46.138885 1 controller.go:284] Reconciling machine "secondary-xpvlm-master-1" triggers idempot ent update I0729 19:40:46.147479 1 machineservice.go:262] Fetching VM by ID: a01683dc-4071-48cd-a40f-56987a663ca3 I0729 19:40:46.182617 1 machineservice.go:270] Got VM by ID: secondary-xpvlm-master-1 I0729 19:40:46.188361 1 actuator.go:262] Updating machine resource I0729 19:40:46.248731 1 actuator.go:271] Updating machine status sub-resource I0729 19:42:25.007316 1 controller.go:164] Reconciling Machine "secondary-xpvlm-worker-0-55clj" I0729 19:42:25.007371 1 controller.go:376] Machine "secondary-xpvlm-worker-0-55clj" in namespace "openshif t-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0729 19:42:25.026622 1 controller.go:164] Reconciling Machine "secondary-xpvlm-worker-0-2bkts" I0729 19:42:25.026653 1 controller.go:376] Machine "secondary-xpvlm-worker-0-2bkts" in namespace "openshif t-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0729 19:42:46.082403 1 machineservice.go:196] Extending the OS disk from 17179869184 to 128849018880 I0729 19:42:46.420546 1 machineservice.go:209] Waiting while extending the OS disk I0729 19:44:26.993616 1 actuator.go:262] Updating machine resource I0729 19:44:27.014082 1 actuator.go:271] Updating machine status sub-resource I0729 19:44:27.025247 1 controller.go:164] Reconciling Machine "secondary-xpvlm-worker-0-2bkts" I0729 19:44:27.025311 1 controller.go:376] Machine "secondary-xpvlm-worker-0-2bkts" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0729 19:44:27.087045 1 controller.go:428] Machine "secondary-xpvlm-worker-0-2bkts" going into phase "Provisioning" I0729 19:44:27.098363 1 controller.go:319] Reconciling machine object secondary-xpvlm-worker-0-2bkts triggers idempotent create. I0729 19:44:27.142084 1 machineservice.go:126] creating VM: secondary-xpvlm-worker-0-2bkts I0729 19:44:47.993861 1 machineservice.go:196] Extending the OS disk from 17179869184 to 128849018880 I0729 19:44:48.259039 1 machineservice.go:209] Waiting while extending the OS disk I0729 19:46:09.001274 1 actuator.go:262] Updating machine resource I0729 19:46:09.142437 1 actuator.go:271] Updating machine status sub-resource I0729 19:46:09.175710 1 controller.go:164] Reconciling Machine "secondary-xpvlm-worker-0-55clj" I0729 19:46:09.175784 1 controller.go:376] Machine "secondary-xpvlm-worker-0-55clj" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0729 19:46:09.191239 1 machineservice.go:262] Fetching VM by ID: a7a7e81e-098c-458f-a8fa-d1060b782b9f I0729 19:46:09.299097 1 machineservice.go:270] Got VM by ID: secondary-xpvlm-worker-0-55clj I0729 19:46:09.299135 1 controller.go:284] Reconciling machine "secondary-xpvlm-worker-0-55clj" triggers idempotent update I0729 19:46:09.310142 1 machineservice.go:262] Fetching VM by ID: a7a7e81e-098c-458f-a8fa-d1060b782b9f I0729 19:46:09.342089 1 machineservice.go:270] Got VM by ID: secondary-xpvlm-worker-0-55clj E0729 19:46:09.522784 1 actuator.go:295] failed to lookup the VM IP lookup secondary-xpvlm-worker-0-55clj on 172.30.0.10:53: no such host - skip setting addresses for this machine E0729 19:46:09.522841 1 controller.go:286] Error updating machine "openshift-machine-api/secondary-xpvlm-worker-0-55clj": lookup secondary-xpvlm-worker-0-55clj on 172.30.0.10:53: no such host {"level":"error","ts":1596051969.5229006,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"machine_controller","request":"openshift-machine-api/secondary-xpvlm-worker-0-55clj","error":"lookup secondary-xpvlm-worker-0-55clj on 172.30.0.10:53: no such host","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/cluster-api-provider-ovirt/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"} I0729 19:46:10.523314 1 controller.go:164] Reconciling Machine "secondary-xpvlm-worker-0-2bkts" I0729 19:46:10.523382 1 controller.go:376] Machine "secondary-xpvlm-worker-0-2bkts" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0729 19:46:10.534959 1 machineservice.go:262] Fetching VM by ID: 8edcee4b-aa4e-4499-b8bb-044b07f8260e I0729 19:46:10.565728 1 machineservice.go:270] Got VM by ID: secondary-xpvlm-worker-0-2bkts I0729 19:46:10.565778 1 controller.go:284] Reconciling machine "secondary-xpvlm-worker-0-2bkts" triggers idempotent update I0729 19:46:10.576903 1 machineservice.go:262] Fetching VM by ID: 8edcee4b-aa4e-4499-b8bb-044b07f8260e I0729 19:46:10.609187 1 machineservice.go:270] Got VM by ID: secondary-xpvlm-worker-0-2bkts E0729 19:46:10.773276 1 actuator.go:295] failed to lookup the VM IP lookup secondary-xpvlm-worker-0-2bkts on 172.30.0.10:53: no such host - skip setting addresses for this machine E0729 19:46:10.773321 1 controller.go:286] Error updating machine "openshift-machine-api/secondary-xpvlm-worker-0-2bkts": lookup secondary-xpvlm-worker-0-2bkts on 172.30.0.10:53: no such host {"level":"error","ts":1596051970.7733457,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"machine_controller","request":"openshift-machine-api/secondary-xpvlm-worker-0-2bkts","error":"lookup secondary-xpvlm-worker-0-2bkts on 172.30.0.10:53: no such host","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/cluster-api-provider-ovirt/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"} (...) It basically continues repeating itself (In reply to Guilherme Santos from comment #29) > Failing in: openshift-4.6.0-0.nightly-2020-07-22-074636 > > I was playing with the scaling and not only I got in this issue scaling the > cluster up, as one machine got stuck in a new state (Provisioning). Hi, please follow up with Evgeni or opening a new bug on that, and keep this one VERIFIED. This is not related to the originally reported issue on IPI installation engine log is useless as it doesn't cover the period. Can you attach the right one? all the new logging added in that PR are not displayed by default, can you increase the log level please? The VM may be running but the guest not necessarily. RHV host logs and verifying that the guest actually runs would be helpful Evgeny found out that the worker VM hasn't booted - most probably because we've exhausted our IP address pool. Let me run one more iteration of the test and if that goes smooth, I'll move to VERIFIED. After some investigation, we found out that the issue described in comment #33 was caused by our DHCP infra setup, therefore unrelated to the OCP on RHV functionality. I performed OCP deployment 4 times, each time doing series of scale-downs and scale-ups and did not encounter the original issue. I believe this is big enough sample to verify the fix. Before this fix was delivered, I used to encounter the original issue in about 75 % of the cases of scale up, so I believe if it was still present, it would pop up during my verification. As the scale-up issue is not encountered any more, it seems to indicate that also the root cause of machines being stuck in Provisioned state during the cluster deploy has been removed. This is far more difficult to confirm 100 % though, as the issue reproduced only rarely during cluster deployment. If the issue during the cluster deployment should re-appear, I think it's safe to assume that it's root cause is different from this one. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 As mentioned in comment #39, this bug was CLOSED ERRATA. If the solution does not work for you, open a new bug report. Please don't reopen this bug as the problem you're seeing may have similar result but different root cause and deserve to be analyzed accordingly. |