Description of problem: When a manual node scale-up is performed, a new machine is created, so far so good. However, the machine is not powered on once it's created which cause the scaling process doesn't occur. # oc scale machineset ocp4-d9g6x-worker --replicas=1 -n openshift-machine-api # oc get machines NAME PHASE TYPE REGION ZONE AGE ocp4-d9g6x-worker-ch485 Provisioned 69s The VM will remain in "Provisioning" status. Digging around this fact I saw the following event: # oc get events --sort-by='{.lastTimestamp}' -n openshift-machine-api LAST SEEN TYPE REASON OBJECT MESSAGE [...] 18s Warning FailedCreate machine/ocp4-d9g6x-worker-ch485 ocp4-d9g6x-worker-ch485: reconciler failed to Create machine: task task-4264 has not finished 18s Normal Create machine/ocp4-d9g6x-worker-ch485 Created Machine ocp4-d9g6x-worker-ch485 18s Warning FailedUpdate machine/ocp4-d9g6x-worker-ch485 ocp4-d9g6x-worker-ch485: reconciler failed to Update machine: task task-4264 has not finished From vcenter the VM appears created but stopped, but there is a hint in the event viewer: # govc events /Datacenter/vm/ocp4-d9g6x/ocp4-d9g6x-worker-ch485 [...] [Wed Jan 20 15:44:27 2021] [info] Clone of rhcos-vmware completed [Wed Jan 20 15:44:27 2021] [info] ocp4-d9g6x-worker-ch485 on host 192.168.1.160 in Datacenter is starting [Wed Jan 20 15:44:27 2021] [info] Virtual machine ocp4-d9g6x-worker-ch485 failed to power on after cloning on host 192.168.1.160 in datacenter Datacenter If I try to power on the VM manually I get the reason behind this behaviour: # govc tasks /Datacenter/vm/ocp4-d9g6x/ocp4-d9g6x-worker-ch485 Task Target Initiator Queued Started Completed Result Datacenter.ExecuteVmPowerOnLRO ocp4-d9g6x-worker-ch485 Administrator 14:55:29 14:55:29 14:55:30 error [Invalid memory setting: memory reservation (sched.mem.min) should be equal to memsize(8192). ] Indeed reviewing the template's settings I see that such information is not set (sched.mem.min) but I would expect the machineset should assign these settings but it's not the case. Let's see what changes are made when the template is cloned: # govc events /Datacenter/vm/ocp4-d9g6x/ocp4-d9g6x-worker-ch485 [Wed Jan 20 15:43:33 2021] [info] Assigned new BIOS UUID (4214a327-0a27-c895-910c-25ad12b3e155) to ocp4-d9g6x-worker-ch485 on 192.168.1.160 in Datacenter [Wed Jan 20 15:43:33 2021] [info] Assign a new instance UUID (5014aa79-3622-769b-0872-8b098be20762) to ocp4-d9g6x-worker-ch485 [Wed Jan 20 15:44:27 2021] [info] The instance UUID of ocp4-d9g6x-worker-ch485 has been changed from (5014aa79-3622-769b-0872-8b098be20762) to (30666ca2-2ad0-4d6c-8743-666d8c653278) [Wed Jan 20 15:44:27 2021] [info] Reconfigured ocp4-d9g6x-worker-ch485 on 192.168.1.160 in Datacenter. Modified: config.instanceUuid: "5014aa79-3622-769b-0872-8b098be20762" -> "30666ca2-2ad0-4d6c-8743-666d8c653278"; config.annotation: "" -> "ocp4-d9g6x-worker-ch485"; config.hardware.numCPU: 2 -> 4; config.hardware.device(2000).deviceInfo.summary: "16,777,216 KB" -> "125,829,120 KB"; config.hardware.device(2000).backing.uuid: "6000C29c-c243-3881-ac35-9a0144837b8b" -> "6000C294-1c9a-9e30-0740-421846617fc0"; config.hardware.device(2000).backing.contentId: "642d1cbb574b28f15be82d28104b3d61" -> "4917d41dd2757d31a9466fa2e719ffa0"; config.hardware.device(2000).capacityInKB: 16777216 -> 125829120; config.hardware.device(2000).capacityInBytes: 17179869184 -> 128849018880; config.hardware.device(100).device: (500, 12000, 1000) -> (500, 12000, 1000, 4000); Version-Release number of selected component (if applicable): # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-fc.3 True False 4h11m Cluster version is 4.7.0-fc.3 How reproducible: Steps to Reproduce: 1. create a machineset 2. scale up a node Actual results: The new node is created on the hypervisor but never start Expected results: The node should start and be part of the cluster Additional info: Tweaking the template before cloning and setting the sched.mem.min equal to memory along with tick on "Reserve all guest memory (All locked)", allows scale process to continue and finished as expected. I will enclose a MG late today but if further information is required just let me know without hesitation.
Must-Gather: https://drive.google.com/file/d/1EIrnQKct85MfuXVfXBHehYIhCHRoQySd/view?usp=sharing
The machine-api doesn't create or manage the template. I think the basic settings come from the OVA, so perhaps that needs modification. Moving to RHCOS team for investigation.
@Roberto which version of vSphere are you using? @Luca could you investigate this? I don't believe we have changed anything in our OVA template between 4.6 -> 4.7, so this is surprising behavior. @Joe tagging you here for visibility, as you have been helpful with vSphere problems in the past.
Is the virtual machine deployed into a resource pool? Did you set the latency sensitivity to high? I tend to think this error is a vSphere environmental / configuration problem. I have tested machine{sets} with UPI never had this problem in VMC (VMware on AWS) or my nested environment. Though it could be because I didn't increase the memory. Also see: https://kb.vmware.com/s/article/2002779
this sounds very similar to an issue that was raised through the okd community. there wasn't a good solution there, but i believe the user had identified local configuration issues. see https://github.com/openshift/okd/issues/419 for more info.
(In reply to Micah Abbott from comment #3) > @Roberto which version of vSphere are you using? > > @Luca could you investigate this? I don't believe we have changed anything > in our OVA template between 4.6 -> 4.7, so this is surprising behavior. > > @Joe tagging you here for visibility, as you have been helpful with vSphere > problems in the past. 6.7.0 Update 3 (Build 16773714) # govc about Name: VMware vCenter Server Vendor: VMware, Inc. Version: 6.7.0 Build: 14368073 OS type: linux-x64 API type: VirtualCenter API version: 6.7.3 Product ID: vpx UUID: 086cf68d-a6da-40e4-a9a9-d5db976eb1f3
(In reply to Joseph Callen from comment #4) > Is the virtual machine deployed into a resource pool? > Did you set the latency sensitivity to high? > > I tend to think this error is a vSphere environmental / configuration > problem. > I have tested machine{sets} with UPI never had this problem in VMC (VMware > on AWS) or my nested environment. > Though it could be because I didn't increase the memory. > > > Also see: > https://kb.vmware.com/s/article/2002779 - No, the VM is not into a resource pool - Yes, I did - With regards to VMware kb, it's exactly what I changed in the template config to make it worked, but I understand that no manual actions, beyond what docs states[1], should be required. "In the Virtual Hardware panel of the Customize hardware tab, modify the specified values as required. Ensure that the amount of RAM, CPU, and disk storage meets the minimum requirements for the machine type."[1] These changes are made before install, but what about after? who or what does change the template? [1] https://docs.openshift.com/container-platform/4.7/installing/installing_vsphere/installing-vsphere.html#installation-vsphere-machines_installing-vsphere
(In reply to Michael McCune from comment #5) > this sounds very similar to an issue that was raised through the okd > community. there wasn't a good solution there, but i believe the user had > identified local configuration issues. see > https://github.com/openshift/okd/issues/419 for more info. I'll do a double check of everything, just in case I missed something, I'll keep you posted
(In reply to Roberto from comment #7) > > Also see: > > https://kb.vmware.com/s/article/2002779 > - With regards to VMware kb, it's exactly what I changed in the template > config to make it worked, but I understand that no manual actions, beyond > what docs states[1], should be required. > > "In the Virtual Hardware panel of the Customize hardware tab, modify the > specified values as required. Ensure that the amount of RAM, CPU, and disk > storage meets the minimum requirements for the machine type."[1] > > These changes are made before install, but what about after? who or what > does change the template? > > [1] > https://docs.openshift.com/container-platform/4.7/installing/ > installing_vsphere/installing-vsphere.html#installation-vsphere- > machines_installing-vsphere We discussed this in a bug scrub today and we don't believe that any changes to the RHCOS OVA would resolve this issue. Since there is a VMware KBase article about this exact error and following the steps to change the configuration of the virtual machine cause the VM to power on successfully, I think the best course of action is to update our docs with some additional information about this situation. I'd suggest a note/warning in the scale up instructions like so: ``` If a node appears to be stuck in the 'Provisioning' state after scaling up a MachineSet, users should investigate the status of the virtual machine in the vSphere instance itself. Users should use the VMware commands `govc tasks` and `govc events` to determine the status of the virtual machine. If a similar error message to "[Invalid memory setting: memory reservation (sched.mem.min) should be equal to memsize(8192). ]" is discovered, users should follow the steps in the following VMware KBase article in an attempt to resolve the issue. https://kb.vmware.com/s/article/2002779 ```
FWIW, here's the vSphere OVA template we use for RHCOS - https://github.com/coreos/coreos-assembler/blob/master/src/vmware-template.xml
(In reply to Micah Abbott from comment #9) > (In reply to Roberto from comment #7) > > > > Also see: > > > https://kb.vmware.com/s/article/2002779 > > > - With regards to VMware kb, it's exactly what I changed in the template > > config to make it worked, but I understand that no manual actions, beyond > > what docs states[1], should be required. > > > > "In the Virtual Hardware panel of the Customize hardware tab, modify the > > specified values as required. Ensure that the amount of RAM, CPU, and disk > > storage meets the minimum requirements for the machine type."[1] > > > > These changes are made before install, but what about after? who or what > > does change the template? > > > > [1] > > https://docs.openshift.com/container-platform/4.7/installing/ > > installing_vsphere/installing-vsphere.html#installation-vsphere- > > machines_installing-vsphere > > We discussed this in a bug scrub today and we don't believe that any changes > to the RHCOS OVA would resolve this issue. > > Since there is a VMware KBase article about this exact error and following > the steps to change the configuration of the virtual machine cause the VM to > power on successfully, I think the best course of action is to update our > docs with some additional information about this situation. > > I'd suggest a note/warning in the scale up instructions like so: > > ``` > If a node appears to be stuck in the 'Provisioning' state after scaling up a > MachineSet, users should investigate the status of the virtual machine in > the vSphere instance itself. Users should use the VMware commands `govc > tasks` and `govc events` to determine the status of the virtual machine. > > If a similar error message to "[Invalid memory setting: memory reservation > (sched.mem.min) should be equal to memsize(8192). ]" is discovered, users > should follow the steps in the following VMware KBase article in an attempt > to resolve the issue. > > https://kb.vmware.com/s/article/2002779 > ``` in the meantime, I will publish a KCS with such content
*** Bug 1919239 has been marked as a duplicate of this bug. ***
In vsphere doc: https://docs.openshift.com/container-platform/4.7/installing/installing_vmc/installing-vmc-user-infra.html#installation-vsphere-machines_installing-vmc-user-infra, the parameter Latency Sensitivity is optional to be set to "High"(default value is "normal"). This may not only impact scaling up work nodes, but also for fresh installation once we set Latency Sensitivity to high. I installed cluster upi-on-vsphere on vsphere env, and found that if only set "Latency Sensitivity" to High, master/worker nodes failed to be cloned/powered on with error on vcenter GUI: On QE VMC(vsphere7.0) env, failed to clone: Error: error reconfiguring virtual machine: error reconfiguring virtual machine: A specified parameter was not correct: spec.memoryAllocation On Dev embedded vsphere6.7 env on VMC, the step of clone is successful, but failed to power on: Error: Invalid CPU reservation for the latency-sensitive VM, (sched.cpu.min) should be at least 2500 MHz. Then set cpu Reservation, retry to clone from rhcos template, the error message is reported again related with memory: Error: invalid memory setting: memory reservation(sched.mem.min) should be equal to memsize(8192). Do you think is it reasonable to update doc once parameter "Latency Sensitivity" is set to "High", vm's cpu reservation and memory reservation also need to be set to current cpu/memory value (default is empty)?
Adding a known issue to the 4.5+ release notes in the following PRs: 4.5 - https://github.com/openshift/openshift-docs/pull/32240 4.6 - https://github.com/openshift/openshift-docs/pull/32241 4.7 - https://github.com/openshift/openshift-docs/pull/32243 4.8 - https://github.com/openshift/openshift-docs/pull/32245
Verified fix is published and live on docs.openshift.com: https://docs.openshift.com/container-platform/4.5/release_notes/ocp-4-5-release-notes.html#ocp-4-5-known-issues https://docs.openshift.com/container-platform/4.6/release_notes/ocp-4-6-release-notes.html#ocp-4-6-known-issues https://docs.openshift.com/container-platform/4.7/release_notes/ocp-4-7-release-notes.html#ocp-4-7-known-issues Verified fix will be available upon release of 4.8: https://docs.openshift.com/container-platform/4.8/release_notes/ocp-4-8-release-notes.html#ocp-4-8-known-issues