Bug 1918383
Summary: | [UPI vSphere] node scale up doesn't work as expected | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Roberto <rdiazgav> |
Component: | Documentation | Assignee: | Lindsey Barbee-Vargas <lbarbeev> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Xiaoli Tian <xtian> |
Severity: | high | Docs Contact: | Vikram Goyal <vigoyal> |
Priority: | low | ||
Version: | 4.7 | CC: | aos-bugs, bbreard, imcleod, jcallen, jima, jligon, jokerman, mgugino, miabbott, mimccune, nstielau, rdiazgav, rsandu, trees |
Target Milestone: | --- | ||
Target Release: | 4.7.0 | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: |
vSphere provider
esx version:
6.7.0 Update 3 (Build 16773714)
vcenter:
# govc about
Name: VMware vCenter Server
Vendor: VMware, Inc.
Version: 6.7.0
Build: 14368073
OS type: linux-x64
API type: VirtualCenter
API version: 6.7.3
Product ID: vpx
UUID: 086cf68d-a6da-40e4-a9a9-d5db976eb1f3
|
|
Last Closed: | 2021-06-02 19:34:28 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Roberto
2021-01-20 15:29:44 UTC
The machine-api doesn't create or manage the template. I think the basic settings come from the OVA, so perhaps that needs modification. Moving to RHCOS team for investigation. @Roberto which version of vSphere are you using? @Luca could you investigate this? I don't believe we have changed anything in our OVA template between 4.6 -> 4.7, so this is surprising behavior. @Joe tagging you here for visibility, as you have been helpful with vSphere problems in the past. Is the virtual machine deployed into a resource pool? Did you set the latency sensitivity to high? I tend to think this error is a vSphere environmental / configuration problem. I have tested machine{sets} with UPI never had this problem in VMC (VMware on AWS) or my nested environment. Though it could be because I didn't increase the memory. Also see: https://kb.vmware.com/s/article/2002779 this sounds very similar to an issue that was raised through the okd community. there wasn't a good solution there, but i believe the user had identified local configuration issues. see https://github.com/openshift/okd/issues/419 for more info. (In reply to Micah Abbott from comment #3) > @Roberto which version of vSphere are you using? > > @Luca could you investigate this? I don't believe we have changed anything > in our OVA template between 4.6 -> 4.7, so this is surprising behavior. > > @Joe tagging you here for visibility, as you have been helpful with vSphere > problems in the past. 6.7.0 Update 3 (Build 16773714) # govc about Name: VMware vCenter Server Vendor: VMware, Inc. Version: 6.7.0 Build: 14368073 OS type: linux-x64 API type: VirtualCenter API version: 6.7.3 Product ID: vpx UUID: 086cf68d-a6da-40e4-a9a9-d5db976eb1f3 (In reply to Joseph Callen from comment #4) > Is the virtual machine deployed into a resource pool? > Did you set the latency sensitivity to high? > > I tend to think this error is a vSphere environmental / configuration > problem. > I have tested machine{sets} with UPI never had this problem in VMC (VMware > on AWS) or my nested environment. > Though it could be because I didn't increase the memory. > > > Also see: > https://kb.vmware.com/s/article/2002779 - No, the VM is not into a resource pool - Yes, I did - With regards to VMware kb, it's exactly what I changed in the template config to make it worked, but I understand that no manual actions, beyond what docs states[1], should be required. "In the Virtual Hardware panel of the Customize hardware tab, modify the specified values as required. Ensure that the amount of RAM, CPU, and disk storage meets the minimum requirements for the machine type."[1] These changes are made before install, but what about after? who or what does change the template? [1] https://docs.openshift.com/container-platform/4.7/installing/installing_vsphere/installing-vsphere.html#installation-vsphere-machines_installing-vsphere (In reply to Michael McCune from comment #5) > this sounds very similar to an issue that was raised through the okd > community. there wasn't a good solution there, but i believe the user had > identified local configuration issues. see > https://github.com/openshift/okd/issues/419 for more info. I'll do a double check of everything, just in case I missed something, I'll keep you posted (In reply to Roberto from comment #7) > > Also see: > > https://kb.vmware.com/s/article/2002779 > - With regards to VMware kb, it's exactly what I changed in the template > config to make it worked, but I understand that no manual actions, beyond > what docs states[1], should be required. > > "In the Virtual Hardware panel of the Customize hardware tab, modify the > specified values as required. Ensure that the amount of RAM, CPU, and disk > storage meets the minimum requirements for the machine type."[1] > > These changes are made before install, but what about after? who or what > does change the template? > > [1] > https://docs.openshift.com/container-platform/4.7/installing/ > installing_vsphere/installing-vsphere.html#installation-vsphere- > machines_installing-vsphere We discussed this in a bug scrub today and we don't believe that any changes to the RHCOS OVA would resolve this issue. Since there is a VMware KBase article about this exact error and following the steps to change the configuration of the virtual machine cause the VM to power on successfully, I think the best course of action is to update our docs with some additional information about this situation. I'd suggest a note/warning in the scale up instructions like so: ``` If a node appears to be stuck in the 'Provisioning' state after scaling up a MachineSet, users should investigate the status of the virtual machine in the vSphere instance itself. Users should use the VMware commands `govc tasks` and `govc events` to determine the status of the virtual machine. If a similar error message to "[Invalid memory setting: memory reservation (sched.mem.min) should be equal to memsize(8192). ]" is discovered, users should follow the steps in the following VMware KBase article in an attempt to resolve the issue. https://kb.vmware.com/s/article/2002779 ``` FWIW, here's the vSphere OVA template we use for RHCOS - https://github.com/coreos/coreos-assembler/blob/master/src/vmware-template.xml (In reply to Micah Abbott from comment #9) > (In reply to Roberto from comment #7) > > > > Also see: > > > https://kb.vmware.com/s/article/2002779 > > > - With regards to VMware kb, it's exactly what I changed in the template > > config to make it worked, but I understand that no manual actions, beyond > > what docs states[1], should be required. > > > > "In the Virtual Hardware panel of the Customize hardware tab, modify the > > specified values as required. Ensure that the amount of RAM, CPU, and disk > > storage meets the minimum requirements for the machine type."[1] > > > > These changes are made before install, but what about after? who or what > > does change the template? > > > > [1] > > https://docs.openshift.com/container-platform/4.7/installing/ > > installing_vsphere/installing-vsphere.html#installation-vsphere- > > machines_installing-vsphere > > We discussed this in a bug scrub today and we don't believe that any changes > to the RHCOS OVA would resolve this issue. > > Since there is a VMware KBase article about this exact error and following > the steps to change the configuration of the virtual machine cause the VM to > power on successfully, I think the best course of action is to update our > docs with some additional information about this situation. > > I'd suggest a note/warning in the scale up instructions like so: > > ``` > If a node appears to be stuck in the 'Provisioning' state after scaling up a > MachineSet, users should investigate the status of the virtual machine in > the vSphere instance itself. Users should use the VMware commands `govc > tasks` and `govc events` to determine the status of the virtual machine. > > If a similar error message to "[Invalid memory setting: memory reservation > (sched.mem.min) should be equal to memsize(8192). ]" is discovered, users > should follow the steps in the following VMware KBase article in an attempt > to resolve the issue. > > https://kb.vmware.com/s/article/2002779 > ``` in the meantime, I will publish a KCS with such content *** Bug 1919239 has been marked as a duplicate of this bug. *** In vsphere doc: https://docs.openshift.com/container-platform/4.7/installing/installing_vmc/installing-vmc-user-infra.html#installation-vsphere-machines_installing-vmc-user-infra, the parameter Latency Sensitivity is optional to be set to "High"(default value is "normal"). This may not only impact scaling up work nodes, but also for fresh installation once we set Latency Sensitivity to high. I installed cluster upi-on-vsphere on vsphere env, and found that if only set "Latency Sensitivity" to High, master/worker nodes failed to be cloned/powered on with error on vcenter GUI: On QE VMC(vsphere7.0) env, failed to clone: Error: error reconfiguring virtual machine: error reconfiguring virtual machine: A specified parameter was not correct: spec.memoryAllocation On Dev embedded vsphere6.7 env on VMC, the step of clone is successful, but failed to power on: Error: Invalid CPU reservation for the latency-sensitive VM, (sched.cpu.min) should be at least 2500 MHz. Then set cpu Reservation, retry to clone from rhcos template, the error message is reported again related with memory: Error: invalid memory setting: memory reservation(sched.mem.min) should be equal to memsize(8192). Do you think is it reasonable to update doc once parameter "Latency Sensitivity" is set to "High", vm's cpu reservation and memory reservation also need to be set to current cpu/memory value (default is empty)? Adding a known issue to the 4.5+ release notes in the following PRs: 4.5 - https://github.com/openshift/openshift-docs/pull/32240 4.6 - https://github.com/openshift/openshift-docs/pull/32241 4.7 - https://github.com/openshift/openshift-docs/pull/32243 4.8 - https://github.com/openshift/openshift-docs/pull/32245 Verified fix is published and live on docs.openshift.com: https://docs.openshift.com/container-platform/4.5/release_notes/ocp-4-5-release-notes.html#ocp-4-5-known-issues https://docs.openshift.com/container-platform/4.6/release_notes/ocp-4-6-release-notes.html#ocp-4-6-known-issues https://docs.openshift.com/container-platform/4.7/release_notes/ocp-4-7-release-notes.html#ocp-4-7-known-issues Verified fix will be available upon release of 4.8: https://docs.openshift.com/container-platform/4.8/release_notes/ocp-4-8-release-notes.html#ocp-4-8-known-issues |