Bug 1918383

Summary:	[UPI vSphere] node scale up doesn't work as expected
Product:	OpenShift Container Platform	Reporter:	Roberto <rdiazgav>
Component:	Documentation	Assignee:	Lindsey Barbee-Vargas <lbarbeev>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Xiaoli Tian <xtian>
Severity:	high	Docs Contact:	Vikram Goyal <vigoyal>
Priority:	low
Version:	4.7	CC:	aos-bugs, bbreard, imcleod, jcallen, jima, jligon, jokerman, mgugino, miabbott, mimccune, nstielau, rdiazgav, rsandu, trees
Target Milestone:	---
Target Release:	4.7.0
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:	vSphere provider esx version: 6.7.0 Update 3 (Build 16773714) vcenter: # govc about Name: VMware vCenter Server Vendor: VMware, Inc. Version: 6.7.0 Build: 14368073 OS type: linux-x64 API type: VirtualCenter API version: 6.7.3 Product ID: vpx UUID: 086cf68d-a6da-40e4-a9a9-d5db976eb1f3
Last Closed:	2021-06-02 19:34:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Roberto 2021-01-20 15:29:44 UTC

Description of problem:

When a manual node scale-up is performed, a new machine is created, so far so good. However, the machine is not powered on once it's created which cause the scaling process doesn't occur.

# oc scale machineset ocp4-d9g6x-worker --replicas=1 -n openshift-machine-api
# oc get machines
NAME                      PHASE         TYPE   REGION   ZONE   AGE
ocp4-d9g6x-worker-ch485   Provisioned                          69s                       

The VM will remain in "Provisioning" status.

Digging around this fact I saw the following event:
# oc get events --sort-by='{.lastTimestamp}' -n openshift-machine-api
LAST SEEN   TYPE      REASON            OBJECT                            MESSAGE
[...]
18s         Warning   FailedCreate      machine/ocp4-d9g6x-worker-ch485                 ocp4-d9g6x-worker-ch485: reconciler failed to Create machine: task task-4264 has not finished
18s         Normal    Create            machine/ocp4-d9g6x-worker-ch485                 Created Machine ocp4-d9g6x-worker-ch485
18s         Warning   FailedUpdate      machine/ocp4-d9g6x-worker-ch485                 ocp4-d9g6x-worker-ch485: reconciler failed to Update machine: task task-4264 has not finished


From vcenter the VM appears created but stopped, but there is a hint in the event viewer:


# govc events /Datacenter/vm/ocp4-d9g6x/ocp4-d9g6x-worker-ch485
[...]
[Wed Jan 20 15:44:27 2021] [info] Clone of rhcos-vmware completed
[Wed Jan 20 15:44:27 2021] [info] ocp4-d9g6x-worker-ch485 on host 192.168.1.160 in Datacenter is starting
[Wed Jan 20 15:44:27 2021] [info] Virtual machine ocp4-d9g6x-worker-ch485 failed to power on after cloning on host 192.168.1.160 in datacenter Datacenter

If I try to power on the VM manually I get the reason behind this behaviour:

# govc tasks /Datacenter/vm/ocp4-d9g6x/ocp4-d9g6x-worker-ch485
Task                                     Target                         Initiator                         Queued   Started Completed Result
Datacenter.ExecuteVmPowerOnLRO           ocp4-d9g6x-worker-ch485        Administrator                   14:55:29  14:55:29  14:55:30 error   [Invalid memory setting: memory reservation (sched.mem.min) should be equal to memsize(8192). ]

Indeed reviewing the template's settings I see that such information is not set (sched.mem.min) but I would expect the machineset should assign these settings but it's not the case.

Let's see what changes are made when the template is cloned:

# govc events /Datacenter/vm/ocp4-d9g6x/ocp4-d9g6x-worker-ch485
[Wed Jan 20 15:43:33 2021] [info] Assigned new BIOS UUID (4214a327-0a27-c895-910c-25ad12b3e155) to ocp4-d9g6x-worker-ch485 on 192.168.1.160 in Datacenter
[Wed Jan 20 15:43:33 2021] [info] Assign a new instance UUID (5014aa79-3622-769b-0872-8b098be20762) to ocp4-d9g6x-worker-ch485
[Wed Jan 20 15:44:27 2021] [info] The instance UUID of ocp4-d9g6x-worker-ch485 has been changed from (5014aa79-3622-769b-0872-8b098be20762) to (30666ca2-2ad0-4d6c-8743-666d8c653278)
[Wed Jan 20 15:44:27 2021] [info] Reconfigured ocp4-d9g6x-worker-ch485 on 192.168.1.160 in Datacenter.  
 
Modified:  
 
config.instanceUuid: "5014aa79-3622-769b-0872-8b098be20762" -> "30666ca2-2ad0-4d6c-8743-666d8c653278"; 
config.annotation: "" -> "ocp4-d9g6x-worker-ch485"; 
config.hardware.numCPU: 2 -> 4; 
config.hardware.device(2000).deviceInfo.summary: "16,777,216 KB" -> "125,829,120 KB"; 
config.hardware.device(2000).backing.uuid: "6000C29c-c243-3881-ac35-9a0144837b8b" -> "6000C294-1c9a-9e30-0740-421846617fc0"; 
config.hardware.device(2000).backing.contentId: "642d1cbb574b28f15be82d28104b3d61" -> "4917d41dd2757d31a9466fa2e719ffa0"; 
config.hardware.device(2000).capacityInKB: 16777216 -> 125829120; 
config.hardware.device(2000).capacityInBytes: 17179869184 -> 128849018880; 
config.hardware.device(100).device: (500, 12000, 1000) -> (500, 12000, 1000, 4000); 


Version-Release number of selected component (if applicable):

# oc get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-fc.3   True        False         4h11m   Cluster version is 4.7.0-fc.3


How reproducible:


Steps to Reproduce:
1. create a machineset
2. scale up a node


Actual results:

The new node is created on the hypervisor but never start

Expected results:

The node should start and be part of the cluster

Additional info:

Tweaking the template before cloning and setting the sched.mem.min equal to memory along with tick on "Reserve all guest memory (All locked)", allows scale process to continue and finished as expected.

I will enclose a MG late today but if further information is required just let me know without hesitation.

Comment 1 Roberto 2021-01-20 15:44:51 UTC

Must-Gather: https://drive.google.com/file/d/1EIrnQKct85MfuXVfXBHehYIhCHRoQySd/view?usp=sharing

Comment 2 Michael Gugino 2021-01-20 15:51:39 UTC

The machine-api doesn't create or manage the template.  I think the basic settings come from the OVA, so perhaps that needs modification.  Moving to RHCOS team for investigation.

Comment 3 Micah Abbott 2021-01-20 16:32:09 UTC

@Roberto which version of vSphere are you using?

@Luca could you investigate this?  I don't believe we have changed anything in our OVA template between 4.6 -> 4.7, so this is surprising behavior.

@Joe tagging you here for visibility, as you have been helpful with vSphere problems in the past.

Comment 4 Joseph Callen 2021-01-20 16:47:10 UTC

Is the virtual machine deployed into a resource pool?
Did you set the latency sensitivity to high?

I tend to think this error is a vSphere environmental / configuration problem. 
I have tested machine{sets} with UPI never had this problem in VMC (VMware on AWS) or my nested environment.
Though it could be because I didn't increase the memory.


Also see:
https://kb.vmware.com/s/article/2002779

Comment 5 Michael McCune 2021-01-20 20:34:54 UTC

this sounds very similar to an issue that was raised through the okd community. there wasn't a good solution there, but i believe the user had identified local configuration issues. see https://github.com/openshift/okd/issues/419 for more info.

Comment 6 Roberto 2021-01-21 07:28:24 UTC

(In reply to Micah Abbott from comment #3)
> @Roberto which version of vSphere are you using?
> 
> @Luca could you investigate this?  I don't believe we have changed anything
> in our OVA template between 4.6 -> 4.7, so this is surprising behavior.
> 
> @Joe tagging you here for visibility, as you have been helpful with vSphere
> problems in the past.


6.7.0 Update 3 (Build 16773714)

# govc about
Name:         VMware vCenter Server
Vendor:       VMware, Inc.
Version:      6.7.0
Build:        14368073
OS type:      linux-x64
API type:     VirtualCenter
API version:  6.7.3
Product ID:   vpx
UUID:         086cf68d-a6da-40e4-a9a9-d5db976eb1f3

Comment 7 Roberto 2021-01-21 07:44:40 UTC

(In reply to Joseph Callen from comment #4)
> Is the virtual machine deployed into a resource pool?
> Did you set the latency sensitivity to high?
> 
> I tend to think this error is a vSphere environmental / configuration
> problem. 
> I have tested machine{sets} with UPI never had this problem in VMC (VMware
> on AWS) or my nested environment.
> Though it could be because I didn't increase the memory.
> 
> 
> Also see:
> https://kb.vmware.com/s/article/2002779

- No, the VM is not into a resource pool

- Yes, I did

- With regards to VMware kb, it's exactly what I changed in the template config to make it worked, but I understand that no manual actions, beyond what docs states[1], should be required.

"In the Virtual Hardware panel of the Customize hardware tab, modify the specified values as required. Ensure that the amount of RAM, CPU, and disk storage meets the minimum requirements for the machine type."[1]

These changes are made before install, but what about after? who or what does change the template?

[1] https://docs.openshift.com/container-platform/4.7/installing/installing_vsphere/installing-vsphere.html#installation-vsphere-machines_installing-vsphere

Comment 8 Roberto 2021-01-21 07:47:31 UTC

(In reply to Michael McCune from comment #5)
> this sounds very similar to an issue that was raised through the okd
> community. there wasn't a good solution there, but i believe the user had
> identified local configuration issues. see
> https://github.com/openshift/okd/issues/419 for more info.

I'll do a double check of everything, just in case I missed something, I'll keep you posted

Comment 9 Micah Abbott 2021-01-21 17:10:15 UTC

(In reply to Roberto from comment #7)

> > Also see:
> > https://kb.vmware.com/s/article/2002779

> - With regards to VMware kb, it's exactly what I changed in the template
> config to make it worked, but I understand that no manual actions, beyond
> what docs states[1], should be required.
> 
> "In the Virtual Hardware panel of the Customize hardware tab, modify the
> specified values as required. Ensure that the amount of RAM, CPU, and disk
> storage meets the minimum requirements for the machine type."[1]
> 
> These changes are made before install, but what about after? who or what
> does change the template?
> 
> [1]
> https://docs.openshift.com/container-platform/4.7/installing/
> installing_vsphere/installing-vsphere.html#installation-vsphere-
> machines_installing-vsphere

We discussed this in a bug scrub today and we don't believe that any changes to the RHCOS OVA would resolve this issue.

Since there is a VMware KBase article about this exact error and following the steps to change the configuration of the virtual machine cause the VM to power on successfully, I think the best course of action is to update our docs with some additional information about this situation.

I'd suggest a note/warning in the scale up instructions like so:

```
If a node appears to be stuck in the 'Provisioning' state after scaling up a MachineSet, users should investigate the status of the virtual machine in the vSphere instance itself.  Users should use the VMware commands `govc tasks` and `govc events` to determine the status of the virtual machine.

If a similar error message to "[Invalid memory setting: memory reservation (sched.mem.min) should be equal to memsize(8192). ]" is discovered, users should follow the steps in the following VMware KBase article in an attempt to resolve the issue.

https://kb.vmware.com/s/article/2002779
```

Comment 10 Micah Abbott 2021-01-21 17:23:05 UTC

FWIW, here's the vSphere OVA template we use for RHCOS - https://github.com/coreos/coreos-assembler/blob/master/src/vmware-template.xml

Comment 12 Roberto 2021-01-21 18:12:40 UTC

(In reply to Micah Abbott from comment #9)
> (In reply to Roberto from comment #7)
> 
> > > Also see:
> > > https://kb.vmware.com/s/article/2002779
> 
> > - With regards to VMware kb, it's exactly what I changed in the template
> > config to make it worked, but I understand that no manual actions, beyond
> > what docs states[1], should be required.
> > 
> > "In the Virtual Hardware panel of the Customize hardware tab, modify the
> > specified values as required. Ensure that the amount of RAM, CPU, and disk
> > storage meets the minimum requirements for the machine type."[1]
> > 
> > These changes are made before install, but what about after? who or what
> > does change the template?
> > 
> > [1]
> > https://docs.openshift.com/container-platform/4.7/installing/
> > installing_vsphere/installing-vsphere.html#installation-vsphere-
> > machines_installing-vsphere
> 
> We discussed this in a bug scrub today and we don't believe that any changes
> to the RHCOS OVA would resolve this issue.
> 
> Since there is a VMware KBase article about this exact error and following
> the steps to change the configuration of the virtual machine cause the VM to
> power on successfully, I think the best course of action is to update our
> docs with some additional information about this situation.
> 
> I'd suggest a note/warning in the scale up instructions like so:
> 
> ```
> If a node appears to be stuck in the 'Provisioning' state after scaling up a
> MachineSet, users should investigate the status of the virtual machine in
> the vSphere instance itself.  Users should use the VMware commands `govc
> tasks` and `govc events` to determine the status of the virtual machine.
> 
> If a similar error message to "[Invalid memory setting: memory reservation
> (sched.mem.min) should be equal to memsize(8192). ]" is discovered, users
> should follow the steps in the following VMware KBase article in an attempt
> to resolve the issue.
> 
> https://kb.vmware.com/s/article/2002779
> ```

in the meantime, I will publish a KCS with such content

Comment 15 Vikram Goyal 2021-02-01 06:47:48 UTC

*** Bug 1919239 has been marked as a duplicate of this bug. ***

Comment 18 jima 2021-05-20 08:14:00 UTC

In vsphere doc: https://docs.openshift.com/container-platform/4.7/installing/installing_vmc/installing-vmc-user-infra.html#installation-vsphere-machines_installing-vmc-user-infra, the parameter Latency Sensitivity is optional to be set to "High"(default value is "normal").

This may not only impact scaling up work nodes, but also for fresh installation once we set Latency Sensitivity to high.
I installed cluster upi-on-vsphere on vsphere env, and found that if only set "Latency Sensitivity" to High, master/worker nodes failed to be cloned/powered on with error on vcenter GUI: 
On QE VMC(vsphere7.0) env, failed to clone:

Error: error reconfiguring virtual machine: error reconfiguring virtual machine: A specified parameter was not correct: spec.memoryAllocation

On Dev embedded vsphere6.7 env on VMC,  the step of clone is successful, but failed to power on:
Error: Invalid CPU reservation for the latency-sensitive VM, (sched.cpu.min) should be at least 2500 MHz.

Then set cpu Reservation, retry to clone from rhcos template, the error message is reported again related with memory:
Error: invalid memory setting: memory reservation(sched.mem.min) should be equal to memsize(8192).

Do you think is it reasonable to update doc once parameter "Latency Sensitivity" is set to "High", vm's cpu reservation and memory reservation also need to be set to current cpu/memory value (default is empty)?

Comment 19 Lindsey Barbee-Vargas 2021-06-02 18:23:54 UTC

Adding a known issue to the 4.5+ release notes in the following PRs:

4.5 - https://github.com/openshift/openshift-docs/pull/32240
4.6 - https://github.com/openshift/openshift-docs/pull/32241
4.7 - https://github.com/openshift/openshift-docs/pull/32243
4.8 - https://github.com/openshift/openshift-docs/pull/32245

Comment 20 Lindsey Barbee-Vargas 2021-06-02 19:34:28 UTC

Verified fix is published and live on docs.openshift.com:
https://docs.openshift.com/container-platform/4.5/release_notes/ocp-4-5-release-notes.html#ocp-4-5-known-issues
https://docs.openshift.com/container-platform/4.6/release_notes/ocp-4-6-release-notes.html#ocp-4-6-known-issues
https://docs.openshift.com/container-platform/4.7/release_notes/ocp-4-7-release-notes.html#ocp-4-7-known-issues

Verified fix will be available upon release of 4.8:
https://docs.openshift.com/container-platform/4.8/release_notes/ocp-4-8-release-notes.html#ocp-4-8-known-issues