Created attachment 1736067 [details] logs and yamls Description of problem: While running Windows2012 VM Console that was imported either from VMware/RHV - a worker node becomes NotReady, virt-launcher pods is terminated, cdi pods are terminated and recreate, and at some point the cluster become unusable Version-Release number of selected component (if applicable): OCP 4.6.6/CNV 2.5.0 How reproducible: 100% Additional info: * The issue is reproduced for VMware/RHV import to CNV * The original source VM is VMware and it was migrated to RHV. So the same VM was tested in RHV -> CNV import as well as VMware -> CNV * The issue wasn't occurred with RHEL7/Windows2016 * The worker node returned to 'Ready' state after reboot Attachments: logs and yamls files
This doesn't seem like a UI bug.
This bug was also reproduced on a libvirt based OCP-4.6/CNV-2.5.2 a windows 2016 was imported from RHV to CNV, and then the VM was started and console was opened. after few minuted one of the cluster's workers became "Not Ready" and a day after it was no longer possible to connect to this cluster. As this already happened so far on 2 PSI based environments, and on a newly installed libvirt based environments it proves that the cause of the environment to become unusable is the imported Windows VM start/open console. This bug blocks VM import for Windows VMs in CNV-2.5.2
Does it also happen with a VM that was created directly in CNV, not imported? Importing a VM generates a VM spec, so either VM spec structure has changed in 2.5.2 and we need to update VMIO accordingly, or it's unlikely a VMIO issue.
The node has roughly 15000 Mi allocatable and the VM alone takes more t han 8000Mi, how much more is running there already (oc describe node <name>)? What are the kubelet-system-reserved settings? Can we get that info too?
Hi, At the time the node becomes NotReady, do you have any console access to the node in order to observe the logs? In addition, can you please post the output of <oc describe nodes> and the virt-launcher pod logs? TIA Igor
Hi All, In Short: Reducing VM Memory Size Solves the problem. Also, it looks like Both virt-launcer log and the worker node console support Roman's note that it might be resources issue. In More Details: 1. The VM created by VMIO Spec CR add the following memory configuration: $ oc describe vm/.. ... ... resources: limits: memory: 32Gi requests: memory: 8Gi 2. Using an identical VM Spec an identical VM was create, with the exception of the memory: resources: requests: memory: 4Gi* ***The Identical 4Gi RAM VM did NOT have any issues, the Console was responsive and nothing crashed :)*** 4. When i tried the change the VM memory configuration After it was created, it has reduced the size (from 32Gi to 16Gi limits) no matter what i change it to. How to Reproduce without VMIO: ConfigMap: http://pastebin.test.redhat.com/925520* Secret: http://pastebin.test.redhat.com/925521 DataVolume: http://pastebin.test.redhat.com/925523 VirtualMachine: http://pastebin.test.redhat.com/925524 *The ca_cert is included in a private comment. virt-launcher log: ------------------ http://pastebin.test.redhat.com/925525 worker node console attached. 5. A different VM with the same memory configuration, had slightly different behaviorist, the virt-manager keep crashing, but i think the nodes where OK.not sure, can be reproduce: VM http://pastebin.test.redhat.com/925527 DV http://pastebin.test.redhat.com/pastebin.php
DV http://pastebin.test.redhat.com/925528
Created attachment 1738829 [details] WorkerNodeConsole
(In reply to Amos Mastbaum from comment #8) > Hi All, > > In Short: > Reducing VM Memory Size Solves the problem. > Also, it looks like Both virt-launcer log and the worker node console > support Roman's note that it might be resources issue. > > In More Details: > > 1. The VM created by VMIO Spec CR add the following memory configuration: > > $ oc describe vm/.. > ... > ... > resources: > limits: > memory: 32Gi > requests: > memory: 8Gi > Until we support something like ballooning, setting a limit at all only make sense when the VM should run in the QoS class "Guaranteed". There it will have to be exactly the same like the "request". At this stage setting a limit different than the request makes for almost 100% of all users no sense in CNV. So first, the import step should not set a limit. Second it seems like you also experience issues without the limit set if you set it to 8Gi due to memory pressure. Here it would be interesting to closely investigate the the node when the memory pressure happens. The kubelet reserves some memory for itself and some memory for the system daemons. I wonder if these settings are too low, or if some services which are delivered via pods are killed before the VM gets killed. > 2. Using an identical VM Spec an identical VM was create, with the exception > of the memory: > > resources: > requests: > memory: 4Gi* > > > ***The Identical 4Gi RAM VM did NOT have any issues, the Console was > responsive and nothing crashed :)*** > > > 4. When i tried the change the VM memory configuration After it was created, > it has reduced the size (from 32Gi to 16Gi limits) no matter what i change > it to. > > > How to Reproduce without VMIO: > > ConfigMap: > http://pastebin.test.redhat.com/925520* > > Secret: > http://pastebin.test.redhat.com/925521 > > DataVolume: > http://pastebin.test.redhat.com/925523 > > VirtualMachine: > http://pastebin.test.redhat.com/925524 > > *The ca_cert is included in a private comment. > > > virt-launcher log: > ------------------ > http://pastebin.test.redhat.com/925525 > > worker node console attached. > > > 5. A different VM with the same memory configuration, had slightly different > behaviorist, the virt-manager keep crashing, but i think the nodes where > OK.not sure, can be reproduce: > > VM > http://pastebin.test.redhat.com/925527 > > DV > http://pastebin.test.redhat.com/pastebin.php
Moving bug from V2V to SSP, as it is not related to VM import.
(In reply to Ilanit Stein from comment #13) > Moving bug from V2V to SSP, as it is not related to VM import. (In reply to Roman Mohr from comment #12) > Until we support something like ballooning, setting a limit at all only make > sense when the VM should run in the QoS class "Guaranteed". There it will > have to be exactly the same like the "request". At this stage setting a > limit different than the request makes for almost 100% of all users no sense > in CNV. > > So first, the import step should not set a limit. @istein what has set requests < limits if not the VM import?
@Dan, There are 2 issues mentioned in Roman's comment #12. 1. VM import adds memory limit and though it shouldn't. For this Amos opened: Bug 1908337 - [v2v] VM import RHV to Vmware remove memory limits from VM Spec 2. Even without adding memory limit, for memory request of 8GB the node becomes "Not Ready". This check was not mentioned above. It appears in the logs. It indicates that this problem (turning the node into Not ready) is not related to VM import.
How big is your node? Can you run other 8Gi VMs on it? Any idea why this particular VM make it crash? spec: domain: clock: timer: hpet: present: false hyperv: present: true pit: present: true tickPolicy: delay rtc: present: true tickPolicy: catchup utc: {} cpu: cores: 1 sockets: 1 threads: 1 devices: disks: - bootOrder: 1 disk: bus: virtio name: harddisk1 inputs: - bus: usb name: tablet type: tablet interfaces: - bridge: {} macAddress: 00:50:56:ad:64:bc model: virtio name: networkadapter1 features: acpi: enabled: true apic: enabled: true hyperv: relaxed: enabled: true spinlocks: enabled: true spinlocks: 8191 vapic: enabled: true firmware: uuid: eb62112f-24be-5877-afe1-3f5df7bf3583 machine: type: pc-q35-rhel8.2.0 resources: requests: cpu: 100m memory: 8Gi evictionStrategy: LiveMigrate hostname: v2v-win2012 networks: - name: networkadapter1 pod: {} terminationGracePeriodSeconds: 3600 volumes: - name: harddisk1 persistentVolumeClaim: claimName: harddisk1-crvrs
@guchen Can we close this after the OCP bug was submitted?
Yes in my opening, once they will reserve enough resources for the node it will block the Pod from getting all the resources, VM or not.
Guy, Can we close this bug, as the OCP bug 1857446 is verified? Thanks.
Yes, this is an OCP bug. *** This bug has been marked as a duplicate of bug 1857446 ***