Bug 1904051 - [CNV VM import] Running imported Windows VM Console makes Openshift cluster unusable
Summary: [CNV VM import] Running imported Windows VM Console makes Openshift cluster u...
Keywords:
Status: CLOSED DUPLICATE of bug 1857446
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: SSP
Version: 2.5.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: future
Assignee: Omer Yahud
QA Contact: Israel Pinto
URL:
Whiteboard:
Depends On: 1857446 1910086
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-03 12:30 UTC by Maayan Hadasi
Modified: 2021-02-01 11:23 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-01 11:23:01 UTC
Target Upstream Version:
Embargoed:
amastbau: needinfo+
amastbau: needinfo+
amastbau: needinfo+


Attachments (Terms of Use)
logs and yamls (478.90 KB, application/gzip)
2020-12-03 12:30 UTC, Maayan Hadasi
no flags Details
WorkerNodeConsole (37.71 KB, image/png)
2020-12-13 22:51 UTC, Amos Mastbaum
no flags Details

Description Maayan Hadasi 2020-12-03 12:30:33 UTC
Created attachment 1736067 [details]
logs and yamls

Description of problem:
While running Windows2012 VM Console that was imported either from VMware/RHV - a worker node becomes NotReady, virt-launcher pods is terminated, cdi pods are terminated and recreate, and at some point the cluster become unusable 


Version-Release number of selected component (if applicable):
OCP 4.6.6/CNV 2.5.0


How reproducible:
100%


Additional info:
* The issue is reproduced for VMware/RHV import to CNV
* The original source VM is VMware and it was migrated to RHV. So the same VM was tested in RHV -> CNV import as well as VMware -> CNV
* The issue wasn't occurred with RHEL7/Windows2016
* The worker node returned to 'Ready' state after reboot

Attachments:
logs and yamls files

Comment 1 Filip Krepinsky 2020-12-03 15:35:27 UTC
This doesn't seem like a UI bug.

Comment 3 Ilanit Stein 2020-12-10 07:11:00 UTC
This bug was also reproduced on a libvirt based OCP-4.6/CNV-2.5.2

a windows 2016 was imported from RHV to CNV, and then the VM was started and console was opened. after few minuted one of the cluster's workers became "Not Ready" and a day after it was no longer possible to connect to this cluster.  

As this already happened so far on 2 PSI based environments, and on a newly installed libvirt based environments it proves that the cause of the environment to become unusable is the imported Windows VM start/open console.

This bug blocks VM import for Windows VMs in CNV-2.5.2

Comment 5 Fabien Dupont 2020-12-10 09:27:02 UTC
Does it also happen with a VM that was created directly in CNV, not imported?

Importing a VM generates a VM spec, so either VM spec structure has changed in 2.5.2 and we need to update VMIO accordingly, or it's unlikely a VMIO issue.

Comment 6 Roman Mohr 2020-12-10 13:46:45 UTC
The node has roughly 15000 Mi allocatable and the VM alone takes more t han 8000Mi, how much more is running there already (oc describe node <name>)? What are the kubelet-system-reserved settings? Can we get that info too?

Comment 7 Igor Bezukh 2020-12-10 16:54:00 UTC
Hi,

At the time the node becomes NotReady, do you have any console access to the node in order to observe the logs?
In addition, can you please post the output of <oc describe nodes> and the virt-launcher pod logs?

TIA
Igor

Comment 8 Amos Mastbaum 2020-12-13 22:22:45 UTC
Hi All,

In Short:
Reducing VM Memory Size Solves the problem.
Also, it looks like Both virt-launcer log and the worker node console support Roman's note that it might be resources issue.

In More Details:

1. The VM created by VMIO Spec CR add the following memory configuration:

$ oc describe vm/..
...
...
resources:
  limits:
    memory: 32Gi
  requests:
    memory: 8Gi

2. Using an identical VM Spec an identical VM was create, with the exception of the memory:

resources:
  requests:
    memory: 4Gi*


***The Identical 4Gi RAM VM did NOT have any issues, the Console was responsive and nothing crashed :)***


4. When i tried the change the VM memory configuration After it was created, it has reduced the size (from 32Gi to 16Gi limits) no matter what i change it to.


How to Reproduce without VMIO:

ConfigMap:
http://pastebin.test.redhat.com/925520*

Secret:
http://pastebin.test.redhat.com/925521

DataVolume:
http://pastebin.test.redhat.com/925523

VirtualMachine:
http://pastebin.test.redhat.com/925524

*The ca_cert is included in a private comment.


virt-launcher log:
------------------
http://pastebin.test.redhat.com/925525

worker node console attached.


5. A different VM with the same memory configuration, had slightly different behaviorist, the virt-manager keep crashing, but i think the nodes where OK.not sure, can be reproduce:

VM
http://pastebin.test.redhat.com/925527

DV
http://pastebin.test.redhat.com/pastebin.php

Comment 10 Amos Mastbaum 2020-12-13 22:49:21 UTC
DV
http://pastebin.test.redhat.com/925528

Comment 11 Amos Mastbaum 2020-12-13 22:51:23 UTC
Created attachment 1738829 [details]
WorkerNodeConsole

Comment 12 Roman Mohr 2020-12-14 09:05:14 UTC
(In reply to Amos Mastbaum from comment #8)
> Hi All,
> 
> In Short:
> Reducing VM Memory Size Solves the problem.
> Also, it looks like Both virt-launcer log and the worker node console
> support Roman's note that it might be resources issue.
> 
> In More Details:
> 
> 1. The VM created by VMIO Spec CR add the following memory configuration:
> 
> $ oc describe vm/..
> ...
> ...
> resources:
>   limits:
>     memory: 32Gi
>   requests:
>     memory: 8Gi
> 

Until we support something like ballooning, setting a limit at all only make sense when the VM should run in the QoS class "Guaranteed". There it will have to be exactly the same like the "request". At this stage setting a limit different than the request makes for almost 100% of all users no sense in CNV.

So first, the import step should not set a limit.

Second it seems like you also experience issues without the limit set if you set it to 8Gi due to memory pressure. Here it would be interesting to closely investigate the the node when the memory pressure happens. The kubelet reserves some memory for itself and  some memory for the system daemons. I wonder if these settings are too low, or if some services which are delivered via pods are killed before the VM gets killed.

> 2. Using an identical VM Spec an identical VM was create, with the exception
> of the memory:
> 
> resources:
>   requests:
>     memory: 4Gi*
> 
> 
> ***The Identical 4Gi RAM VM did NOT have any issues, the Console was
> responsive and nothing crashed :)***
> 
> 
> 4. When i tried the change the VM memory configuration After it was created,
> it has reduced the size (from 32Gi to 16Gi limits) no matter what i change
> it to.
> 
> 
> How to Reproduce without VMIO:
> 
> ConfigMap:
> http://pastebin.test.redhat.com/925520*
> 
> Secret:
> http://pastebin.test.redhat.com/925521
> 
> DataVolume:
> http://pastebin.test.redhat.com/925523
> 
> VirtualMachine:
> http://pastebin.test.redhat.com/925524
> 
> *The ca_cert is included in a private comment.
> 
> 
> virt-launcher log:
> ------------------
> http://pastebin.test.redhat.com/925525
> 
> worker node console attached.
> 
> 
> 5. A different VM with the same memory configuration, had slightly different
> behaviorist, the virt-manager keep crashing, but i think the nodes where
> OK.not sure, can be reproduce:
> 
> VM
> http://pastebin.test.redhat.com/925527
> 
> DV
> http://pastebin.test.redhat.com/pastebin.php

Comment 13 Ilanit Stein 2020-12-16 10:30:18 UTC
Moving bug from V2V to SSP, as it is not related to VM import.

Comment 14 Dan Kenigsberg 2020-12-16 15:12:10 UTC
(In reply to Ilanit Stein from comment #13)
> Moving bug from V2V to SSP, as it is not related to VM import.

(In reply to Roman Mohr from comment #12)
> Until we support something like ballooning, setting a limit at all only make
> sense when the VM should run in the QoS class "Guaranteed". There it will
> have to be exactly the same like the "request". At this stage setting a
> limit different than the request makes for almost 100% of all users no sense
> in CNV.
> 
> So first, the import step should not set a limit.

@istein what has set requests < limits if not the VM import?

Comment 15 Ilanit Stein 2020-12-17 11:31:30 UTC
@Dan,

There are 2 issues mentioned in Roman's comment #12.

1. VM import adds memory limit and though it shouldn't. For this Amos opened:
Bug 1908337 - [v2v] VM import RHV to Vmware remove memory limits from VM Spec

2. Even without adding memory limit, for memory request of 8GB the node becomes "Not Ready". 
This check was not mentioned above. It appears in the logs.
It indicates that this problem (turning the node into Not ready) is not related to VM import.

Comment 16 Dan Kenigsberg 2020-12-17 13:18:12 UTC
How big is your node? Can you run other 8Gi VMs on it? Any idea why this particular VM make it crash?

spec:
  domain:
    clock:
      timer:
        hpet:
          present: false
        hyperv:
          present: true
        pit:
          present: true
          tickPolicy: delay
        rtc:
          present: true
          tickPolicy: catchup
      utc: {}
    cpu:
      cores: 1
      sockets: 1
      threads: 1
    devices:
      disks:
      - bootOrder: 1
        disk:
          bus: virtio
        name: harddisk1
      inputs:
      - bus: usb
        name: tablet
        type: tablet
      interfaces:
      - bridge: {}
        macAddress: 00:50:56:ad:64:bc
        model: virtio
        name: networkadapter1
    features:
      acpi:
        enabled: true
      apic:
        enabled: true
      hyperv:
        relaxed:
          enabled: true
        spinlocks:
          enabled: true
          spinlocks: 8191
        vapic:
          enabled: true
    firmware:
      uuid: eb62112f-24be-5877-afe1-3f5df7bf3583
    machine:
      type: pc-q35-rhel8.2.0
    resources:
      requests:
        cpu: 100m
        memory: 8Gi
  evictionStrategy: LiveMigrate
  hostname: v2v-win2012
  networks:
  - name: networkadapter1
    pod: {}
  terminationGracePeriodSeconds: 3600
  volumes:
  - name: harddisk1
    persistentVolumeClaim:
      claimName: harddisk1-crvrs

Comment 18 Omer Yahud 2020-12-23 13:01:33 UTC
@guchen Can we close this after the OCP bug was submitted?

Comment 19 guy chen 2020-12-27 11:36:25 UTC
Yes in my opening,  once they will reserve enough resources for the node it will block the Pod from getting all the resources, VM or not.

Comment 20 Ilanit Stein 2021-01-31 11:32:02 UTC
Guy,

Can we close this bug,
as the OCP bug 1857446 is verified?

Thanks.

Comment 21 guy chen 2021-02-01 11:23:01 UTC
Yes, this is an OCP bug.

*** This bug has been marked as a duplicate of bug 1857446 ***


Note You need to log in before you can comment on or make changes to this bug.