Bug 1827370
| Summary: | VMI fails during CNV upgrade | ||
|---|---|---|---|
| Product: | Container Native Virtualization (CNV) | Reporter: | Kevin Alon Goldblatt <kgoldbla> |
| Component: | Virtualization | Assignee: | David Vossel <dvossel> |
| Status: | CLOSED NOTABUG | QA Contact: | Israel Pinto <ipinto> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 2.3.0 | CC: | cnv-qe-bugs, danken, dvossel, fdeutsch, myakove, ncredi, sgott |
| Target Milestone: | --- | ||
| Target Release: | 2.4.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | Virtualization | ||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-05-04 08:54:57 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Kevin, Do you have access to the virt-launcher logs associated with vmi-test1? Could you attach them? David, could you take a look into this? Host Path based VMs and VMIs can not survive an OCP upgrade. During the OCP upgrade, each node is getting updated. All workloads will be evicated from the node. Thus a) The VMI will be killed b) The VMI can not be restarted, but even a VM could not be started on any other node, because the DV is tied to the specific node currently in the upgrade process. Thus please clarify: 1. When updating from ocp4.3 to ocp4.4 -- Are you sure that all nodes got updated? 2. Please provide the error of the failed VMI, all events, all errors, the final "describe VMI" output As a rule of thumb, supply the output of must-gather unless you know where the bug is, where you can supply the limited log requested by Fabian. I think pod interface should be masquerade and not bridge.
interfaces:
- bridge: {}
name: default
All nodes get updated. VMIs are mortal, thus once killed (for whatever reason) they will not be brought up again. The VMI must got killed during the upgrade of the relevant node, thus it is expected that it is not running anymore. VMs are the entities which ensure that VMs will still be running even if the underlying infrastructure has an issue (error or i.e. upgrade). And the VM in the example above is still running. I actually wonder if it got restarted or migrated. The remaining question is if the state (Failed) is correct. David, do you know how VMIs should behave (aka what status they should have) after they get killed upon node shutdown? In the end I do not see much of a bug here. Moving this out to 2.4 because it does not look like a bug. Version was changed, while I think Fabian meant to change the target version @Kevin, can you please try the same flow with masquerade to see if the pod gets into failed state? Moving target version to 2.4.0 based on Comment # 9 and Comment #10 Clearing blocker flag based recent comments as well. > The remaining question is if the state (Failed) is correct.
> David, do you know how VMIs should behave (aka what status they should have) after they get killed upon node shutdown?
In the case of node shutdown, I'd expect to see the VMI as "Failed". The "Success" Phase is when we can detect the guest within the VMI has shutdown on it's own accord. (didn't crash and wasn't forced off)
To further clarify expectations here.
Anytime an OCP update occurs that involves power cycling nodes, VMIs running on those nodes will go down (failed). The VMIs will only be brought back up again if they are backed by a VM object with "running: true" or "runStratagy: always" set.
Anytime CNV is updated and no node disruption is occurring at the same time, we expect _ALL_ VMIs to remain online and healthy throughout the update. If any VMI is disrupted as a result of a CNV update, that is a bug.
In this case, as Fabian has pointed out, node disruption during the OCP update will cause VMIs to go down.
As David says, the only exception is where "VMI.spec.evictionStrategy: LiveMigration", in that case, on controlled node disruptions (as during upgrade) the VMI should get live migrated away to a different node. Not a bug |
Description of problem: A direct VMI created on a datavolume fails during upgrade of cnv Regular VMs created with datavolumes remain running throughout the upgrade Version-Release number of selected component (if applicable): oc version: ---------------------------------------- Client Version: 4.3.10 Server Version: 4.4.0-rc.8 Kubernetes Version: v1.17.1 oc get csv: ----------------------------------------- NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.v2.3.0 Container-native virtualization 2.3.0 Succeeded How reproducible: 100% Steps to Reproduce: 1. Create datavolume on env with ocp4.3 cnv2.2 2. Create VMI - vmi running 3. Upgrade to ocp4.4 VMI still running 4. Upgrade to cnv2.3 VMI failed Actual results: VMI failed during upgrade to cnv2.3 from cnv2.2 Expected results: VMI should remain running Additional info: oc get vmi NAME AGE PHASE IP NODENAME vm-cirros-datavolume 5h40m Running 10.128.0.30 host-172-16-0-19 vmi-test1 27h Failed 10.129.0.27 host-172-16-0-40 cat dv-test1.yaml: ------------------------------------------------ apiVersion: cdi.kubevirt.io/v1alpha1 kind: DataVolume metadata: name: dv-test1 namespace: default spec: source: http: url: "http://cnv-qe-server.rhevdev.lab.eng.rdu2.redhat.com/files/cnv-tests/cirros-images/cirros-0.4.0-x86_64-disk.qcow2" pvc: storageClassName: hostpath-provisioner volumeMode: accessModes: - ReadWriteOnce resources: requests: storage: 1Gi cat vm-test1.yaml: ---------------------------------------- apiVersion: kubevirt.io/v1alpha3 kind: VirtualMachineInstance metadata: labels: special: vmi-test1 name: vmi-test1 namespace: spec: domain: devices: disks: - disk: bus: virtio name: dv-test1 resources: requests: memory: 200M terminationGracePeriodSeconds: 0 volumes: - name: dv-test1 dataVolume: name: dv-test1 apiVersion: kubevirt.io/v1alpha3 kind: VirtualMachineInstance metadata: annotations: kubevirt.io/latest-observed-api-version: v1alpha3 kubevirt.io/storage-observed-api-version: v1alpha3 creationTimestamp: "2020-04-22T15:09:22Z" generation: 10 labels: kubevirt.io/nodeName: host-172-16-0-40 special: vmi-test1 name: vmi-test1 namespace: default resourceVersion: "661178" selfLink: /apis/kubevirt.io/v1alpha3/namespaces/default/virtualmachineinstances/vmi-test1 uid: 8249e9c0-c91d-4bdf-9330-4379e8ec3931 spec: domain: devices: disks: - disk: bus: virtio name: dv-test1 interfaces: - bridge: {} name: default features: acpi: enabled: true firmware: uuid: 0188d572-fe4d-41d4-8c36-4313336231bc machine: type: q35 resources: requests: cpu: 100m memory: 200M networks: - name: default pod: {} terminationGracePeriodSeconds: 0 volumes: - dataVolume: name: dv-test1 name: dv-test1 status: conditions: - lastProbeTime: null lastTransitionTime: null message: cannot migrate VMI with non-shared PVCs reason: DisksNotLiveMigratable status: "False" type: LiveMigratable - lastProbeTime: null lastTransitionTime: null message: cannot migrate VMI with a bridge interface connected to a pod network reason: InterfaceNotLiveMigratable status: "False" type: LiveMigratable - lastProbeTime: null lastTransitionTime: "2020-04-22T15:09:29Z" status: "True" type: Ready guestOSInfo: {} interfaces: - ipAddress: 10.129.0.27 mac: 0a:58:0a:81:00:1b name: default migrationMethod: BlockMigration nodeName: host-172-16-0-40 phase: Failed qosClass: Burstable