Bug 2135381

Summary: Live migration of OpenShift Virtualization VMs with ODF (ceph storage) based disks is failing consistently
Product: Container Native Virtualization (CNV) Reporter: pbunev <pbunev>
Component: VirtualizationAssignee: Jed Lejosne <jlejosne>
Status: VERIFIED --- QA Contact: zhe peng <zpeng>
Severity: high Docs Contact:
Priority: high    
Version: 4.10.5CC: acardace, bdumont, dzilberm, fdeutsch, ipinto, ktenzer, pbunev, pelauter, prince.tcet, sgott, vromanso, yadu, ycui, zpeng
Target Milestone: ---   
Target Release: 4.14.0   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: v4.14.0.rhel9-1569 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2016584 Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2092271, 2016584, 2174226    
Bug Blocks:    

Description pbunev@redhat.com 2022-10-17 13:03:29 UTC
Description of problem:

Live migration of OpenShift Virtualization VMs with ODF based shared disks is failing:

Version-Release number of selected component (if applicable):

OCP: 4.10.26
ODF: 4.10.6
OCP-virt: 4.10.5

How reproducible: 100%


Steps to Reproduce:
1. Create an OpenShift Virtualization VM (Fedora OS) with ODF file Storage (ocs-storagecluster-cephfs), underlying PVC of type 'Filesystem'
2. Start VM Migration 
3. Observe that a new virt-launcher pod is created, but old one never finishes and the VM is paused indefinitely.

Actual results:

VM is paused indefinetly.

Expected results:

VM gets migrated to another OCP node preserving its state 

Additional info:

Log files and screenshots will be attached to the Bugzilla.

Comment 9 Ying Cui 2022-11-03 06:57:24 UTC
This bug is reported against 4.10, not sure why the target version is set to 4.8.4, so let's re-target it in bug scrub meeting.

And from old virt launcher log:
{"component":"virt-launcher","kind":"","level":"error","msg":"Recevied a live migration error. Will check the latest migration status.","name":"fedora-cephfs","namespace":"vm-testproj","pos":"live-migration-source.go:805","reason":"error encountered during MigrateToURI3 libvirt api call: virError(Code=1, Domain=10, Message='internal error: unable to execute QEMU command 'cont': Failed to get \"write\" lock')","timestamp":"2022-10-17T09:55:42.993889Z","uid":"2a34c2ae-71af-4d4e-a116-5ce9621ce88a"}

Comment 13 Kedar Bidarkar 2023-01-04 13:35:34 UTC
*** Bug 2152909 has been marked as a duplicate of this bug. ***

Comment 14 Antonio Cardace 2023-03-03 15:26:32 UTC
Deferring to 4.13.1 due to capacity.

Comment 15 Antonio Cardace 2023-03-03 16:45:01 UTC
Deferring to 4.14 due to priority.

Comment 16 zhe peng 2023-08-16 07:31:26 UTC
verify with build: CNV-v4.14.0.rhel9-1632

step:
1. create vm with ocs-storagecluster-cephfs
...
   storage:
          resources:
            requests:
              storage: 30Gi
          storageClassName: ocs-storagecluster-cephfs
...
check pvc:
...
   resources:
      requests:
        storage: "34087042032"
    storageClassName: ocs-storagecluster-cephfs
    volumeMode: Filesystem
    volumeName: pvc-19fd569a-a750-4298-99fc-81e0767ea167
...

2. start vm
$ oc get pods
NAME                            READY   STATUS    RESTARTS   AGE
virt-launcher-vm-fedora-9pgkv   1/1     Running   0          2m41s

3. do live migration

$ oc get pods
NAME                            READY   STATUS      RESTARTS   AGE
virt-launcher-vm-fedora-6tgzl   1/1     Running     0          16s
virt-launcher-vm-fedora-9pgkv   0/1     Completed   0          3m22s

$ oc get vm
NAME        AGE     STATUS    READY
vm-fedora   4m31s   Running   True

$ oc get virtualmachineinstancemigrations.kubevirt.io 
NAME                        PHASE       VMI
vm-fedora-migration-o514o   Succeeded   vm-fedora

move to verified.