Bug 2016290

Summary: [Warm] Warm Migration Fails and reporting ambiguous status.
Product: Container Native Virtualization (CNV) Reporter: Maayan Hadasi <mguetta>
Component: StorageAssignee: Matthew Arnold <marnold>
Status: CLOSED ERRATA QA Contact: Maayan Hadasi <mguetta>
Severity: high Docs Contact:
Priority: high    
Version: 4.9.0CC: alitke, amastbau, awels, cnv-qe-bugs, fdupont, istein, jortel, marnold, mrashish, yadu
Target Milestone: ---Keywords: Regression, TestBlocker
Target Release: 4.10.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: CNV v4.10.1-42 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-06-14 17:42:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Maayan Hadasi 2021-10-21 08:19:16 UTC
Description of problem:
Precopy does not get completed for the following VM migrations:
- RHEL 8 VM with 2 disks
- Windows VM (issue was found with VMware Windows2019 VM)

Once pressing on 'Cutover' - Transfer disks phase is running forever


Version-Release number of selected component (if applicable):
MTV 2.2.0-61 / iib:127106
CNV 4.9.0-249 / iib:122549


How reproducible:
100%


Attachments:
Screenshot and Plan yaml files 


Additional info:
RHE8 VM with one disk - migration works OK in MTV 2.2.0-61
RHL8 VM with 2 disks - we do not have a record of when it recently worked
Windows2019 -migration was ok in MTV 2.2.0-39


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:

Comment 4 Maayan Hadasi 2021-10-21 08:57:20 UTC
Updating that AFAIK Windows2019 warm migration was OK in MTV 2.2.0-39, regarding this issue.

Comment 5 Fabien Dupont 2021-10-21 15:37:42 UTC
Maayan, can you tell us what version on CNV you're testing with?

Comment 6 Maayan Hadasi 2021-10-21 15:56:07 UTC
(In reply to Fabien Dupont from comment #5)
> Maayan, can you tell us what version on CNV you're testing with?

CNV 4.9.0-249 / iib: 122549

Comment 8 Amos Mastbaum 2021-10-28 05:57:25 UTC
Run warm migration on a few CNV 4.9.0-249 + MTV 2.2.0-63 (BM/PSI) with only 1 disk and the results were a little different (The DV is showing 80/90% Pause While The Plan is Successful.

Sam: 

"Very strange that the DV is showing paused with completion of only 80%, but the importer pod is done. The CDI controller is also full of strange network errors."

"In the initial report, the DVs were getting stuck at ImportInProgress even when the importer was done, so MTV never considered them as complete. The DV yaml in this case is a little different but also suffering from the status being wrong. It's at the Paused phase, but 80% complete and still has the Running condition. (edited) 
Something seems to be broken with CDI"



"It seems like there's a problem with CDI managing the status of the data volume, though the plan showing as complete is very strange unless the DV passed through a phase that would make MTV think it was completed." 

http://pastebin.test.redhat.com/1004347 (plan)
http://pastebin.test.redhat.com/1004338 (dv)
http://pastebin.test.redhat.com/1004343 (cdi)

Comment 9 Alexander Wels 2021-10-28 14:39:20 UTC
Most of the errors in the logs in CDI can be categories into 2 groups:
1. Standard k8s concurrent access, which are normal and not concerning at all
2. Network errors. The DV controller attempts to connect to the import pod to read the progress so it can update the DV status. It is failing to connect, so either there is a network problem, or the import pod endpoint is down (aka the pod is not running). Assuming the network itself is fine, the import pod must not be running.

I also see this in the log:
I1027 15:24:49.999979       1 util.go:604] Saving VDDK annotations from pod status message: messageUnable to process data: pread: nbd_pread: poll: Interrupted system call Unable to transfer source data to target file kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause     /remote-source/app/pkg/importer/data-processor.go:206 kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData         /remote-source/app/pkg/importer/data-processor.go:166 main.main         /remote-source/app/cmd/cdi-importer/importer.go:189 runtime.main        /usr/lib/golang/src/runtime/proc.go:204 runtime.goexit  /usr/lib/golang/src/runtime/asm_amd64.s:1374; VDDK: {"Version":"7.0.2","Host":""}

This is from the pod exit log, thus the pod must have died. The original error is originating from the nbdkit process it seems, but without the pod logs I cannot be sure. The import pod log might shed some light on what is actually failing.

Comment 10 Amos Mastbaum 2021-10-28 15:12:27 UTC
importer log http://pastebin.test.redhat.com/1004642

Comment 11 Amos Mastbaum 2021-10-29 05:58:45 UTC
@awels ^^

Comment 12 Alexander Wels 2021-10-29 11:57:37 UTC
So the importer logs indicate a successful import, I see no failures, and a success, but the DV controller logs do not indicate this. So something else is happening.

Comment 13 Fabien Dupont 2021-11-03 13:42:54 UTC
According to the current state of investigation, the issue seems to be in CNV Storage.
Changing the product and component, respectively to CNV and Storage

Comment 14 Alexander Wels 2021-11-11 15:06:27 UTC
So after investigating the problem came down to a new debugging feature added to CDI and subsequently used in MVT. This feature allows one to put an annotation on the DataVolume, and we will retain the pod that populates the DataVolume. Now in the case of warm migration this was enabled for MTV, this presents a problem because of the following.

 Warm migration takes one or more snapshots of the running the VM in the source system, and then copies the snapshots over in sequence until we can switch off the source and copy the final small delta. On the CDI side this means creating a new importer pod for each snapshots being copied. However due to the way the name of the pod is generated it doesn't change between snapshots. So if the DV is named `example`, then the importer pod will be named `importer-example`, and if a new snapshot is being copied, the name will be same. But if the first pod was retained due to the annotation, the second cannot be created (cannot have 2 pods with the same name). The import is reporting 100% because it is reporting the progress of a single pod, which completed successfully, its not 100% complete for the entire process, just one snapshot.

This can be fixed by disabling the annotation on warm migrations, and if we want to keep it enabled we should figure out a different mechanism for generating the pod names that is 'stable' and predictable but different for each snapshot. There is a checkpoint name we could potentially use for this.

Comment 15 Ilanit Stein 2021-11-22 16:58:16 UTC
In MTV-2.2 this workaround was added: Bug 2020297 - "Disable retaining "Importer" pods for warm migration",
to allow warm migration from VMware.

Comment 16 Fabien Dupont 2022-01-03 08:50:00 UTC
There will not be a MTV 2.2.z release to leverage a fix in CNV 4.9.z. Moving to CNV 4.10.0.

Comment 17 Yan Du 2022-02-09 09:24:19 UTC
Hi, Matthew, Do you have plan to make it in 4.10.0?

Comment 18 Matthew Arnold 2022-02-09 13:11:28 UTC
I have a fix that is almost ready (see linked pull request), but I think it is unlikely to be accepted into 4.10. This can be pushed out again.

Comment 19 Matthew Arnold 2022-02-14 16:26:13 UTC
I checked, and CNV 4.10 is blockers-only at this point. I'm not sure if there's a later release that would benefit MTV near-term.

Comment 20 Maya Rashish 2022-03-22 12:37:56 UTC
It is now possible to backport for 4.10.1, if you'd like this sooner than 4.11 (as suggested by the target release).

Comment 21 Matthew Arnold 2022-03-23 15:26:48 UTC
Yes, it would be good to have this backported to 4.10.1. Is there a process to follow for that, or can I just open a cherrypick in CDI?

Comment 22 Maya Rashish 2022-03-23 15:41:06 UTC
A cherry pick in CDI will do it

Comment 23 Maya Rashish 2022-04-04 07:49:16 UTC
oops, chose the version of the importer rather than CNV bundle.

Comment 26 Maayan Hadasi 2022-05-15 13:10:24 UTC
The issue was reproduced in MTV warm migration with RHV RHEL8 VM with 2 disks, using Ceph-RBD as targeted storage class.
The 2nd precopy crashed, and once 'Cutover' was executed - the "Transfer disks" step was not ended (running forever)

Attachments in bug-2016290_may-15.zip file:
plan.yaml
Log of the crashed importer pod
Screenshots


Versions:
CNV 4.10.1-99 / iib: 224744
MTV STAGE 2.3.1-6 + patched ForkliftController CR


$ oc get forkliftcontroller forklift-controller -n openshift-mtv -oyaml
apiVersion: forklift.konveyor.io/v1beta1
kind: ForkliftController
metadata:
  creationTimestamp: "2022-05-15T08:27:50Z"
  generation: 3
  name: forklift-controller
  namespace: openshift-mtv
  resourceVersion: "16935517"
  uid: 73311c57-5b4e-4870-95a8-7426ed019855
spec:
  controller_image_fqin: quay.io/mrnold/forklift-controller:latest
  controller_precopy_interval: 3
  feature_must_gather_api: "true"
  feature_ui: "true"
  feature_validation: "true"
status:
  conditions:
  - ansibleResult:
      changed: 0
      completion: 2022-05-15T13:00:44.965727
      failures: 0
      ok: 27
      skipped: 9
    lastTransitionTime: "2022-05-15T08:27:50Z"
    message: Awaiting next reconciliation
    reason: Successful
    status: "True"
    type: Running

Comment 28 Yan Du 2022-05-16 03:11:02 UTC
Hi, Matthew
Could we re-target it to 4.10.2?

Comment 29 Matthew Arnold 2022-05-24 12:17:50 UTC
It looks like the fix is working, since the latest screenshot shows a bunch of correctly retained importer pods. The logs show the "no space left" bug from https://bugzilla.redhat.com/show_bug.cgi?id=2087916, so it's going to be stuck trying to transfer a disk that won't fit in the scratch space. I think that bug is specific to imports from RHV, so I might consider this fixed if it works with a warm import from VMware.

Comment 30 Maayan Hadasi 2022-05-26 07:23:12 UTC
Moving to VERIFIED, based on comment #29.
RHEL8 2-disks VM was successfully migrated from VMware, tested with NFS & Ceph-RBD

Versions:
CNV 4.10.1-99 / iib: 224744
OCP 4.10.12
MTV STAGE 2.3.1-7 + patched ForkliftController CR


$ oc get forkliftcontroller forklift-controller -n openshift-mtv -oyaml
apiVersion: forklift.konveyor.io/v1beta1
kind: ForkliftController
metadata:
  creationTimestamp: "2022-05-25T19:00:28Z"
  generation: 3
  name: forklift-controller
  namespace: openshift-mtv
  resourceVersion: "33615748"
  uid: ca1497f5-7a2b-48c7-bf7f-e26828a7024e
spec:
  controller_image_fqin: quay.io/mrnold/forklift-controller:latest
  controller_precopy_interval: 3
  feature_must_gather_api: "true"
  feature_ui: "true"
  feature_validation: "true"
status:
  conditions:
  - ansibleResult:
      changed: 0
      completion: 2022-05-26T07:09:13.571672
      failures: 0
      ok: 27
      skipped: 9
    lastTransitionTime: "2022-05-25T19:00:28Z"
    message: Awaiting next reconciliation
    reason: Successful
    status: "True"
    type: Running

Comment 31 Maayan Hadasi 2022-05-26 08:44:46 UTC
(In reply to Maayan Hadasi from comment #30)
> Moving to VERIFIED, based on comment #29.
> RHEL8 2-disks VM was successfully migrated from VMware, tested with NFS &
> Ceph-RBD
> 
> Versions:
> CNV 4.10.1-99 / iib: 224744
> OCP 4.10.12
> MTV 2.3.1 GA + patched ForkliftController CR

Correcting that MTV 2.3.1 GA was used when testing the above

Comment 37 errata-xmlrpc 2022-06-14 17:42:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.10.2 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5026