Bug 2016290 - [Warm] Warm Migration Fails and reporting ambiguous status.
Summary: [Warm] Warm Migration Fails and reporting ambiguous status.
Keywords:
Status: ASSIGNED
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Storage
Version: 4.9.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.10.0
Assignee: Matthew Arnold
QA Contact: Ilanit Stein
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-10-21 08:19 UTC by Maayan Hadasi
Modified: 2022-01-03 08:50 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)

Description Maayan Hadasi 2021-10-21 08:19:16 UTC
Description of problem:
Precopy does not get completed for the following VM migrations:
- RHEL 8 VM with 2 disks
- Windows VM (issue was found with VMware Windows2019 VM)

Once pressing on 'Cutover' - Transfer disks phase is running forever


Version-Release number of selected component (if applicable):
MTV 2.2.0-61 / iib:127106
CNV 4.9.0-249 / iib:122549


How reproducible:
100%


Attachments:
Screenshot and Plan yaml files 


Additional info:
RHE8 VM with one disk - migration works OK in MTV 2.2.0-61
RHL8 VM with 2 disks - we do not have a record of when it recently worked
Windows2019 -migration was ok in MTV 2.2.0-39


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:

Comment 4 Maayan Hadasi 2021-10-21 08:57:20 UTC
Updating that AFAIK Windows2019 warm migration was OK in MTV 2.2.0-39, regarding this issue.

Comment 5 Fabien Dupont 2021-10-21 15:37:42 UTC
Maayan, can you tell us what version on CNV you're testing with?

Comment 6 Maayan Hadasi 2021-10-21 15:56:07 UTC
(In reply to Fabien Dupont from comment #5)
> Maayan, can you tell us what version on CNV you're testing with?

CNV 4.9.0-249 / iib: 122549

Comment 8 Amos Mastbaum 2021-10-28 05:57:25 UTC
Run warm migration on a few CNV 4.9.0-249 + MTV 2.2.0-63 (BM/PSI) with only 1 disk and the results were a little different (The DV is showing 80/90% Pause While The Plan is Successful.

Sam: 

"Very strange that the DV is showing paused with completion of only 80%, but the importer pod is done. The CDI controller is also full of strange network errors."

"In the initial report, the DVs were getting stuck at ImportInProgress even when the importer was done, so MTV never considered them as complete. The DV yaml in this case is a little different but also suffering from the status being wrong. It's at the Paused phase, but 80% complete and still has the Running condition. (edited) 
Something seems to be broken with CDI"



"It seems like there's a problem with CDI managing the status of the data volume, though the plan showing as complete is very strange unless the DV passed through a phase that would make MTV think it was completed." 

http://pastebin.test.redhat.com/1004347 (plan)
http://pastebin.test.redhat.com/1004338 (dv)
http://pastebin.test.redhat.com/1004343 (cdi)

Comment 9 Alexander Wels 2021-10-28 14:39:20 UTC
Most of the errors in the logs in CDI can be categories into 2 groups:
1. Standard k8s concurrent access, which are normal and not concerning at all
2. Network errors. The DV controller attempts to connect to the import pod to read the progress so it can update the DV status. It is failing to connect, so either there is a network problem, or the import pod endpoint is down (aka the pod is not running). Assuming the network itself is fine, the import pod must not be running.

I also see this in the log:
I1027 15:24:49.999979       1 util.go:604] Saving VDDK annotations from pod status message: messageUnable to process data: pread: nbd_pread: poll: Interrupted system call Unable to transfer source data to target file kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause     /remote-source/app/pkg/importer/data-processor.go:206 kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData         /remote-source/app/pkg/importer/data-processor.go:166 main.main         /remote-source/app/cmd/cdi-importer/importer.go:189 runtime.main        /usr/lib/golang/src/runtime/proc.go:204 runtime.goexit  /usr/lib/golang/src/runtime/asm_amd64.s:1374; VDDK: {"Version":"7.0.2","Host":""}

This is from the pod exit log, thus the pod must have died. The original error is originating from the nbdkit process it seems, but without the pod logs I cannot be sure. The import pod log might shed some light on what is actually failing.

Comment 10 Amos Mastbaum 2021-10-28 15:12:27 UTC
importer log http://pastebin.test.redhat.com/1004642

Comment 11 Amos Mastbaum 2021-10-29 05:58:45 UTC
@awels@redhat.com ^^

Comment 12 Alexander Wels 2021-10-29 11:57:37 UTC
So the importer logs indicate a successful import, I see no failures, and a success, but the DV controller logs do not indicate this. So something else is happening.

Comment 13 Fabien Dupont 2021-11-03 13:42:54 UTC
According to the current state of investigation, the issue seems to be in CNV Storage.
Changing the product and component, respectively to CNV and Storage

Comment 14 Alexander Wels 2021-11-11 15:06:27 UTC
So after investigating the problem came down to a new debugging feature added to CDI and subsequently used in MVT. This feature allows one to put an annotation on the DataVolume, and we will retain the pod that populates the DataVolume. Now in the case of warm migration this was enabled for MTV, this presents a problem because of the following.

 Warm migration takes one or more snapshots of the running the VM in the source system, and then copies the snapshots over in sequence until we can switch off the source and copy the final small delta. On the CDI side this means creating a new importer pod for each snapshots being copied. However due to the way the name of the pod is generated it doesn't change between snapshots. So if the DV is named `example`, then the importer pod will be named `importer-example`, and if a new snapshot is being copied, the name will be same. But if the first pod was retained due to the annotation, the second cannot be created (cannot have 2 pods with the same name). The import is reporting 100% because it is reporting the progress of a single pod, which completed successfully, its not 100% complete for the entire process, just one snapshot.

This can be fixed by disabling the annotation on warm migrations, and if we want to keep it enabled we should figure out a different mechanism for generating the pod names that is 'stable' and predictable but different for each snapshot. There is a checkpoint name we could potentially use for this.

Comment 15 Ilanit Stein 2021-11-22 16:58:16 UTC
In MTV-2.2 this workaround was added: Bug 2020297 - "Disable retaining "Importer" pods for warm migration",
to allow warm migration from VMware.

Comment 16 Fabien Dupont 2022-01-03 08:50:00 UTC
There will not be a MTV 2.2.z release to leverage a fix in CNV 4.9.z. Moving to CNV 4.10.0.


Note You need to log in before you can comment on or make changes to this bug.