2021504 – [MTV Scale] importer pods are restarting during migration

This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .

Bug 2021504 - [MTV Scale] importer pods are restarting during migration

Summary: [MTV Scale] importer pods are restarting during migration

Keywords:
Status:	CLOSED MIGRATED
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.9.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.15.0
Assignee:	Arik
QA Contact:	David Vaanunu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-11-09 13:11 UTC by David Vaanunu
Modified:	2023-12-14 16:04 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-12-14 16:04:02 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	CNV-14998	0	None	None	None	2023-12-14 16:04:02 UTC

Description David Vaanunu 2021-11-09 13:11:01 UTC

Description of problem:
while migrating VMs from both vmware65&vmware67 to BM env. (cloud38)
importer pods are restarting.


Version-Release number of selected component (if applicable):
MTV_2.2.0-80
CNV_4.9.1
VDDK 7.0.2

How reproducible:


Steps to Reproduce:
1. Migrate VMs from vmware65/67
2. monitor the importer pod log files
3.

Actual results:

importer PODs are restarting during migration
All VMs were migrated



Expected results:

Migrated All VMs with PODs restart


Additional info:

Attached: POD restart screenshot, PODs log file, Test report

POD error:

E1109 12:36:21.016236       1 vddk-datasource_amd64.go:587] Error reading from source at offset 24538775552: pread: nbd_pread: poll: Interrupted system call
E1109 12:36:21.016274       1 vddk-datasource_amd64.go:996] Unable to copy block at offset 23622320128: pread: nbd_pread: poll: Interrupted system call
E1109 12:36:21.300977       1 data-processor.go:229] pread: nbd_pread: poll: Interrupted system call
Unable to transfer source data to target file
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause
        /remote-source/app/pkg/importer/data-processor.go:206
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData
        /remote-source/app/pkg/importer/data-processor.go:166
main.main
        /remote-source/app/cmd/cdi-importer/importer.go:189
runtime.main
        /usr/lib/golang/src/runtime/proc.go:204
runtime.goexit
        /usr/lib/golang/src/runtime/asm_amd64.s:1374
E1109 12:36:21.301070       1 importer.go:191] pread: nbd_pread: poll: Interrupted system call
Unable to transfer source data to target file
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause
        /remote-source/app/pkg/importer/data-processor.go:206
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData
        /remote-source/app/pkg/importer/data-processor.go:166
main.main
        /remote-source/app/cmd/cdi-importer/importer.go:189
runtime.main
        /usr/lib/golang/src/runtime/proc.go:204
runtime.goexit
        /usr/lib/golang/src/runtime/asm_amd64.s:1374
I1109 12:36:21.301457       1 vddk-datasource_amd64.go:144] Waiting for VDDK nbdkit log watcher to stop.
I1109 12:36:22.414440       1 vddk-datasource_amd64.go:184] Stopped watching nbdkit log. Last lines follow:
I1109 12:36:22.414461       1 vddk-datasource_amd64.go:213] Log line from nbdkit: nbdkit: vddk[1]: debug: VDDK call: VixDiskLib_Read (handle, 47861760 sectors, 4096 sectors, buffer)

Comment 4 Ilanit Stein 2021-11-14 11:24:07 UTC

Adding that the importer pod restarts has also performance influence since it increases the migration time.

Comment 6 Fabien Dupont 2021-11-16 13:31:25 UTC

Does it also happen when doing migration via VMIO API?

Comment 7 David Vaanunu 2021-11-18 10:30:28 UTC

NOTE: The bug was open while using 'vsphere endpoint'.
      While changed to 'esx endpoint' - no pods restart 



Running 1 VMIO - Passed

migrate 6 VMs using VMIO API:
All PODs were restart 


same error:
E1118 10:06:12.378671       1 vddk-datasource_amd64.go:587] Error reading from source at offset 25702694912: pread: nbd_pread: poll: Interrupted system call
E1118 10:06:12.378711       1 vddk-datasource_amd64.go:996] Unable to copy block at offset 23622320128: pread: nbd_pread: poll: Interrupted system call
E1118 10:06:12.522774       1 data-processor.go:229] pread: nbd_pread: poll: Interrupted system call
Unable to transfer source data to target file
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause
	/remote-source/app/pkg/importer/data-processor.go:206
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData
	/remote-source/app/pkg/importer/data-processor.go:166
main.main
	/remote-source/app/cmd/cdi-importer/importer.go:189
runtime.main
	/usr/lib/golang/src/runtime/proc.go:204
runtime.goexit
	/usr/lib/golang/src/runtime/asm_amd64.s:1374
E1118 10:06:12.522870       1 importer.go:191] pread: nbd_pread: poll: Interrupted system call
Unable to transfer source data to target file
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause
	/remote-source/app/pkg/importer/data-processor.go:206
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData
	/remote-source/app/pkg/importer/data-processor.go:166
main.main
	/remote-source/app/cmd/cdi-importer/importer.go:189
runtime.main
	/usr/lib/golang/src/runtime/proc.go:204
runtime.goexit
	/usr/lib/golang/src/runtime/asm_amd64.s:1374
I1118 10:06:12.523216       1 vddk-datasource_amd64.go:144] Waiting for VDDK nbdkit log watcher to stop.
I1118 10:06:13.560335       1 vddk-datasource_amd64.go:184] Stopped watching nbdkit log. Last lines follow:

Comment 9 David Vaanunu 2021-11-18 11:24:35 UTC

(In reply to David Vaanunu from comment #7)


Correction!!!!

> Running 1 VMIO - Passed

Running 1 VMIO failed too, but have no error of: "pread: nbd_pread: poll: Interrupted system call"

it failed with error:

I1118 11:01:59.217802       1 vddk-datasource_amd64.go:954] Transferred 53687091200/53687091200 bytes (100%) at 125144734 bytes/second overall, last 1% was 536870912 bytes at 89478485 bytes/second
I1118 11:01:59.236227       1 data-processor.go:232] New phase: Resize
W1118 11:01:59.243078       1 data-processor.go:311] Available space less than requested size, resizing image to available space 52612300800.
I1118 11:01:59.243093       1 data-processor.go:319] Calculated new size is < than current size, not resizing: requested size 52612300800, virtual size: 53687091200.
I1118 11:01:59.243099       1 data-processor.go:238] Validating image
E1118 11:01:59.246590       1 data-processor.go:229] Virtual image size 53687091200 is larger than available size 52613311242 (PVC size 53687091200, reserved overhead 0.055000%). A larger PVC is required.
Unable to resize disk image to requested size
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause
	/remote-source/app/pkg/importer/data-processor.go:223
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData
	/remote-source/app/pkg/importer/data-processor.go:166
main.main
	/remote-source/app/cmd/cdi-importer/importer.go:189
runtime.main
	/usr/lib/golang/src/runtime/proc.go:204
runtime.goexit
	/usr/lib/golang/src/runtime/asm_amd64.s:1374
E1118 11:01:59.246640       1 importer.go:191] Virtual image size 53687091200 is larger than available size 52613311242 (PVC size 53687091200, reserved overhead 0.055000%). A larger PVC is required.
Unable to resize disk image to requested size
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause
	/remote-source/app/pkg/importer/data-processor.go:223
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData
	/remote-source/app/pkg/importer/data-processor.go:166
main.main
	/remote-source/app/cmd/cdi-importer/importer.go:189
runtime.main
	/usr/lib/golang/src/runtime/proc.go:204
runtime.goexit
	/usr/lib/golang/src/runtime/asm_amd64.s:1374
I1118 11:01:59.246923       1 vddk-datasource_amd64.go:144] Waiting for VDDK nbdkit log watcher to stop.





All 6 VMs were failed to migrate. POD stopped after 5 restart.
it failed with a size error. (see above)
migrate to OCS storage

Comment 10 Fabien Dupont 2021-11-19 14:47:34 UTC

I discussed this one with Matthew and the problem is in the importer pod itself, not how MTV builds the DataVolume CRs.
Moving the bug to CNV / Storage to fix it in CNV 4.9.z.

Comment 11 Adam Litke 2021-12-15 14:51:40 UTC

Are you using the "storage" API in the created DataVolumes?  Please attach the yaml associated with the affected DataVolumes.

Comment 13 David Vaanunu 2021-12-20 13:41:13 UTC

I migrated 6 VMs from VMware65


openshift-mtv   bug2021504-vm-7039-vt4wh   ImportInProgress   41.00%     2          11m
openshift-mtv   bug2021504-vm-7040-4pt9v   ImportInProgress   83.00%     2          11m
openshift-mtv   bug2021504-vm-7041-w8xbt   Succeeded          100.0%     1          10m
openshift-mtv   bug2021504-vm-7042-pg5w4   ImportInProgress   33.00%     2          10m
openshift-mtv   bug2021504-vm-7043-vzwwn   Succeeded          100.0%                9m29s
openshift-mtv   bug2021504-vm-7044-vgmd6   ImportInProgress   95.00%                8m46s


importer-bug2021504-vm-7039-vt4wh          1/1     Running     2 (3m5s ago)    11m
importer-bug2021504-vm-7040-4pt9v          1/1     Running     2 (7m32s ago)   11m
importer-bug2021504-vm-7041-w8xbt          0/1     Completed   1               10m
importer-bug2021504-vm-7042-pg5w4          1/1     Running     2 (2m12s ago)   10m
importer-bug2021504-vm-7043-vzwwn          0/1     Completed   0               9m19s
importer-bug2021504-vm-7044-vgmd6          1/1     Running     0               8m49s

Comment 14 Adam Litke 2022-01-12 18:27:22 UTC

Alice, can you help to understand why the nbd_pread system call is interrupted?  I wonder if this operation is simply more resource intensive than our importer pod resource requests allow and the kubelet is therefore killing the pod?  Alexander, any thoughts?

Comment 15 Alexander Wels 2022-01-12 20:19:07 UTC

Looking at the DVs my first thought this is another one of the warm migration that adds 'cdi.kubevirt.io/storage.pod.retainAfterCompletion: "true"' annotation to the DV, and thus the pod. Since the warm migration expects multiple pods to be started with the same name, if the first one is not deleted then the second one cannot be started (the cdi-deployment log from openshift-cnv namespace should tell us) and weird things happen, like the DV reporting complete (first phase is complete) but then showing < 100% and things of that nature. However AFAICR we fixed that issue, just can't find the fix on the CDI side, might have been on the MTV side. I just went back into my logs and I believe the fix was to turn off the annotation in MTV, but I guess it is turned back on?

Comment 16 Alice Frosi 2022-01-13 08:14:13 UTC

Was the importer pod that has 'pread: nbd_pread: poll: Interrupted system call'restarted or terminated? Any OOM event? It seems to me also from comment #9 that it was a temporary failure?

Comment 17 David Vaanunu 2022-01-13 09:29:35 UTC

During the tests, a few importer pods were restarted but no OOM events.
In the end, the VMs were migrated successfully - using MTV migration plan
(Some of the VMs had 3-4 restarts)


While trying with VMIO, all VMs were failed to migrate. (Comment #9)

Comment 18 Yan Du 2022-01-26 13:31:19 UTC

Matthew, is there any updates for the bug?

Comment 19 Matthew Arnold 2022-01-27 14:43:53 UTC

Nothing yet, I haven't had a chance to take a closer look. Probably the target version will need to be bumped up again.

Comment 20 Adam Litke 2022-02-28 20:00:01 UTC

Moving out of z-streams since this doesn't appear to have priority.  Please retarget to an appropriate release if you intend to merge it in either 4.9.z or 4.10.z.

Comment 21 Tzahi Ashkenazi 2022-03-03 16:11:41 UTC

reproduce on cloud10 with the following versions  : 
* MTV 2.3.0-32
* CNV 4.10.0-648
* OCP 4.10
* VMware 6.7 

running  a single plan migration with single  esxi host vmware6.7 as provider with 20 VMS  8 importers pods were restarted : 

[root@f01-h14-000-r640 scale-tests]# oc get pods  |grep importer |grep Running
NAME                                                              READY   STATUS      RESTARTS        AGE

importer-mtv-api-tests-22-03-03-13-53-10-b03-plan-vm-1070-28lmt   1/1     Running     1 (13m ago)     16m
importer-mtv-api-tests-22-03-03-13-53-10-b03-plan-vm-1071-k4k8p   1/1     Running     1 (14m ago)     15m
importer-mtv-api-tests-22-03-03-13-53-10-b03-plan-vm-1074-jd7sr   1/1     Running     1 (7m39s ago)   12m
importer-mtv-api-tests-22-03-03-13-53-10-b03-plan-vm-1075-f87qs   1/1     Running     0               17m
importer-mtv-api-tests-22-03-03-13-53-10-b03-plan-vm-1077-hvxv8   1/1     Running     1 (9m6s ago)    18m
importer-mtv-api-tests-22-03-03-13-53-10-b03-plan-vm-1079-pw7nb   1/1     Running     0               13m
importer-mtv-api-tests-22-03-03-13-53-10-b03-plan-vm-1080-gvtkx   1/1     Running     1 (11m ago)     19m
importer-mtv-api-tests-22-03-03-13-53-10-b03-plan-vm-1081-t6bls   1/1     Running     3 (8m3s ago)    27m
importer-mtv-api-tests-22-03-03-13-53-10-b03-plan-vm-1082-q66nj   1/1     Running     1 (10m ago)     21m
importer-mtv-api-tests-22-03-03-13-53-10-b03-plan-vm-1084-4zcp2   1/1     Running     1 (11m ago)     21m
importer-mtv-api-tests-22-03-03-13-53-10-b03-plan-vm-1086-cr9zt   1/1     Running     0               14m

Comment 22 Matthew Arnold 2022-03-14 20:47:33 UTC

I think this libnbd fix will take care of this bug: https://github.com/libguestfs/libnbd/commit/b3440853cdeca0e44ad9c526e71faaa6cf344bfc

I had good results testing with this fix, but I'm not sure there's a quick way to get the new version (libnbd-1.11.2) picked up by CDI. I will see what the options are.

Comment 23 Matthew Arnold 2022-06-03 19:12:42 UTC

I believe this is actually fixed in 4.11, since this commit updated libnbd to 1.12.1: https://github.com/kubevirt/containerized-data-importer/commit/02e70a5a5486e03e0a9a04980127ba2bdc2d44fb

It would be worth a retest at some point.

Comment 24 Prince Sarvaiya 2022-10-27 09:56:39 UTC

Hi,

I can't this bug fix in 4.11.x release notes. Can you confirm in which release this bug is fixed ?
I am facing this issue in 4.10.

>Interrupted system call Unable to transfer source data to target file 
> Here, CDI pod crashes, and restarts again. On restart, it starts copying data/VM again from Zero..

Comment 25 Prince Sarvaiya 2022-10-27 09:58:31 UTC

Hi,

I can't find this bug fix in OCP/MTV/CNV 4.11.x release notes. Can you confirm in which release this bug is fixed ?
I am facing this issue in 4.10.

>Interrupted system call Unable to transfer source data to target file 
> Here, CDI pod crashes, and restarts again. On restart, it starts copying data/VM again from Zero..

Comment 26 Matthew Arnold 2022-10-27 18:27:33 UTC

The closest I can find is the CDI release notes: https://github.com/kubevirt/containerized-data-importer/releases/tag/v1.48.0

The relevant entry is "Enhancement: Use centos:stream9 as a base image for our containers.", which is from pull request https://github.com/kubevirt/containerized-data-importer/pull/1983. There is a commit in that pull request that updates libnbd to a newer version that contains the fix. So I think it got fixed kind of implicitly, in a way that won't really show up in a release note. But I am pretty sure it was fixed in 4.11, and not backported to any 4.10 release.

Comment 27 David Vaanunu 2022-11-02 11:45:53 UTC

(In reply to Prince Sarvaiya from comment #25)
> Hi,
> 
> I can't find this bug fix in OCP/MTV/CNV 4.11.x release notes. Can you
> confirm in which release this bug is fixed ?
> I am facing this issue in 4.10.
> 
> >Interrupted system call Unable to transfer source data to target file 
> > Here, CDI pod crashes, and restarts again. On restart, it starts copying data/VM again from Zero..


I don't think it a question for me

Note You need to log in before you can comment on or make changes to this bug.