Bug 2021504
| Summary: | [MTV Scale] importer pods are restarting during migration | ||
|---|---|---|---|
| Product: | Container Native Virtualization (CNV) | Reporter: | David Vaanunu <dvaanunu> |
| Component: | Storage | Assignee: | Arik <ahadas> |
| Status: | CLOSED MIGRATED | QA Contact: | David Vaanunu <dvaanunu> |
| Severity: | medium | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.9.1 | CC: | afrosi, alitke, apinnick, awels, fdupont, istein, marnold, mguetta, mlehrer, prince.tcet, tashkena, yadu |
| Target Milestone: | --- | Keywords: | Regression |
| Target Release: | 4.15.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-12-14 16:04:02 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Adding that the importer pod restarts has also performance influence since it increases the migration time. Does it also happen when doing migration via VMIO API? NOTE: The bug was open while using 'vsphere endpoint'.
While changed to 'esx endpoint' - no pods restart
Running 1 VMIO - Passed
migrate 6 VMs using VMIO API:
All PODs were restart
same error:
E1118 10:06:12.378671 1 vddk-datasource_amd64.go:587] Error reading from source at offset 25702694912: pread: nbd_pread: poll: Interrupted system call
E1118 10:06:12.378711 1 vddk-datasource_amd64.go:996] Unable to copy block at offset 23622320128: pread: nbd_pread: poll: Interrupted system call
E1118 10:06:12.522774 1 data-processor.go:229] pread: nbd_pread: poll: Interrupted system call
Unable to transfer source data to target file
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause
/remote-source/app/pkg/importer/data-processor.go:206
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData
/remote-source/app/pkg/importer/data-processor.go:166
main.main
/remote-source/app/cmd/cdi-importer/importer.go:189
runtime.main
/usr/lib/golang/src/runtime/proc.go:204
runtime.goexit
/usr/lib/golang/src/runtime/asm_amd64.s:1374
E1118 10:06:12.522870 1 importer.go:191] pread: nbd_pread: poll: Interrupted system call
Unable to transfer source data to target file
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause
/remote-source/app/pkg/importer/data-processor.go:206
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData
/remote-source/app/pkg/importer/data-processor.go:166
main.main
/remote-source/app/cmd/cdi-importer/importer.go:189
runtime.main
/usr/lib/golang/src/runtime/proc.go:204
runtime.goexit
/usr/lib/golang/src/runtime/asm_amd64.s:1374
I1118 10:06:12.523216 1 vddk-datasource_amd64.go:144] Waiting for VDDK nbdkit log watcher to stop.
I1118 10:06:13.560335 1 vddk-datasource_amd64.go:184] Stopped watching nbdkit log. Last lines follow:
(In reply to David Vaanunu from comment #7) Correction!!!! > Running 1 VMIO - Passed Running 1 VMIO failed too, but have no error of: "pread: nbd_pread: poll: Interrupted system call" it failed with error: I1118 11:01:59.217802 1 vddk-datasource_amd64.go:954] Transferred 53687091200/53687091200 bytes (100%) at 125144734 bytes/second overall, last 1% was 536870912 bytes at 89478485 bytes/second I1118 11:01:59.236227 1 data-processor.go:232] New phase: Resize W1118 11:01:59.243078 1 data-processor.go:311] Available space less than requested size, resizing image to available space 52612300800. I1118 11:01:59.243093 1 data-processor.go:319] Calculated new size is < than current size, not resizing: requested size 52612300800, virtual size: 53687091200. I1118 11:01:59.243099 1 data-processor.go:238] Validating image E1118 11:01:59.246590 1 data-processor.go:229] Virtual image size 53687091200 is larger than available size 52613311242 (PVC size 53687091200, reserved overhead 0.055000%). A larger PVC is required. Unable to resize disk image to requested size kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause /remote-source/app/pkg/importer/data-processor.go:223 kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData /remote-source/app/pkg/importer/data-processor.go:166 main.main /remote-source/app/cmd/cdi-importer/importer.go:189 runtime.main /usr/lib/golang/src/runtime/proc.go:204 runtime.goexit /usr/lib/golang/src/runtime/asm_amd64.s:1374 E1118 11:01:59.246640 1 importer.go:191] Virtual image size 53687091200 is larger than available size 52613311242 (PVC size 53687091200, reserved overhead 0.055000%). A larger PVC is required. Unable to resize disk image to requested size kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause /remote-source/app/pkg/importer/data-processor.go:223 kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData /remote-source/app/pkg/importer/data-processor.go:166 main.main /remote-source/app/cmd/cdi-importer/importer.go:189 runtime.main /usr/lib/golang/src/runtime/proc.go:204 runtime.goexit /usr/lib/golang/src/runtime/asm_amd64.s:1374 I1118 11:01:59.246923 1 vddk-datasource_amd64.go:144] Waiting for VDDK nbdkit log watcher to stop. All 6 VMs were failed to migrate. POD stopped after 5 restart. it failed with a size error. (see above) migrate to OCS storage I discussed this one with Matthew and the problem is in the importer pod itself, not how MTV builds the DataVolume CRs. Moving the bug to CNV / Storage to fix it in CNV 4.9.z. Are you using the "storage" API in the created DataVolumes? Please attach the yaml associated with the affected DataVolumes. I migrated 6 VMs from VMware65 openshift-mtv bug2021504-vm-7039-vt4wh ImportInProgress 41.00% 2 11m openshift-mtv bug2021504-vm-7040-4pt9v ImportInProgress 83.00% 2 11m openshift-mtv bug2021504-vm-7041-w8xbt Succeeded 100.0% 1 10m openshift-mtv bug2021504-vm-7042-pg5w4 ImportInProgress 33.00% 2 10m openshift-mtv bug2021504-vm-7043-vzwwn Succeeded 100.0% 9m29s openshift-mtv bug2021504-vm-7044-vgmd6 ImportInProgress 95.00% 8m46s importer-bug2021504-vm-7039-vt4wh 1/1 Running 2 (3m5s ago) 11m importer-bug2021504-vm-7040-4pt9v 1/1 Running 2 (7m32s ago) 11m importer-bug2021504-vm-7041-w8xbt 0/1 Completed 1 10m importer-bug2021504-vm-7042-pg5w4 1/1 Running 2 (2m12s ago) 10m importer-bug2021504-vm-7043-vzwwn 0/1 Completed 0 9m19s importer-bug2021504-vm-7044-vgmd6 1/1 Running 0 8m49s Alice, can you help to understand why the nbd_pread system call is interrupted? I wonder if this operation is simply more resource intensive than our importer pod resource requests allow and the kubelet is therefore killing the pod? Alexander, any thoughts? Looking at the DVs my first thought this is another one of the warm migration that adds 'cdi.kubevirt.io/storage.pod.retainAfterCompletion: "true"' annotation to the DV, and thus the pod. Since the warm migration expects multiple pods to be started with the same name, if the first one is not deleted then the second one cannot be started (the cdi-deployment log from openshift-cnv namespace should tell us) and weird things happen, like the DV reporting complete (first phase is complete) but then showing < 100% and things of that nature. However AFAICR we fixed that issue, just can't find the fix on the CDI side, might have been on the MTV side. I just went back into my logs and I believe the fix was to turn off the annotation in MTV, but I guess it is turned back on? Was the importer pod that has 'pread: nbd_pread: poll: Interrupted system call'restarted or terminated? Any OOM event? It seems to me also from comment #9 that it was a temporary failure? During the tests, a few importer pods were restarted but no OOM events. In the end, the VMs were migrated successfully - using MTV migration plan (Some of the VMs had 3-4 restarts) While trying with VMIO, all VMs were failed to migrate. (Comment #9) Matthew, is there any updates for the bug? Nothing yet, I haven't had a chance to take a closer look. Probably the target version will need to be bumped up again. Moving out of z-streams since this doesn't appear to have priority. Please retarget to an appropriate release if you intend to merge it in either 4.9.z or 4.10.z. reproduce on cloud10 with the following versions : * MTV 2.3.0-32 * CNV 4.10.0-648 * OCP 4.10 * VMware 6.7 running a single plan migration with single esxi host vmware6.7 as provider with 20 VMS 8 importers pods were restarted : [root@f01-h14-000-r640 scale-tests]# oc get pods |grep importer |grep Running NAME READY STATUS RESTARTS AGE importer-mtv-api-tests-22-03-03-13-53-10-b03-plan-vm-1070-28lmt 1/1 Running 1 (13m ago) 16m importer-mtv-api-tests-22-03-03-13-53-10-b03-plan-vm-1071-k4k8p 1/1 Running 1 (14m ago) 15m importer-mtv-api-tests-22-03-03-13-53-10-b03-plan-vm-1074-jd7sr 1/1 Running 1 (7m39s ago) 12m importer-mtv-api-tests-22-03-03-13-53-10-b03-plan-vm-1075-f87qs 1/1 Running 0 17m importer-mtv-api-tests-22-03-03-13-53-10-b03-plan-vm-1077-hvxv8 1/1 Running 1 (9m6s ago) 18m importer-mtv-api-tests-22-03-03-13-53-10-b03-plan-vm-1079-pw7nb 1/1 Running 0 13m importer-mtv-api-tests-22-03-03-13-53-10-b03-plan-vm-1080-gvtkx 1/1 Running 1 (11m ago) 19m importer-mtv-api-tests-22-03-03-13-53-10-b03-plan-vm-1081-t6bls 1/1 Running 3 (8m3s ago) 27m importer-mtv-api-tests-22-03-03-13-53-10-b03-plan-vm-1082-q66nj 1/1 Running 1 (10m ago) 21m importer-mtv-api-tests-22-03-03-13-53-10-b03-plan-vm-1084-4zcp2 1/1 Running 1 (11m ago) 21m importer-mtv-api-tests-22-03-03-13-53-10-b03-plan-vm-1086-cr9zt 1/1 Running 0 14m I think this libnbd fix will take care of this bug: https://github.com/libguestfs/libnbd/commit/b3440853cdeca0e44ad9c526e71faaa6cf344bfc I had good results testing with this fix, but I'm not sure there's a quick way to get the new version (libnbd-1.11.2) picked up by CDI. I will see what the options are. I believe this is actually fixed in 4.11, since this commit updated libnbd to 1.12.1: https://github.com/kubevirt/containerized-data-importer/commit/02e70a5a5486e03e0a9a04980127ba2bdc2d44fb It would be worth a retest at some point. Hi,
I can't this bug fix in 4.11.x release notes. Can you confirm in which release this bug is fixed ?
I am facing this issue in 4.10.
>Interrupted system call Unable to transfer source data to target file
> Here, CDI pod crashes, and restarts again. On restart, it starts copying data/VM again from Zero..
Hi,
I can't find this bug fix in OCP/MTV/CNV 4.11.x release notes. Can you confirm in which release this bug is fixed ?
I am facing this issue in 4.10.
>Interrupted system call Unable to transfer source data to target file
> Here, CDI pod crashes, and restarts again. On restart, it starts copying data/VM again from Zero..
The closest I can find is the CDI release notes: https://github.com/kubevirt/containerized-data-importer/releases/tag/v1.48.0 The relevant entry is "Enhancement: Use centos:stream9 as a base image for our containers.", which is from pull request https://github.com/kubevirt/containerized-data-importer/pull/1983. There is a commit in that pull request that updates libnbd to a newer version that contains the fix. So I think it got fixed kind of implicitly, in a way that won't really show up in a release note. But I am pretty sure it was fixed in 4.11, and not backported to any 4.10 release. (In reply to Prince Sarvaiya from comment #25) > Hi, > > I can't find this bug fix in OCP/MTV/CNV 4.11.x release notes. Can you > confirm in which release this bug is fixed ? > I am facing this issue in 4.10. > > >Interrupted system call Unable to transfer source data to target file > > Here, CDI pod crashes, and restarts again. On restart, it starts copying data/VM again from Zero.. I don't think it a question for me |
Description of problem: while migrating VMs from both vmware65&vmware67 to BM env. (cloud38) importer pods are restarting. Version-Release number of selected component (if applicable): MTV_2.2.0-80 CNV_4.9.1 VDDK 7.0.2 How reproducible: Steps to Reproduce: 1. Migrate VMs from vmware65/67 2. monitor the importer pod log files 3. Actual results: importer PODs are restarting during migration All VMs were migrated Expected results: Migrated All VMs with PODs restart Additional info: Attached: POD restart screenshot, PODs log file, Test report POD error: E1109 12:36:21.016236 1 vddk-datasource_amd64.go:587] Error reading from source at offset 24538775552: pread: nbd_pread: poll: Interrupted system call E1109 12:36:21.016274 1 vddk-datasource_amd64.go:996] Unable to copy block at offset 23622320128: pread: nbd_pread: poll: Interrupted system call E1109 12:36:21.300977 1 data-processor.go:229] pread: nbd_pread: poll: Interrupted system call Unable to transfer source data to target file kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause /remote-source/app/pkg/importer/data-processor.go:206 kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData /remote-source/app/pkg/importer/data-processor.go:166 main.main /remote-source/app/cmd/cdi-importer/importer.go:189 runtime.main /usr/lib/golang/src/runtime/proc.go:204 runtime.goexit /usr/lib/golang/src/runtime/asm_amd64.s:1374 E1109 12:36:21.301070 1 importer.go:191] pread: nbd_pread: poll: Interrupted system call Unable to transfer source data to target file kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause /remote-source/app/pkg/importer/data-processor.go:206 kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData /remote-source/app/pkg/importer/data-processor.go:166 main.main /remote-source/app/cmd/cdi-importer/importer.go:189 runtime.main /usr/lib/golang/src/runtime/proc.go:204 runtime.goexit /usr/lib/golang/src/runtime/asm_amd64.s:1374 I1109 12:36:21.301457 1 vddk-datasource_amd64.go:144] Waiting for VDDK nbdkit log watcher to stop. I1109 12:36:22.414440 1 vddk-datasource_amd64.go:184] Stopped watching nbdkit log. Last lines follow: I1109 12:36:22.414461 1 vddk-datasource_amd64.go:213] Log line from nbdkit: nbdkit: vddk[1]: debug: VDDK call: VixDiskLib_Read (handle, 47861760 sectors, 4096 sectors, buffer)