Bug 2024138
| Summary: | Warm migration fails at "Transfer disks" stage with "Importer pod not found" error | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Migration Toolkit for Virtualization | Reporter: | Tzahi Ashkenazi <tashkena> | ||||
| Component: | Controller | Assignee: | Sam Lucidi <slucidi> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Ilanit Stein <istein> | ||||
| Severity: | urgent | Docs Contact: | Avital Pinnick <apinnick> | ||||
| Priority: | urgent | ||||||
| Version: | 2.2.0 | CC: | dagur, dvaanunu, fbladilo, fdupont | ||||
| Target Milestone: | --- | Keywords: | Regression | ||||
| Target Release: | 2.2.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2021-12-09 19:21:12 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Tzahi Ashkenazi
2021-11-17 12:04:50 UTC
Thanks for reporting this issue. To have a bit more statistics, would it be possible to perform the following actions ? 1. Run the same test multiple times in the same cluster to understand if it was just bad luck 2. Run the same test in another cluster to understand if the problem is linked to the cluster 3. Run the same test with only 5 VMs to understand if it's related to the number of concurrent migrations Reproduce again on cloud10 1VM failed warm migration out of 10VMs on the same error " pods "importer-bz202413810vms-2-hosts-warm-mtv-87-max-inflight-vm-2145-8wrpf" not found" MTV Ver 87 results Summary for 5 Cycles frequency (cloud10)
Cycle 1 - 2 VMs failed
Cycle 2 - 1 VM failed
Cycle 3 - Pass ( 5 snapshot per VM + cutover )
Cycle 4 - Pass ( 5 snapshot per VM + cutover )
Cycle 5 - 1 VM failed/stuck on the second snapshot
pods "importer-bz202413810vms-2-hosts-warm-mtv-87-max-inflight-vm-2147-bpds2" not found
Tested on Cloud38: ocp4.9.7 cnv4.9.1-14 mtv2.2.0-87 warm setup: 10 vms precopy - 10min Running 3 cycles: cycle 1 - passed cycle 2 - 1 VM failed, snapshot #6 (name: - vm7) cycle 3 - 1 VM failed, snapshot #4 (name: - vm9) Each cycle had 1 pod which restart on "nbd_pread: poll: Interrupted system call" - https://bugzilla.redhat.com/show_bug.cgi?id=2021504 reproduce for BZ2024138 - Warm migration results Summary for 5 Cycles frequency 1. Cloud10 2. CNV 4.9.1-23 ( latest ) 3. MTV 2.2.0-87 Cycle 1 - 1 VM failed ( second snapshot) - importer-10vms-2-hosts-warm-mtv-87-max-inflight-cnv-491-23-vm-2147-mtfkq" not found Cycle 2 - 1 VM failed ( first snapshot ) - importer-10vms-2-hosts-warm-mtv-87-max-inflight-cnv-491-23-vm-2147-42qn5" not found Cycle 3 - 2 VMs failed ( third snapshot) - importer-10vms-2-hosts-warm-mtv-87-max-inflight-cnv-491-23-vm-2148-pgmkg" not found - importer-10vms-2-hosts-warm-mtv-87-max-inflight-cnv-491-23-vm-2141-dvx8p" not found Cycle 4 - 2 VMs failed ( second snapshot) - importer-10vms-2-hosts-warm-mtv-87-max-inflight-cnv-491-23-vm-2145-9qmf5" not found - importer-10vms-2-hosts-warm-mtv-87-max-inflight-cnv-491-23-vm-2146-9zn9r" not found Cycle 5 - 1 VM failed ( first snapshot ) - importer-10vms-2-hosts-warm-mtv-87-max-inflight-cnv-491-23-vm-2147-vw8mj" not found Reproduce 7/50 = 14% its seems like this bug doesn't related to CNV version !! ( reproduce on both CNV 4.9.1-14 & 4.9.1-23 ) when verifying this bug, please also check Forklift controller is not reset with a panic message in it's log. Further to comment #10, this is the panic error that should not be in the Forklift controller main log: {"level":"info","ts":1637752173.5547638,"logger":"plan|75gng","msg":"Reconcile ended.","plan":"openshift-mtv/mig-plan-warm-mig","reQ":0} E1124 11:09:33.554941 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 611 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic(0x2828220, 0x45abd80) /remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.3/pkg/util/runtime/runtime.go:74 +0xa6 k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.3/pkg/util/runtime/runtime.go:48 +0x86 panic(0x2828220, 0x45abd80) /usr/lib/golang/src/runtime/panic.go:965 +0x1b9 (taken from bug 2024554 attached log) Please verify with mtv-operator-bundle-container-2.2.0-103 / iib:140554, or later. Additional info: that should also fix the panic mentioned in comment 11. reproduce again with MTV 2.2.0-103 ( sha256:8b8c6d58cd656850ccff4e66ead1d9b22dad2676442184dfe00c5bc536793057) on cloud10 during warm migration of 10VMS using VMware 6.7 VM -> scale-rhel-scale-fio-50gb-70usage-vm-9 - on the third snapshot error > pods "importer-10vms-mtv103-fio-2-hosts-vm-2147-stx25" not found Please verify with mtv-operator-bundle-container-2.2.0-104 / iib:140982, or later. verified on: 1. Cloud10 2. MTV 2.2.0-104 3. 10VMs 4. 2 ESXi hosts, Vmware 6.7 5. PRECOPY_INTERVAL = 10 min Cycle_1 = PASS - 7 snapshots per VM + cut-over Cycle_2 = PASS - 7 snapshots per VM + cut-over Cycle_3 = PASS - 5 snapshots per VM + cut-over Cycle_4 = PASS - 5 snapshots per VM + cut-over Cycle_5 = PASS - 13 snapshots per VM + cut-over Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (MTV 2.2.0 Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:5066 |