Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2024138

Summary: Warm migration fails at "Transfer disks" stage with "Importer pod not found" error
Product: Migration Toolkit for Virtualization Reporter: Tzahi Ashkenazi <tashkena>
Component: ControllerAssignee: Sam Lucidi <slucidi>
Status: CLOSED ERRATA QA Contact: Ilanit Stein <istein>
Severity: urgent Docs Contact: Avital Pinnick <apinnick>
Priority: urgent    
Version: 2.2.0CC: dagur, dvaanunu, fbladilo, fdupont
Target Milestone: ---Keywords: Regression
Target Release: 2.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-12-09 19:21:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
10_vms_warm_error none

Description Tzahi Ashkenazi 2021-11-17 12:04:50 UTC
Created attachment 1842329 [details]
10_vms_warm_error

Created attachment 1842329 [details]
10_vms_warm_error

Created attachment 1842329 [details]
10_vms_warm_error

Description of problem:
During warm migration of 10VMs , 5VMs per ESXi host, from vSphere 6.7, with some IO load during the Migration, using the following  setting on MTV side :
  PRECOPY_INTERVAL = 10
  MAX_VM_INFLIGHT = 10 

1. one VM failed after 2 snapshots and exit on the following error : 
   VM name : scale-rhel-scale-fio-50gb-70usage-vm-9
   error from the UI : 
   pods "importer-10vms-2-hosts-warm-mtv-87-max-inflight-10-vm-2147-2ttmk" not 
   found
2. the second VM failed on "CutOver" > scale-rhel-scale-fio-50gb-70usage-vm-4
   VM name : scale-rhel-scale-fio-50gb-70usage-vm-4
   error from the UI : 
   pods "importer-10vms-2-hosts-warm-mtv-87-max-inflight-10-vm-2142-5cf9b" not 
   found
 
3. 8VMs out of 10 completed successfully  : 

[root@f01-h14-000-r640 ~]# oc get pods  |grep 10vms
10vms-2-hosts-warm-mtv-87-max-inflight-10-vm-2139-kkzv7                    0/1     Completed   0          34m
10vms-2-hosts-warm-mtv-87-max-inflight-10-vm-2140-jrt9z                    0/1     Completed   0          34m
10vms-2-hosts-warm-mtv-87-max-inflight-10-vm-2141-84lxx                    0/1     Completed   0          34m
10vms-2-hosts-warm-mtv-87-max-inflight-10-vm-2143-cd6tv                    0/1     Completed   0          34m
10vms-2-hosts-warm-mtv-87-max-inflight-10-vm-2144-rjt87                    0/1     Completed   0          34m
10vms-2-hosts-warm-mtv-87-max-inflight-10-vm-2145-vx44f                    0/1     Completed   0          34m
10vms-2-hosts-warm-mtv-87-max-inflight-10-vm-2146-54cn9                    0/1     Completed   0          34m
10vms-2-hosts-warm-mtv-87-max-inflight-10-vm-2148-lr4v8                    0/1     Completed   0          34m
[root@f01-h14-000-r640 ~]# oc get pods  |grep 10vms |wc -l
8



Version-Release number of selected component (if applicable):

MTV Build 2.2.0-87
Cloud10 
OCP 4.9.7

Additional info:

the Full logs of for the failed VMs can be found here :
https://drive.google.com/drive/folders/1lWQp895qqoSuR2UfM8f1oWxi6-SoyNEe?usp=sharing

Comment 2 Fabien Dupont 2021-11-17 15:14:30 UTC
Thanks for reporting this issue. To have a bit more statistics, would it be possible to perform the following actions ?

1. Run the same test multiple times in the same cluster to understand if it was just bad luck

2. Run the same test in another cluster to understand if the problem is linked to the cluster

3. Run the same test with only 5 VMs to understand if it's related to the number of concurrent migrations

Comment 3 Tzahi Ashkenazi 2021-11-18 10:28:27 UTC
Reproduce again on cloud10 

1VM failed warm migration out of 10VMs on the same error " pods "importer-bz202413810vms-2-hosts-warm-mtv-87-max-inflight-vm-2145-8wrpf" not found"

Comment 4 Tzahi Ashkenazi 2021-11-18 15:19:34 UTC
MTV Ver 87  results Summary for 5 Cycles frequency  (cloud10) 
    Cycle 1  - 2 VMs failed 
    Cycle 2 - 1  VM failed 
    Cycle 3 - Pass ( 5 snapshot per VM + cutover )
    Cycle 4 - Pass ( 5 snapshot per VM + cutover )
    Cycle 5 - 1 VM failed/stuck on the second snapshot 
              pods "importer-bz202413810vms-2-hosts-warm-mtv-87-max-inflight-vm-2147-bpds2" not found

Comment 5 David Vaanunu 2021-11-21 16:32:24 UTC
Tested on Cloud38:
ocp4.9.7
cnv4.9.1-14
mtv2.2.0-87

warm setup:
10 vms
precopy - 10min


Running 3 cycles:
cycle 1 - passed
cycle 2 - 1 VM failed, snapshot #6 (name: - vm7)
cycle 3 - 1 VM failed, snapshot #4 (name: - vm9)

Each cycle had 1 pod which restart on "nbd_pread: poll: Interrupted system call" - https://bugzilla.redhat.com/show_bug.cgi?id=2021504

Comment 9 Tzahi Ashkenazi 2021-11-22 13:04:52 UTC
reproduce for BZ2024138   - Warm migration   results Summary for 5 Cycles frequency
1. Cloud10 
2. CNV 4.9.1-23  ( latest ) 
3. MTV 2.2.0-87

    Cycle 1 -  1 VM    failed  ( second snapshot)  - importer-10vms-2-hosts-warm-mtv-87-max-inflight-cnv-491-23-vm-2147-mtfkq" not found
    Cycle 2 -  1 VM    failed  ( first snapshot )  - importer-10vms-2-hosts-warm-mtv-87-max-inflight-cnv-491-23-vm-2147-42qn5" not found
    Cycle 3 -  2 VMs  failed  ( third snapshot)    - importer-10vms-2-hosts-warm-mtv-87-max-inflight-cnv-491-23-vm-2148-pgmkg" not found
                                                   - importer-10vms-2-hosts-warm-mtv-87-max-inflight-cnv-491-23-vm-2141-dvx8p" not found
    Cycle 4 -  2 VMs  failed  ( second snapshot)   - importer-10vms-2-hosts-warm-mtv-87-max-inflight-cnv-491-23-vm-2145-9qmf5" not found
                                                   - importer-10vms-2-hosts-warm-mtv-87-max-inflight-cnv-491-23-vm-2146-9zn9r" not found
    Cycle 5 -  1 VM    failed  ( first snapshot )  - importer-10vms-2-hosts-warm-mtv-87-max-inflight-cnv-491-23-vm-2147-vw8mj" not found
    Reproduce 7/50  = 14%
its seems like this bug doesn't  related to CNV version !!  ( reproduce on both CNV 4.9.1-14  &  4.9.1-23 )

Comment 10 Ilanit Stein 2021-11-28 06:55:28 UTC
when verifying this bug, please also check Forklift controller is not reset with a panic message in it's log.

Comment 11 Ilanit Stein 2021-11-29 16:20:39 UTC
Further to comment #10, this is the panic error that should not be in the Forklift controller main log:

{"level":"info","ts":1637752173.5547638,"logger":"plan|75gng","msg":"Reconcile ended.","plan":"openshift-mtv/mig-plan-warm-mig","reQ":0}
E1124 11:09:33.554941       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 611 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x2828220, 0x45abd80)
	/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.3/pkg/util/runtime/runtime.go:74 +0xa6
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.3/pkg/util/runtime/runtime.go:48 +0x86
panic(0x2828220, 0x45abd80)
	/usr/lib/golang/src/runtime/panic.go:965 +0x1b9

(taken from bug 2024554 attached log)

Comment 12 Fabien Dupont 2021-12-01 10:57:43 UTC
Please verify with mtv-operator-bundle-container-2.2.0-103 / iib:140554, or later.

Additional info: that should also fix the panic mentioned in comment 11.

Comment 13 Tzahi Ashkenazi 2021-12-01 12:43:58 UTC
reproduce again with MTV 2.2.0-103 ( sha256:8b8c6d58cd656850ccff4e66ead1d9b22dad2676442184dfe00c5bc536793057)
on cloud10 during warm migration of 10VMS using VMware 6.7 

VM -> scale-rhel-scale-fio-50gb-70usage-vm-9  - on the third snapshot 
error > pods "importer-10vms-mtv103-fio-2-hosts-vm-2147-stx25" not found

Comment 14 Fabien Dupont 2021-12-02 09:59:00 UTC
Please verify with mtv-operator-bundle-container-2.2.0-104 / iib:140982, or later.

Comment 15 Tzahi Ashkenazi 2021-12-02 17:41:10 UTC
verified on:
1. Cloud10
2. MTV 2.2.0-104
3. 10VMs
4. 2 ESXi hosts, Vmware 6.7 
5. PRECOPY_INTERVAL = 10 min 
Cycle_1 = PASS - 7 snapshots per VM + cut-over
Cycle_2 = PASS - 7 snapshots per VM + cut-over
Cycle_3 = PASS - 5 snapshots per VM + cut-over
Cycle_4 = PASS - 5 snapshots per VM + cut-over
Cycle_5 = PASS - 13 snapshots per VM + cut-over

Comment 18 errata-xmlrpc 2021-12-09 19:21:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (MTV 2.2.0 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:5066