Bug 1726439 - [v2v][RHV][vddk] 1 out of 10 VMs failed migration, on CFME failed to read the state file.
Summary: [v2v][RHV][vddk] 1 out of 10 VMs failed migration, on CFME failed to read the...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: V2V
Version: 5.10.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: GA
: 5.10.10
Assignee: Daniel Berger
QA Contact: Ilanit Stein
Red Hat CloudForms Documentation
URL:
Whiteboard:
Depends On: 1719689
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-07-02 20:38 UTC by Satoe Imaishi
Modified: 2019-09-18 13:55 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1719689
Environment:
Last Closed: 2019-08-21 18:54:26 UTC
Category: ---
Cloudforms Team: V2V
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
evm.log (4.45 MB, application/gzip)
2019-07-17 15:08 UTC, Ilanit Stein
no flags Details
automation.log (2.43 MB, application/gzip)
2019-07-17 15:09 UTC, Ilanit Stein
no flags Details
v2v-import-wrapper.log (13.35 KB, text/plain)
2019-07-17 15:11 UTC, Ilanit Stein
no flags Details
v2v-import.log (3.53 MB, text/plain)
2019-07-17 15:20 UTC, Ilanit Stein
no flags Details

Comment 2 CFME Bot 2019-07-02 20:46:22 UTC
New commit detected on ManageIQ/manageiq-gems-pending/hammer:

https://github.com/ManageIQ/manageiq-gems-pending/commit/7300bf53e17784a7c12dc1133239bc8b76a61f8a
commit 7300bf53e17784a7c12dc1133239bc8b76a61f8a
Author:     Jason Frey <jfrey>
AuthorDate: Tue Jul  2 16:16:08 2019 -0400
Commit:     Jason Frey <jfrey>
CommitDate: Tue Jul  2 16:16:08 2019 -0400

    Merge pull request #437 from djberg96/miq_ssh_util_updates

    MiqSshUtil updates

    (cherry picked from commit 02c9fb67797331640e2840682d8234f2f4a5415c)

    Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1726439

 lib/gems/pending/util/MiqSshUtil.rb | 43 +-
 spec/util/miq_ssh_util_spec.rb | 57 +-
 2 files changed, 73 insertions(+), 27 deletions(-)

Comment 3 Ilanit Stein 2019-07-17 08:10:28 UTC
Verified on 
cfme-5.10.7.1
RHV-4.3.5.4-0.1.el7

No errors in evm.log during the following v2v VMware->RHV successful 2 migrations:
Mac concurrent migrations per provider: 20

1. 20 VMs/100GB disk, ISCSI->FC, Max concurrent migrations per conversion host: 10. 
2. 20 VMs/100GB disk, ISCSI->FC, Max concurrent migrations per conversion host: 5.

Comment 4 Ilanit Stein 2019-07-17 14:13:41 UTC
Moving bug back to ASSIGNED, due to that error in the bug description was reproduced, 
on a 3rd migration of 20 VMs/100GB disk, ISCSI->FC, Max concurrent migrations per conversion host: 10. 
Same system of:
cfme-5.10.7.1
RHV-4.3.5.4-0.1.el7

This error appear in evm.log:

[----] E, [2019-07-17T05:59:02.202245 #46834:106af58] ERROR -- : Q-task_id([job_dispatcher]) [RuntimeError]: Could not parse conversion state data from file '/tmp/v2v-import-20190717T122139-1858.state':   Method:[block (2 levels) in <class:LogProxy>]
[----] E, [2019-07-17T05:59:02.202441 #46834:106af58] ERROR -- : Q-task_id([job_dispatcher]) /var/www/miq/vmdb/app/models/conversion_host.rb:151:in `rescue in get_conversion_state'
/var/www/miq/vmdb/app/models/conversion_host.rb:145:in `get_conversion_state'
/var/www/miq/vmdb/app/models/service_template_transformation_plan_task.rb:210:in `get_conversion_state'
/var/www/miq/vmdb/app/models/infra_conversion_job.rb:94:in `poll_conversion'
/var/www/miq/vmdb/app/models/job/state_machine.rb:34:in `signal'
/var/www/miq/vmdb/app/models/miq_queue.rb:455:in `block in dispatch_method'
/usr/share/ruby/timeout.rb:93:in `block in timeout'
/usr/share/ruby/timeout.rb:33:in `block in catch'
/usr/share/ruby/timeout.rb:33:in `catch'
/usr/share/ruby/timeout.rb:33:in `catch'
/usr/share/ruby/timeout.rb:108:in `timeout'
/var/www/miq/vmdb/app/models/miq_queue.rb:453:in `dispatch_method'
/var/www/miq/vmdb/app/models/miq_queue.rb:430:in `block in deliver'
/var/www/miq/vmdb/app/models/user.rb:275:in `with_user_group'
/var/www/miq/vmdb/app/models/miq_queue.rb:430:in `deliver'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:104:in `deliver_queue_message'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:137:in `deliver_message'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:155:in `block in do_work'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:149:in `loop'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:149:in `do_work'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:329:in `block in do_work_loop'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:326:in `loop'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:326:in `do_work_loop'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:153:in `run'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:127:in `start'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:22:in `start_worker'
/var/www/miq/vmdb/app/models/miq_worker.rb:408:in `block in start_runner_via_fork'
/opt/rh/cfme-gemset/gems/nakayoshi_fork-0.0.4/lib/nakayoshi_fork.rb:23:in `fork'
/opt/rh/cfme-gemset/gems/nakayoshi_fork-0.0.4/lib/nakayoshi_fork.rb:23:in `fork'
/var/www/miq/vmdb/app/models/miq_worker.rb:406:in `start_runner_via_fork'
/var/www/miq/vmdb/app/models/miq_worker.rb:396:in `start_runner'
/var/www/miq/vmdb/app/models/miq_worker.rb:447:in `start'
/var/www/miq/vmdb/app/models/miq_worker.rb:277:in `start_worker'
/var/www/miq/vmdb/app/models/miq_worker.rb:154:in `block in sync_workers'
/var/www/miq/vmdb/app/models/miq_worker.rb:154:in `times'
/var/www/miq/vmdb/app/models/miq_worker.rb:154:in `sync_workers'
/var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:53:in `block in sync_workers'
/var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:50:in `each'
/var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:50:in `sync_workers'
/var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:22:in `monitor_workers'
/var/www/miq/vmdb/app/models/miq_server.rb:339:in `block in monitor'
/opt/rh/cfme-gemset/bundler/gems/cfme-gems-pending-7300bf53e177/lib/gems/pending/util/extensions/miq-benchmark.rb:11:in `realtime_store'
/opt/rh/cfme-gemset/bundler/gems/cfme-gems-pending-7300bf53e177/lib/gems/pending/util/extensions/miq-benchmark.rb:28:in `realtime_block'
/var/www/miq/vmdb/app/models/miq_server.rb:339:in `monitor'
/var/www/miq/vmdb/app/models/miq_server.rb:380:in `block (2 levels) in monitor_loop'
/opt/rh/cfme-gemset/bundler/gems/cfme-gems-pending-7300bf53e177/lib/gems/pending/util/extensions/miq-benchmark.rb:11:in `realtime_store'
/opt/rh/cfme-gemset/bundler/gems/cfme-gems-pending-7300bf53e177/lib/gems/pending/util/extensions/miq-benchmark.rb:35:in `realtime_block'
/var/www/miq/vmdb/app/models/miq_server.rb:380:in `block in monitor_loop'
/var/www/miq/vmdb/app/models/miq_server.rb:379:in `loop'
/var/www/miq/vmdb/app/models/miq_server.rb:379:in `monitor_loop'
/var/www/miq/vmdb/app/models/miq_server.rb:241:in `start'
/var/www/miq/vmdb/lib/workers/evm_server.rb:27:in `start'
/var/www/miq/vmdb/lib/workers/evm_server.rb:48:in `start'
/var/www/miq/vmdb/lib/workers/bin/evm_server.rb:4:in `<main>'

That terminates the conversion. virt-v2v-wrapper log:
2019-07-17 13:00:33,815:DEBUG: Updated progress: 24.03 (virt-v2v-wrapper:1254)
2019-07-17 13:01:35,111:INFO: virt-v2v terminated with return code 0 (virt-v2v-wrapper:1901)

Comment 6 Ilanit Stein 2019-07-17 15:08:39 UTC
Created attachment 1591455 [details]
evm.log

VMs migration started July 17, @5:20. then it failed to parse the file @5:59

Comment 7 Ilanit Stein 2019-07-17 15:09:24 UTC
Created attachment 1591456 [details]
automation.log

VMs migration started July 17, @5:20. then it failed to parse the file @5:59

Comment 8 Ilanit Stein 2019-07-17 15:11:40 UTC
Created attachment 1591457 [details]
v2v-import-wrapper.log

Comment 9 Ilanit Stein 2019-07-17 15:20:41 UTC
Created attachment 1591459 [details]
v2v-import.log

Comment 11 Ilanit Stein 2019-07-17 15:40:43 UTC
[root@leopard03 import]# cat /tmp/v2v-import-20190717T122139-1858.state
{"throttling": {"cpu": null, "network": null}, "started": true, "disks": [{"path": "[HP-iscsi-1] v2v_migration_vm_12/v2v_migration_vm_12.vmdk", "progress": 24.03}], "pid": 1937, "disk_count": 1, "return_code": 0, "failed": false, "finished": true}[root@leopard03 import]#

Comment 12 Daniel Berger 2019-07-17 16:28:52 UTC
Based on this in the evm.log it looks like it's returning a blank string:

Conversion error: Could not parse conversion state data from file '/tmp/v2v-import-20190717T122139-1858.state':

Notice there is no text after the colon.

I would think that means the ssh call actually failed but we got a bogus return result (a blank string) from MiqSshUtil, instead of the real error. My fix should have taken care of that possibility.

The other possibility is that it succeeded but didn't come over the wire for some reason.

Comment 13 Daniel Berger 2019-07-17 17:21:26 UTC
Very strange, it looks like Net::SFTP#download! (what that method ultimately calls behind the scenes) is not raising an error, but returning a blank string in some cases.

I've made a local modification to the appliance that will sleep a few seconds, log a warning message and retry the download. If it's still blank it will raise a slightly different error.

Let's see how that goes.

Comment 14 Daniel Berger 2019-07-17 20:00:36 UTC
As an experiment I added a sleep(3) + retry inside the get_conversion_state method in conversion_host.rb if the result is blank. Only if it's blank a second time does it raise an error.

The result was that when Ilanit restarted the appliance and ran the migration it completed successfully. Afterwards there was one instance where it did resort to a retry, but then it was successful.

This should work as a solution, though I want to spend some time seeing if I can duplicate that behavior in a controlled manner.

Comment 15 Daniel Berger 2019-07-17 20:02:05 UTC
Meant to say "Afterwards I checked the evm.log file and there was one instance where I saw a warning indicating that it resorted to a retry..."

Comment 16 Fabien Dupont 2019-07-29 15:11:24 UTC
@ilanit, have you been able to run your test only with wrapper patch ? If it works, then it could mean that the retry is not needed and that Dan's PR can be closed.

Comment 17 Ilanit Stein 2019-07-30 12:40:22 UTC
@Fabien,
I ran only 2 runs of 20VMs/100GB disk so far, with the wrapper patch on my conversion hosts.
I'll need to do further runs, to confirm this patch.
I shall update here with the results.

Comment 19 Ilanit Stein 2019-08-13 06:36:24 UTC
In reply to comment 16:
bug 1719689 can be tracked, as we are testing now CFME-5.11.
I asked there to add the wrapper fix, in order to verify it,
since so far I run it successfully only twice, on our TLV lab,
and now after our transition to the RDU(US) lab, I'd like to continue testing it there.
So waiting for the fix to be applied.

Comment 20 Daniel Berger 2019-08-21 18:54:26 UTC
Seems to be fixed by the v2v wrapper patch. Closing for now.

Comment 21 Ilanit Stein 2019-09-05 12:50:05 UTC
This bug was last seen on large scale RHV-4.3.5/CFME-5.11.0.22.
However,
this RHV is not free to use ATM.
Once it is, I'll try to reproduce, and update.

Comment 22 Ilanit Stein 2019-09-18 13:55:59 UTC
I tried to reproduce it on scale RHV environment RHV-4.3.6, with CFME-5.11.0.24. however it did not reproduce so far.
if/once I'll see it again, I'll update this bug.


Note You need to log in before you can comment on or make changes to this bug.