New commit detected on ManageIQ/manageiq-gems-pending/hammer: https://github.com/ManageIQ/manageiq-gems-pending/commit/7300bf53e17784a7c12dc1133239bc8b76a61f8a commit 7300bf53e17784a7c12dc1133239bc8b76a61f8a Author: Jason Frey <jfrey> AuthorDate: Tue Jul 2 16:16:08 2019 -0400 Commit: Jason Frey <jfrey> CommitDate: Tue Jul 2 16:16:08 2019 -0400 Merge pull request #437 from djberg96/miq_ssh_util_updates MiqSshUtil updates (cherry picked from commit 02c9fb67797331640e2840682d8234f2f4a5415c) Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1726439 lib/gems/pending/util/MiqSshUtil.rb | 43 +- spec/util/miq_ssh_util_spec.rb | 57 +- 2 files changed, 73 insertions(+), 27 deletions(-)
Verified on cfme-5.10.7.1 RHV-4.3.5.4-0.1.el7 No errors in evm.log during the following v2v VMware->RHV successful 2 migrations: Mac concurrent migrations per provider: 20 1. 20 VMs/100GB disk, ISCSI->FC, Max concurrent migrations per conversion host: 10. 2. 20 VMs/100GB disk, ISCSI->FC, Max concurrent migrations per conversion host: 5.
Moving bug back to ASSIGNED, due to that error in the bug description was reproduced, on a 3rd migration of 20 VMs/100GB disk, ISCSI->FC, Max concurrent migrations per conversion host: 10. Same system of: cfme-5.10.7.1 RHV-4.3.5.4-0.1.el7 This error appear in evm.log: [----] E, [2019-07-17T05:59:02.202245 #46834:106af58] ERROR -- : Q-task_id([job_dispatcher]) [RuntimeError]: Could not parse conversion state data from file '/tmp/v2v-import-20190717T122139-1858.state': Method:[block (2 levels) in <class:LogProxy>] [----] E, [2019-07-17T05:59:02.202441 #46834:106af58] ERROR -- : Q-task_id([job_dispatcher]) /var/www/miq/vmdb/app/models/conversion_host.rb:151:in `rescue in get_conversion_state' /var/www/miq/vmdb/app/models/conversion_host.rb:145:in `get_conversion_state' /var/www/miq/vmdb/app/models/service_template_transformation_plan_task.rb:210:in `get_conversion_state' /var/www/miq/vmdb/app/models/infra_conversion_job.rb:94:in `poll_conversion' /var/www/miq/vmdb/app/models/job/state_machine.rb:34:in `signal' /var/www/miq/vmdb/app/models/miq_queue.rb:455:in `block in dispatch_method' /usr/share/ruby/timeout.rb:93:in `block in timeout' /usr/share/ruby/timeout.rb:33:in `block in catch' /usr/share/ruby/timeout.rb:33:in `catch' /usr/share/ruby/timeout.rb:33:in `catch' /usr/share/ruby/timeout.rb:108:in `timeout' /var/www/miq/vmdb/app/models/miq_queue.rb:453:in `dispatch_method' /var/www/miq/vmdb/app/models/miq_queue.rb:430:in `block in deliver' /var/www/miq/vmdb/app/models/user.rb:275:in `with_user_group' /var/www/miq/vmdb/app/models/miq_queue.rb:430:in `deliver' /var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:104:in `deliver_queue_message' /var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:137:in `deliver_message' /var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:155:in `block in do_work' /var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:149:in `loop' /var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:149:in `do_work' /var/www/miq/vmdb/app/models/miq_worker/runner.rb:329:in `block in do_work_loop' /var/www/miq/vmdb/app/models/miq_worker/runner.rb:326:in `loop' /var/www/miq/vmdb/app/models/miq_worker/runner.rb:326:in `do_work_loop' /var/www/miq/vmdb/app/models/miq_worker/runner.rb:153:in `run' /var/www/miq/vmdb/app/models/miq_worker/runner.rb:127:in `start' /var/www/miq/vmdb/app/models/miq_worker/runner.rb:22:in `start_worker' /var/www/miq/vmdb/app/models/miq_worker.rb:408:in `block in start_runner_via_fork' /opt/rh/cfme-gemset/gems/nakayoshi_fork-0.0.4/lib/nakayoshi_fork.rb:23:in `fork' /opt/rh/cfme-gemset/gems/nakayoshi_fork-0.0.4/lib/nakayoshi_fork.rb:23:in `fork' /var/www/miq/vmdb/app/models/miq_worker.rb:406:in `start_runner_via_fork' /var/www/miq/vmdb/app/models/miq_worker.rb:396:in `start_runner' /var/www/miq/vmdb/app/models/miq_worker.rb:447:in `start' /var/www/miq/vmdb/app/models/miq_worker.rb:277:in `start_worker' /var/www/miq/vmdb/app/models/miq_worker.rb:154:in `block in sync_workers' /var/www/miq/vmdb/app/models/miq_worker.rb:154:in `times' /var/www/miq/vmdb/app/models/miq_worker.rb:154:in `sync_workers' /var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:53:in `block in sync_workers' /var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:50:in `each' /var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:50:in `sync_workers' /var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:22:in `monitor_workers' /var/www/miq/vmdb/app/models/miq_server.rb:339:in `block in monitor' /opt/rh/cfme-gemset/bundler/gems/cfme-gems-pending-7300bf53e177/lib/gems/pending/util/extensions/miq-benchmark.rb:11:in `realtime_store' /opt/rh/cfme-gemset/bundler/gems/cfme-gems-pending-7300bf53e177/lib/gems/pending/util/extensions/miq-benchmark.rb:28:in `realtime_block' /var/www/miq/vmdb/app/models/miq_server.rb:339:in `monitor' /var/www/miq/vmdb/app/models/miq_server.rb:380:in `block (2 levels) in monitor_loop' /opt/rh/cfme-gemset/bundler/gems/cfme-gems-pending-7300bf53e177/lib/gems/pending/util/extensions/miq-benchmark.rb:11:in `realtime_store' /opt/rh/cfme-gemset/bundler/gems/cfme-gems-pending-7300bf53e177/lib/gems/pending/util/extensions/miq-benchmark.rb:35:in `realtime_block' /var/www/miq/vmdb/app/models/miq_server.rb:380:in `block in monitor_loop' /var/www/miq/vmdb/app/models/miq_server.rb:379:in `loop' /var/www/miq/vmdb/app/models/miq_server.rb:379:in `monitor_loop' /var/www/miq/vmdb/app/models/miq_server.rb:241:in `start' /var/www/miq/vmdb/lib/workers/evm_server.rb:27:in `start' /var/www/miq/vmdb/lib/workers/evm_server.rb:48:in `start' /var/www/miq/vmdb/lib/workers/bin/evm_server.rb:4:in `<main>' That terminates the conversion. virt-v2v-wrapper log: 2019-07-17 13:00:33,815:DEBUG: Updated progress: 24.03 (virt-v2v-wrapper:1254) 2019-07-17 13:01:35,111:INFO: virt-v2v terminated with return code 0 (virt-v2v-wrapper:1901)
Created attachment 1591455 [details] evm.log VMs migration started July 17, @5:20. then it failed to parse the file @5:59
Created attachment 1591456 [details] automation.log VMs migration started July 17, @5:20. then it failed to parse the file @5:59
Created attachment 1591457 [details] v2v-import-wrapper.log
Created attachment 1591459 [details] v2v-import.log
[root@leopard03 import]# cat /tmp/v2v-import-20190717T122139-1858.state {"throttling": {"cpu": null, "network": null}, "started": true, "disks": [{"path": "[HP-iscsi-1] v2v_migration_vm_12/v2v_migration_vm_12.vmdk", "progress": 24.03}], "pid": 1937, "disk_count": 1, "return_code": 0, "failed": false, "finished": true}[root@leopard03 import]#
Based on this in the evm.log it looks like it's returning a blank string: Conversion error: Could not parse conversion state data from file '/tmp/v2v-import-20190717T122139-1858.state': Notice there is no text after the colon. I would think that means the ssh call actually failed but we got a bogus return result (a blank string) from MiqSshUtil, instead of the real error. My fix should have taken care of that possibility. The other possibility is that it succeeded but didn't come over the wire for some reason.
Very strange, it looks like Net::SFTP#download! (what that method ultimately calls behind the scenes) is not raising an error, but returning a blank string in some cases. I've made a local modification to the appliance that will sleep a few seconds, log a warning message and retry the download. If it's still blank it will raise a slightly different error. Let's see how that goes.
As an experiment I added a sleep(3) + retry inside the get_conversion_state method in conversion_host.rb if the result is blank. Only if it's blank a second time does it raise an error. The result was that when Ilanit restarted the appliance and ran the migration it completed successfully. Afterwards there was one instance where it did resort to a retry, but then it was successful. This should work as a solution, though I want to spend some time seeing if I can duplicate that behavior in a controlled manner.
Meant to say "Afterwards I checked the evm.log file and there was one instance where I saw a warning indicating that it resorted to a retry..."
@ilanit, have you been able to run your test only with wrapper patch ? If it works, then it could mean that the retry is not needed and that Dan's PR can be closed.
@Fabien, I ran only 2 runs of 20VMs/100GB disk so far, with the wrapper patch on my conversion hosts. I'll need to do further runs, to confirm this patch. I shall update here with the results.
The patch in question: https://github.com/oVirt/v2v-conversion-host/commit/aedb0ebcc8860a6bd6e7ea548de1bf4237b1bb17
In reply to comment 16: bug 1719689 can be tracked, as we are testing now CFME-5.11. I asked there to add the wrapper fix, in order to verify it, since so far I run it successfully only twice, on our TLV lab, and now after our transition to the RDU(US) lab, I'd like to continue testing it there. So waiting for the fix to be applied.
Seems to be fixed by the v2v wrapper patch. Closing for now.
This bug was last seen on large scale RHV-4.3.5/CFME-5.11.0.22. However, this RHV is not free to use ATM. Once it is, I'll try to reproduce, and update.
I tried to reproduce it on scale RHV environment RHV-4.3.6, with CFME-5.11.0.24. however it did not reproduce so far. if/once I'll see it again, I'll update this bug.