Description of problem: Migration plan (case 15a) failed on one VM out of 39 VMs due to the following error (taken from v2v import log): qemu-img: error while reading sector 82292736: Input/output error I tried to find the root cause for this failure in deamon logs, but I wasnt able to correlate the transfer.id a9373fb4-996f-43be-872e-0d00c7b3ba61 to "ticket" from deamon logs. Version-Release number of selected component (if applicable): CFME: 5.9.4.5.20180821202240_ced41ae RHV: 4.2.6.4 How reproducible: Not sure yet. Steps to Reproduce: 1. Run migration plan to migrate 40 VMs from VMware to RHV using vms list as input. The method is VDDK 2. the Exact migration plan is detailed here: https://docs.google.com/document/d/1HfYMp92tI5ob3XsTHx3XX1Qhh-G_w9y4LmMizd7NAD4/edit?ts=5b67faab# Actual results: Migration plan fails on one VM. Expected results: Should be completed Additional info: All relevant logs for this migration can be downloaded from Gdrive link in private message below.
I looked at the logs yesterday, adding what I discovered. 1. Last successful read from vddk: nbdkit: vddk[3]: debug: pread count=2097152 offset=42129686528 nbdkit: vddk[3]: debug: VDDK call: VixDiskLib_Read (handle, 82284544, 4096, buffer) 2. Last successful write from rhv-upload-plugin: nbdkit: python[1]: debug: pwrite count=2097152 offset=42127589376 fua=0 3. Both plugins are unloaded?! nbdkit: debug: python: unload nbdkit: debug: vddk: unload nbdkit: debug: VDDK call: VixDiskLib_Exit () nbdkit: debug: VixDiskLib: VixDiskLib_Exit: Unmatched Init calls so far: 1. 4. Not sure what this means nbdkit: python[1]: debug: connection cleanup with final status 1 5. qemu-img fail to read form nbdkit on vddk side? qemu-img: Unable to read from socket: Connection reset by peer 6. This looks like result of previous error, not an error in imageio side. qemu-img: error while writing sector 82284544: Input/output error nbdkit: debug: VixDiskLibVim: VixDiskLibVim_Exit: Clean up. nbdkit: vddk[3]: debug: connection cleanup with final status 1 qemu-img: Unable to read from socket: Connection reset by peer qemu-img: error while reading sector 82292736: Input/output error On imageio daemon side - we have no ERROR or WARNING in the logs, it seems that we have all logs, nothing was rotated. We cannot match the virt-v2v log and daemon.log because of bug 1625283. So this looks like issue on vddk side, or maybe this is side effect of canceling this transfer? We have these logs with the transfer.id: $ grep -r a9373fb4-996f-43be-872e-0d00c7b3ba61 . ./hosts/2018-09-02_18-32-00_b02-h23-r620_case_15a_02-09-conv_host/files/import/v2v-import-20180902T164827-39674.log:transfer.id = 'a9373fb4-996f-43be-872e-0d00c7b3ba61' ./hosts/2018-09-02_18-32-00_b02-h23-r620_case_15a_02-09-conv_host/files/import/v2v-import-20180902T164827-39674-wrapper.log:2018-09-02 16:58:29,795:INFO: Canceling transfer id=a9373fb4-996f-43be-872e-0d00c7b3ba61 for disk=f35a89e8-ca60-445e-a535-7bc41fb569b1 (virt-v2v-wrapper:626) Richard, can this be a virt-v2v bug?
(In reply to Nir Soffer from comment #3) > I looked at the logs yesterday, adding what I discovered. I was able to open the GDrive, but I wasn't able to find out which log you were looking at, so my comments are confined to the excerpts of the log that you posted. As you will see below I will need to see the full virt-v2v log to make any proper diagnosis. But anyway ... > 1. Last successful read from vddk: > > nbdkit: vddk[3]: debug: pread count=2097152 offset=42129686528 > nbdkit: vddk[3]: debug: VDDK call: VixDiskLib_Read (handle, 82284544, 4096, > buffer) > > 2. Last successful write from rhv-upload-plugin: > > nbdkit: python[1]: debug: pwrite count=2097152 offset=42127589376 fua=0 > > 3. Both plugins are unloaded?! > > nbdkit: debug: python: unload > nbdkit: debug: vddk: unload > nbdkit: debug: VDDK call: VixDiskLib_Exit () > nbdkit: debug: VixDiskLib: VixDiskLib_Exit: Unmatched Init calls so far: 1. Note there are two independent copies of nbdkit running, but their debug logs are going to the same file (nbdkit itself only allows you to load a single plugin per instance which is the clue there). Both copies of nbdkit use the --exit-with-parent feature which causes Linux to send a kill signal to them when the parent (virt-v2v) exits. On exit nbdkit will unload the plugin. Thus it's expected that you'll see unload messages from both plugins appearing in the logs at about the same time. > 4. Not sure what this means > > nbdkit: python[1]: debug: connection cleanup with final status 1 This message is a bit confusing. We should probably delete it. Anyway what it means is that the Python plugin was still processing a connection when it shut down, which is consistent with virt-v2v exiting on error, causing nbdkit to receive an asynchronous shutdown signal. Note that nbdkit does *not* guarantee that the close method is called on connections when shut down with a signal. We only guarantee that the unload method of the plugin is called. > 5. qemu-img fail to read form nbdkit on vddk side? > > qemu-img: Unable to read from socket: Connection reset by peer > > 6. This looks like result of previous error, not an error in imageio side. > > qemu-img: error while writing sector 82284544: Input/output error > nbdkit: debug: VixDiskLibVim: VixDiskLibVim_Exit: Clean up. > nbdkit: vddk[3]: debug: connection cleanup with final status 1 > qemu-img: Unable to read from socket: Connection reset by peer > qemu-img: error while reading sector 82292736: Input/output error This is consistent with nbdkit shutting down and the qemu-img input side NBD socket being abruptly closed. > On imageio daemon side - we have no ERROR or WARNING in the logs, it seems > that > we have all logs, nothing was rotated. > > We cannot match the virt-v2v log and daemon.log because of bug 1625283. > > So this looks like issue on vddk side, or maybe this is side effect of > canceling > this transfer? I would say all of this is consistent with virt-v2v exiting. We need to see logs before the ones excerpted above to find out exactly why virt-v2v exiting. If the transfer was cancelled, then I believe that virt-v2v is sent an asychronous signal causing it to exit, which would cause messages consistent with the ones see above. > Richard, can this be a virt-v2v bug? Unclear. On the evidence so far I'd say no.
Created attachment 1480835 [details] failing import v2v logs Richard, attaching the logs I mentioned in comment 3.
Created attachment 1480839 [details] imageio daemon logs for reference We don't have a good way to correlate v2v logs to daemon log, so I'm attaching all of them.
Ilan, can you attach the relevant engine log matching: egrep transfer.+a9373fb4-996f-43be-872e-0d00c7b3ba61 In engine log we should have one log with both ticket id and transfer id. Based on this log we can find the correct daemon log for this transfer.
This part from the wrapper log is interesting: 2018-09-02 16:51:13,843:INFO: Copying disk 1/1 (virt-v2v-wrapper:190) 2018-09-02 16:51:13,843:DEBUG: Skipping progress update for unknown disk (virt-v2v-wrapper:230) 2018-09-02 16:51:13,844:INFO: Copying path: [netapp-v2v-datastore] 100_VM_TEST_19/100_VM_TEST_19.vmdk (virt-v2v-wrapper:205) 2018-09-02 16:51:13,844:DEBUG: Found path at correct index (virt-v2v-wrapper:253) 2018-09-02 16:51:13,844:DEBUG: Path '[netapp-v2v-datastore] 100_VM_TEST_19/100_VM_TEST_19.vmdk' has disk id='f35a89e8-ca60-445e-a535-7bc41fb569b1' (virt-v2v-wrapper:237) 2018-09-02 16:52:08,920:DEBUG: Updated progress: 2.00 (virt-v2v-wrapper:226) 2018-09-02 16:52:18,944:DEBUG: Updated progress: 3.00 (virt-v2v-wrapper:226) 2018-09-02 16:53:24,053:DEBUG: Updated progress: 6.01 (virt-v2v-wrapper:226) 2018-09-02 16:53:54,088:DEBUG: Updated progress: 7.01 (virt-v2v-wrapper:226) 2018-09-02 16:54:54,161:DEBUG: Updated progress: 8.01 (virt-v2v-wrapper:226) 2018-09-02 16:55:19,198:DEBUG: Updated progress: 9.01 (virt-v2v-wrapper:226) 2018-09-02 16:55:29,209:DEBUG: Updated progress: 10.01 (virt-v2v-wrapper:226) 2018-09-02 16:55:44,238:DEBUG: Updated progress: 11.02 (virt-v2v-wrapper:226) 2018-09-02 16:55:54,257:DEBUG: Updated progress: 12.02 (virt-v2v-wrapper:226) 2018-09-02 16:56:04,281:DEBUG: Updated progress: 13.02 (virt-v2v-wrapper:226) 2018-09-02 16:56:19,306:DEBUG: Updated progress: 14.02 (virt-v2v-wrapper:226) 2018-09-02 16:56:29,326:DEBUG: Updated progress: 15.02 (virt-v2v-wrapper:226) 2018-09-02 16:56:39,344:DEBUG: Updated progress: 16.03 (virt-v2v-wrapper:226) 2018-09-02 16:56:54,367:DEBUG: Updated progress: 17.03 (virt-v2v-wrapper:226) 2018-09-02 16:57:04,380:DEBUG: Updated progress: 18.03 (virt-v2v-wrapper:226) 2018-09-02 16:57:19,403:DEBUG: Updated progress: 19.03 (virt-v2v-wrapper:226) 2018-09-02 16:57:29,419:DEBUG: Updated progress: 20.03 (virt-v2v-wrapper:226) 2018-09-02 16:57:44,441:DEBUG: Updated progress: 21.03 (virt-v2v-wrapper:226) 2018-09-02 16:57:54,459:DEBUG: Updated progress: 22.03 (virt-v2v-wrapper:226) 2018-09-02 16:58:04,468:DEBUG: Updated progress: 23.04 (virt-v2v-wrapper:226) 2018-09-02 16:58:19,492:DEBUG: Updated progress: 24.04 (virt-v2v-wrapper:226) 2018-09-02 16:58:29,503:INFO: virt-v2v terminated with return code -9 (virt-v2v-wrapper:466) 2018-09-02 16:58:29,506:DEBUG: Cleanup phase (virt-v2v-wrapper:927) 2018-09-02 16:58:29,795:INFO: Canceling transfer id=a9373fb4-996f-43be-872e-0d00c7b3ba61 for disk=f35a89e8-ca60-445e-a535-7bc41fb569b1 (virt-v2v-wrapper:626) 2018-09-02 16:58:29,821:INFO: Removing disks: [] (virt-v2v-wrapper:636) 2018-09-02 16:58:29,885:INFO: Removing password files (virt-v2v-wrapper:935) 2018-09-02 16:58:29,886:INFO: Finished (virt-v2v-wrapper:957) Looks like virt-v2v was terminated by SIGKILL
I found another way to correlate the logs - using disk.id in v2v-import.*.log: $ grep -r f35a89e8-ca60-445e-a535-7bc41fb569b1 . ./imageio-logs/daemon.log.1:2018-09-02 16:51:25,213 INFO (Thread-1619) [tickets] [local] ADD ticket={u'uuid': u'7a16e39b-a4ba-4eb5-8c27-ef7d998532ba', u'ops': [u'write'], u'url': u'file:///rhev/data-center/mnt/blockSD/2f431d12-72a8-40a1-a3de-7e41a765dde6/images/f35a89e8-ca60-445e-a535-7bc41fb569b1/90b26d5d-2601-4f7b-b69e-2f7af14cf845', u'sparse': False, u'timeout': 300, u'size': 107374182400} So the relevant ticket is 7a16e39b-a4ba-4eb5-8c27-ef7d998532ba.
Created attachment 1480843 [details] relevant imageio daemon log.
> Looks like virt-v2v was terminated by SIGKILL I agree. virt-v2v-wrapper uses subprocess.Popen. The "return code -9" comes from printing the returncode method. https://github.com/oVirt/ovirt-ansible-v2v-conversion-host/blob/d16cbc96e40f4aee411f8d338402e84acbb25e0c/files/virt-v2v-wrapper.py#L465 As you can see in the documentation for that: https://docs.python.org/2/library/subprocess.html#subprocess.Popen.returncode negative values indicate that the subprocess was killed with a signal, in this case "terminate with extreme prejudice" signal 9 (SIGKILL). Do we have any idea what is sending SIGKILL?
If we look messages, we see interesting events at around that time: Sep 2 16:58:28 b02-h23-r620 systemd-logind: New session 3262 of user root. Sep 2 16:58:28 b02-h23-r620 systemd: Started Session 3262 of user root. Sep 2 16:58:28 b02-h23-r620 systemd: Starting Session 3262 of user root. Sep 2 16:58:28 b02-h23-r620 systemd-logind: Removed session 3262. Sep 2 16:58:29 b02-h23-r620 kernel: nbdkit[10842]: segfault at 7fcc7768686e ip 00007fcc7768686e sp 00007fcc75f56dd8 error 14 Sep 2 16:58:29 b02-h23-r620 abrt-hook-ccpp: Process 40113 (nbdkit) of user 36 killed by SIGSEGV - dumping core Sep 2 16:58:29 b02-h23-r620 systemd-logind: Removed session 3220. I don't know if this is the nbdkit process related to the import, because virt-v2v does not log the process id of the sub processes. Ilan, can you get the core dump mentioned by abrt-hook-ccpp?
Forget about the segfault. It's a known bug caused by a race during shutdown, already fixed upstream and available in nbdkit 1.2.6.
Created attachment 1480990 [details] engine log from the failed migration plan eexecution Nir adding the engine.log as you requested in Comment 7
Created attachment 1480992 [details] conversion host core-dump for SIGSEGV Nir, attaching core dump as you requested in Comment 12