1625220 – [v2v] Migration plan with multiple VMs fails - Input/output error

Bug 1625220 - [v2v] Migration plan with multiple VMs fails - Input/output error

Summary: [v2v] Migration plan with multiple VMs fails - Input/output error

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	UI - OPS
Sub Component:
Version:	5.9.4
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	GA
Target Release:	cfme-future
Assignee:	Brett Thurber
QA Contact:	Jad Haj Yahya
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-09-04 12:18 UTC by Ilan Zuckerman
Modified:	2019-06-11 15:26 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-11 15:26:43 UTC
Category:	---
Cloudforms Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
failing import v2v logs (461.51 KB, application/octet-stream) 2018-09-04 17:10 UTC, Nir Soffer	no flags	Details
imageio daemon logs for reference (10.40 MB, application/octet-stream) 2018-09-04 17:12 UTC, Nir Soffer	no flags	Details
relevant imageio daemon log. (294.34 KB, application/x-xz) 2018-09-04 17:31 UTC, Nir Soffer	no flags	Details
engine log from the failed migration plan eexecution (1.07 MB, application/x-gzip) 2018-09-05 06:30 UTC, Ilan Zuckerman	no flags	Details
conversion host core-dump for SIGSEGV (18.07 MB, application/x-gzip) 2018-09-05 06:48 UTC, Ilan Zuckerman	no flags	Details
View All

Description Ilan Zuckerman 2018-09-04 12:18:09 UTC

Description of problem:
Migration plan (case 15a) failed on one VM out of 39 VMs due to the following error (taken from v2v import log):
qemu-img: error while reading sector 82292736: Input/output error

I tried to find the root cause for this failure in deamon logs, but I wasnt able to correlate the transfer.id a9373fb4-996f-43be-872e-0d00c7b3ba61 to "ticket" from deamon logs.

Version-Release number of selected component (if applicable):
CFME: 5.9.4.5.20180821202240_ced41ae
RHV: 4.2.6.4

How reproducible:
Not sure yet.

Steps to Reproduce:
1. Run migration plan to migrate 40 VMs from VMware to RHV using vms list as input. The method is VDDK
2. the Exact migration plan is detailed here:
https://docs.google.com/document/d/1HfYMp92tI5ob3XsTHx3XX1Qhh-G_w9y4LmMizd7NAD4/edit?ts=5b67faab#

Actual results:
Migration plan fails on one VM.

Expected results:
Should be completed

Additional info:
All relevant logs for this migration can be downloaded from Gdrive link in private message below.

Comment 3 Nir Soffer 2018-09-04 15:07:07 UTC

I looked at the logs yesterday, adding what I discovered.

1. Last successful read from vddk:

nbdkit: vddk[3]: debug: pread count=2097152 offset=42129686528
nbdkit: vddk[3]: debug: VDDK call: VixDiskLib_Read (handle, 82284544, 4096, buffer)

2. Last successful write from rhv-upload-plugin:

nbdkit: python[1]: debug: pwrite count=2097152 offset=42127589376 fua=0

3. Both plugins are unloaded?!

nbdkit: debug: python: unload
nbdkit: debug: vddk: unload
nbdkit: debug: VDDK call: VixDiskLib_Exit ()
nbdkit: debug: VixDiskLib: VixDiskLib_Exit: Unmatched Init calls so far: 1.

4. Not sure what this means

nbdkit: python[1]: debug: connection cleanup with final status 1

5. qemu-img fail to read form nbdkit on vddk side?

qemu-img: Unable to read from socket: Connection reset by peer

6. This looks like result of previous error, not an error in imageio side.

qemu-img: error while writing sector 82284544: Input/output error
nbdkit: debug: VixDiskLibVim: VixDiskLibVim_Exit: Clean up.
nbdkit: vddk[3]: debug: connection cleanup with final status 1
qemu-img: Unable to read from socket: Connection reset by peer
qemu-img: error while reading sector 82292736: Input/output error

On imageio daemon side - we have no ERROR or WARNING in the logs, it seems that
we have all logs, nothing was rotated.

We cannot match the virt-v2v log and daemon.log because of bug 1625283.

So this looks like issue on vddk side, or maybe this is side effect of canceling
this transfer?

We have these logs with the transfer.id:

$ grep -r a9373fb4-996f-43be-872e-0d00c7b3ba61 .
./hosts/2018-09-02_18-32-00_b02-h23-r620_case_15a_02-09-conv_host/files/import/v2v-import-20180902T164827-39674.log:transfer.id = 'a9373fb4-996f-43be-872e-0d00c7b3ba61'
./hosts/2018-09-02_18-32-00_b02-h23-r620_case_15a_02-09-conv_host/files/import/v2v-import-20180902T164827-39674-wrapper.log:2018-09-02 16:58:29,795:INFO: Canceling transfer id=a9373fb4-996f-43be-872e-0d00c7b3ba61 for disk=f35a89e8-ca60-445e-a535-7bc41fb569b1 (virt-v2v-wrapper:626)

Richard, can this be a virt-v2v bug?

Comment 4 Richard W.M. Jones 2018-09-04 16:05:40 UTC

(In reply to Nir Soffer from comment #3)
> I looked at the logs yesterday, adding what I discovered.

I was able to open the GDrive, but I wasn't able to find out which log you were looking
at, so my comments are confined to the excerpts of the log that you posted.  As you will
see below I will need to see the full virt-v2v log to make any proper diagnosis.

But anyway ...

> 1. Last successful read from vddk:
> 
> nbdkit: vddk[3]: debug: pread count=2097152 offset=42129686528
> nbdkit: vddk[3]: debug: VDDK call: VixDiskLib_Read (handle, 82284544, 4096,
> buffer)
> 
> 2. Last successful write from rhv-upload-plugin:
> 
> nbdkit: python[1]: debug: pwrite count=2097152 offset=42127589376 fua=0
> 
> 3. Both plugins are unloaded?!
> 
> nbdkit: debug: python: unload
> nbdkit: debug: vddk: unload
> nbdkit: debug: VDDK call: VixDiskLib_Exit ()
> nbdkit: debug: VixDiskLib: VixDiskLib_Exit: Unmatched Init calls so far: 1.

Note there are two independent copies of nbdkit running, but their debug logs are
going to the same file (nbdkit itself only allows you to load a single plugin per
instance which is the clue there).  Both copies of nbdkit use the --exit-with-parent
feature which causes Linux to send a kill signal to them when the parent (virt-v2v)
exits.  On exit nbdkit will unload the plugin.

Thus it's expected that you'll see unload messages from both plugins appearing
in the logs at about the same time.

> 4. Not sure what this means
> 
> nbdkit: python[1]: debug: connection cleanup with final status 1

This message is a bit confusing.  We should probably delete it.  Anyway what
it means is that the Python plugin was still processing a connection when it
shut down, which is consistent with virt-v2v exiting on error, causing
nbdkit to receive an asynchronous shutdown signal.

Note that nbdkit does *not* guarantee that the close method is called on
connections when shut down with a signal.  We only guarantee that the
unload method of the plugin is called.

> 5. qemu-img fail to read form nbdkit on vddk side?
> 
> qemu-img: Unable to read from socket: Connection reset by peer
> 
> 6. This looks like result of previous error, not an error in imageio side.
> 
> qemu-img: error while writing sector 82284544: Input/output error
> nbdkit: debug: VixDiskLibVim: VixDiskLibVim_Exit: Clean up.
> nbdkit: vddk[3]: debug: connection cleanup with final status 1
> qemu-img: Unable to read from socket: Connection reset by peer
> qemu-img: error while reading sector 82292736: Input/output error

This is consistent with nbdkit shutting down and the qemu-img input
side NBD socket being abruptly closed.

> On imageio daemon side - we have no ERROR or WARNING in the logs, it seems
> that
> we have all logs, nothing was rotated.
> 
> We cannot match the virt-v2v log and daemon.log because of bug 1625283.
> 
> So this looks like issue on vddk side, or maybe this is side effect of
> canceling
> this transfer?

I would say all of this is consistent with virt-v2v exiting.

We need to see logs before the ones excerpted above to find out exactly
why virt-v2v exiting.

If the transfer was cancelled, then I believe that virt-v2v is sent an
asychronous signal causing it to exit, which would cause messages
consistent with the ones see above.

> Richard, can this be a virt-v2v bug?

Unclear.  On the evidence so far I'd say no.

Comment 5 Nir Soffer 2018-09-04 17:10:49 UTC

Created attachment 1480835 [details]
failing import v2v logs

Richard, attaching the logs I mentioned in comment 3.

Comment 6 Nir Soffer 2018-09-04 17:12:51 UTC

Created attachment 1480839 [details]
imageio daemon logs for reference

We don't have a good way to correlate v2v logs to daemon log, so I'm attaching all
of them.

Comment 7 Nir Soffer 2018-09-04 17:14:43 UTC

Ilan, can you attach the relevant engine log matching:

    egrep transfer.+a9373fb4-996f-43be-872e-0d00c7b3ba61

In engine log we should have one log with both ticket id and transfer id. Based on
this log we can find the correct daemon log for this transfer.

Comment 8 Nir Soffer 2018-09-04 17:20:01 UTC

This part from the wrapper log is interesting:

2018-09-02 16:51:13,843:INFO: Copying disk 1/1 (virt-v2v-wrapper:190)
2018-09-02 16:51:13,843:DEBUG: Skipping progress update for unknown disk (virt-v2v-wrapper:230)
2018-09-02 16:51:13,844:INFO: Copying path: [netapp-v2v-datastore] 100_VM_TEST_19/100_VM_TEST_19.vmdk (virt-v2v-wrapper:205)
2018-09-02 16:51:13,844:DEBUG: Found path at correct index (virt-v2v-wrapper:253)
2018-09-02 16:51:13,844:DEBUG: Path '[netapp-v2v-datastore] 100_VM_TEST_19/100_VM_TEST_19.vmdk' has disk id='f35a89e8-ca60-445e-a535-7bc41fb569b1' (virt-v2v-wrapper:237)
2018-09-02 16:52:08,920:DEBUG: Updated progress: 2.00 (virt-v2v-wrapper:226)
2018-09-02 16:52:18,944:DEBUG: Updated progress: 3.00 (virt-v2v-wrapper:226)
2018-09-02 16:53:24,053:DEBUG: Updated progress: 6.01 (virt-v2v-wrapper:226)
2018-09-02 16:53:54,088:DEBUG: Updated progress: 7.01 (virt-v2v-wrapper:226)
2018-09-02 16:54:54,161:DEBUG: Updated progress: 8.01 (virt-v2v-wrapper:226)
2018-09-02 16:55:19,198:DEBUG: Updated progress: 9.01 (virt-v2v-wrapper:226)
2018-09-02 16:55:29,209:DEBUG: Updated progress: 10.01 (virt-v2v-wrapper:226)
2018-09-02 16:55:44,238:DEBUG: Updated progress: 11.02 (virt-v2v-wrapper:226)
2018-09-02 16:55:54,257:DEBUG: Updated progress: 12.02 (virt-v2v-wrapper:226)
2018-09-02 16:56:04,281:DEBUG: Updated progress: 13.02 (virt-v2v-wrapper:226)
2018-09-02 16:56:19,306:DEBUG: Updated progress: 14.02 (virt-v2v-wrapper:226)
2018-09-02 16:56:29,326:DEBUG: Updated progress: 15.02 (virt-v2v-wrapper:226)
2018-09-02 16:56:39,344:DEBUG: Updated progress: 16.03 (virt-v2v-wrapper:226)
2018-09-02 16:56:54,367:DEBUG: Updated progress: 17.03 (virt-v2v-wrapper:226)
2018-09-02 16:57:04,380:DEBUG: Updated progress: 18.03 (virt-v2v-wrapper:226)
2018-09-02 16:57:19,403:DEBUG: Updated progress: 19.03 (virt-v2v-wrapper:226)
2018-09-02 16:57:29,419:DEBUG: Updated progress: 20.03 (virt-v2v-wrapper:226)
2018-09-02 16:57:44,441:DEBUG: Updated progress: 21.03 (virt-v2v-wrapper:226)
2018-09-02 16:57:54,459:DEBUG: Updated progress: 22.03 (virt-v2v-wrapper:226)
2018-09-02 16:58:04,468:DEBUG: Updated progress: 23.04 (virt-v2v-wrapper:226)
2018-09-02 16:58:19,492:DEBUG: Updated progress: 24.04 (virt-v2v-wrapper:226)
2018-09-02 16:58:29,503:INFO: virt-v2v terminated with return code -9 (virt-v2v-wrapper:466)
2018-09-02 16:58:29,506:DEBUG: Cleanup phase (virt-v2v-wrapper:927)
2018-09-02 16:58:29,795:INFO: Canceling transfer id=a9373fb4-996f-43be-872e-0d00c7b3ba61 for disk=f35a89e8-ca60-445e-a535-7bc41fb569b1 (virt-v2v-wrapper:626)
2018-09-02 16:58:29,821:INFO: Removing disks: [] (virt-v2v-wrapper:636)
2018-09-02 16:58:29,885:INFO: Removing password files (virt-v2v-wrapper:935)
2018-09-02 16:58:29,886:INFO: Finished (virt-v2v-wrapper:957)

Looks like virt-v2v was terminated by SIGKILL

Comment 9 Nir Soffer 2018-09-04 17:30:41 UTC

I found another way to correlate the logs - using disk.id in v2v-import.*.log:

$ grep -r f35a89e8-ca60-445e-a535-7bc41fb569b1 .
./imageio-logs/daemon.log.1:2018-09-02 16:51:25,213 INFO    (Thread-1619) [tickets] [local] ADD ticket={u'uuid': u'7a16e39b-a4ba-4eb5-8c27-ef7d998532ba', u'ops': [u'write'], u'url': u'file:///rhev/data-center/mnt/blockSD/2f431d12-72a8-40a1-a3de-7e41a765dde6/images/f35a89e8-ca60-445e-a535-7bc41fb569b1/90b26d5d-2601-4f7b-b69e-2f7af14cf845', u'sparse': False, u'timeout': 300, u'size': 107374182400}

So the relevant ticket is 7a16e39b-a4ba-4eb5-8c27-ef7d998532ba.

Comment 10 Nir Soffer 2018-09-04 17:31:27 UTC

Created attachment 1480843 [details]
relevant imageio daemon log.

Comment 11 Richard W.M. Jones 2018-09-04 17:34:09 UTC

> Looks like virt-v2v was terminated by SIGKILL

I agree.

virt-v2v-wrapper uses subprocess.Popen.  The "return code -9"
comes from printing the returncode method.

  https://github.com/oVirt/ovirt-ansible-v2v-conversion-host/blob/d16cbc96e40f4aee411f8d338402e84acbb25e0c/files/virt-v2v-wrapper.py#L465

As you can see in the documentation for that:

  https://docs.python.org/2/library/subprocess.html#subprocess.Popen.returncode

negative values indicate that the subprocess was killed with
a signal, in this case "terminate with extreme prejudice" signal 9
(SIGKILL).

Do we have any idea what is sending SIGKILL?

Comment 12 Nir Soffer 2018-09-04 18:19:24 UTC

If we look messages, we see interesting events at around that time:

Sep  2 16:58:28 b02-h23-r620 systemd-logind: New session 3262 of user root.
Sep  2 16:58:28 b02-h23-r620 systemd: Started Session 3262 of user root.
Sep  2 16:58:28 b02-h23-r620 systemd: Starting Session 3262 of user root.
Sep  2 16:58:28 b02-h23-r620 systemd-logind: Removed session 3262.
Sep  2 16:58:29 b02-h23-r620 kernel: nbdkit[10842]: segfault at 7fcc7768686e ip 00007fcc7768686e sp 00007fcc75f56dd8 error 14
Sep  2 16:58:29 b02-h23-r620 abrt-hook-ccpp: Process 40113 (nbdkit) of user 36 killed by SIGSEGV - dumping core
Sep  2 16:58:29 b02-h23-r620 systemd-logind: Removed session 3220.

I don't know if this is the nbdkit process related to the import, because virt-v2v
does not log the process id of the sub processes.

Ilan, can you get the core dump mentioned by abrt-hook-ccpp?

Comment 13 Richard W.M. Jones 2018-09-04 18:24:46 UTC

Forget about the segfault.  It's a known bug caused by a race
during shutdown, already fixed upstream and available in nbdkit 1.2.6.

Comment 14 Ilan Zuckerman 2018-09-05 06:30:01 UTC

Created attachment 1480990 [details]
engine log from the failed migration plan eexecution

Nir adding the engine.log as you requested in Comment 7

Comment 15 Ilan Zuckerman 2018-09-05 06:48:28 UTC

Created attachment 1480992 [details]
conversion host core-dump for SIGSEGV

Nir, attaching core dump as you requested in Comment 12

Note You need to log in before you can comment on or make changes to this bug.