1984775 – VMs Migration from a specific VMware fails the importer, on NfcFssrvrProcessErrorMsg

Bug 1984775 - VMs Migration from a specific VMware fails the importer, on NfcFssrvrProcessErrorMsg

Summary: VMs Migration from a specific VMware fails the importer, on NfcFssrvrProcessE...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	2.6.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	medium
Target Milestone:	---
Target Release:	4.8.1
Assignee:	Matthew Arnold
QA Contact:	Ilanit Stein
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1973193 (view as bug list)
Depends On:
Blocks:	2003691
TreeView+	depends on / blocked

Reported:	2021-07-22 08:27 UTC by Ilanit Stein
Modified:	2021-09-13 12:43 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2003691 (view as bug list)
Environment:
Last Closed:	2021-08-24 12:49:09 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	kubevirt containerized-data-importer pull 1883	0	None	None	None	2021-08-09 14:32:06 UTC
Red Hat Product Errata	RHSA-2021:3259	0	None	None	None	2021-08-24 12:49:30 UTC

Description Ilanit Stein 2021-07-22 08:27:20 UTC

Description of problem:
 
MTV migration plan with 2 VMs, fails with error in the the importer pod log:

 I0721 09:28:24.883919 1 vddk-datasource.go:200] Log line from nbdkit: nbdkit: vddk[1]: error: [NFC ERROR]NfcFssrvrProcessErrorMsg: received NFC error 5 from server: Failed to allocate the requested 24117272 bytes I0721 09:28:24.883976 1 vddk-datasource.go:200] Log line from nbdkit: nbdkit: vddk[1]: error: VixDiskLib_Read: Memory allocation failed. Out of memory. I0721 09:28:31.988105 1 vddk-datasource.go:200] Log line from nbdkit: nbdkit: vddk[1]: error: [NFC ERROR]NfcFssrvrProcessErrorMsg: received NFC error 5 from server: Failed to allocate the requested 24117272 bytes

A migration plan with only one of the VMs did pass, but the copy was extremely slow, the above error appeared in the log,
and the VM didn't start automatically, after migration.
It was possible to start it manually though.

Other VMs rom same VMware fail to migrate as well.

Richard Jones: 
=============
The one above seems to indicate an error inside VDDK library when allocating memory.

It's pretty clearly running out of memory inside VDDK.  It
happens quite quickly too, probably within the first few read
requests.

Of course VDDK is a black box so we don't know specifically what's
going on inside it, but it wouldn't be a surprise if it needs to
allocate memory during a read.

If this is running inside a container, try increasing the cgroup
limits on the amount of RAM the container is allowed to use.  I'm not
clear if this is virt-v2v or you're using nbdkit directly, but for
virt-v2v there are some guidelines here:

https://libguestfs.org/virt-v2v.1.html#compute-power-and-ram

If using nbdkit directly, you shouldn't need nearly that much RAM, but
clearly you need more than you're giving it now.

Matthew Arnold:
==============
Thanks Rich, it's helpful to know that it's the local VDDK side that's failing. Just to confirm, this log is indeed from CDI running nbdkit in a container. The container is created and managed by CDI itself though, so I don't know of an easy way to adjust its cgroup limits on a live system. I will try to reproduce the bug with modifications to the code that creates the container, unless anyone else knows a trick for changing these limits as soon as it starts.

Richard Jones: 
=============
Changing the limits is the best thing to do.  However there's
another thing that you could try if turns out to be impossible.

nbdkit doesn't normally break up large requests from the NBD client.
eg. If the client makes a request to read the maximum size block (32M)
then it will pass that to the plugin which will request that VDDK
makes a 32M read.  In other words, VixDiskLib_Read is being called
here with count = 32M (actually in sectors, so divided by 512):

https://gitlab.com/nbdkit/nbdkit/-/blob/e510b9c0a061966d07e3f56c975a968f277913d1/plugins/vddk/vddk.c#L726

If we theorize that this is causing VDDK to allocate 32M per request,
you could adjust the client to make smaller requests.  eg. nbdcopy
lets you adjust max request size using the --request-size flag.  Or if
you can't do that, then insert the blocksize filter into the chain of
filters which will break up large requests:

https://libguestfs.org/nbdkit-blocksize-filter.1.html

(eg: --filter=blocksize ... maxdata=1M)

Similarly if the client is making multiple requests in parallel (which
could allocate N * 32M) either reduce the amount of parallelism in the
client or use this filter:

https://libguestfs.org/nbdkit-noparallel-filter.1.html

(--filter=noparallel ... serialize=all-requests)

Version-Release number of selected component (if applicable):
OCP-4.7/CNV-2.6.6-44/MTV-2.4 release
VMware 6.5

How reproducible:
Issue doesn't reproduce on the same OCP cluster using another 6.5 VMware.

Additional info:
This issue is NOT related to the bellow nfc Max Memory:

Based on the info in  https://bugzilla.redhat.com/show_bug.cgi?id=1614276#c24
We checked the Max Memory that is set on the ESXi host, and it was already set to 1000000000:

 <!-- The nfc service --> <nfcsvc> <path>libnfcsvc.so</path> <enabled>true</enabled> <maxMemory>1000000000</maxMemory> <maxStreamMemory>10485760</maxStreamMemory> </nfcsvc>

Comment 2 Fabien Dupont 2021-08-05 07:02:21 UTC

Changing the component to Storage, since the fix is in CDI.

Comment 3 Fabien Dupont 2021-08-05 07:04:38 UTC

*** Bug 1973193 has been marked as a duplicate of this bug. ***

Comment 4 Ilanit Stein 2021-08-11 19:43:20 UTC

Verified on CNV-4.8.1-18,
By migrating from the same VMware and 2 VMs with 2 disks,
for which this bug was reported.

Migration to target storage NFS passed.
VMs were successfully started on Openshift side,

Comment 9 errata-xmlrpc 2021-08-24 12:49:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.8.1 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3259

Note You need to log in before you can comment on or make changes to this bug.