Bug 1614276 - VixDiskLib: VixDiskLib_Read: Read 4096 sectors at 57516160 failed. Error 2 (Memory allocation failed. Out of memory.) (DiskLib error 802: NBD_ERR_INSUFFICIENT_RESOURCES) at 5240.
Summary: VixDiskLib: VixDiskLib_Read: Read 4096 sectors at 57516160 failed. Error 2 (M...
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: libguestfs
Version: 7.6
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: rc
: ---
Assignee: Richard W.M. Jones
QA Contact: Virtualization Bugs
URL:
Whiteboard: V2V
Depends On: 1618717
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-08-09 10:04 UTC by Richard W.M. Jones
Modified: 2019-06-05 21:50 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-05 21:50:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
v2v-import-20180808T200531-45022.log (4.34 MB, text/plain)
2018-08-09 10:04 UTC, Richard W.M. Jones
no flags Details
VMware hostd configuration (27.36 KB, text/plain)
2018-08-16 05:14 UTC, Mor
no flags Details
Richard'd config.xml file (25.38 KB, text/plain)
2018-08-16 08:03 UTC, Richard W.M. Jones
no flags Details
vddk-nfc_error-rhel8.1.log (2.36 MB, text/plain)
2019-05-12 15:00 UTC, mxie@redhat.com
no flags Details

Description Richard W.M. Jones 2018-08-09 10:04:29 UTC
Created attachment 1474623 [details]
v2v-import-20180808T200531-45022.log

Description of problem:

virt-v2v conversion failed with this VDDK error:

nbdkit: nbdkit: vddkpython[3][1]: : debug: debug: pread count=2097152 offset=29448273920pwrite count=2097152 offset=29433593856 fua=0

nbdkit: vddk[3]: debug: VDDK call: VixDiskLib_Read (handle, 57516160, 4096, buffer)
nbdkit: vddk[3]: error: [NFC ERROR] NfcFssrvrProcessErrorMsg: received NFC error 5 from server: Failed to allocate the requested 2097176 bytes
nbdkit: vddk[3]: debug: DISKLIB-LIB   : RWv failed ioId: #8757 (802) (34) .
nbdkit: vddk[3]: debug: VixDiskLib: Detected DiskLib error 802 (NBD_ERR_INSUFFICIENT_RESOURCES).
nbdkit: vddk[3]: debug: VixDiskLib: VixDiskLib_Read: Read 4096 sectors at 57516160 failed. Error 2 (Memory allocation failed. Out of memory.) (DiskLib error 802: NBD_ERR_INSUFFICIENT_RESOURCES) at 5240.
nbdkit: vddk[3]: error: VixDiskLib_Read: Memory allocation failed. Out of memory.
nbdkit: vddk[3]: debug: sending error reply: Input/output error
qemu-img: error while reading sector 57516160: Input/output error

Version-Release number of selected component (if applicable):

libguestfs 1.36.10-6.15.rhvpreview.el7ev

How reproducible:

Unknown.

Steps to Reproduce:

It happened during conversion of a guest that has a single 100 GB disk.

Comment 2 Mor 2018-08-09 10:31:03 UTC
Package versions:
VMware-vix-disklib-6.5.2-6195444.x86_64
VMware ESXi 6.5.0, 7967591
VMware vCenter Version 6.5.0.20000 Build 8667236

Comment 3 Mor 2018-08-13 08:40:32 UTC
Currently, issue is reproducible on my system while running 40x100GB VM migrations on 4 conversion hosts.

Versions:
nbdkit-plugin-python2-1.2.3-1.el7ev.x86_64
nbdkit-1.2.3-1.el7ev.x86_64
nbdkit-plugin-vddk-1.2.3-1.el7.x86_64
nbdkit-plugin-python-common-1.2.3-1.el7ev.x86_64
libguestfs-tools-c-1.36.10-6.16.rhvpreview.el7ev.x86_64
python-libguestfs-1.36.10-6.16.rhvpreview.el7ev.x86_64
libguestfs-winsupport-7.2-2.el7.x86_64
libguestfs-1.36.10-6.16.rhvpreview.el7ev.x86_64
VMware-vix-disklib-6.5.2-6195444.x86_64
VMware ESXi 6.5.0, 7967591
VMware vCenter Version 6.5.0.20000 Build 8667236
RHV 4.2.5.3-0.1.el7ev
CFME 5.9.4.2

Comment 5 Mor 2018-08-13 10:46:57 UTC
Error is reflected in the vpxa process log on the ESXi host:
2018-08-13T08:30:13.226Z warning vpxa[B964B70] [Originator@6876 sub=Libs opID=lro-10650-549a0804] [NFC ERROR] NfcCheckAndReserveMem: Cannot allocate any more memory as NFC is already using 48366284 and allocati
ng 2097176 will make it more than the maximum allocated: 50331648. Please close some sessions and try again
2018-08-13T08:30:13.227Z warning vpxa[B964B70] [Originator@6876 sub=Libs opID=lro-10650-549a0804] [NFC ERROR] NfcFssrvrReadEx: Failed to allocate the requested memory
2018-08-13T08:30:13.227Z warning vpxa[B964B70] [Originator@6876 sub=Libs opID=lro-10650-549a0804] [NFC ERROR] Sending Nfc error 5: Failed to allocate the requested 2097176 bytes

Comment 6 Mor 2018-08-13 10:49:48 UTC
VMware article: https://kb.vmware.com/s/article/1022543

Comment 7 Mor 2018-08-13 11:11:14 UTC
Another one: https://kb.vmware.com/s/article/50442

Comment 8 Tomáš Golembiovský 2018-08-13 11:43:43 UTC
    After quick search this seems to be directly related to number of concurrent conversions [1][2] rather than a bug in nbdkit/plugin.

    Possible workaround is to increase the memory limit for the NFC service. The <maxMemory> element in file /etc/vmware/hostd/config.xml:

    <nfcsvc>
        <path>libnfcsvc.so</path>
        <enabled>true</enabled>
        <maxMemory>50331648</maxMemory>
        <maxStreamMemory>10485760</maxStreamMemory>
    </nfcsvc>


    [1] https://www.virtuallypeculiar.com/2018/02/vsphere-replication-jobs-fail-due-to.html
    [2] https://unitrends-support.force.com/UnitrendsBackup/articles/Article/000005944?retURL=%2FUnitrendsBackup%2Fapex%2FLiveAgentPreChatForm&popup=true

Comment 9 Mor 2018-08-13 12:01:31 UTC
As per comment #7: I'm trying a rerun with 64 * 1024 * 1024 maxMemory for nfcsvc.

Comment 10 Richard W.M. Jones 2018-08-13 12:11:02 UTC
When we come up with a suitable formula I will add that information to
the virt-v2v manual, so let's leave this bug open and with the
current assignment.

Comment 11 Mor 2018-08-13 19:20:25 UTC
Tried 64 * 1024 * 1024 but I still get the error. Going to try 1 GB: 1024 * 1024 * 1024, if that won't work I think that we might need to reconsider the maximum number of parallel migrations in the plan.

Comment 12 Mor 2018-08-14 13:58:36 UTC
1 GB still didn't solved the issue. Seems like we are limited to certain number of parallel migrations in our environment, or is it for every environment? 

Richard, would it be possible for libguestfs functional QE to test 40 x 100GB VM with 3/4 full disk? Just to make sure we are not hitting a possible connection limit in VMWare.

Comment 13 Richard W.M. Jones 2018-08-14 14:21:09 UTC
Our functional testing team don't test performance or limits and
aren't really set up to do this kind of testing.

Is it definitely hitting the same error (and not some different error)?

Presumably with a 1GB limit you must now be seeing this in the logs:

  "Cannot allocate any more memory as NFC is already using XX
   and allocati ng 2097176 will make it more than the maximum
   allocated: [1GB]"

Also note that after adjusting the limit you have to restart the
daemon, I think:

# /etc/init.d/hostd restart

Probably if you don't restart the daemon it won't actually change
the limit.

Comment 14 Mor 2018-08-15 11:40:27 UTC
It seems like the change on /etc/vmware/hostd/config.xml on the relevant ESXi host doesn't hold up. Even after restarting services hostd, vpxa and doing a complete reboot to the host.

I still get the same result:
2018-08-15T09:40:29.938Z warning vpxa[7363B70] [Originator@6876 sub=Libs opID=lro-358-559be7bf] [NFC ERROR] NfcCheckAndReserveMem: Cannot allocate any more memory as NFC is already using 50332756 and allocating 48664 will make it more than the maximum allocated: 50331648. Please close some sessions and try again.

For the suggested XML:
      <nfcsvc>
         <path>libnfcsvc.so</path>
         <enabled>true</enabled>
         <maxMemory>62914560</maxMemory>
         <maxStreamMemory>10485760</maxStreamMemory>
      </nfcsvc>


In addition, just adding more info. URL:
https://pubs.vmware.com/vsphere-6-5/index.jsp?topic=%2Fcom.vmware.vddk.pg.doc%2FvddkDataStruct.5.5.html&resultof=%22%4e%46%43%22%20%22%6e%66%63%22%20%22%53%65%73%73%69%6f%6e%22%20%22%73%65%73%73%69%6f%6e%22%20%22%4c%69%6d%69%74%73%22%20%22%6c%69%6d%69%74%22%20

Describes a possible limit in VMware host to 32MB buffer, and 52 connections through vCenter Server, including the per host limit.

Comment 15 Richard W.M. Jones 2018-08-15 15:02:25 UTC
I can see from the error message that the limit has not been
changed at all (was 50331648, still is 50331648).  Obviously
editing that file and even rebooting doesn't change this.

I tried to reproduce this locally but could not.  I got up to
11 streams without seeing the error (couldn't go beyond that
because I don't have any more guests that I can stream in parallel).

Two thoughts:

(1) Does adjusting maxStreamMemory help?

(2) There are two files which contain this setting.  The second one
is /etc/vmware/hostd/default-config.xml
It might be worth trying to make the same change to this file too.

Comment 16 Daniel Gur 2018-08-15 15:05:37 UTC
Richard how do you reproduce it with one transfer while we see it only when migrating more then 10?

Comment 17 Richard W.M. Jones 2018-08-15 15:08:25 UTC
I made 11 separate connections.  I was running nbdkit directly, but it
should work similarly to how virt-v2v works.  Will attach my test script
as a private comment.

Comment 19 Mor 2018-08-15 15:47:49 UTC
I don't think 11 is enough to reproduce this bug. It should be around 19 or 20.

Comment 20 Mor 2018-08-15 15:55:53 UTC
btw, I tried to change default-config.xml and adjusted its file permissions, but I get 'Operation not permitted' when trying to save the changes. Looks like something is blocking writes to that file.

Comment 21 Richard W.M. Jones 2018-08-15 19:41:21 UTC
(In reply to Mor from comment #19)
> I don't think 11 is enough to reproduce this bug. It should be around 19 or
> 20.

Yeah I suspect you are right.  However I didn't have enough guests
at the time to do 20 parallel conversions (I *think* that I have to
convert a separate guest each time -- but now I think about it a bit
more, maybe it will work with a single guest if I use a readonly
connection?).

Anyway I will continue experimenting tomorrow.

Comment 22 Richard W.M. Jones 2018-08-15 20:11:27 UTC
Yes I'm able to reproduce this now.  I used 20 concurrent *read-only*
connections to the same guest image.  The error seen is the same
as you describe and on the ESXi server I see in /var/log/hostd.log:

2018-08-15T20:11:21.339Z warning hostd[7F4D2B70] [Originator@6876 sub=Libs] [NFC ERROR] NfcCheckAndReserveMem: Cannot allocate any more memory as NFC is already using 48235234 and allocating 2097176 will make it more than the maximum allocated: 50331648. Please close some sessions and try again

I will play with the settings and see if I can find any way to fix it.

Comment 24 Richard W.M. Jones 2018-08-15 20:43:01 UTC
I increased the limit in /etc/vmware/hostd/config.xml to a billion:

      <nfcsvc>
         <path>libnfcsvc.so</path>
         <enabled>true</enabled>
         <maxMemory>1000000000</maxMemory>
         <maxStreamMemory>10485760</maxStreamMemory>
      </nfcsvc>

and then I restarted hostd by doing:

/etc/init.d/hostd restart

and I am now able to do 24 parallel conversions (was not able
to do 20 before).  I did not need to reboot the ESXi server.

Could you check that you didn't make a mistake while editing
config.xml?  I noticed in the /etc/init.d/hostd script that it
will actually replace config.xml with the default-config.xml
if it detects any error in config.xml (crazy, I know).

I am testing ESXi 6.0.0.

Comment 25 Mor 2018-08-16 05:07:23 UTC
(In reply to Richard W.M. Jones from comment #24)
> I increased the limit in /etc/vmware/hostd/config.xml to a billion:
> 
>       <nfcsvc>
>          <path>libnfcsvc.so</path>
>          <enabled>true</enabled>
>          <maxMemory>1000000000</maxMemory>
>          <maxStreamMemory>10485760</maxStreamMemory>
>       </nfcsvc>
> 
> and then I restarted hostd by doing:
> 
> /etc/init.d/hostd restart
> 
> and I am now able to do 24 parallel conversions (was not able
> to do 20 before).  I did not need to reboot the ESXi server.
> 
> Could you check that you didn't make a mistake while editing
> config.xml?  I noticed in the /etc/init.d/hostd script that it
> will actually replace config.xml with the default-config.xml
> if it detects any error in config.xml (crazy, I know).
> 
> I am testing ESXi 6.0.0.

I double checked, I haven't done a mistake. My config.xml was not replaced (if 6.5 behaves like 6.0). I can still see my value after restarting hostd.

Comment 26 Mor 2018-08-16 05:14:30 UTC
Created attachment 1476346 [details]
VMware hostd configuration

Comment 28 Richard W.M. Jones 2018-08-16 08:03:08 UTC
Created attachment 1476371 [details]
Richard'd config.xml file

Comment 29 Richard W.M. Jones 2018-08-16 08:41:13 UTC
Here's the change in my configuration:

diff -u default-config.xml config.xml 
--- default-config.xml
+++ config.xml
@@ -683,7 +683,7 @@
       <nfcsvc>
          <path>libnfcsvc.so</path>
          <enabled>true</enabled>
-         <maxMemory>50331648</maxMemory>
+         <maxMemory>1000000000</maxMemory>
          <maxStreamMemory>10485760</maxStreamMemory>
       </nfcsvc>

Comment 30 Richard W.M. Jones 2018-08-16 08:50:49 UTC
I repeated the same test this morning, running 24 parallel copies
successfully using my attached test script.

I then replaced config.xml with default-config.xml (ie. reducing
the limit back down to 50331648).  I reran the test with 20
parallel connections and was able to reproduce the bug again easily.

Comment 33 Daniel Gur 2018-08-16 14:04:26 UTC
Richard - What speed of transfer do you get  with 24 parallel connections to ESXi?

Comment 34 Richard W.M. Jones 2018-08-16 14:08:06 UTC
Pretty slow, but I didn't measure it.  These were happening over
the internet so it's not representative of any real world use case.

Comment 35 Daniel Gur 2018-08-19 10:22:10 UTC
Doc bug was opened
Bug 1618982 - [v2v][doc] Concurrent conversions via vSphere and VDDK should be limited to 10 (edit)  https://bugzilla.redhat.com/show_bug.cgi?id=1618982

Comment 36 Mykola Ivanets 2018-08-27 06:01:41 UTC
I faced with this ESXi limitations in our own VMware-backup related product. And want to share with you some knowledge.

1. Any call to VixDiskLib_Read using NBD[SSL] transport may fail with VIX_E_OUT_OF_MEMORY. According to VDDK 6.0 release notes: "Upon receiving this error, the caller can try again with a smaller I/O buffer, retry at a later time, perform other operations on the disk, or close the disk."

We implemented adaptive readers: start reading using X MB I/O buffer, we increase I/O buffer after each N successful reads and fall back to a previous value on failure or decrease I/O buffer on Ь sequential failures.

2. Earlier versions of ESXi/VDDK behaved even worse: VixDiskLib_Open/VixDiskLib_Read may hang when NFC buffer limit reached.

We track each VixDiskLib_Open/VixDiskLib_Read call and kill the hanged thread if we didn't get response after reasonable timeout.

And here are Release Notes for VDDK 5.5, 6.0, 6.5 for you reference. You can see how VMware tried to fix NFC issue:
---

VDDK 5.5 Release Notes (https://www.vmware.com/support/developer/vddk/vddk-550-releasenotes.html#knownissues)

"Connection may hang when NBD backup exceeds documented NFC buffer limit.
ESXi 5.x hosts have a 32MB buffer allocated for NFC connections. When the dedicated ESXi memory is consumed, the next connection attempted can hang, although other established connections will continue normally. The following message appears in the ESXi host's vpxa.log file: “[NFC ERROR] NfcCheckAndReserveMem: Cannot allocate any more memory as NFC is already using bytes and allocating morebytes will make it more than the maximum allocated: 33554432. Please close some sessions and try again.” The workaround is to reduce the number of concurrent backups targeting the same ESXi host so as to remain within the documented limit. See KB 1022543."

VDDK 6.0 Release Notes (https://www.vmware.com/support/developer/vddk/vddk-600-releasenotes.html#resolvedissues)

"Error returned when ESXi host reaches NFC memory limit.
Previously, VDDK would hang during I/O operations if the host ran out of NFC memory. This has been fixed in this release. You must upgrade both VDDK and ESXi to get the fix. VDDK now returns the VIX_E_OUT_OF_MEMORY error, leaving the disk handle in a still-valid state. Upon receiving this error, the caller can try again with a smaller I/O buffer, retry at a later time, perform other operations on the disk, or close the disk."

VDDK 6.5 Release Notes (http://pubs.vmware.com/Release_Notes/en/developer/vddk/65/vsphere-vddk-65-release-notes.html#resolvedissues)

"Error not returned when ESXi host reaches NFC connection limit.
This issue may cause an apparent hang in backup, or slowness of vCenter Server. The VIX_E_OUT_OF_MEMORY error is returned in cases when an ESXi host reaches the NFC memory limit (a fixed issue), but not for other NFC limits such as the number of connections. Programs should check that the “Number of current connections” is less than the VMware recommended NFC session connection limit before making a new VixDiskLib_Open(Ex) call. This is not a defect but an NFC limitation. See section NFC Session Limits in the Virtual Disk Programming Guide. ESXi host limits depend on consumed buffer space (32MB) not just the number of connections. VixDiskLib_Open(Ex) uses one connection for every virtual disk accessed on an ESXi host. It is not possible to share a connection across virtual disks. Connections made through vCenter Server are limited to a maximum of 52, then the ESXi host limits listed above apply."

Comment 37 Richard W.M. Jones 2018-08-28 08:10:32 UTC
Documentation was added to the virt-v2v manual in:
https://github.com/libguestfs/libguestfs/commit/0d0b5511309182f163ae263e862ddb0235780917

Comment 38 Mykola Ivanets 2018-08-28 08:26:30 UTC
Yes, I saw. I've added the comment just in case some one google this ticket rather then commit message.

Comment 41 mxie@redhat.com 2019-05-12 15:00:39 UTC
Created attachment 1567516 [details]
vddk-nfc_error-rhel8.1.log

Comment 43 Richard W.M. Jones 2019-06-05 21:50:32 UTC
This bug is fixed in 1.40 (by adding documentation) so the fix should be present in
RHEL 7.7.  I'm closing this bug as it's (a) not an actual bug in virt-v2v and
(b) the workaround is documented upstream, in RHEL 7 and RHEL 8.


Note You need to log in before you can comment on or make changes to this bug.