Created attachment 1474623 [details] v2v-import-20180808T200531-45022.log Description of problem: virt-v2v conversion failed with this VDDK error: nbdkit: nbdkit: vddkpython[3][1]: : debug: debug: pread count=2097152 offset=29448273920pwrite count=2097152 offset=29433593856 fua=0 nbdkit: vddk[3]: debug: VDDK call: VixDiskLib_Read (handle, 57516160, 4096, buffer) nbdkit: vddk[3]: error: [NFC ERROR] NfcFssrvrProcessErrorMsg: received NFC error 5 from server: Failed to allocate the requested 2097176 bytes nbdkit: vddk[3]: debug: DISKLIB-LIB : RWv failed ioId: #8757 (802) (34) . nbdkit: vddk[3]: debug: VixDiskLib: Detected DiskLib error 802 (NBD_ERR_INSUFFICIENT_RESOURCES). nbdkit: vddk[3]: debug: VixDiskLib: VixDiskLib_Read: Read 4096 sectors at 57516160 failed. Error 2 (Memory allocation failed. Out of memory.) (DiskLib error 802: NBD_ERR_INSUFFICIENT_RESOURCES) at 5240. nbdkit: vddk[3]: error: VixDiskLib_Read: Memory allocation failed. Out of memory. nbdkit: vddk[3]: debug: sending error reply: Input/output error qemu-img: error while reading sector 57516160: Input/output error Version-Release number of selected component (if applicable): libguestfs 1.36.10-6.15.rhvpreview.el7ev How reproducible: Unknown. Steps to Reproduce: It happened during conversion of a guest that has a single 100 GB disk.
Package versions: VMware-vix-disklib-6.5.2-6195444.x86_64 VMware ESXi 6.5.0, 7967591 VMware vCenter Version 6.5.0.20000 Build 8667236
Currently, issue is reproducible on my system while running 40x100GB VM migrations on 4 conversion hosts. Versions: nbdkit-plugin-python2-1.2.3-1.el7ev.x86_64 nbdkit-1.2.3-1.el7ev.x86_64 nbdkit-plugin-vddk-1.2.3-1.el7.x86_64 nbdkit-plugin-python-common-1.2.3-1.el7ev.x86_64 libguestfs-tools-c-1.36.10-6.16.rhvpreview.el7ev.x86_64 python-libguestfs-1.36.10-6.16.rhvpreview.el7ev.x86_64 libguestfs-winsupport-7.2-2.el7.x86_64 libguestfs-1.36.10-6.16.rhvpreview.el7ev.x86_64 VMware-vix-disklib-6.5.2-6195444.x86_64 VMware ESXi 6.5.0, 7967591 VMware vCenter Version 6.5.0.20000 Build 8667236 RHV 4.2.5.3-0.1.el7ev CFME 5.9.4.2
Error is reflected in the vpxa process log on the ESXi host: 2018-08-13T08:30:13.226Z warning vpxa[B964B70] [Originator@6876 sub=Libs opID=lro-10650-549a0804] [NFC ERROR] NfcCheckAndReserveMem: Cannot allocate any more memory as NFC is already using 48366284 and allocati ng 2097176 will make it more than the maximum allocated: 50331648. Please close some sessions and try again 2018-08-13T08:30:13.227Z warning vpxa[B964B70] [Originator@6876 sub=Libs opID=lro-10650-549a0804] [NFC ERROR] NfcFssrvrReadEx: Failed to allocate the requested memory 2018-08-13T08:30:13.227Z warning vpxa[B964B70] [Originator@6876 sub=Libs opID=lro-10650-549a0804] [NFC ERROR] Sending Nfc error 5: Failed to allocate the requested 2097176 bytes
VMware article: https://kb.vmware.com/s/article/1022543
Another one: https://kb.vmware.com/s/article/50442
After quick search this seems to be directly related to number of concurrent conversions [1][2] rather than a bug in nbdkit/plugin. Possible workaround is to increase the memory limit for the NFC service. The <maxMemory> element in file /etc/vmware/hostd/config.xml: <nfcsvc> <path>libnfcsvc.so</path> <enabled>true</enabled> <maxMemory>50331648</maxMemory> <maxStreamMemory>10485760</maxStreamMemory> </nfcsvc> [1] https://www.virtuallypeculiar.com/2018/02/vsphere-replication-jobs-fail-due-to.html [2] https://unitrends-support.force.com/UnitrendsBackup/articles/Article/000005944?retURL=%2FUnitrendsBackup%2Fapex%2FLiveAgentPreChatForm&popup=true
As per comment #7: I'm trying a rerun with 64 * 1024 * 1024 maxMemory for nfcsvc.
When we come up with a suitable formula I will add that information to the virt-v2v manual, so let's leave this bug open and with the current assignment.
Tried 64 * 1024 * 1024 but I still get the error. Going to try 1 GB: 1024 * 1024 * 1024, if that won't work I think that we might need to reconsider the maximum number of parallel migrations in the plan.
1 GB still didn't solved the issue. Seems like we are limited to certain number of parallel migrations in our environment, or is it for every environment? Richard, would it be possible for libguestfs functional QE to test 40 x 100GB VM with 3/4 full disk? Just to make sure we are not hitting a possible connection limit in VMWare.
Our functional testing team don't test performance or limits and aren't really set up to do this kind of testing. Is it definitely hitting the same error (and not some different error)? Presumably with a 1GB limit you must now be seeing this in the logs: "Cannot allocate any more memory as NFC is already using XX and allocati ng 2097176 will make it more than the maximum allocated: [1GB]" Also note that after adjusting the limit you have to restart the daemon, I think: # /etc/init.d/hostd restart Probably if you don't restart the daemon it won't actually change the limit.
It seems like the change on /etc/vmware/hostd/config.xml on the relevant ESXi host doesn't hold up. Even after restarting services hostd, vpxa and doing a complete reboot to the host. I still get the same result: 2018-08-15T09:40:29.938Z warning vpxa[7363B70] [Originator@6876 sub=Libs opID=lro-358-559be7bf] [NFC ERROR] NfcCheckAndReserveMem: Cannot allocate any more memory as NFC is already using 50332756 and allocating 48664 will make it more than the maximum allocated: 50331648. Please close some sessions and try again. For the suggested XML: <nfcsvc> <path>libnfcsvc.so</path> <enabled>true</enabled> <maxMemory>62914560</maxMemory> <maxStreamMemory>10485760</maxStreamMemory> </nfcsvc> In addition, just adding more info. URL: https://pubs.vmware.com/vsphere-6-5/index.jsp?topic=%2Fcom.vmware.vddk.pg.doc%2FvddkDataStruct.5.5.html&resultof=%22%4e%46%43%22%20%22%6e%66%63%22%20%22%53%65%73%73%69%6f%6e%22%20%22%73%65%73%73%69%6f%6e%22%20%22%4c%69%6d%69%74%73%22%20%22%6c%69%6d%69%74%22%20 Describes a possible limit in VMware host to 32MB buffer, and 52 connections through vCenter Server, including the per host limit.
I can see from the error message that the limit has not been changed at all (was 50331648, still is 50331648). Obviously editing that file and even rebooting doesn't change this. I tried to reproduce this locally but could not. I got up to 11 streams without seeing the error (couldn't go beyond that because I don't have any more guests that I can stream in parallel). Two thoughts: (1) Does adjusting maxStreamMemory help? (2) There are two files which contain this setting. The second one is /etc/vmware/hostd/default-config.xml It might be worth trying to make the same change to this file too.
Richard how do you reproduce it with one transfer while we see it only when migrating more then 10?
I made 11 separate connections. I was running nbdkit directly, but it should work similarly to how virt-v2v works. Will attach my test script as a private comment.
I don't think 11 is enough to reproduce this bug. It should be around 19 or 20.
btw, I tried to change default-config.xml and adjusted its file permissions, but I get 'Operation not permitted' when trying to save the changes. Looks like something is blocking writes to that file.
(In reply to Mor from comment #19) > I don't think 11 is enough to reproduce this bug. It should be around 19 or > 20. Yeah I suspect you are right. However I didn't have enough guests at the time to do 20 parallel conversions (I *think* that I have to convert a separate guest each time -- but now I think about it a bit more, maybe it will work with a single guest if I use a readonly connection?). Anyway I will continue experimenting tomorrow.
Yes I'm able to reproduce this now. I used 20 concurrent *read-only* connections to the same guest image. The error seen is the same as you describe and on the ESXi server I see in /var/log/hostd.log: 2018-08-15T20:11:21.339Z warning hostd[7F4D2B70] [Originator@6876 sub=Libs] [NFC ERROR] NfcCheckAndReserveMem: Cannot allocate any more memory as NFC is already using 48235234 and allocating 2097176 will make it more than the maximum allocated: 50331648. Please close some sessions and try again I will play with the settings and see if I can find any way to fix it.
I increased the limit in /etc/vmware/hostd/config.xml to a billion: <nfcsvc> <path>libnfcsvc.so</path> <enabled>true</enabled> <maxMemory>1000000000</maxMemory> <maxStreamMemory>10485760</maxStreamMemory> </nfcsvc> and then I restarted hostd by doing: /etc/init.d/hostd restart and I am now able to do 24 parallel conversions (was not able to do 20 before). I did not need to reboot the ESXi server. Could you check that you didn't make a mistake while editing config.xml? I noticed in the /etc/init.d/hostd script that it will actually replace config.xml with the default-config.xml if it detects any error in config.xml (crazy, I know). I am testing ESXi 6.0.0.
(In reply to Richard W.M. Jones from comment #24) > I increased the limit in /etc/vmware/hostd/config.xml to a billion: > > <nfcsvc> > <path>libnfcsvc.so</path> > <enabled>true</enabled> > <maxMemory>1000000000</maxMemory> > <maxStreamMemory>10485760</maxStreamMemory> > </nfcsvc> > > and then I restarted hostd by doing: > > /etc/init.d/hostd restart > > and I am now able to do 24 parallel conversions (was not able > to do 20 before). I did not need to reboot the ESXi server. > > Could you check that you didn't make a mistake while editing > config.xml? I noticed in the /etc/init.d/hostd script that it > will actually replace config.xml with the default-config.xml > if it detects any error in config.xml (crazy, I know). > > I am testing ESXi 6.0.0. I double checked, I haven't done a mistake. My config.xml was not replaced (if 6.5 behaves like 6.0). I can still see my value after restarting hostd.
Created attachment 1476346 [details] VMware hostd configuration
Created attachment 1476371 [details] Richard'd config.xml file
Here's the change in my configuration: diff -u default-config.xml config.xml --- default-config.xml +++ config.xml @@ -683,7 +683,7 @@ <nfcsvc> <path>libnfcsvc.so</path> <enabled>true</enabled> - <maxMemory>50331648</maxMemory> + <maxMemory>1000000000</maxMemory> <maxStreamMemory>10485760</maxStreamMemory> </nfcsvc>
I repeated the same test this morning, running 24 parallel copies successfully using my attached test script. I then replaced config.xml with default-config.xml (ie. reducing the limit back down to 50331648). I reran the test with 20 parallel connections and was able to reproduce the bug again easily.
Richard - What speed of transfer do you get with 24 parallel connections to ESXi?
Pretty slow, but I didn't measure it. These were happening over the internet so it's not representative of any real world use case.
Doc bug was opened Bug 1618982 - [v2v][doc] Concurrent conversions via vSphere and VDDK should be limited to 10 (edit) https://bugzilla.redhat.com/show_bug.cgi?id=1618982
I faced with this ESXi limitations in our own VMware-backup related product. And want to share with you some knowledge. 1. Any call to VixDiskLib_Read using NBD[SSL] transport may fail with VIX_E_OUT_OF_MEMORY. According to VDDK 6.0 release notes: "Upon receiving this error, the caller can try again with a smaller I/O buffer, retry at a later time, perform other operations on the disk, or close the disk." We implemented adaptive readers: start reading using X MB I/O buffer, we increase I/O buffer after each N successful reads and fall back to a previous value on failure or decrease I/O buffer on Ь sequential failures. 2. Earlier versions of ESXi/VDDK behaved even worse: VixDiskLib_Open/VixDiskLib_Read may hang when NFC buffer limit reached. We track each VixDiskLib_Open/VixDiskLib_Read call and kill the hanged thread if we didn't get response after reasonable timeout. And here are Release Notes for VDDK 5.5, 6.0, 6.5 for you reference. You can see how VMware tried to fix NFC issue: --- VDDK 5.5 Release Notes (https://www.vmware.com/support/developer/vddk/vddk-550-releasenotes.html#knownissues) "Connection may hang when NBD backup exceeds documented NFC buffer limit. ESXi 5.x hosts have a 32MB buffer allocated for NFC connections. When the dedicated ESXi memory is consumed, the next connection attempted can hang, although other established connections will continue normally. The following message appears in the ESXi host's vpxa.log file: “[NFC ERROR] NfcCheckAndReserveMem: Cannot allocate any more memory as NFC is already using bytes and allocating morebytes will make it more than the maximum allocated: 33554432. Please close some sessions and try again.” The workaround is to reduce the number of concurrent backups targeting the same ESXi host so as to remain within the documented limit. See KB 1022543." VDDK 6.0 Release Notes (https://www.vmware.com/support/developer/vddk/vddk-600-releasenotes.html#resolvedissues) "Error returned when ESXi host reaches NFC memory limit. Previously, VDDK would hang during I/O operations if the host ran out of NFC memory. This has been fixed in this release. You must upgrade both VDDK and ESXi to get the fix. VDDK now returns the VIX_E_OUT_OF_MEMORY error, leaving the disk handle in a still-valid state. Upon receiving this error, the caller can try again with a smaller I/O buffer, retry at a later time, perform other operations on the disk, or close the disk." VDDK 6.5 Release Notes (http://pubs.vmware.com/Release_Notes/en/developer/vddk/65/vsphere-vddk-65-release-notes.html#resolvedissues) "Error not returned when ESXi host reaches NFC connection limit. This issue may cause an apparent hang in backup, or slowness of vCenter Server. The VIX_E_OUT_OF_MEMORY error is returned in cases when an ESXi host reaches the NFC memory limit (a fixed issue), but not for other NFC limits such as the number of connections. Programs should check that the “Number of current connections” is less than the VMware recommended NFC session connection limit before making a new VixDiskLib_Open(Ex) call. This is not a defect but an NFC limitation. See section NFC Session Limits in the Virtual Disk Programming Guide. ESXi host limits depend on consumed buffer space (32MB) not just the number of connections. VixDiskLib_Open(Ex) uses one connection for every virtual disk accessed on an ESXi host. It is not possible to share a connection across virtual disks. Connections made through vCenter Server are limited to a maximum of 52, then the ESXi host limits listed above apply."
Documentation was added to the virt-v2v manual in: https://github.com/libguestfs/libguestfs/commit/0d0b5511309182f163ae263e862ddb0235780917
Yes, I saw. I've added the comment just in case some one google this ticket rather then commit message.
Created attachment 1567516 [details] vddk-nfc_error-rhel8.1.log
This bug is fixed in 1.40 (by adding documentation) so the fix should be present in RHEL 7.7. I'm closing this bug as it's (a) not an actual bug in virt-v2v and (b) the workaround is documented upstream, in RHEL 7 and RHEL 8.