Bug 1973193 - [MTV] Failed to migrate VMs from VMware65 - importer pod restart due to OutOfMemory - When Vsphere is an end point
Summary: [MTV] Failed to migrate VMs from VMware65 - importer pod restart due to OutOf...
Keywords:
Status: CLOSED DUPLICATE of bug 1984775
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Storage
Version: 2.6.3
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.1
Assignee: Matthew Arnold
QA Contact: Ying Cui
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-17 11:59 UTC by David Vaanunu
Modified: 2021-08-05 07:04 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-08-05 07:04:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
vm32 failed to migrate in first try - have OOM Errors (26.38 KB, text/plain)
2021-06-17 11:59 UTC, David Vaanunu
no flags Details
vm32 passed to migrate in a second try (40.68 KB, text/plain)
2021-06-17 12:00 UTC, David Vaanunu
no flags Details

Description David Vaanunu 2021-06-17 11:59:07 UTC
Created attachment 1791789 [details]
vm32 failed to migrate in first try - have OOM Errors

Description of problem:

While trying to migrate VMs (5/10 VMs), the importer pod restart due to OutOfMemory.

Tested flows:
Migrate 2 VMs using 2 ESX (1 VM per ESX) - Passed
Migrate 4 VMs using 2 ESX (2 VMs per ESX) - 2 VMs were passed. 1 VM from each ESX
Migrate 2 VMs using 1 ESX (Using 2 migration plans) - 1 VM passed without importer restart. 2nd VM passed after importer pod restart (Attached importer log vm32_1.log & vm32_2.log)


Version-Release number of selected component (if applicable):
CNV_2.6.3-16
MTV STAGE

How reproducible:


Steps to Reproduce:
1. Create a migration plan with 5/10 VMs using 1 ESX over VMware65 provider
2. Run the plan
3. Monitor importer pod status

Actual results:

Importer pods were restarted, failed to migrate VMs.

Expected results:
Migrate the VMs successfully.

Additional info:

Migrate 2 VMs from 1 ESX host (using 2 plans)
Attached importer logs: (Migrate same VM)
vm32_1.log - migrate VM failed
vm32_2.log - migrate VM passed


NFC ESX Configuration:
    <nfcsvc>
         <path>libnfcsvc.so</path>
         <enabled>true</enabled>
         <maxMemory>1000000000</maxMemory>
         <maxStreamMemory>10485760</maxStreamMemory>
      </nfcsvc>


vpxa log:

2021-06-16T16:29:12.970Z warning vpxa[B3FFB70] [Originator@6876 sub=Libs opID=lro-556-67c68bef] [NFC ERROR] NfcCheckAndReserveMem: Cannot allocate any more memory as NFC is already using 48234835 and allocating 24117272 will make it more than the maximum allocated: 50331648. Please close some sessions and try again
2021-06-16T16:29:12.970Z warning vpxa[B3FFB70] [Originator@6876 sub=Libs opID=lro-556-67c68bef] [NFC ERROR] NfcFssrvrReadEx: Failed to allocate the temporary buffer of size 24117272
2021-06-16T16:29:12.970Z warning vpxa[B3FFB70] [Originator@6876 sub=Libs opID=lro-556-67c68bef] [NFC ERROR] Sending Nfc error 5: Failed to allocate the requested 24117272 bytes
2021-06-16T16:29:12.981Z warning vpxa[B3FFB70] [Originator@6876 sub=Libs opID=lro-556-67c68bef] [NFC ERROR] NfcNetTcpRead: EOF should not be reached
2021-06-16T16:29:12.981Z warning vpxa[B3FFB70] [Originator@6876 sub=Libs opID=lro-556-67c68bef] [NFC ERROR] NfcNet_Recv: requested 264, recevied only 0 bytes
2021-06-16T16:29:12.981Z warning vpxa[B3FFB70] [Originator@6876 sub=Libs opID=lro-556-67c68bef] [NFC ERROR] NfcGetMessage: recv failed:
2021-06-16T16:29:12.981Z warning vpxa[B3FFB70] [Originator@6876 sub=Libs opID=lro-556-67c68bef] [NFC ERROR] NfcServerProcessClientMsg: Failed to get message from client: NFC_NETWORK_ERROR (The operation experienced a network error)
2021-06-16T16:29:12.981Z warning vpxa[B3FFB70] [Originator@6876 sub=Libs opID=lro-556-67c68bef] [NFC ERROR] NfcServerLoop: NfcServer_HandleRead returned an error: NFC_NETWORK_ERROR (The operation experienced a network error)
2021-06-16T16:29:12.984Z error vpxa[B3FFB70] [Originator@6876 sub=vpxNfcServer opID=lro-556-67c68bef] Nfc server failed with return value : NFC_NETWORK_ERROR
2021-06-16T16:29:12.984Z verbose vpxa[B3FFB70] [Originator@6876 sub=vpxNfcServer opID=lro-556-67c68bef] Closing NFC session
2021-06-16T16:29:12.984Z info vpxa[B3FFB70] [Originator@6876 sub=vpxLro opID=lro-556-67c68bef] [VpxLRO] -- FINISH lro-556
2021-06-16T16:29:12.984Z info vpxa[B3FFB70] [Originator@6876 sub=Default opID=lro-556-67c68bef] [VpxLRO] -- ERROR lro-556 --  -- VpxNfcServerLro: vmodl.fault.SystemError:
--> Result:
--> (vmodl.fault.SystemError) {
-->    faultCause = (vmodl.MethodFault) null,
-->    faultMessage = <unset>,
-->    reason = "Nfc server failed with return value : NFC_NETWORK_ERROR"
-->    msg = ""
--> }
--> Args:
-->
2021-06-16T16:29:13.170Z warning vpxa[94ACB70] [Originator@6876 sub=Libs opID=lro-559-53940489] [NFC ERROR] NfcCheckAndReserveMem: Cannot allocate any more memory as NFC is already using 48234738 and allocating 24117272 will make it more than the maximum allocated: 50331648. Please close some sessions and try again
2021-06-16T16:29:13.170Z warning vpxa[94ACB70] [Originator@6876 sub=Libs opID=lro-559-53940489] [NFC ERROR] NfcFssrvrReadEx: Failed to allocate the temporary buffer of size 24117272
2021-06-16T16:29:13.170Z warning vpxa[94ACB70] [Originator@6876 sub=Libs opID=lro-559-53940489] [NFC ERROR] Sending Nfc error 5: Failed to allocate the requested 24117272 bytes

Comment 1 David Vaanunu 2021-06-17 12:00:57 UTC
Created attachment 1791791 [details]
vm32 passed to migrate in a second try

Comment 6 Yan Du 2021-07-07 12:23:30 UTC
could you please provide the updates for the bug ?

Comment 7 Fabien Dupont 2021-07-07 13:41:02 UTC
We're too late for 2.6.6 or 4.8.0. Targeting 4.8.1.

Comment 8 Fabien Dupont 2021-07-07 15:19:20 UTC
If I read the VMware error message, it says that it cannot allocate more than 50331648, which is the default value for the NFC service. It looks like the maxMemory value is not applied.

I have found this VMware discussion where the person has a similar issue: https://communities.vmware.com/t5/VMware-vSphere-Discussions/NFC-session-memory-limit/td-p/2757411. The person has also set the value to 1000000000 and the ESXi host seems to not use it, possibly because it would be beyond an undocumented limit. The recommendation in the discussion is to increase the maxMemory progressively and verify how many concurrent sessions can be opened.

Could you try to set the value to 62914560 and check whether the log reflects it when it fails?

Comment 9 David Vaanunu 2021-07-08 09:28:48 UTC
Running plan with 10VMs , 1 ESX (F01-H06)
1. Set the max memory to 60MB
      <!-- The nfc service -->               
      <nfcsvc>
         <path>libnfcsvc.so</path>
         <enabled>true</enabled>                           
         <maxMemory>62914560</maxMemory>
         <maxStreamMemory>10485760</maxStreamMemory>
      </nfcsvc> 
2. Restart hosted & vpxa services on ESX host

Problem reproduced. (Using vsphere Endpoint)
Still used the default value - 50331648 (48MB)


vpxa.log
--------
2021-07-08T09:04:41.454Z verbose vpxa[10C8EB70] [Originator@6876 sub=vpxNfcServer opID=lro-416-6466af9e] Starting NFC server loop
2021-07-08T09:04:41.561Z verbose vpxa[10C8EB70] [Originator@6876 sub=vpxNfcServer opID=lro-416-6466af9e] Received request to access ds:///vmfs/volumes/602bc26a-b99c5ac8-88e0-bc97e17cf0f0/dc65-iscsi-scalevm-50gb-70usage-40/dc65-iscsi-scalevm-50gb-70usage-40.vmdk
2021-07-08T09:04:42.081Z warning vpxa[BEB2B70] [Originator@6876 sub=Libs opID=lro-412-61bbfb41] [NFC ERROR] NfcCheckAndReserveMem: Cannot allocate any more memory as NFC is already using 48234932 and allocating 24117272 will make it more than the maximum allocated: 50331648. Please close some sessions and try again
2021-07-08T09:04:42.081Z warning vpxa[BEB2B70] [Originator@6876 sub=Libs opID=lro-412-61bbfb41] [NFC ERROR] NfcFssrvrReadEx: Failed to allocate the temporary buffer of size 24117272
2021-07-08T09:04:42.081Z warning vpxa[BEB2B70] [Originator@6876 sub=Libs opID=lro-412-61bbfb41] [NFC ERROR] Sending Nfc error 5: Failed to allocate the requested 24117272 bytes
2021-07-08T09:04:42.084Z verbose vpxa[BEB2B70] [Originator@6876 sub=vpxNfcServer opID=lro-412-61bbfb41] Closing NFC session
2021-07-08T09:04:42.084Z info vpxa[BEB2B70] [Originator@6876 sub=vpxLro opID=lro-412-61bbfb41] [VpxLRO] -- FINISH lro-412
2021-07-08T09:04:43.197Z verbose vpxa[BFBAB70] [Originator@6876 sub=vpxNfcServer opID=lro-407-474dd277] Closing NFC session
2021-07-08T09:04:43.197Z info vpxa[BFBAB70] [Originator@6876 sub=vpxLro opID=lro-407-474dd277] [VpxLRO] -- FINISH lro-407
2021-07-08T09:04:43.241Z warning vpxa[10C2BB70] [Originator@6876 sub=Libs opID=lro-414-3e1554e2] [NFC ERROR] NfcCheckAndReserveMem: Cannot allocate any more memory as NFC is already using 48234835 and allocating 24117272 will make it more than the maximum allocated: 50331648. Please close some sessions and try again
2021-07-08T09:04:43.241Z warning vpxa[10C2BB70] [Originator@6876 sub=Libs opID=lro-414-3e1554e2] [NFC ERROR] NfcFssrvrReadEx: Failed to allocate the temporary buffer of size 24117272
2021-07-08T09:04:43.241Z warning vpxa[10C2BB70] [Originator@6876 sub=Libs opID=lro-414-3e1554e2] [NFC ERROR] Sending Nfc error 5: Failed to allocate the requested 24117272 bytes
2021-07-08T09:04:43.355Z info vpxa[BF99B70] [Originator@6876 sub=vpxLro opID=74741185-91] [VpxLRO] -- BEGIN lro-417 -- vpxa -- vpxapi.VpxaService.nfcMakeTicketByFiles -- 5270d3b4-8ab2-6837-c9e4-b5bc3021d95a
2021-07-08T09:04:43.356Z info vpxa[BF99B70] [Originator@6876 sub=vpxLro opID=74741185-91] [VpxLRO] -- FINISH lro-417


importer.log
------------
I0708 09:03:58.674976       1 vddk-datasource.go:807] Transferred 12353273856/53687091200 bytes (23%) at 143642719 bytes/second overall, last 1% was 530579456 bytes at 176859818 bytes/second
I0708 09:04:02.607429       1 vddk-datasource.go:807] Transferred 12884901888/53687091200 bytes (24%) at 143165576 bytes/second overall, last 1% was 531628032 bytes at 177209344 bytes/second
I0708 09:04:03.495210       1 nbdkit.go:217] Log line from nbdkit: nbdkit: vddk[1]: error: [NFC ERROR]NfcFssrvrProcessErrorMsg: received NFC error 5 from server: Failed to allocate the requested 24117272 bytes
I0708 09:04:03.495244       1 nbdkit.go:217] Log line from nbdkit: nbdkit: vddk[1]: error: VixDiskLib_Read: Memory allocation failed. Out of memory.
I0708 09:04:08.019716       1 nbdkit.go:217] Log line from nbdkit: nbdkit: vddk[1]: error: [NFC ERROR]NfcFssrvrProcessErrorMsg: received NFC error 5 from server: Failed to allocate the requested 24117272 bytes
I0708 09:04:08.019752       1 nbdkit.go:217] Log line from nbdkit: nbdkit: vddk[1]: error: VixDiskLib_Read: Memory allocation failed. Out of memory.
I0708 09:04:16.126019       1 nbdkit.go:217] Log line from nbdkit: nbdkit: vddk[1]: error: [NFC ERROR]NfcFssrvrProcessErrorMsg: received NFC error 5 from server: Failed to allocate the requested 24117272 bytes
I0708 09:04:16.126071       1 nbdkit.go:217] Log line from nbdkit: nbdkit: vddk[1]: error: VixDiskLib_Read: Memory allocation failed. Out of memory.
I0708 09:04:20.659840       1 nbdkit.go:217] Log line from nbdkit: nbdkit: vddk[1]: error: [NFC ERROR]NfcFssrvrProcessErrorMsg: received NFC error 5 from server: Failed to allocate the requested 24117272 bytes
I0708 09:04:20.659859       1 nbdkit.go:217] Log line from nbdkit: nbdkit: vddk[1]: error: VixDiskLib_Read: Memory allocation failed. Out of memory.

Unable to transfer source data to target file
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause
	/remote-source/app/pkg/importer/data-processor.go:205
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData
	/remote-source/app/pkg/importer/data-processor.go:165
main.main
	/remote-source/app/cmd/cdi-importer/importer.go:189
runtime.main
	/usr/lib/golang/src/runtime/proc.go:204
runtime.goexit
	/usr/lib/golang/src/runtime/asm_amd64.s:1374
E0708 09:26:28.754095       1 importer.go:191] pread: nbd_pread: read: command failed: Input/output error
Unable to transfer source data to target file
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause
	/remote-source/app/pkg/importer/data-processor.go:205
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData
	/remote-source/app/pkg/importer/data-processor.go:165
main.main
	/remote-source/app/cmd/cdi-importer/importer.go:189
runtime.main
	/usr/lib/golang/src/runtime/proc.go:204
runtime.goexit
	/usr/lib/golang/src/runtime/asm_amd64.s:1374

Comment 10 Adam Litke 2021-08-04 17:32:32 UTC
Do you plan to have this merged and backported by Aug 10?  If not we will need to push it out to 4.8.2.

Comment 11 Matthew Arnold 2021-08-04 18:05:22 UTC
Yes, I have a pull request almost ready for 1984775, which should also fix this bug once it is merged to the right branch. It might be a good idea to make this one a duplicate of 1984775, or the other way around.

Comment 12 Fabien Dupont 2021-08-05 07:04:38 UTC

*** This bug has been marked as a duplicate of bug 1984775 ***


Note You need to log in before you can comment on or make changes to this bug.