Bug 2038050

Summary: [i40e sr-iov] failed to increase or reduce VF numbers when many VFs exists
Product: Red Hat Enterprise Linux 8 Reporter: Mingyu Shi <mshi>
Component: NetworkManagerAssignee: Beniamino Galvani <bgalvani>
Status: CLOSED ERRATA QA Contact: Vladimir Benes <vbenes>
Severity: medium Docs Contact:
Priority: medium    
Version: 8.6CC: bgalvani, ferferna, fge, jiji, jishi, lrintel, network-qe, rkhan, sfaye, sukulkar, till, vbenes
Target Milestone: rcKeywords: Triaged
Target Release: 8.7Flags: pm-rhel: mirror+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: NetworkManager-1.40.2-1.el8 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2150831 (view as bug list) Environment:
Last Closed: 2023-05-16 09:04:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2150831    

Description Mingyu Shi 2022-01-07 07:55:44 UTC
Description of problem:
When there are many VFs of a PF, nmstate cannot increase or reduce VF. The definition of "many" seems depending on the environment of different hosts.

Version-Release number of selected component (if applicable):
nmstate-1.2.0-1.el8.x86_64
nispor-1.2.2-1.el8.x86_64
NetworkManager-1.36.0-0.3.el8.x86_64

driver: i40e
version: 4.18.0-357.el8.x86_64
firmware-version: 6.00 0x800036cb 1.1747.0
expansion-rom-version: 
bus-info: 0000:e3:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes


How reproducible:
100%

Steps to Reproduce:
# cat many-vfs.yaml 
---
interfaces:
- name: ens4f0
  type: ethernet
  state: up
  ethernet:
    sr-iov:
      total-vfs: 64

[15:25:32@netqe-amd-01 ~/repo-nmstate/sriov]0# nmstatectl apply many-vfs.yaml 
/tmp/nmstatelog/2022-01-07-15:25:48-358932058.log
Desired state applied: 
---
interfaces:
- name: ens4f0
  type: ethernet
  state: up
  ethernet:
    sr-iov:
      total-vfs: 64
/tmp/nmstatelog/2022-01-07-15:25:48-358932058.0.log nmstatectl apply many-vfs.yaml return 0

#reduce 64 to 63, failed:
[15:25:56@netqe-amd-01 ~/repo-nmstate/sriov]0# sed 's/64/63/' many-vfs.yaml | nmstatectl apply -
/tmp/nmstatelog/2022-01-07-15:26:29-753811422.log
Traceback (most recent call last):
  File "/usr/bin/nmstatectl", line 11, in <module>
    load_entry_point('nmstate==1.2.0', 'console_scripts', 'nmstatectl')()
  File "/usr/lib/python3.6/site-packages/nmstatectl/nmstatectl.py", line 74, in main
    return args.func(args)
  File "/usr/lib/python3.6/site-packages/nmstatectl/nmstatectl.py", line 355, in apply
    args.save_to_disk,
  File "/usr/lib/python3.6/site-packages/nmstatectl/nmstatectl.py", line 419, in apply_state
    save_to_disk=save_to_disk,
  File "/usr/lib/python3.6/site-packages/libnmstate/netapplier.py", line 89, in apply
    _apply_ifaces_state(plugins, net_state, verify_change, save_to_disk)
  File "/usr/lib/python3.6/site-packages/libnmstate/netapplier.py", line 122, in _apply_ifaces_state
    plugin.apply_changes(net_state, save_to_disk)
  File "/usr/lib/python3.6/site-packages/libnmstate/nm/plugin.py", line 233, in apply_changes
    NmProfiles(self.context).apply_config(net_state, save_to_disk)
  File "/usr/lib/python3.6/site-packages/libnmstate/nm/profiles.py", line 99, in apply_config
    self._ctx.wait_all_finish()
  File "/usr/lib/python3.6/site-packages/libnmstate/nm/context.py", line 217, in wait_all_finish
    raise tmp_error
libnmstate.error.NmstateLibnmError: Activate profile uuid:d7e9349a-a8e8-408b-95e8-e6fd76fc7a7d iface:ens4f0 type: ethernet failed: reason=<enum NM_ACTIVE_CONNECTION_STATE_REASON_DEVICE_DISCONNECTED of type NM.ActiveConnectionStateReason><enum NM_DEVICE_STATE_REASON_SRIOV_CONFIGURATION_FAILED of type NM.DeviceStateReason>

Actual results:
Failed

Expected results:
No failure

Additional info:
Changing 10 VFs to 9 or 11 is OK, but changing from 60 to 59 or 61 fails.
This is an extending of https://bugzilla.redhat.com/show_bug.cgi?id=1938675 , and nothing to do with VF profiles.

Comment 1 Mingyu Shi 2022-01-07 08:02:02 UTC
Also on netqe-amd-01.knqe.lab.eng.bos.redhat.com
In the beginning, created 40 VFs on this interface. Then trying to increase to 41, failed with a different error message:
# cat many-vfs.yaml 
---
interfaces:
- name: ens4f0
  type: ethernet
  state: up
  ethernet:
    sr-iov:
      total-vfs: 40

nmstatectl apply many-vfs.yaml
sed 's/40/41/' many-vfs.yaml | nmstatectl apply -

Traceback (most recent call last):
  File "/usr/bin/nmstatectl", line 11, in <module>
    load_entry_point('nmstate==1.2.0', 'console_scripts', 'nmstatectl')()
  File "/usr/lib/python3.6/site-packages/nmstatectl/nmstatectl.py", line 74, in main
    return args.func(args)
  File "/usr/lib/python3.6/site-packages/nmstatectl/nmstatectl.py", line 338, in set
    return apply(args)
  File "/usr/lib/python3.6/site-packages/nmstatectl/nmstatectl.py", line 355, in apply
    args.save_to_disk,
  File "/usr/lib/python3.6/site-packages/nmstatectl/nmstatectl.py", line 419, in apply_state
    save_to_disk=save_to_disk,
  File "/usr/lib/python3.6/site-packages/libnmstate/netapplier.py", line 89, in apply
    _apply_ifaces_state(plugins, net_state, verify_change, save_to_disk)
  File "/usr/lib/python3.6/site-packages/libnmstate/netapplier.py", line 140, in _apply_ifaces_state
    _verify_change(plugins, net_state)
  File "/usr/lib/python3.6/site-packages/libnmstate/netapplier.py", line 155, in _verify_change
    net_state.verify(current_state)
  File "/usr/lib/python3.6/site-packages/libnmstate/net_state.py", line 86, in verify
    self._ifaces.verify(current_state.get(Interface.KEY))
  File "/usr/lib/python3.6/site-packages/libnmstate/ifaces/ifaces.py", line 621, in verify
    verify_sriov_vf(iface, cur_ifaces)
  File "/usr/lib/python3.6/site-packages/libnmstate/ifaces/ethernet.py", line 169, in verify_sriov_vf
    f"Found VF ports count does not match desired "
libnmstate.error.NmstateVerificationError: Found VF ports count does not match desired 41, current is: ens4f0v0,ens4f0v1,ens4f0v3,ens4f0v4,ens4f0v5,ens4f0v6,ens4f0v7,ens4f0v8,ens4f0v9,ens4f0v10,ens4f0v11,ens4f0v12,ens4f0v13,ens4f0v14,ens4f0v15,ens4f0v16,ens4f0v17,ens4f0v18,ens4f0v19,ens4f0v20,ens4f0v21,ens4f0v22,ens4f0v23,ens4f0v24,ens4f0v25,ens4f0v26,ens4f0v27,ens4f0v28,ens4f0v29,ens4f0v30,ens4f0v31,ens4f0v32,ens4f0v33,ens4f0v34,ens4f0v35,ens4f0v36,ens4f0v37,ens4f0v38,ens4f0v39,ens4f0v40

Comment 2 Gris Ge 2022-01-07 09:15:37 UTC
Acceptance criteria: nmstate should not fail when changing SRIOV total vfs count in kernel support range.

Comment 3 Gris Ge 2022-07-27 07:41:44 UTC
Hi Mingyu,

SR-IOV code has changed a lot in nmstate-1.3.1-1.el8. Could you try again to see whether this bug is fixed also?

Comment 4 Mingyu Shi 2022-08-31 02:14:41 UTC
Hi Gris,

Sorry for the late response.

tested with:
nmstate-1.3.3-1.el8.x86_64
nispor-1.2.7-1.el8.x86_64
NetworkManager-1.40.0-1.el8.x86_64
openvswitch2.15-2.15.0-113.2.el8fdp.x86_64
Linux dell-per740-79.rhts.eng.pek2.redhat.com 4.18.0-422.el8.x86_64 #1 SMP Thu Aug 25 21:40:53 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux
DISTRO=RHEL-8.7.0-20220829.1

driver: i40e
version: 4.18.0-422.el8.x86_64
firmware-version: 7.10 0x800075df 19.5.12
expansion-rom-version: 
bus-info: 0000:3b:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

I mentioned 2 different errors in #comment0 and #comment1:
for #comment1, increasing from 40 to 41, it works well now
for #comment0, reduce from 64 to 63, I still got the same error:
Traceback (most recent call last):
  File "/usr/bin/nmstatectl", line 11, in <module>
    load_entry_point('nmstate==1.3.3', 'console_scripts', 'nmstatectl')()
  File "/usr/lib/python3.6/site-packages/nmstatectl/nmstatectl.py", line 74, in main
    return args.func(args)
  File "/usr/lib/python3.6/site-packages/nmstatectl/nmstatectl.py", line 338, in set
    return apply(args)
  File "/usr/lib/python3.6/site-packages/nmstatectl/nmstatectl.py", line 355, in apply
    args.save_to_disk,
  File "/usr/lib/python3.6/site-packages/nmstatectl/nmstatectl.py", line 419, in apply_state
    save_to_disk=save_to_disk,
  File "/usr/lib/python3.6/site-packages/libnmstate/netapplier.py", line 89, in apply
    _apply_ifaces_state(plugins, net_state, verify_change, save_to_disk)
  File "/usr/lib/python3.6/site-packages/libnmstate/netapplier.py", line 122, in _apply_ifaces_state
    plugin.apply_changes(net_state, save_to_disk)
  File "/usr/lib/python3.6/site-packages/libnmstate/nm/plugin.py", line 233, in apply_changes
    NmProfiles(self.context).apply_config(net_state, save_to_disk)
  File "/usr/lib/python3.6/site-packages/libnmstate/nm/profiles.py", line 100, in apply_config
    self._ctx.wait_all_finish()
  File "/usr/lib/python3.6/site-packages/libnmstate/nm/context.py", line 217, in wait_all_finish
    raise tmp_error
libnmstate.error.NmstateLibnmError: Activate profile uuid:d1b8e2ce-9554-475f-87b2-c891f50de887 iface:ens1f0 type: ethernet failed: reason=<enum NM_ACTIVE_CONNECTION_STATE_REASON_DEVICE_DISCONNECTED of type NM.ActiveConnectionStateReason><enum NM_DEVICE_STATE_REASON_SRIOV_CONFIGURATION_FAILED of type NM.DeviceStateReason>

Comment 6 Mingyu Shi 2022-08-31 04:57:23 UTC
Everything works fine on RHEL 9
nmstate-2.1.3-1.el9.x86_64
NetworkManager-1.39.90-1.el9.x86_64

Comment 7 Gris Ge 2022-09-07 02:49:00 UTC
Hi Beniamino,

Could you take a look on above logs regarding NM_ACTIVE_CONNECTION_STATE_REASON_DEVICE_DISCONNECTED NM_DEVICE_STATE_REASON_SRIOV_CONFIGURATION_FAILED failure of SR-IOV changes?

Thank you!

Comment 8 Beniamino Galvani 2022-10-05 04:49:54 UTC
>  <debug> [1661912355.4769] platform: (ens1f0) link:   VF 0 mac FE:68:46:1F:42:FE spoofchk 1 trust 0
>  <debug> [1661912355.4770] platform: (ens1f0) link:   VF 1 mac 96:71:B0:5A:41:FA spoofchk 1 trust 0
>  <debug> [1661912355.4770] platform: (ens1f0) link:   VF 2 mac 62:E6:BB:05:46:3A spoofchk 1 trust 0
>  ...
>  <debug> [1661912355.4782] platform: (ens1f0) link:   VF 63 mac 26:B6:D0:84:9A:C3 spoofchk 1 trust 0
>  <error> [1661912355.4783] device (ens1f0): failed to apply SR-IOV VFs

I think the problem is that we allocate a 4KiB buffer for the netlink message and when there are many VFs with parameters the buffer is not large enough. I'll prepare a patch for that.

Comment 9 Beniamino Galvani 2022-10-05 10:03:05 UTC
I have opened a upstream merge request for NM to increase the buffer based on the number of VFs.

Comment 10 Beniamino Galvani 2022-10-05 10:03:31 UTC
Should I reassign this bz to NM or clone it?

Comment 14 Vladimir Benes 2022-11-23 09:49:58 UTC
we do see some crashes, postponing ITM a bit

Comment 19 Vladimir Benes 2022-11-29 18:36:12 UTC
We cannot reproduce the crash on the 1.40.2-1 and the original machine was reinstalled, moving back to ON_QA

Comment 22 errata-xmlrpc 2023-05-16 09:04:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (NetworkManager bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:2968