Bug 1566712

Summary: Got nfpu_open failed ERR when starting OVS
Product: Red Hat Enterprise Linux 7 Reporter: Jean-Tsung Hsiao <jhsiao>
Component: openvswitchAssignee: Aaron Conole <aconole>
Status: CLOSED ERRATA QA Contact: Jean-Tsung Hsiao <jhsiao>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.5CC: aconole, atragler, ctrautma, echaudro, jhsiao, kzhang, louis.peens, pablo.cascon, qding, rkhan, simon.horman, tredaelli
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-03 14:37:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1560628    

Description Jean-Tsung Hsiao 2018-04-12 20:23:49 UTC
Description of problem: Got nfpu_open failed ERR when starting OVS

2018-04-12T19:49:09.317Z|00024|dpdk|ERR|PMD: nfpu_open failed
2018-04-12T19:49:09.317Z|00025|dpdk|INFO|EAL: Releasing pci mapped resource for 0000:81:00.0
2018-04-12T19:49:09.317Z|00026|dpdk|INFO|EAL: Calling pci_unmap_resource for 0000:81:00.0 at 0x7ff740000000
2018-04-12T19:49:09.317Z|00027|dpdk|INFO|EAL: Calling pci_unmap_resource for 0000:81:00.0 at 0x7ff748000000
2018-04-12T19:49:09.317Z|00028|dpdk|INFO|EAL: Calling pci_unmap_resource for 0000:81:00.0 at 0x7ff74c000000
2018-04-12T19:49:09.455Z|00029|dpdk|ERR|EAL: Requested device 0000:81:00.0 cannot be used


Version-Release number of selected component (if applicable):
[root@netqe7 openvswitch-2.9.0]# rpm -q openvswitch
openvswitch-2.9.0-15.el7fdp.x86_64
[root@netqe7 openvswitch-2.9.0]# uname -a
Linux netqe7.knqe.lab.eng.bos.redhat.com 3.10.0-862.el7.x86_64 #1 SMP Wed Mar 21 18:14:51 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux

How reproducible: Reproducible


Steps to Reproduce:
1. Install NFP card on a server
2. Provision the server under 7.5
3. Install openvswitch-2.9.0-15.el7fdp
4. Bind NFP
5. Start OVS

Actual results:


Expected results:


Additional info:

Comment 2 Aaron Conole 2018-04-12 21:31:55 UTC
Most likely a permissions issue with:

nspv_aquire_process_lock()
  -> ...
  -> snprintf(lockname, sizeof(lockname), "/var/lock/nfp%d", desc->nfp);
  -> ...
  -> desc->lock = open(lockname, O_RDWR | O_CREAT, 0666);

/var/lock is owned by root:root

Perhaps we need a /var/lock/nfp/... directory which can be accessed by the ovs process and be where the nfp locks are stored.

Comment 3 Aaron Conole 2018-04-12 22:02:26 UTC
I have two patches to DPDK to test out.  Will provide a private build tomorrow.

Comment 4 Aaron Conole 2018-04-12 22:22:48 UTC
Patches posted as RFC:
http://dpdk.org/ml/archives/dev/2018-April/096911.html

Comment 5 Rashid Khan 2018-04-13 11:19:21 UTC
what is an NFP card?

Comment 6 Jean-Tsung Hsiao 2018-04-13 11:59:28 UTC
(In reply to Rashid Khan from comment #5)
> what is an NFP card?

Ethernet controller: Netronome Systems, Inc. Device 4000

Comment 7 Jean-Tsung Hsiao 2018-04-13 12:31:34 UTC
Testpmd also failed --- missing firmware. See log below.

Hi Aaron,

Where to get the file: Firmware file /lib/firmware/netronome/nic_dpdk_default.nffw ?

Thanks!

Jean

EAL: Detected 16 lcore(s)
EAL: Auto-detected process type: PRIMARY
EAL: Probing VFIO support...
EAL: VFIO support initialized
EAL: PCI device 0000:81:00.0 on NUMA socket 1
EAL:   probe driver: 19ee:4000 net_nfp_pf
EAL:   using IOMMU type 1 (Type 1)
PMD: nfp_pf_pci_probe(): nspu ABI version: 0.21

PMD: No firmware detected. Resetting NFP...
PMD: Reset done.
PMD: Uploading firmware...
PMD: Firmware file /lib/firmware/netronome/nic_dpdk_default.nffw not found.PMD: nfp fw upload failed
EAL: Releasing pci mapped resource for 0000:81:00.0
EAL: Calling pci_unmap_resource for 0000:81:00.0 at 0x7fb340000000
EAL: Calling pci_unmap_resource for 0000:81:00.0 at 0x7fb348000000
EAL: Calling pci_unmap_resource for 0000:81:00.0 at 0x7fb34c000000
EAL: Requested device 0000:81:00.0 cannot be used
EAL: No probed ethernet devices
Interactive-mode selected
USER1: create a new mbuf pool <mbuf_pool_socket_0>: n=179456, size=2176, socket=0
USER1: create a new mbuf pool <mbuf_pool_socket_1>: n=179456, size=2176, socket=1
Done
testpmd>

Comment 8 Pablo Cascon (Netronome) 2018-04-13 15:27:35 UTC
To create the symbolic link please check which FW the netdev driver 'nfp' would load by running "dmesg | grep AMDA" . (normally the 'nfp' driver would have been loaded but might need to be manually loaded/binded to the device).

Example:

[ 3468.049908] nfp 0000:04:00.0: nfp: netronome/nic_AMDA0099-0001_2x25.nffw: found, loading...

cd /lib/firmware/netronome
ln -s nic_AMDA0099-0001_2x25.nffw nic_dpdk_default.nffw

(the FW that the NFP PMD is requesting is a symbolic link to the right FW. Later releases of DPDK - from 18.05 - don't need this symbolic link, backport?)

Comment 9 Jean-Tsung Hsiao 2018-04-13 17:51:21 UTC
(In reply to Pablo Cascon from comment #8)
> To create the symbolic link please check which FW the netdev driver 'nfp'
> would load by running "dmesg | grep AMDA" . (normally the 'nfp' driver would
> have been loaded but might need to be manually loaded/binded to the device).
> 
> Example:
> 
> [ 3468.049908] nfp 0000:04:00.0: nfp: netronome/nic_AMDA0099-0001_2x25.nffw:
> found, loading...
> 
> cd /lib/firmware/netronome
> ln -s nic_AMDA0099-0001_2x25.nffw nic_dpdk_default.nffw
> 
> (the FW that the NFP PMD is requesting is a symbolic link to the right FW.
> Later releases of DPDK - from 18.05 - don't need this symbolic link,
> backport?)

The suggestion works.

NOTE: For testpmd to work properly, need to add "--disable-hw-vlan".

Comment 10 Timothy Redaelli 2018-04-13 18:12:24 UTC
Hi,
I have some problems to find the commit that will be in 18.05 in order to avoid doing the symlink.

I tried to use the last dpdk git (fb73e096110a41b77448fe27fd9be8c489ec5d82) version, but I still have the same problem if I don't create the symlink

Thank you

Comment 11 Timothy Redaelli 2018-04-16 09:41:40 UTC
I found that the needed commits were merged on Saturday.
I'm backporting it on 17.11

Comment 14 Jean-Tsung Hsiao 2018-04-17 17:05:51 UTC
The issue has been fixed with openvswitch-2.9.0-17.el7fdp.x86_64. Will do further testing.

[root@netqe7 ~]# tail -f /var/log/openvswitch/ovs-vswitchd.log
2018-04-17T16:41:53.232Z|00001|vlog|INFO|opened log file
/var/log/openvswitch/ovs-vswitchd.log
2018-04-17T16:41:53.236Z|00002|ovs_numa|INFO|Discovered 8 CPU cores on
NUMA node 0
2018-04-17T16:41:53.236Z|00003|ovs_numa|INFO|Discovered 8 CPU cores on
NUMA node 1
2018-04-17T16:41:53.236Z|00004|ovs_numa|INFO|Discovered 2 NUMA nodes and
16 CPU cores
2018-04-17T16:41:53.236Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock:
connecting...
2018-04-17T16:41:53.237Z|00006|reconnect|INFO|unix:/var/run/openvswitch/db.sock:
connected
2018-04-17T16:41:53.238Z|00007|dpdk|INFO|Using DPDK 17.11.0
2018-04-17T16:41:53.238Z|00008|dpdk|INFO|DPDK Enabled - initializing...
2018-04-17T16:41:53.238Z|00009|dpdk|INFO|No vhost-sock-dir provided -
defaulting to /var/run/openvswitch
2018-04-17T16:41:53.238Z|00010|dpdk|INFO|IOMMU support for
vhost-user-client disabled.
2018-04-17T16:41:53.238Z|00011|dpdk|INFO|EAL ARGS: ovs-vswitchd -c
0x0808 --socket-mem 4096,1
2018-04-17T16:41:53.238Z|00012|dpdk|INFO|EAL: Detected 16 lcore(s)
2018-04-17T16:41:53.262Z|00013|dpdk|INFO|EAL: Probing VFIO support...
2018-04-17T16:41:53.262Z|00014|dpdk|INFO|EAL: VFIO support initialized
2018-04-17T16:41:59.884Z|00015|dpdk|INFO|EAL: PCI device 0000:03:00.0 on
NUMA socket 0
2018-04-17T16:41:59.884Z|00016|dpdk|INFO|EAL:   probe driver: 15b3:1007
net_mlx4
2018-04-17T16:41:59.886Z|00017|dpdk|INFO|PMD: net_mlx4: PCI information
matches, using device "mlx4_0" (VF: false)
2018-04-17T16:41:59.886Z|00018|dpdk|INFO|PMD: net_mlx4: 2 port(s) detected
2018-04-17T16:41:59.887Z|00019|dpdk|INFO|PMD: net_mlx4: port 1 MAC
address is e4:1d:2d:79:b5:51
2018-04-17T16:41:59.888Z|00020|dpdk|INFO|PMD: net_mlx4: port 2 MAC
address is e4:1d:2d:79:b5:52
2018-04-17T16:41:59.889Z|00021|dpdk|INFO|EAL: PCI device 0000:81:00.0 on
NUMA socket 1
2018-04-17T16:41:59.889Z|00022|dpdk|INFO|EAL:   probe driver: 19ee:4000
net_nfp_pf
2018-04-17T16:41:59.890Z|00023|dpdk|INFO|EAL:   using IOMMU type 1 (Type 1)
2018-04-17T16:42:00.214Z|00024|dpdk|INFO|PMD: firmware model found:
AMDA0099-0001
2018-04-17T16:42:00.214Z|00025|dpdk|INFO|PMD: NFP ethernet port table
reports 2 ports
2018-04-17T16:42:00.214Z|00026|dpdk|INFO|PMD: Port speed: 25000
2018-04-17T16:42:02.924Z|00027|dpdk|INFO|PMD: Firmware file found at
/lib/firmware/netronome/nic_AMDA0099-0001_2x25.nffw with size: 1386368
2018-04-17T16:42:02.925Z|00028|dpdk|INFO|PMD: Uploading the firmware ...
2018-04-17T16:42:08.294Z|00029|dpdk|INFO|PMD: Done
2018-04-17T16:42:08.297Z|00030|dpdk|INFO|PMD: nfp_pf_pci_probe(): Total
pf ports: 2
2018-04-17T16:42:08.297Z|00031|dpdk|INFO|PMD: nfp_net_init(): VER:
0x305, Maximum supported MTU: 9216
2018-04-17T16:42:08.297Z|00032|dpdk|INFO|PMD: nfp_net_init(): CAP:
0x78160233, PROMISC RXCSUM TXCSUM GATHER RSS
2018-04-17T16:42:08.297Z|00033|dpdk|INFO|PMD: nfp_net_init():
max_rx_queues: 32, max_tx_queues: 32
2018-04-17T16:42:08.423Z|00034|dpdk|INFO|PMD: nfp_net_init(): port 2
VendorID=0x19ee DeviceID=0x4000 mac=00:15:4d:13:07:8a
2018-04-17T16:42:08.423Z|00035|dpdk|INFO|PMD: nfp_net_init(): VER:
0x305, Maximum supported MTU: 9216
2018-04-17T16:42:08.423Z|00036|dpdk|INFO|PMD: nfp_net_init(): CAP:
0x78160233, PROMISC RXCSUM TXCSUM GATHER RSS
2018-04-17T16:42:08.423Z|00037|dpdk|INFO|PMD: nfp_net_init():
max_rx_queues: 31, max_tx_queues: 31
2018-04-17T16:42:08.524Z|00038|dpdk|INFO|PMD: nfp_net_init(): port 3
VendorID=0x19ee DeviceID=0x4000 mac=00:15:4d:13:07:8b
2018-04-17T16:42:08.524Z|00039|dpdk|INFO|EAL: PCI device 0000:83:00.0 on
NUMA socket 1
2018-04-17T16:42:08.524Z|00040|dpdk|INFO|EAL:   probe driver: 8086:154d
net_ixgbe
2018-04-17T16:42:08.524Z|00041|dpdk|INFO|EAL: PCI device 0000:83:00.1 on
NUMA socket 1
2018-04-17T16:42:08.524Z|00042|dpdk|INFO|EAL:   probe driver: 8086:154d
net_ixgbe
2018-04-17T16:42:08.525Z|00043|dpdk|INFO|DPDK Enabled - initialized
2018-04-17T16:42:08.528Z|00044|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.9.0
2018-04-17T16:42:08.528Z|00045|timeval|WARN|Unreasonably long 15290ms
poll interval (40ms user, 6723ms system)
2018-04-17T16:42:08.528Z|00046|timeval|WARN|faults: 22578 minor, 0 major
2018-04-17T16:42:08.528Z|00047|timeval|WARN|context switches: 345
voluntary, 11 involuntary
2018-04-17T16:42:08.528Z|00048|coverage|INFO|Event coverage, avg rate
over last: 5 seconds, last minute, last hour,  hash=aaa85002:
2018-04-17T16:42:08.528Z|00049|coverage|INFO|bridge_reconfigure        
0.2/sec     0.017/sec        0.0003/sec   total: 1
2018-04-17T16:42:08.528Z|00050|coverage|INFO|cmap_expand               
0.8/sec     0.133/sec        0.0022/sec   total: 8
2018-04-17T16:42:08.528Z|00051|coverage|INFO|miniflow_malloc           
1.0/sec     0.183/sec        0.0031/sec   total: 11
2018-04-17T16:42:08.528Z|00052|coverage|INFO|hmap_expand              
25.4/sec     6.317/sec        0.1053/sec   total: 379
2018-04-17T16:42:08.528Z|00053|coverage|INFO|txn_unchanged             
0.2/sec     0.050/sec        0.0008/sec   total: 3
2018-04-17T16:42:08.528Z|00054|coverage|INFO|poll_create_node          
2.8/sec     0.667/sec        0.0111/sec   total: 40
2018-04-17T16:42:08.528Z|00055|coverage|INFO|seq_change                
4.8/sec     1.167/sec        0.0194/sec   total: 70
2018-04-17T16:42:08.528Z|00056|coverage|INFO|pstream_open              
0.2/sec     0.017/sec        0.0003/sec   total: 1
2018-04-17T16:42:08.528Z|00057|coverage|INFO|stream_open               
0.2/sec     0.017/sec        0.0003/sec   total: 1
2018-04-17T16:42:08.528Z|00058|coverage|INFO|util_xalloc             
495.2/sec   123.767/sec        2.0628/sec   total: 7426
2018-04-17T16:42:08.528Z|00059|coverage|INFO|netdev_get_hwaddr         
0.4/sec     0.033/sec        0.0006/sec   total: 2
2018-04-17T16:42:08.528Z|00060|coverage|INFO|netlink_received          
0.2/sec     0.050/sec        0.0008/sec   total: 3
2018-04-17T16:42:08.528Z|00061|coverage|INFO|netlink_sent              
0.2/sec     0.017/sec        0.0003/sec   total: 1
2018-04-17T16:42:08.528Z|00062|coverage|INFO|91 events never hit

Comment 15 Jean-Tsung Hsiao 2018-04-17 17:36:50 UTC
Got the same issue when adding nfp dpdk interface to OVS-dpdk bridge:

2018-04-17T17:34:25.399Z|00051|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports recirculation
2018-04-17T17:34:25.399Z|00052|ofproto_dpif|INFO|netdev@ovs-netdev: VLAN header stack length probed as 1
2018-04-17T17:34:25.399Z|00053|ofproto_dpif|INFO|netdev@ovs-netdev: MPLS label stack length probed as 3
2018-04-17T17:34:25.399Z|00054|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports truncate action
2018-04-17T17:34:25.399Z|00055|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports unique flow ids
2018-04-17T17:34:25.399Z|00056|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports clone action
2018-04-17T17:34:25.399Z|00057|ofproto_dpif|INFO|netdev@ovs-netdev: Max sample nesting level probed as 10
2018-04-17T17:34:25.399Z|00058|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports eventmask in conntrack action
2018-04-17T17:34:25.399Z|00059|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_clear action
2018-04-17T17:34:25.399Z|00060|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_state
2018-04-17T17:34:25.399Z|00061|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_zone
2018-04-17T17:34:25.399Z|00062|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_mark
2018-04-17T17:34:25.399Z|00063|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_label
2018-04-17T17:34:25.399Z|00064|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_state_nat
2018-04-17T17:34:25.399Z|00065|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_orig_tuple
2018-04-17T17:34:25.399Z|00066|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_orig_tuple6
2018-04-17T17:34:25.429Z|00067|bridge|INFO|bridge ovsbr0: added interface ovsbr0 on port 65534
2018-04-17T17:34:25.433Z|00068|bridge|INFO|bridge ovsbr0: using datapath ID 000006269728d840
2018-04-17T17:34:25.433Z|00069|connmgr|INFO|ovsbr0: added service controller "punix:/var/run/openvswitch/ovsbr0.mgmt"
2018-04-17T17:34:25.487Z|00070|dpdk|INFO|EAL: PCI device 0000:81:00.0 on NUMA socket 1
2018-04-17T17:34:25.487Z|00071|dpdk|INFO|EAL:   probe driver: 19ee:4000 net_nfp_pf
2018-04-17T17:34:25.488Z|00072|dpdk|INFO|EAL:   using IOMMU type 1 (Type 1)
2018-04-17T17:34:25.709Z|00073|dpdk|ERR|PMD: nfpu_open failed

Comment 16 Jean-Tsung Hsiao 2018-04-17 19:33:16 UTC
Hi Tim and Eelco,

Sorry, it was a false alarm.

In the process of yum remove and yum install, I got -15 back instead of -17 installed.

Thanks Tim for reminding me.

Next, to run ping/netperf traffic between guests over OVS-dpdk vxlan tunnel. This is another critical test --- bnxt failed on that.

Thanks!

Jean

Comment 17 Jean-Tsung Hsiao 2018-04-18 04:19:42 UTC
NFP passed the following test case --- netperf between guests over OVS-dpdk/NFP vxlan tunnel:

[root@localhost ~]# netperf -H 172.16.33.106 -l 60; netperf -H 172.16.33.106 -t TCP_MAERTS -l 60
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.16.33.106 () port 0 AF_INET
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 87380  16384  16384    60.00    3878.57   
MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.16.33.106 () port 0 AF_INET
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 87380  16384  16384    60.00    3801.05   
[root@localhost ~]#

Comment 19 Timothy Redaelli 2018-05-03 14:37:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:1267