Bug 1325984

Summary: DPDK-OVS-vlan bridge setup; when nic bind to dpdk driver there is no connectivity between hosts
Product: Red Hat OpenStack Reporter: Eran Kuris <ekuris>
Component: openvswitchAssignee: Flavio Leitner <fleitner>
Status: CLOSED NOTABUG QA Contact: Ofer Blaut <oblaut>
Severity: high Docs Contact:
Priority: high    
Version: 8.0 (Liberty)CC: aconole, apevec, chrisw, edannon, ekuris, fbaudin, jhsiao, mlopes, pmatilai, rhos-maint, rkhan, srevivo, tfreger
Target Milestone: ---Keywords: Regression, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-04-21 05:26:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
ovs
none
sosreport none

Description Eran Kuris 2016-04-11 14:09:30 UTC
Description of problem: Its look like degradation because it worked for me in the past 
installed DPDK environment with vlan bridge.
Saw that when the physical port is bind to dpdk driver the connectivity between hosts disconnected.
When the port is bind and we launch instance it boot without IP address because it cannot connect to DHCP server,
When  unbind the port and run "dhclient" from the instance it get IP address.
Also there is an error in OVS-vsctl "error: "could not open network device enp5s0f1 (No such device)"


root@puma48 ~]# ovs-vsctl show
48411fb5-3081-4a79-ba11-19f3a49d7ed1
    Bridge br-vlan
        Port br-vlan
            Interface br-vlan
                type: internal
        Port "enp5s0f1"
            Interface "enp5s0f1"
                error: "could not open network device enp5s0f1 (No such device)"
        Port phy-br-vlan
            Interface phy-br-vlan
                type: patch
                options: {peer=int-br-vlan}
        Port "dpdk0"
            Interface "dpdk0"
                type: dpdk
    Bridge br-int
        fail_mode: secure
        Port "vhu3cdf6fdc-9b"
            tag: 1
            Interface "vhu3cdf6fdc-9b"
                type: dpdkvhostuser
        Port br-int
            Interface br-int
                type: internal
        Port int-br-vlan
            Interface int-br-vlan
                type: patch
                options: {peer=phy-br-vlan}
    ovs_version: "2.4.0"

openvswitch log : 
2016-04-11 16:22:12.533 2960 ERROR neutron.agent.linux.utils [req-8698ad69-e997-4b7d-a260-53ff226be25e - - - - -]
Command: ['ovs-ofctl', 'dump-flows', 'br-int', 'table=23']
Exit code: 1
Stdin:
Stdout:
Stderr: ovs-ofctl: br-int is not a bridge or a socket

2016-04-11 16:22:12.534 2960 ERROR neutron.agent.common.ovs_lib [req-8698ad69-e997-4b7d-a260-53ff226be25e - - - - -] Unable to execute ['ovs-ofctl', 'dump-flows', 'br-int', 'table=23']. Exception:
Command: ['ovs-ofctl', 'dump-flows', 'br-int', 'table=23']
Exit code: 1
Stdin:
Stdout:
Stderr: ovs-ofctl: br-int is not a bridge or a socket
 
Version-Release number of selected component (if applicable):
[root@puma48 ~]# rpm -qa |grep dpdk
dpdk-2.2.0-3.el7.x86_64
dpdk-tools-2.2.0-3.el7.x86_64
openvswitch-dpdk-2.4.0-0.10346.git97bab959.3.el7_2.x86_64
[root@puma48 ~]# rpm -qa |grep  neutron
openstack-neutron-common-7.0.1-15.el7ost.noarch
openstack-neutron-7.0.1-15.el7ost.noarch
python-neutronclient-3.1.0-1.el7ost.noarch
python-neutron-7.0.1-15.el7ost.noarch
openstack-neutron-openvswitch-7.0.1-15.el7ost.noarch
[root@puma48 ~]# rpm -qa |grep  packstack
openstack-packstack-7.0.0-0.14.dev1702.g490e674.el7ost.noarch
openstack-packstack-puppet-7.0.0-0.14.dev1702.g490e674.el7ost.noarch

How reproducible:
always

Steps to Reproduce:
1.run installation on vlan setup use this guide : https://docs.google.com/document/d/1K_ku6_08ooq46dFLiE7fAJ0ByFdPCb0W_q6kKqF3Y0o/edit#
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Aaron Conole 2016-04-12 13:01:28 UTC
Can you post the results of:

* ovs-vsctl list Bridge
* ovs-vsctl list Interface
* ip l

Thanks

Comment 3 Eran Kuris 2016-04-12 13:31:58 UTC
Created attachment 1146449 [details]
ovs

Comment 4 Eran Kuris 2016-04-12 13:32:52 UTC
Attached file with all output that you asked

Comment 5 Flavio Leitner 2016-04-13 23:38:31 UTC
It sounds like you have a stale interfaces in the OVS DB. I wonder if the host crashed or didn't shutdown properly which might lead to that.

Any chance to stop OSP services, delete all bridges and then restart the host?
That would make sure if you have everything fresh.

thanks,
fbl

Comment 6 Eran Kuris 2016-04-14 06:22:43 UTC
 fbl ,
I saw the issue in fresh setup if you want we can set  debug session I will show you the setup .

Comment 7 Aaron Conole 2016-04-14 13:48:53 UTC
Eran, thanks for that command output - it is most helpful.

Please send the output of the following script:

#!/bin/sh

FILES=("br-int" "br-vlan")
FILE_SUFFIXES=(".mgmt" ".snoop")
FILE_PATHS=("/var/run/openvswitch" "/usr/var/run/openvswitch" "/usr/local/var/run/openvswitch")


found=0
for file in ${FILES[@]}; do
    echo -n "Checking for files related to $file... "
    for suffix in ${FILE_SUFFIXES[@]}; do
        for path in ${FILE_PATHS[@]}; do
            if [ -f "${path}/${file}${suffix}" ]; then
                echo -n "found file for $file"
                found=$(( $found + 1 ))
            fi
        done
    done
    echo " ... done"
done

expected=${#FILES[@]}
expected=$(( $expected * ${#FILE_SUFFIXES[@]} ))
if [ "$found" != "$expected" ]; then
   echo "E: Suspicous mismatch of files (${found} vs ${expected})"
   exit 1
fi
echo "I: Files seem to be in order"
exit 0

Comment 8 Flavio Leitner 2016-04-14 13:54:05 UTC
Could you please provide a sosreport?

It seems like "enp5s0f1" was added to the bridge but it is the one bound to DPDK (dpdk0 interface), right? So that would mean an incorrect installation steps, though the document looks good.

One improvement on that doc would be to use driverctl:
http://people.redhat.com/~pmatilai/dpdk-guide/setup/binding.html#vfio

Thanks!

Comment 9 Flavio Leitner 2016-04-14 13:55:44 UTC
Just to be clear: comment#8 complements comment#7. So, both are relevant.

Comment 10 Eran Kuris 2016-04-17 06:01:33 UTC
[root@puma48 ~]# ./bug.sh 
Checking for files related to br-int...  ... done
Checking for files related to br-vlan...  ... done
E: Suspicous mismatch of files (0 vs 4)
[root@puma48 ~]# ./bug.sh 
Checking for files related to br-int...  ... done
Checking for files related to br-vlan...  ... done
E: Suspicous mismatch of files (0 vs 4)

sosreport attached 


In few days  I'm leaving for long vacation. your contact from  Neutrn QE is 
edannon

Comment 11 Eran Kuris 2016-04-17 06:04:09 UTC
Created attachment 1148049 [details]
sosreport

Comment 12 Flavio Leitner 2016-04-18 18:10:40 UTC
The sosreport is missing the openvswitch module, not sure if it wasn't enabled or didn't work.

Anyway, it seems we are missing the unix sockets in /var/run/openvswitch or the tool is looking somewhere else.  Could you check if you have that directory and what is inside of it?

Do you have ovs tools installed somewhere else besides the ones provided by the RPM package?  Most probably a local compiled OVS would go to another path and search for the unix sockets in another place as well.

I told by email but for completeness, the stale interface "enp5s0f1" likely is the one bound to DPDK. So, it shouldn't have been included in OVS at all, please remove it. Also that if it was used by the kernel, chances are that it might not work when moved to DPDK.  Therefore, I'd recommend to create an ifcfg- file for it disabling the interface: ONBOOT=no NM_CONTROLLED=no, then reboot, double check if the interface is listed but in 'DOWN' state, then bind to DPDK, start OVS and so on.

Thanks!

Comment 13 Eran Kuris 2016-04-19 04:11:37 UTC
Flavio I dont have this directory /var/run/openvswitch .
which is strange : 
[root@puma48 ~]# cat  /etc/neutron/plugins/ml2/openvswitch_agent.ini |grep -i openvswitch
# '/var/run/openvswitch' is the default value
vhostuser_socket_dir = /var/run/openvswitch

About your comment that the stale interface "enp5s0f1" likely is the one bound to DPDK. So, it shouldn't have been included in OVS at   ,  I'm not sure why I  need to remove it from the bridge , otherwise how the nodes will communicate ?

Comment 14 Eran Kuris 2016-04-19 05:47:11 UTC
the dir is exist after  I restart the services  : 
[root@puma48 ~]# cd /var/run/openvswitch
[root@puma48 openvswitch]# ll
total 8
srwx------ 1 root qemu 0 Apr 19 08:05 br-int.mgmt
srwx------ 1 root qemu 0 Apr 19 08:05 br-int.snoop
srwx------ 1 root qemu 0 Apr 19 08:05 br-vlan.mgmt
srwx------ 1 root qemu 0 Apr 19 08:05 br-vlan.snoop
srwx------ 1 root qemu 0 Apr 19 08:05 db.sock
srwx------ 1 root qemu 0 Apr 19 08:05 ovsdb-server.54708.ctl
-rw-r--r-- 1 root qemu 6 Apr 19 08:05 ovsdb-server.pid
srwx------ 1 root qemu 0 Apr 19 08:05 ovs-vswitchd.54724.ctl
-rw-rw-r-- 1 root qemu 6 Apr 19 08:05 ovs-vswitchd.pid
srwxrwxr-x 1 root qemu 0 Apr 19 08:05 vhu3cdf6fdc-9b
srwxrwxr-x 1 root qemu 0 Apr 19 08:05 vhuec23fdc2-ad
[root@puma48 ~]# grep -ir "/var/run/openvswitch/" /var/log/
/var/log/messages:Apr 19 08:04:51 puma48 neutron-openvswitch-agent: 2016-04-19 08:04:51.881 2960 ERROR neutron.agent.linux.async_process [-] Process [ovsdb-client monitor Interface name,ofport,external_ids --format=json] dies due to the error: ovsdb-client: failed to connect to "unix:/var/run/openvswitch/db.sock" (No such file or directory)
/var/log/messages:Apr 19 08:05:22 puma48 neutron-openvswitch-agent: 2016-04-19 08:05:22.022 2960 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor Interface name,ofport,external_ids --format=json]: ovsdb-client: failed to connect to "unix:/var/run/openvswitch/db.sock" (No such file or directory)
/var/log/messages:Apr 19 08:05:22 puma48 neutron-openvswitch-agent: 2016-04-19 08:05:22.023 2960 ERROR neutron.agent.linux.async_process [-] Process [ovsdb-client monitor Interface name,ofport,external_ids --format=json] dies due to the error: ovsdb-client: failed to connect to "unix:/var/run/openvswitch/db.sock" (No such file or directory)
/var/log/messages:Apr 19 08:05:47 puma48 ovs-ctl: VHOST_CONFIG: bind to /var/run/openvswitch/vhuec23fdc2-ad
/var/log/messages:Apr 19 08:05:47 puma48 ovs-ctl: VHOST_CONFIG: bind to /var/run/openvswitch/vhu3cdf6fdc-9b

[root@puma48 openvswitch]# grep -ir "/var/run/openvswitch/" /var/log/openvswitch/
/var/log/openvswitch/ovs-vswitchd.log:2016-04-19T05:05:46.627Z|00006|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
/var/log/openvswitch/ovs-vswitchd.log:2016-04-19T05:05:46.631Z|00007|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
/var/log/openvswitch/ovs-vswitchd.log:2016-04-19T05:05:47.282Z|00018|dpdk|INFO|Socket /var/run/openvswitch/vhuec23fdc2-ad created for vhost-user port vhuec23fdc2-ad
/var/log/openvswitch/ovs-vswitchd.log:2016-04-19T05:05:47.288Z|00021|dpdk|INFO|Socket /var/run/openvswitch/vhu3cdf6fdc-9b created for vhost-user port vhu3cdf6fdc-9b
/var/log/openvswitch/ovs-vswitchd.log:2016-04-19T05:05:47.289Z|00024|connmgr|INFO|br-vlan: added service controller "punix:/var/run/openvswitch/br-vlan.mgmt"
/var/log/openvswitch/ovs-vswitchd.log:2016-04-19T05:05:47.324Z|00026|connmgr|INFO|br-int: added service controller "punix:/var/run/openvswitch/br-int.mgmt"

Comment 15 Flavio Leitner 2016-04-19 17:45:44 UTC
(In reply to Eran Kuris from comment #13)
> Flavio I dont have this directory /var/run/openvswitch .
> which is strange : 
> [root@puma48 ~]# cat  /etc/neutron/plugins/ml2/openvswitch_agent.ini |grep
> -i openvswitch
> # '/var/run/openvswitch' is the default value
> vhostuser_socket_dir = /var/run/openvswitch

OVS keeps the sockets there, so if that directory disappears, OVS won't work well.  That directory is managed by the openvswitch systemd service.

The summary says:
"""
2016-04-11 16:22:12.534 2960 ERROR neutron.agent.common.ovs_lib [req-8698ad69-e997-4b7d-a260-53ff226be25e - - - - -] Unable to execute ['ovs-ofctl', 'dump-flows', 'br-int', 'table=23']. Exception:
Command: ['ovs-ofctl', 'dump-flows', 'br-int', 'table=23']
"""

Looking at the journal I see this:
Apr 11 16:22:05 puma48.scl.lab.tlv.redhat.com systemd[1]: Starting Open vSwitch Internal Unit...
Apr 11 16:22:06 puma48.scl.lab.tlv.redhat.com ovs-ctl[4489]: Starting ovsdb-server [  OK  ]
Apr 11 16:22:06 puma48.scl.lab.tlv.redhat.com ovs-ctl[4489]: Starting ovs-vswitchd 2016-04-11T13:22:06Z|00001|dpdk|INFO|No -vhost_sock_dir provided - defaulting to /var/run/openvswitch
[...]
<the event reported happens here>
[...]
pr 11 16:22:12 puma48.scl.lab.tlv.redhat.com ovs-ctl[4489]: Enabling remote OVSDB managers [  OK  ]
Apr 11 16:22:12 puma48.scl.lab.tlv.redhat.com systemd[1]: Started Open vSwitch Internal Unit.
Apr 11 16:22:12 puma48.scl.lab.tlv.redhat.com systemd[1]: Starting Open vSwitch...
Apr 11 16:22:12 puma48.scl.lab.tlv.redhat.com systemd[1]: Started Open vSwitch.

So, the errors seem to be a consequence of running commands while OVS wasn't ready.
 
> About your comment that the stale interface "enp5s0f1" likely is the one
> bound to DPDK. So, it shouldn't have been included in OVS at   ,  I'm not
> sure why I  need to remove it from the bridge , otherwise how the nodes will
> communicate ?

You can't use enp5s0f1 and dpdk0 if they are the same NIC.  If you bound enp5s0f1 to DPDK, it is now called dpdk0 and you can't use enp5s0f1 anymore. Vice-versa is true.  You can't use the same interface with both DPDK and kernel.  That is most probably the reason for the communication issue in the host.

You need to remove enp5s0f1 from the OVSDB, use ifcfg-enp5s0f1 to not bring up the device, then reboot. The interface should be listed by 'ip link' but it must be in DOWN state.  After that, you need to bind the interface to DPDK (which takes the interface out of kernel's control) and only then start openvswitch-dpdk.

Comment 16 Flavio Leitner 2016-04-19 17:49:43 UTC
BTW, could you check if after a clean boot, you still see DMAR messages in the logs?

If yes, you might want to boot passing 'iommu=pt' to work around the issue which could be related to the connectivity issue as well.

Comment 17 Eran Kuris 2016-04-20 06:27:47 UTC
Eyal , the setup is yours now so please provide to Flavio the details .

Comment 18 Eyal Dannon 2016-04-20 06:57:10 UTC
I tried what you wrote, I'm still having the same issue,
about the DMAR messages, after a clean reboot with the configuration you suggested they disappeared.

Comment 19 Eyal Dannon 2016-04-20 18:24:46 UTC
Flavio Leitner found out we are using "uio_pci_generic" driver for the dpdk interface,
we are supporting "vfio-pci" driver, and while using this driver the issue is gone.

Comment 20 Rashid Khan 2016-04-20 20:14:55 UTC
Eyal, 
Can we close the issue then? 


Flavio, 
Others will do the same thing in future. Uio vs vfio.
Can we put a big giant warning in dmesg or somewhere else if we see unsupported uio? Just a thought. 
If you like the thought, then please start an enhancement request BZ and we will give it to someone in the team

Comment 21 Martin Lopes 2017-07-24 00:33:13 UTC
Note: the google doc guide mentioned in the description has been published here:

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html-single/network_functions_virtualization_configuration_guide/