Bug 1397299 - Explicit restart of openvswitch required for dpdk0 association
Summary: Explicit restart of openvswitch required for dpdk0 association
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openvswitch
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 10.0 (Newton)
Assignee: Aaron Conole
QA Contact: Maxim Babushkin
URL:
Whiteboard:
Depends On:
Blocks: 1468751
TreeView+ depends on / blocked
 
Reported: 2016-11-22 08:51 UTC by Saravanan KR
Modified: 2022-08-16 13:51 UTC (History)
24 users (show)

Fixed In Version: openvswitch-2.7.2-1.git20170719
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1468751 (view as bug list)
Environment:
Last Closed: 2018-02-16 15:28:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1649267 0 None None None 2016-12-12 13:50:32 UTC
Red Hat Issue Tracker OSP-4585 0 None None None 2022-08-16 13:51:32 UTC

Description Saravanan KR 2016-11-22 08:51:00 UTC
Description of problem:
During the deployment of OVS-DPDK, 'ovs-vsctl show' displays error in the dpdk port binding as:

Port "dpdk0"
    Interface "dpdk0"
        type: dpdk
        error: "could not open network device dpdk0 (Address family not supported by protocol)" 

If the openvswitch is restarted, then the error is gone and DPDK works as expected. To achieve this, a post install script to restart the openvswitch is required for the deployment - https://review.openstack.org/#/c/395431/1/doc/source/advanced_deployment/ovs_dpdk_config.rst@195


But this same problem happens, if the DPDK compute node is restarted, the same error occurs. Again if the openvswitch is restarted, the problem goes off. Need to identify why the restart of openvswitch is required.


Version-Release number of selected component (if applicable):
openvswitch.x86_64
2.5.0-14.git20160727.el7fdp
@rhos-10.0-rhel-7-fast-datapath


How reproducible:
100%

Steps to Reproduce:
Deploy with environment as guided in https://mojo.redhat.com/docs/DOC-1100744


Additional info:
All the required kernel args are set with the help of first-boot templates and the compute has been restarted before the actual configuration starts.

Comment 1 Maxim Babushkin 2016-11-22 09:13:34 UTC
Hi Saravanan,

I'm not facing this behavior.
In my post-install script I'm restarting openvswitch service additional to openvswitch-nonetwork.
For me the section of services restart in post-install.yaml looks like the following:

systemctl daemon-reload
systemctl restart openvswitch-nonetwork
systemctl restart openvswitch

Comment 2 Saravanan KR 2016-11-23 06:09:50 UTC
(In reply to Maxim Babushkin from comment #1)
> Hi Saravanan,
> 
> I'm not facing this behavior.
> In my post-install script I'm restarting openvswitch service additional to
> openvswitch-nonetwork.
> For me the section of services restart in post-install.yaml looks like the
> following:
> 
> systemctl daemon-reload
> systemctl restart openvswitch-nonetwork
> systemctl restart openvswitch

The reason for this BZ is to undertand why do we need to restart open-vswitch in the post install. As the compute has been already rebooted after all the configuration changes, it should work when puppet enables openvswitch.



In my setup, when i restart the compute node (after the complete deployment), the out of "ovs-vsctl show" displays as below:
    Bridge br-link
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port phy-br-link
            Interface phy-br-link
                type: patch
                options: {peer=int-br-link}
        Port "dpdk0"
            Interface "dpdk0"
                type: dpdk
                error: "could not open network device dpdk0 (No such device)"

The templates used for the deployment is at - https://github.com/krsacme/tht-dpdk/tree/33651e4c10b28714727435e49c0da9665175d149

Comment 3 Flavio Leitner 2016-11-25 16:37:54 UTC
(In reply to Maxim Babushkin from comment #1)
> systemctl restart openvswitch-nonetwork

Please do not add dependencies on openvswitch-nonetwork service because it is
for internal use of OVS initialization.

> systemctl restart openvswitch

That should be enough if you need to restart OVS.

"error: "could not open network device dpdk0 (No such device)"

That means the DPDK port wasn't available when OVS started. You need to have all physical DPDK ports available when OVS initializes.

That behavior is changing in upstream, so OVS 2.7 most probably can hotplug DPDK ports.

The questions then are how are you binding the NIC and when?

Comment 4 Maxim Babushkin 2016-11-28 12:37:08 UTC
From the tests:

systemctl restart openvswitch is enough to restart the OVS.

The restart of openvswitch should be located on the post-install.yaml script.
If trying to restart the openvswitch service on the first-boot.yaml script, I get the following error.


/var/log/openvswitch/ovs-vswitchd.log
netdev|WARN|could not create netdev dpdk0 of unknown type dpdk
bridge|WARN|could not open network device dpdk0 (Address family not supported by protocol)
netdev|WARN|could not create netdev dpdk1 of unknown type dpdk
bridge|WARN|could not open network device dpdk1 (Address family not supported by protocol)

Comment 5 Maxim Babushkin 2016-11-28 12:42:32 UTC
The bind done with the heat templates, by using the 'ovs_dpdk_port' within the ovs dpdk bridge.

First-boot.yaml file contains some workaround that changes the openvswitch service so the instance will be able to boot up.
The actual bind happen after the first reboot of the compute node, when post-install.yaml script restart the openvswitch service.

Comment 6 Flavio Leitner 2016-12-01 13:50:29 UTC
(In reply to Maxim Babushkin from comment #5)
> The actual bind happen after the first reboot of the compute node, when
> post-install.yaml script restart the openvswitch service.

OK, so it reboots, then OVS is started by default, Neutron binds the NIC and OVS is restarted. Is that correct?

I've found this NeutronDpdkDriverType: "vfio-pci", but I can't tell when and how the NIC is being configured.

Comment 7 Flavio Leitner 2016-12-01 14:02:51 UTC
BTW, the opevswitch-nonetwork.service used in [1] doesn't exist in 
22.git20160727.el7fdp due to bz#1397049.

[1] https://github.com/krsacme/tht-dpdk/blob/33651e4c10b28714727435e49c0da9665175d149/first-boot.yaml#L48

We have one service per daemon now, so that needs to be updated.

Comment 8 Karthik Sundaravel 2016-12-02 06:07:40 UTC
(In reply to Flavio Leitner from comment #6)
> (In reply to Maxim Babushkin from comment #5)
> > The actual bind happen after the first reboot of the compute node, when
> > post-install.yaml script restart the openvswitch service.
> 
> OK, so it reboots, then OVS is started by default, Neutron binds the NIC and
> OVS is restarted. Is that correct?
> 
> I've found this NeutronDpdkDriverType: "vfio-pci", but I can't tell when and
> how the NIC is being configured.

The sequence of steps AFAIK

a) The vfio-pci/igb_uio driver will be bind to the DPDK nic by os-net-config [1]

b) the first-boot scripts [2] will run. This script will perform a reboot.

c) The DPDK_OPTIONS in /etc/sysconfig/openvswitch will be set by puppet [3] and openvswitch service shall be enabled.

After step c) we still observe ""error: "could not open network device dpdk0 (No such device)"". As a workaround we've restarted openvswitch.


[1] https://github.com/openstack/os-net-config/blob/35823f261506f9256c1a227dd4a2770a0508c62d/os_net_config/utils.py#L180
[2] https://github.com/krsacme/tht-dpdk/blob/33651e4c10b28714727435e49c0da9665175d149/first-boot.yaml#L48
[3] https://github.com/openstack/puppet-vswitch/blob/master/manifests/dpdk.pp#L67

Comment 9 Flavio Leitner 2016-12-02 17:30:39 UTC
OK, most probably udev is racing with openvswitch service.

Could you try patching the ovs-vswitchd.service? 

--- ovs-vswitchd.service.bk	2016-12-02 15:19:09.363393965 -0200
+++ ovs-vswitchd.service	2016-12-02 15:19:32.968918348 -0200
@@ -1,6 +1,7 @@
 [Unit]
 Description=Open vSwitch Forwarding Unit
-After=ovsdb-server.service
+Wants=systemd-udev-settle.service
+After=ovsdb-server.service systemd-udev-settle.service
 Requires=ovsdb-server.service
 ReloadPropagatedFrom=ovsdb-server.service
 AssertPathIsReadWrite=/var/run/openvswitch/db.sock


Thanks

Comment 10 Vijay Chundury 2016-12-05 08:59:28 UTC
Karthik,
As told by Flavio, can you check this scenario by taking the patch (udev) and check by removing the explicit re-start of openVswitchd.

Ideally if this works the post install script that goes to Deepthi should not have any restarts :). 

Regards
Vijay.

Comment 11 Yariv 2016-12-06 11:52:09 UTC
(In reply to Flavio Leitner from comment #9)
> OK, most probably udev is racing with openvswitch service.
> 
> Could you try patching the ovs-vswitchd.service? 
> 
> --- ovs-vswitchd.service.bk	2016-12-02 15:19:09.363393965 -0200
> +++ ovs-vswitchd.service	2016-12-02 15:19:32.968918348 -0200
> @@ -1,6 +1,7 @@
>  [Unit]
>  Description=Open vSwitch Forwarding Unit
> -After=ovsdb-server.service
> +Wants=systemd-udev-settle.service
> +After=ovsdb-server.service systemd-udev-settle.service
>  Requires=ovsdb-server.service
>  ReloadPropagatedFrom=ovsdb-server.service
>  AssertPathIsReadWrite=/var/run/openvswitch/db.sock
> 
> 
> Thanks

Did Bengaluru Team tested the patch?

Comment 12 Maxim Babushkin 2016-12-06 11:54:01 UTC
Checking the patch right now.

Comment 13 Maxim Babushkin 2016-12-07 11:10:15 UTC
The patch suggested by Flavio match 2.5.0-22 version, but for 2.5.0-14 can't be matched.

For the 2.5.0-22 version, the patch doesn't work.


I tried to match the patch for the 2.5.0-14 version by implementing the changes on the openvswitch-nonetwork.service. But without success.

Comment 14 Saravanan KR 2016-12-12 13:49:56 UTC
> That means the DPDK port wasn't available when OVS started. You need to have all physical DPDK ports available when OVS initializes.

Thanks Flavio for the  pointer, the issue is that the openvswitch is not restarted after modifying the /etc/sysconfig/openvswitch file with DPDK_OPTIONS. After adding the puppet code, it is working. I have raised the review upstream.

https://review.openstack.org/#/c/409779/

Comment 15 Saravanan KR 2016-12-14 07:04:03 UTC
This bug refers to 2 issues.
1) DPDK port is not up during the deployment - for which above review will address
2) DPDK port is not up, if we restart compute node, after a successful deployment - This issue is still present. We need to investigate it.

Comment 16 Karthik Sundaravel 2016-12-14 07:09:14 UTC
The DPDK error mentioned in ovs-vsctl show after a reboot is

Port "dpdk0"
  Interface "dpdk0"
    type: dpdk
    error: "could not open network device dpdk0 (No such device)"

Comment 17 Karthik Sundaravel 2017-01-09 07:42:46 UTC
As Saravanan mentioned, the errors were seen in 2 cases. An update on the 2nd case.

[heat-admin@overcloud-compute-0 ~]$ cat /usr/lib/systemd/system/openvswitch-nonetwork.service
[Unit]
Description=Open vSwitch Internal Unit
After=syslog.target systemd-udev-settle.service
PartOf=openvswitch.service
Wants=openvswitch.service systemd-udev-settle.service

[Service]
Type=oneshot
RemainAfterExit=yes
EnvironmentFile=-/etc/sysconfig/openvswitch
ExecStart=/usr/share/openvswitch/scripts/ovs-ctl start \
          --system-id=random $OPTIONS
ExecStop=/usr/share/openvswitch/scripts/ovs-ctl stop
RuntimeDirectory=openvswitch
RuntimeDirectoryMode=0775
Group=qemu
UMask=0002

After using the above openvswitch-nonetwork.service file in ovs 2.5.0.14, with the changes suggested by Flavio, we are not able to reproduce this issue. (Attempted 100 times).

I think this issue needs to be addressed in OVS.

Comment 26 Aaron Conole 2017-07-07 20:10:14 UTC
Let's keep this bug for just the initial work.  I have opened a new bug which will be used to track any backport effort to 2.6


Note You need to log in before you can comment on or make changes to this bug.