Bug 1463627 - ovs-vswitchd fails when dpdk is enabled in OSP12 puddle (RHEL7.4) with ovs 2.7
Summary: ovs-vswitchd fails when dpdk is enabled in OSP12 puddle (RHEL7.4) with ovs 2.7
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 12.0 (Pike)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: 12.0 (Pike)
Assignee: Emilien Macchi
QA Contact: Yariv
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-06-21 11:23 UTC by Saravanan KR
Modified: 2021-03-11 15:23 UTC (History)
14 users (show)

Fixed In Version: openstack-tripleo-heat-templates-7.0.0-0.20170805163048.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-13 21:33:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
private ovs build (4.70 MB, application/x-rpm)
2017-07-05 18:57 UTC, Aaron Conole
no flags Details
Sos report for the private build 2.7.0-9.bz1463627.el7fdb (9.61 MB, application/x-xz)
2017-07-06 13:57 UTC, Karthik Sundaravel
no flags Details


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 478163 0 'None' MERGED Added OvS permission workaround for enabling DPDK 2020-10-26 12:50:26 UTC
Red Hat Product Errata RHEA-2017:3462 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 12.0 Enhancement Advisory 2018-02-16 01:43:25 UTC

Description Saravanan KR 2017-06-21 11:23:18 UTC
Description of problem:
ovs-vswitchd process fails when dpdk is enabled with below command. 
  ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true


Version-Release number of selected component (if applicable):
openvswitch-2.7
osp12-puddle (2017-06-19.1)
RHEL7.4 nightly (http://download-node-02.eng.bos.redhat.com/composes/nightly/latest-RHEL-7/compose/Server/x86_64/os/Packages/)


Tried to deploy OSP12 puddle with DPDK. The deployment failed and the compute nodes were not reachable. While trying to reproduce the issue in controller node, found that by enabling dpdk, ovs-vswitcd fails.
LOGS - http://pastebin.test.redhat.com/496216

Comment 1 Flavio Leitner 2017-06-21 14:07:42 UTC
Please attach an sosreport or at least the systemd logs since boot and the logs in /var/run/openvswitch/*

Thanks,
fbl

Comment 2 Saravanan KR 2017-06-22 07:25:30 UTC
(In reply to Flavio Leitner from comment #1)
> Please attach an sosreport or at least the systemd logs since boot and the
> logs in /var/run/openvswitch/*
> 
> Thanks,
> fbl

sosreport is in google drive as its more that 20MB - 
https://drive.google.com/open?id=0B2NDG0wO_XsqcDRETkN2bXlQNTg

Comment 3 Saravanan KR 2017-06-23 06:01:31 UTC
An observation. In the same RHEL7.4 image, instead of OvS2.7, I have tried with OvS2.6 package (from fdp). It is also having the same issue. After initializing dpdk-init=true, restarting of openvswitch fails as ovs-vswitchd service goes to failed stated.

Comment 4 Karthik Sundaravel 2017-06-28 09:08:27 UTC
We could reproduce it in a standalone RHEL7.4 based VM and OvS 2.7.

The RHEL 7.4 vm is obtained from [1] 
And

OvS rpm is obtained from [2] 

[1] http://download-node-02.eng.bos.redhat.com/rel-eng/latest-RHEL-7/compose/Server/x86_64/images/rhel-guest-image-7.4-176.x86_64.qcow2

[2] http://download-node-02.eng.bos.redhat.com/brewroot/packages/openvswitch/2.7.0/8.git20170530.el7fdb/x86_64/openvswitch-2.7.0-8.git20170530.el7fdb.x86_64.rpm

After installation of the openvswitch packages, I started the openvswitch service by doing "systemctl start openvswitch" and did "ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true".
After enabling dpdk, the openvswitch service fails.

Comment 5 Aaron Conole 2017-06-28 14:30:41 UTC
I tried viewing the sosreport, but I found that the tarball was corrupted.  Is it possible to get a complete version?

Just a guess - either hugepage configuration could be wrong, or there could be some kind of hardware issue with the i40e that they have. NOTE - that is a complete guess based on very minimal information.

Comment 6 Aaron Conole 2017-06-28 18:49:08 UTC
It appears that hugepage allocation is
very slow on those machines, and makes the system believe that
ovs-vswitchd has become unresponsive.

However, manually running the ovs-vswitchd:
 cd /var/run/openvswitch && ovs-vswitchd \
    unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err \
    -vfile:info --mlockall --no-chdir \
    --log-file=/var/log/openvswitch/ovs-vswitchd.log \
    --pidfile=/var/run/openvswitch/ovs-vswitchd.pid)

allows the system to come up (notice the lack of --detach in the above
command).  If this is an acceptable workaround for now, please go with
that.  If you need something like systemd integration and all (for
instance because this is for a customer), let me know.  In the meantime
I will work with upstream on a solution that we can include in RHEL.

Comment 7 Saravanan KR 2017-06-30 09:05:49 UTC
We can't manually start vswitchd during the deployment and and without out it deployment fails. We tried removing "--detach" option in /usr/share/openvswitch/scripts/ovs-lib. But it didn't help. Any other alternative to use it with deployment (like file modification)?

Comment 8 Aaron Conole 2017-07-05 18:57:53 UTC
Created attachment 1294704 [details]
private ovs build

Comment 9 Aaron Conole 2017-07-05 18:58:26 UTC
Attached a private OVS build with a possible remedy.  Please try the attached and let me know.

Comment 10 Karthik Sundaravel 2017-07-06 13:57:03 UTC
Created attachment 1294973 [details]
Sos report for the private build 2.7.0-9.bz1463627.el7fdb

Comment 11 Karthik Sundaravel 2017-07-06 13:58:19 UTC
We see failure in openvswitch service. Attached the sos report

Comment 12 Aaron Conole 2017-07-06 14:22:18 UTC
That SOS report is corrupted.  But it's okay, enough logs are there.

It looks like after reload, no systemctl daemon-reload was executed.  Not sure if that was due to my changes to the specfile, but if so, apologies.  After running systemctl daemon-reload, and systemctl restart openvswitch, I see the following:

[heat-admin@overcloud-computeovsdpdk-0 ~]$ systemctl status openvswitch
● openvswitch.service - Open vSwitch
   Loaded: loaded (/usr/lib/systemd/system/openvswitch.service; enabled; vendor preset: disabled)
   Active: active (exited) since Thu 2017-07-06 10:18:50 EDT; 21s ago
  Process: 35801 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
 Main PID: 35801 (code=exited, status=0/SUCCESS)

Jul 06 10:18:50 overcloud-computeovsdpdk-0 systemd[1]: Starting Open vSwitch...
Jul 06 10:18:50 overcloud-computeovsdpdk-0 systemd[1]: Started Open vSwitch.
Hint: Some lines were ellipsized, use -l to show in full.
[heat-admin@overcloud-computeovsdpdk-0 ~]$ systemctl status ovs-vswitchd
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
   Active: active (running) since Thu 2017-07-06 10:18:50 EDT; 1min 11s ago
  Process: 35591 ExecStop=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server stop (code=exited, status=0/SUCCESS)
  Process: 35677 ExecStart=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server --no-monitor --system-id=random start $OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 35703 (ovs-vswitchd)
   CGroup: /system.slice/ovs-vswitchd.service
           └─35703 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:e...

Jul 06 10:18:50 overcloud-computeovsdpdk-0 ovs-vswitchd[35703]: EAL:   probe ...
Jul 06 10:18:50 overcloud-computeovsdpdk-0 ovs-vswitchd[35703]: EAL: PCI devi...
Jul 06 10:18:50 overcloud-computeovsdpdk-0 ovs-vswitchd[35703]: EAL:   probe ...
Jul 06 10:18:50 overcloud-computeovsdpdk-0 ovs-vswitchd[35703]: EAL: PCI devi...
Jul 06 10:18:50 overcloud-computeovsdpdk-0 ovs-vswitchd[35703]: EAL:   probe ...
Jul 06 10:18:50 overcloud-computeovsdpdk-0 ovs-vswitchd[35703]: EAL: PCI devi...
Jul 06 10:18:50 overcloud-computeovsdpdk-0 ovs-vswitchd[35703]: EAL:   probe ...
Jul 06 10:18:50 overcloud-computeovsdpdk-0 ovs-ctl[35677]: [  OK  ]
Jul 06 10:18:50 overcloud-computeovsdpdk-0 ovs-ctl[35677]: Enabling remote OV...
Jul 06 10:18:50 overcloud-computeovsdpdk-0 systemd[1]: Started Open vSwitch F...
Hint: Some lines were ellipsized, use -l to show in full.
[heat-admin@overcloud-computeovsdpdk-0 ~]$ systemctl status ovsdb-server
● ovsdb-server.service - Open vSwitch Database Unit
   Loaded: loaded (/usr/lib/systemd/system/ovsdb-server.service; static; vendor preset: disabled)
   Active: active (running) since Thu 2017-07-06 10:17:33 EDT; 2min 35s ago
  Process: 35615 ExecStop=/usr/share/openvswitch/scripts/ovs-ctl --no-ovs-vswitchd stop (code=exited, status=0/SUCCESS)
  Process: 35638 ExecStart=/usr/share/openvswitch/scripts/ovs-ctl --no-ovs-vswitchd --no-monitor --system-id=random start $OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 35667 (ovsdb-server)
   CGroup: /system.slice/ovsdb-server.service
           └─35667 ovsdb-server /etc/openvswitch/conf.db -vconsole:emer -vsys...

Jul 06 10:17:33 overcloud-computeovsdpdk-0 systemd[1]: Starting Open vSwitch ...
Jul 06 10:17:33 overcloud-computeovsdpdk-0 ovs-ctl[35638]: Starting ovsdb-ser...
Jul 06 10:17:33 overcloud-computeovsdpdk-0 ovs-vsctl[35668]: ovs|00001|vsctl|...
Jul 06 10:17:33 overcloud-computeovsdpdk-0 ovs-vsctl[35674]: ovs|00001|vsctl|...
Jul 06 10:17:33 overcloud-computeovsdpdk-0 ovs-ctl[35638]: Configuring Open v...
Jul 06 10:17:33 overcloud-computeovsdpdk-0 systemd[1]: Started Open vSwitch D...
Hint: Some lines were ellipsized, use -l to show in full.

Should be all set to go, now?

Comment 13 Saravanan KR 2017-07-17 07:40:02 UTC
We found that the ovs-ctl script file for the permission workaround was patched wrongly, as it did not accommodate ovs2.7 version. After fixing it, we are able to enable DPDK successfully. 

Posted the review for THT
https://review.openstack.org/#/c/478163/

Comment 20 errata-xmlrpc 2017-12-13 21:33:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462


Note You need to log in before you can comment on or make changes to this bug.