Bug 1540017 - ovs-vswitchd systemd process timesout in environments with large (200G) hugepages
Summary: ovs-vswitchd systemd process timesout in environments with large (200G) hugep...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openvswitch
Version: 10.0 (Newton)
Hardware: All
OS: Linux
high
high
Target Milestone: async
: 10.0 (Newton)
Assignee: Aaron Conole
QA Contact: Yariv
URL:
Whiteboard:
Depends On:
Blocks: 1540158
TreeView+ depends on / blocked
 
Reported: 2018-01-30 05:38 UTC by Jaison Raju
Modified: 2022-07-09 13:25 UTC (History)
10 users (show)

Fixed In Version: openvswitch-2.6.1-18.git20180130.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1559571 (view as bug list)
Environment:
Last Closed: 2018-06-27 23:33:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker NFV-807 0 None None None 2021-08-30 12:01:27 UTC
Red Hat Issue Tracker OSP-4836 0 None None None 2022-07-09 13:25:43 UTC
Red Hat Knowledge Base (Solution) 3358421 0 None None None 2018-02-20 12:43:48 UTC
Red Hat Product Errata RHSA-2018:2102 0 None None None 2018-06-27 23:35:37 UTC

Description Jaison Raju 2018-01-30 05:38:46 UTC
Description of problem:
While deployment on environments with 200G+ hugepages, ovs-vswitchd is known to take longer time that systemd timeout (1.5mins) .
Large hugepages like 400G is usually known to take 5mins for ovs-dpdk.
https://mail.openvswitch.org/pipermail/ovs-git/2017-July/019944.html

Another concern is that this is irrespective of how much the dpdk configuration uses.
Like 4G hugepage for ovs-dpdk will also take the same time when using 200G hugepage.
The problem is that map_all_hugepages() would map all free huge pages, and then select the proper ones. If I have 500 free huge pages (each 1G), and application only needs 1G per NUMA socket, it is unreasonable for such mapping.

http://dpdk.org/ml/archives/dev/2017-September/074621.html

Version-Release number of selected component (if applicable):
RHOS

How reproducible:
Always

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Jaison Raju 2018-01-30 06:16:48 UTC
I will raise a new bug for  map_all_hugepages() mapping all free huge pages

Comment 3 Jaison Raju 2018-01-30 14:14:45 UTC
# for i in openvswitch.service ovs-vswitchd.service ; do systemctl show $i | grep -i timeout ; done
TimeoutStartUSec=0
TimeoutStopUSec=1min 30s
JobTimeoutUSec=0
JobTimeoutAction=none
TimeoutStartUSec=1min 30s
TimeoutStopUSec=1min 30s
JobTimeoutUSec=0
JobTimeoutAction=none

Comment 4 Jaison Raju 2018-01-31 09:23:05 UTC
$ git remote -v
origin	https://github.com/openvswitch/ovs.git (fetch)
$ git tag --contains c1c69e8a45ead25f4309ec3d340c805a10bcae79
v2.8.0
v2.8.1

Comment 19 errata-xmlrpc 2018-06-27 23:33:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2102

Comment 20 Andreas Karis 2019-06-17 13:27:43 UTC
Workaround for older versions of OVS. Add this to firstboot.yaml: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html-single/advanced_overcloud_customization/index#sect-Customizing_Configuration_on_First_Boot
~~~
mkdir /etc/systemd/system/ovs-vswitchd.service.d
cat<<'EOF'>/etc/systemd/system/ovs-vswitchd.service.d/timeout.conf
[Service]
TimeoutSec=300
EOF
systemctl daemon-reload
~~~


Verification:
~~~
[root@overcloud-compute-0 ~]# systemctl show ovs-vswitchd | grep -i timeout
TimeoutStartUSec=5min
TimeoutStopUSec=5min
DropInPaths=/etc/systemd/system/ovs-vswitchd.service.d/timeout.conf
JobTimeoutUSec=0
JobTimeoutAction=none
~~~


Note You need to log in before you can comment on or make changes to this bug.