Bug 1572797 - ovs-vswitchd dirties entire stack of all revalidator/handler pthreads immediately
Summary: ovs-vswitchd dirties entire stack of all revalidator/handler pthreads immedia...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: openvswitch
Version: FDP 19.B
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: FDP 19.D
Assignee: Flavio Leitner
QA Contact: qding
URL:
Whiteboard:
: 1572619 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-27 21:41 UTC by Dan Williams
Modified: 2019-09-10 19:09 UTC (History)
11 users (show)

Fixed In Version: openvswitch-2.9.0-110.el7fdp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1720315 (view as bug list)
Environment:
Last Closed: 2019-07-10 12:57:13 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:1748 None None None 2019-07-10 12:57:20 UTC

Description Dan Williams 2018-04-27 21:41:21 UTC
openvswitch-2.9.0-15.el7fdp.x86_64
(and reproducible with 2.8.1 and 2.9.0 on Fedora 27)

On RHEL 7.5 (40-core) and F27 (8-core) machines just running ovs-vswitchd creates N+1 pthreads where N is the number of CPU cores.  That appears to allocate ulimit RLIMIT_STACK stacks for each thread.  All fine.

But the vswitchd appears to dirty the entire stack of each pthread immediately, resulting in very high RSS usage for the process.  Adjusting ulimit -s before running the vswitchd changes the RSS usage accordingly, indicating that the RSS segments are indeed pthread stacks.

Setting n-handler-threads and n-revalidator-threads in the OVSDB does change the number of threads (but not the RSS usage) and is a workaround.

Comment 2 Dan Williams 2018-04-27 21:42:59 UTC
For exmaple, from smaps:

7f994efff000-7f994f7ff000 rw-p 00000000 00:00 0 
Size:               8192 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                8192 kB
Pss:                8192 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:      8192 kB
Referenced:         8192 kB
Anonymous:          8192 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:             8192 kB

on an 8-core system there will be 9 of these blocks.

Comment 3 Dan Williams 2018-04-28 01:43:20 UTC
*** Bug 1572619 has been marked as a duplicate of this bug. ***

Comment 4 Dan Winship 2018-09-28 20:12:50 UTC
Just to clarify for priority/severity purposes: this means that you can't even start OVS on a machine with a large number of cores and a non-ludicrous amount of RAM, because it will immediately get oom-killed (bug 1571379).

Comment 5 Flavio Leitner 2018-11-06 12:44:23 UTC
Compiled upstream OVS 
commit af26093ab19 ("connmgr: Improve interface for setting controllers.")
and this is the smaps without DPDK:

# cat /proc/31807/smaps | grep Private_Dirty | awk '{ print $2$3 }' | sort -h -r
380kB
216kB
136kB
132kB
120kB
44kB
28kB
24kB
20kB
20kB
16kB
16kB
16kB
16kB
12kB
8kB

This is on netdev33, 24 CPUs.

Comment 6 Flavio Leitner 2018-11-06 13:35:53 UTC
Now using the version in comment#0:
# rpm -q openvswitch 
openvswitch-2.9.0-15.el7fdp.x86_64

cat /proc/$(pidof ovs-vswitchd)/smaps | grep Private_Dirty | awk '{ print $2$3 }' | sort -h -r
3360kB
268kB
256kB
132kB
132kB
116kB
112kB
48kB
36kB
28kB
24kB
20kB
16kB
16kB
16kB
16kB
16kB
16kB

Comment 7 Flavio Leitner 2018-11-06 13:38:58 UTC
However, once I create a bridge:
# ovs-vsctl add-br ovsbr0
# cat /proc/$(pidof ovs-vswitchd)/smaps | grep Private_Dirty | awk '{ print $2$3 }' | sort -h -r | head -n 24
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB

I will try again with upstream code without DPDK.
fbl

Comment 8 Flavio Leitner 2018-11-06 13:47:30 UTC
# ovs-vsctl show
6d6269b2-3768-4c35-bf77-8c4d0c97f166
    Bridge "ovsbr0"
        Port "ovsbr0"
            Interface "ovsbr0"
                type: internal
    ovs_version: "2.10.90"

Upstream without DPDK:

# cat /proc/$(pidof ovs-vswitchd)/smaps | grep Private_Dirty | awk '{ print $2$3 }' | sort -h -r |head -n 24
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB

Ok, I can reproduce the issue. Going to dig deeper to find out more details.
fbl

Comment 9 Flavio Leitner 2018-11-06 18:10:13 UTC
The memory consumption is a result of a combination of two things:

1) As mentioned in comment#0 that pthread creates threads with default stack size of RLIMIT_STACK.

       PTHREAD_CREATE(3), Notes:
[...]
       On  Linux/x86-32,  the default stack size for a new thread is 2 megabytes.  Under the NPTL
       threading implementation, if the RLIMIT_STACK soft resource limit at the time the  program
       started has any value other than "unlimited", then it determines the default stack size of
       new threads.  Using pthread_attr_setstacksize(3), the stack size attribute can be  explic‐
       itly  set  in  the  attr argument used to create a thread, in order to obtain a stack size
       other than the default.

Our default is 8MB, so that's where the number comes from.

2) OVS uses mlockall:

Link to the code:
https://github.com/openvswitch/ovs/blob/c814545b43acc760a85d165c2c5676f06deccde1/vswitchd/ovs-vswitchd.c#L95

Relevant piece of code:
    if (want_mlockall) {
#ifdef HAVE_MLOCKALL
        if (mlockall(MCL_CURRENT | MCL_FUTURE)) {
            VLOG_ERR("mlockall failed: %s", ovs_strerror(errno));
        }
#else

Documentation:
mlockall(2)

   mlockall() and munlockall()
       mlockall() locks all pages mapped into the address space of  the  calling  process.   This
       includes  the pages of the code, data and stack segment, as well as shared libraries, user
       space kernel data, shared memory, and memory-mapped files.  All mapped pages  are  guaran‐
       teed to be resident in RAM when the call returns successfully; the pages are guaranteed to
       stay in RAM until later unlocked.
[...]

Not that it says stack segment also.
In the Notes we have:
       Real-time processes that are using mlockall() to prevent  delays  on  page  faults  should
       reserve  enough  locked  stack pages before entering the time-critical section, so that no
       page fault can be caused by function calls.  This can be achieved by  calling  a  function
       that allocates a sufficiently large automatic variable (an array) and writes to the memory
       occupied by this array in order to touch these stack pages.  This way, enough  pages  will
       be mapped for the stack and can be locked into RAM.  The dummy writes ensure that not even
       copy-on-write page faults can occur in the critical section.


So, (#1) explains why 8MB and (#2) why they are dirty/allocated right from start.

fbl

Comment 10 Flavio Leitner 2018-11-06 19:08:13 UTC
Hi,

I am not sure how to proceed here.  OVS does have a logic to set the minimum stack required which is much less than what the system provides. However I am not sure if we should enforce since the system allows for more.

https://github.com/openvswitch/ovs/blob/c814545b43acc760a85d165c2c5676f06deccde1/lib/ovs-thread.c#L358

static void
set_min_stack_size(pthread_attr_t *attr, size_t min_stacksize)
{
    size_t stacksize;
    int error;

    error = pthread_attr_getstacksize(attr, &stacksize);
    if (error) {
        ovs_abort(error, "pthread_attr_getstacksize failed");
    }

    if (stacksize < min_stacksize) {
        error = pthread_attr_setstacksize(attr, min_stacksize);
        if (error) {
            ovs_abort(error, "pthread_attr_setstacksize failed");
        }
    }
}

That is called as set_min_stack_size(&attr, 512 * 1024) in ovs_thread_create()

Then we go to mlockall(). That call is needed to make sure OVS doesn't get pushed to the swap if the system becomes overloaded. Here is the commit log:
commit 86a06318bdfbea056b04eb78bcdea5672d0b200e
Author: Ben Pfaff <blp@nicira.com>
Date:   Mon Nov 30 13:17:34 2009 -0800

    ovs-vswitchd: Add --mlockall option and enable on XenServer.
    
    On XenServer 5.5 we found that running 4 simultaneous vm-import operations
    on iSCSI caused so much disk and cache activity that (we suspect) parts of
    ovs-vswitchd were paged out to disk and were not paged back in for over
    10 seconds, causing the XenServer to fall off the network and the XenCenter
    connection to fail.
    
    Locking ovs-vswitchd into memory appears to avoid this problem.  Henrik
    reports that, with memory locking, importing 11 VMs simultaneously
    completed successfully.


Then there is the number of threads.  The assumption is that with more CPUs, more workload and then more threads would be necessary. How much? Good question, it depends on the use case.

fbl

Comment 11 Flavio Leitner 2018-11-06 19:50:18 UTC
Hi,

http://man7.org/linux/man-pages/man3/pthread_create.3.html
       Under the NPTL threading implementation, if the RLIMIT_STACK soft
       resource limit at the time the program started has any value other
       than "unlimited", then it determines the default stack size of new
       threads.  Using pthread_attr_setstacksize(3), the stack size
       attribute can be explicitly set in the attr argument used to create a
       thread, in order to obtain a stack size other than the default.  If
       the RLIMIT_STACK resource limit is set to "unlimited", a per-
       architecture value is used for the stack size.  Here is the value for
       a few architectures:

              ┌─────────────┬────────────────────┐
              │Architecture │ Default stack size │
              ├─────────────┼────────────────────┤
              │i386         │               2 MB │
              ├─────────────┼────────────────────┤
              │IA-64        │              32 MB │
              ├─────────────┼────────────────────┤
              │PowerPC      │               4 MB │
              ├─────────────┼────────────────────┤
              │S/390        │               2 MB │
              ├─────────────┼────────────────────┤
              │Sparc-32     │               2 MB │
              ├─────────────┼────────────────────┤
              │Sparc-64     │               4 MB │
              ├─────────────┼────────────────────┤
              │x86_64       │               2 MB │
              └─────────────┴────────────────────┘


Maybe we can set the default in OVS service to be 2MB?


SYSTEMD.EXEC(5)
[...]
PROCESS PROPERTIES
       LimitCPU=, LimitFSIZE=, LimitDATA=, LimitSTACK=, LimitCORE=, LimitRSS=, LimitNOFILE=, LimitAS=, LimitNPROC=, LimitMEMLOCK=, LimitLOCKS=, LimitSIGPENDING=, LimitMSGQUEUE=,
       LimitNICE=, LimitRTPRIO=, LimitRTTIME=
           Set soft and hard limits on various resources for executed processes. See setrlimit(2) for details on the resource limit concept.

[...]
Table 1. Resource limit directives, their equivalent ulimit shell commands and the unit used
[...]
           ┌─────────────────┬───────────────────┬────────────────────────────┐
           │Directive        │ ulimit equivalent │ Unit                       │
           ├─────────────────┼───────────────────┼────────────────────────────┤
           │LimitSTACK=      │ ulimit -s         │ Bytes                      │



So we would need to patch the ovs-vswitchd.service. E.g.:

# diff -u /usr/lib/systemd/system/ovs-vswitchd.service /etc/systemd/system/ovs-vswitchd.service 
--- /usr/lib/systemd/system/ovs-vswitchd.service        2018-03-27 05:20:24.000000000 -0400
+++ /etc/systemd/system/ovs-vswitchd.service    2018-11-06 14:35:55.669023084 -0500
@@ -13,6 +13,7 @@
 Environment=HOME=/var/run/openvswitch
 EnvironmentFile=/etc/openvswitch/default.conf
 EnvironmentFile=-/etc/sysconfig/openvswitch
+LimitSTACK=2M
 ExecStartPre=-/usr/bin/chown :hugetlbfs /dev/hugepages
 ExecStartPre=-/usr/bin/chmod 0775 /dev/hugepages
 ExecStart=/usr/share/openvswitch/scripts/ovs-ctl \

If reducing to 2M or even 1M is acceptable, I will run a PVP test with and without DPDK and see how much memory we would normally use.

Thanks,
fbl

Comment 18 Flavio Leitner 2019-02-28 16:29:45 UTC
Patch posted upstream:
https://mail.openvswitch.org/pipermail/ovs-dev/2019-February/356850.html

Comment 22 qding 2019-06-18 06:09:51 UTC
[root@dell-per730-04 rpms]# rpm -q openvswitch
openvswitch-2.9.0-15.el7fdp.x86_64
[root@dell-per730-04 rpms]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                48
On-line CPU(s) list:   0-47
Thread(s) per core:    2
Core(s) per socket:    12
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
Stepping:              2
CPU MHz:               1481.860
CPU max MHz:           3100.0000
CPU min MHz:           1200.0000
BogoMIPS:              4599.62
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              30720K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear spec_ctrl intel_stibp flush_l1d
[root@dell-per730-04 rpms]# 
[root@dell-per730-04 rpms]# cat /proc/$(pidof ovs-vswitchd)/smaps | grep Private_Dirty | awk '{ print $2$3 }' | sort -h -r
3360kB
268kB
132kB
132kB
132kB
116kB
112kB
56kB
52kB
44kB
32kB
28kB
24kB
20kB
16kB
16kB
16kB
16kB
12kB
8kB
8kB
8kB
8kB
8kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
[root@dell-per730-04 rpms]# ovs-vsctl add-br ovsbr0
[root@dell-per730-04 rpms]# cat /proc/$(pidof ovs-vswitchd)/smaps | grep Private_Dirty | awk '{ print $2$3 }' | sort -h -r
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
8192kB
3360kB
268kB
264kB
140kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
132kB
116kB
112kB
56kB
52kB
44kB
32kB
28kB
24kB
20kB
16kB
16kB
16kB
16kB
12kB
8kB
8kB
8kB
8kB
8kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
4kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
0kB
[root@dell-per730-04 rpms]# 




[root@dell-per730-04 rpms]# rpm -q openvswitch
openvswitch-2.9.0-110.el7fdp.x86_64
[root@dell-per730-04 rpms]# cat /proc/$(pidof ovs-vswitchd)/smaps | grep Private_Dirty | awk '{ print $2$3 }' | sort -h -r | head -n 24
3420kB
284kB
132kB
132kB
132kB
116kB
112kB
56kB
52kB
44kB
32kB
28kB
24kB
20kB
16kB
16kB
16kB
16kB
12kB
8kB
8kB
8kB
8kB
8kB
[root@dell-per730-04 rpms]# ovs-vsctl add-br ovsbr0
[root@dell-per730-04 rpms]# cat /proc/$(pidof ovs-vswitchd)/smaps | grep Private_Dirty | awk '{ print $2$3 }' | sort -h -r | head -n 24
3420kB
2048kB
2048kB
2048kB
2048kB
2048kB
2048kB
2048kB
2048kB
2048kB
2048kB
2048kB
2048kB
2048kB
2048kB
2048kB
2048kB
2048kB
2048kB
2048kB
2048kB
2048kB
2048kB
2048kB
[root@dell-per730-04 rpms]#

Comment 24 errata-xmlrpc 2019-07-10 12:57:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1748


Note You need to log in before you can comment on or make changes to this bug.