openvswitch-2.9.0-15.el7fdp.x86_64 (and reproducible with 2.8.1 and 2.9.0 on Fedora 27) On RHEL 7.5 (40-core) and F27 (8-core) machines just running ovs-vswitchd creates N+1 pthreads where N is the number of CPU cores. That appears to allocate ulimit RLIMIT_STACK stacks for each thread. All fine. But the vswitchd appears to dirty the entire stack of each pthread immediately, resulting in very high RSS usage for the process. Adjusting ulimit -s before running the vswitchd changes the RSS usage accordingly, indicating that the RSS segments are indeed pthread stacks. Setting n-handler-threads and n-revalidator-threads in the OVSDB does change the number of threads (but not the RSS usage) and is a workaround.
For exmaple, from smaps: 7f994efff000-7f994f7ff000 rw-p 00000000 00:00 0 Size: 8192 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 8192 kB Pss: 8192 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 8192 kB Referenced: 8192 kB Anonymous: 8192 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 8192 kB on an 8-core system there will be 9 of these blocks.
*** Bug 1572619 has been marked as a duplicate of this bug. ***
Just to clarify for priority/severity purposes: this means that you can't even start OVS on a machine with a large number of cores and a non-ludicrous amount of RAM, because it will immediately get oom-killed (bug 1571379).
Compiled upstream OVS commit af26093ab19 ("connmgr: Improve interface for setting controllers.") and this is the smaps without DPDK: # cat /proc/31807/smaps | grep Private_Dirty | awk '{ print $2$3 }' | sort -h -r 380kB 216kB 136kB 132kB 120kB 44kB 28kB 24kB 20kB 20kB 16kB 16kB 16kB 16kB 12kB 8kB This is on netdev33, 24 CPUs.
Now using the version in comment#0: # rpm -q openvswitch openvswitch-2.9.0-15.el7fdp.x86_64 cat /proc/$(pidof ovs-vswitchd)/smaps | grep Private_Dirty | awk '{ print $2$3 }' | sort -h -r 3360kB 268kB 256kB 132kB 132kB 116kB 112kB 48kB 36kB 28kB 24kB 20kB 16kB 16kB 16kB 16kB 16kB 16kB
However, once I create a bridge: # ovs-vsctl add-br ovsbr0 # cat /proc/$(pidof ovs-vswitchd)/smaps | grep Private_Dirty | awk '{ print $2$3 }' | sort -h -r | head -n 24 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB I will try again with upstream code without DPDK. fbl
# ovs-vsctl show 6d6269b2-3768-4c35-bf77-8c4d0c97f166 Bridge "ovsbr0" Port "ovsbr0" Interface "ovsbr0" type: internal ovs_version: "2.10.90" Upstream without DPDK: # cat /proc/$(pidof ovs-vswitchd)/smaps | grep Private_Dirty | awk '{ print $2$3 }' | sort -h -r |head -n 24 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB Ok, I can reproduce the issue. Going to dig deeper to find out more details. fbl
The memory consumption is a result of a combination of two things: 1) As mentioned in comment#0 that pthread creates threads with default stack size of RLIMIT_STACK. PTHREAD_CREATE(3), Notes: [...] On Linux/x86-32, the default stack size for a new thread is 2 megabytes. Under the NPTL threading implementation, if the RLIMIT_STACK soft resource limit at the time the program started has any value other than "unlimited", then it determines the default stack size of new threads. Using pthread_attr_setstacksize(3), the stack size attribute can be explic‐ itly set in the attr argument used to create a thread, in order to obtain a stack size other than the default. Our default is 8MB, so that's where the number comes from. 2) OVS uses mlockall: Link to the code: https://github.com/openvswitch/ovs/blob/c814545b43acc760a85d165c2c5676f06deccde1/vswitchd/ovs-vswitchd.c#L95 Relevant piece of code: if (want_mlockall) { #ifdef HAVE_MLOCKALL if (mlockall(MCL_CURRENT | MCL_FUTURE)) { VLOG_ERR("mlockall failed: %s", ovs_strerror(errno)); } #else Documentation: mlockall(2) mlockall() and munlockall() mlockall() locks all pages mapped into the address space of the calling process. This includes the pages of the code, data and stack segment, as well as shared libraries, user space kernel data, shared memory, and memory-mapped files. All mapped pages are guaran‐ teed to be resident in RAM when the call returns successfully; the pages are guaranteed to stay in RAM until later unlocked. [...] Not that it says stack segment also. In the Notes we have: Real-time processes that are using mlockall() to prevent delays on page faults should reserve enough locked stack pages before entering the time-critical section, so that no page fault can be caused by function calls. This can be achieved by calling a function that allocates a sufficiently large automatic variable (an array) and writes to the memory occupied by this array in order to touch these stack pages. This way, enough pages will be mapped for the stack and can be locked into RAM. The dummy writes ensure that not even copy-on-write page faults can occur in the critical section. So, (#1) explains why 8MB and (#2) why they are dirty/allocated right from start. fbl
Hi, I am not sure how to proceed here. OVS does have a logic to set the minimum stack required which is much less than what the system provides. However I am not sure if we should enforce since the system allows for more. https://github.com/openvswitch/ovs/blob/c814545b43acc760a85d165c2c5676f06deccde1/lib/ovs-thread.c#L358 static void set_min_stack_size(pthread_attr_t *attr, size_t min_stacksize) { size_t stacksize; int error; error = pthread_attr_getstacksize(attr, &stacksize); if (error) { ovs_abort(error, "pthread_attr_getstacksize failed"); } if (stacksize < min_stacksize) { error = pthread_attr_setstacksize(attr, min_stacksize); if (error) { ovs_abort(error, "pthread_attr_setstacksize failed"); } } } That is called as set_min_stack_size(&attr, 512 * 1024) in ovs_thread_create() Then we go to mlockall(). That call is needed to make sure OVS doesn't get pushed to the swap if the system becomes overloaded. Here is the commit log: commit 86a06318bdfbea056b04eb78bcdea5672d0b200e Author: Ben Pfaff <blp> Date: Mon Nov 30 13:17:34 2009 -0800 ovs-vswitchd: Add --mlockall option and enable on XenServer. On XenServer 5.5 we found that running 4 simultaneous vm-import operations on iSCSI caused so much disk and cache activity that (we suspect) parts of ovs-vswitchd were paged out to disk and were not paged back in for over 10 seconds, causing the XenServer to fall off the network and the XenCenter connection to fail. Locking ovs-vswitchd into memory appears to avoid this problem. Henrik reports that, with memory locking, importing 11 VMs simultaneously completed successfully. Then there is the number of threads. The assumption is that with more CPUs, more workload and then more threads would be necessary. How much? Good question, it depends on the use case. fbl
Hi, http://man7.org/linux/man-pages/man3/pthread_create.3.html Under the NPTL threading implementation, if the RLIMIT_STACK soft resource limit at the time the program started has any value other than "unlimited", then it determines the default stack size of new threads. Using pthread_attr_setstacksize(3), the stack size attribute can be explicitly set in the attr argument used to create a thread, in order to obtain a stack size other than the default. If the RLIMIT_STACK resource limit is set to "unlimited", a per- architecture value is used for the stack size. Here is the value for a few architectures: ┌─────────────┬────────────────────┐ │Architecture │ Default stack size │ ├─────────────┼────────────────────┤ │i386 │ 2 MB │ ├─────────────┼────────────────────┤ │IA-64 │ 32 MB │ ├─────────────┼────────────────────┤ │PowerPC │ 4 MB │ ├─────────────┼────────────────────┤ │S/390 │ 2 MB │ ├─────────────┼────────────────────┤ │Sparc-32 │ 2 MB │ ├─────────────┼────────────────────┤ │Sparc-64 │ 4 MB │ ├─────────────┼────────────────────┤ │x86_64 │ 2 MB │ └─────────────┴────────────────────┘ Maybe we can set the default in OVS service to be 2MB? SYSTEMD.EXEC(5) [...] PROCESS PROPERTIES LimitCPU=, LimitFSIZE=, LimitDATA=, LimitSTACK=, LimitCORE=, LimitRSS=, LimitNOFILE=, LimitAS=, LimitNPROC=, LimitMEMLOCK=, LimitLOCKS=, LimitSIGPENDING=, LimitMSGQUEUE=, LimitNICE=, LimitRTPRIO=, LimitRTTIME= Set soft and hard limits on various resources for executed processes. See setrlimit(2) for details on the resource limit concept. [...] Table 1. Resource limit directives, their equivalent ulimit shell commands and the unit used [...] ┌─────────────────┬───────────────────┬────────────────────────────┐ │Directive │ ulimit equivalent │ Unit │ ├─────────────────┼───────────────────┼────────────────────────────┤ │LimitSTACK= │ ulimit -s │ Bytes │ So we would need to patch the ovs-vswitchd.service. E.g.: # diff -u /usr/lib/systemd/system/ovs-vswitchd.service /etc/systemd/system/ovs-vswitchd.service --- /usr/lib/systemd/system/ovs-vswitchd.service 2018-03-27 05:20:24.000000000 -0400 +++ /etc/systemd/system/ovs-vswitchd.service 2018-11-06 14:35:55.669023084 -0500 @@ -13,6 +13,7 @@ Environment=HOME=/var/run/openvswitch EnvironmentFile=/etc/openvswitch/default.conf EnvironmentFile=-/etc/sysconfig/openvswitch +LimitSTACK=2M ExecStartPre=-/usr/bin/chown :hugetlbfs /dev/hugepages ExecStartPre=-/usr/bin/chmod 0775 /dev/hugepages ExecStart=/usr/share/openvswitch/scripts/ovs-ctl \ If reducing to 2M or even 1M is acceptable, I will run a PVP test with and without DPDK and see how much memory we would normally use. Thanks, fbl
Patch posted upstream: https://mail.openvswitch.org/pipermail/ovs-dev/2019-February/356850.html
[root@dell-per730-04 rpms]# rpm -q openvswitch openvswitch-2.9.0-15.el7fdp.x86_64 [root@dell-per730-04 rpms]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Thread(s) per core: 2 Core(s) per socket: 12 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz Stepping: 2 CPU MHz: 1481.860 CPU max MHz: 3100.0000 CPU min MHz: 1200.0000 BogoMIPS: 4599.62 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 30720K NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear spec_ctrl intel_stibp flush_l1d [root@dell-per730-04 rpms]# [root@dell-per730-04 rpms]# cat /proc/$(pidof ovs-vswitchd)/smaps | grep Private_Dirty | awk '{ print $2$3 }' | sort -h -r 3360kB 268kB 132kB 132kB 132kB 116kB 112kB 56kB 52kB 44kB 32kB 28kB 24kB 20kB 16kB 16kB 16kB 16kB 12kB 8kB 8kB 8kB 8kB 8kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB [root@dell-per730-04 rpms]# ovs-vsctl add-br ovsbr0 [root@dell-per730-04 rpms]# cat /proc/$(pidof ovs-vswitchd)/smaps | grep Private_Dirty | awk '{ print $2$3 }' | sort -h -r 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 8192kB 3360kB 268kB 264kB 140kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 132kB 116kB 112kB 56kB 52kB 44kB 32kB 28kB 24kB 20kB 16kB 16kB 16kB 16kB 12kB 8kB 8kB 8kB 8kB 8kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB 0kB [root@dell-per730-04 rpms]# [root@dell-per730-04 rpms]# rpm -q openvswitch openvswitch-2.9.0-110.el7fdp.x86_64 [root@dell-per730-04 rpms]# cat /proc/$(pidof ovs-vswitchd)/smaps | grep Private_Dirty | awk '{ print $2$3 }' | sort -h -r | head -n 24 3420kB 284kB 132kB 132kB 132kB 116kB 112kB 56kB 52kB 44kB 32kB 28kB 24kB 20kB 16kB 16kB 16kB 16kB 12kB 8kB 8kB 8kB 8kB 8kB [root@dell-per730-04 rpms]# ovs-vsctl add-br ovsbr0 [root@dell-per730-04 rpms]# cat /proc/$(pidof ovs-vswitchd)/smaps | grep Private_Dirty | awk '{ print $2$3 }' | sort -h -r | head -n 24 3420kB 2048kB 2048kB 2048kB 2048kB 2048kB 2048kB 2048kB 2048kB 2048kB 2048kB 2048kB 2048kB 2048kB 2048kB 2048kB 2048kB 2048kB 2048kB 2048kB 2048kB 2048kB 2048kB 2048kB [root@dell-per730-04 rpms]#
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1748