Description of problem: Since RHOSP10, a compute node can be configured to dedicate CPUs for the host and dedicate CPUs for the VMs/vSwitch. This is not specific to OVS-DPDK deployments, for instance when only using SR-IOV and kernel OVS, cuch partitioning makes also sense. Example, dedicate the first physical node of a compute node to the host: HostCpusList: "'0,18,36,54'" All OpenStack and hypervisor services will run on the HostCpusList, aka 4 CPUs in the example above. The problem is that ovs-vswitchd is not aware of such partitioning, and is spawning many userland threads as it believes that he can run on all of the CPUs (72 in my example): 10574 ? - 9864:01 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --user openvswitch:hugetlbfs --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach - - S<Lsl 5:19 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:04 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - R<Lsl 4926:43 - - - R<Lsl 4926:43 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 0:00 - - - S<Lsl 1:05 - - - S<Lsl 0:16 - - - S<Lsl 0:13 - - - S<Lsl 0:12 - - - S<Lsl 0:18 - - - S<Lsl 0:12 - - - S<Lsl 0:12 - - - S<Lsl 0:12 - - - S<Lsl 0:12 - - - S<Lsl 0:13 - - - S<Lsl 0:12 - - - S<Lsl 0:12 - - - S<Lsl 0:12 - - - S<Lsl 0:13 - - - S<Lsl 0:17 - - - S<Lsl 0:12 - - - S<Lsl 0:12 - - - S<Lsl 0:12 - - - S<Lsl 0:12 - ovs-vswitchd ovs-vswitchd ovs-vswitchd ovs-vswitchd ovs-vswitchd ovs-vswitchd dpdk_watchdog1 urcu2 ct_clean3 pmd220 pmd221 pmd222 pmd223 handler731 handler728 handler729 handler730 handler732 handler733 handler734 handler735 handler737 handler736 handler738 handler739 handler741 handler740 handler742 handler743 handler745 handler746 handler744 handler747 handler749 handler750 handler748 handler751 handler753 handler754 handler752 handler756 handler755 handler757 handler758 handler760 handler761 handler759 handler762 handler764 handler763 handler765 handler766 handler768 handler769 handler767 handler770 handler772 handler773 handler771 handler774 handler776 handler775 handler777 handler778 handler780 handler781 revalidator779 revalidator782 revalidator784 revalidator785 revalidator783 revalidator787 revalidator788 revalidator789 revalidator786 revalidator791 revalidator790 revalidator792 revalidator793 revalidator795 revalidator796 revalidator794 revalidator797 revalidator798 revalidator799 Those threads are revalidators and handlers, and the consequence of OVS design (see https://www.youtube.com/watch?v=wUJupgOAIgY if you're curious). Having more that available threads than available CPUs to run on make no sense,this has been confirmed by OVS SMEs. So when we dedicate 4 CPUs to the host, we should have exactly 4 handlers and 4 revalidators: ovs-vsctl --no-wait set Open_vSwitch . other_config:n-handler-threads=4 ovs-vsctl --no-wait set Open_vSwitch . other_config:n-revalidator-threads=4 Version-Release number of selected component (if applicable): since RHOSP10 How reproducible: just deploy with HostCpusList Additional info, criticity: why do we care? why is it important? When debugging a live system in production in SEV1, or simply investigate an SOS report, having all of those unnecessary threads is making the investigations harder. Also, they are completely useless, and if all want to run at the same time, we will face a snowball effect. Because we have no time to trigger such corner testing proactively, not starting unnecessary threads can only be beneficial. So this is a bug, not an RFE. This is not a regression: it has always been a pending risk.
The number of handlers is the number of online CPUs minus the number of revalidator threads. The number of revalidators is the number of online CPUs divided by 4 plus 1. So, it should not overcommit as the comment#0 is implying. Regarding to limiting the number of threads, well, if they don't have anything to do, then the kernel would not schedule then. The old vswitchd had a duplicate events issue, where more threads could wake up, but still they should quickly go sleep again. There is a patch being pushed in upstream relying on epoll exclusive flag to wake up only a single thread. Anyways, I don't know what this bug is requesting. Do you want OVS to start small and spawn more threads if needed? Do you want to have a fixed upper limit for the threads? I.e. more than 16 CPUs, just assume 16. Do you need a CPU mask parameter to tell exactly how many and in which CPUs the thread should be created? Please clarify. Thanks fbl
Thanks Flavio, The context of this BZ is that in OpenStack NFV deployments, most of the CPUs are isolated and not available for ovs-vswitchd, as they are dedicated to run vCPUs and PMD threads. The numbers of revalidators and dispatchers threads is calculated based on the total number of CPUs, regardless if they are isolated or not. Most of the time, there will be only 4 CPUs non isolated (one core, 2 HT, per NUMA node), while the total number of CPUs is 72 or even more. So my proposal is to configure the number of revalidators and handler therads based on the non isolated CPUs list, which are known by OpenStack installer, so can be configured by OpenStack installer. Thanks! Franck
(In reply to Franck Baudin from comment #3) My concern is that the revalidator/handler workload depends on variables that we can't predict (the traffic pattern and flow table), so maybe that number of CPUs is enough, maybe not. However, if the goal is resource isolation, then I think this is on the right track. Well, OVS exposes the parameters to configure the number of the threads, but not the CPU mask. Perhaps we could add the CPU mask to indicate which CPUs are allowed to run the threads while the existing parameters define how many threads. Please let me know how you want to proceed. fbl
Thanks Flavio! So we will calculate the revalidator/handler at TripleO level, based on the number on non isolated CPUs. Adding a parameter to OVS is not required at this point. Thanks Again! Franck
According to our records, this should be resolved by openstack-tripleo-heat-templates-8.3.1-87.el7ost. This build is available now.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0760