Red Hat Bugzilla – Bug 1394402
[Doc] provide steps to achieve zero-loss networking with DPDK, openvswitch, vhostuser, and testpmd
Last modified: 2017-11-21 14:34:27 EST
Description of problem: Achieving zero-packet-loss for NFV workloads on Openstack with current guidelines may not always be achievable. This will document further steps to achieve zero loss. This BZ is not intended to track enhancements or fixes. Version-Release number of selected component (if applicable): OSP-10 openvswitch-2.5.0-*.git20160727 RHEL 7.3 Current guidelines make use of cpu-partitioning tuned profile for a OSP compute node and VM running in that compute node, which does the following: 1) Uses boot options nohz_full and rcu_nocbs, specifying a list of cpus which are reserved for openvswitch and the VM vcpus 2) Uses systemd's CPUAffinity option to launch all user processes within a subset of online cpus, this list = inverse of the cpulist used for nohz_full and rcu_nocbs. This prevents user processes from running on the cpus reserved for openvswitch and the VM. 3) Uses the IRQBALANCE_BANNED_CPUS= option for irqbalance to avoid sending interrupts to the reserved cpus. Both oenvswitch and the VM's vcpus must be configured so that they use, 1 thread per cpu, a cpu in the reserved cpus list. All of this in an effort to reduce as much as possible any interruptions to these cpus while running either openvswitch, VM-vcpu, or the inside-VM VNF threads. However, CPUAffinity cannot remove or prevent all kernel threads from running on these cpus. In order to prevent [most] of the kernel threads, one must use the boot option "isolcpus=<cpulist>". This uses the same cpu list as nohz_full and rcu_nocbs uses. Isolcpus is engaged right at kernel boot, and thus can prevent many kernel threads from be scheduled on those cpus. To enable isolcpus, first find out which cpus are being used for rcu_nocbs and nohz_full: # cat /proc/cmdline Look for the option, for example: nohz_full=1,3,5,7,9,11,13 Use the cpulist and add isolcpus option. One way to do this is with grubby utility: grubby --update-kernel=`grubby --default-kernel` --args="isolcpus=<cpulist>" This should be done on both the host and the VM. A reboot will be required for this option to take effect. Note that not all kernel threads are removed from these cpus. kworker, migration, and ksoftirqd will still be present on every cpu. However, kworker and migration threads should not have to run, and ksoftirqd runs typically once per second for a period of up to 20 microseconds, as long as only 1 user task is running on that cpu. After this option is configured and openvswitch and the VM is running, you can check to see what threads are scheduled on these cpus by looking at the contents of /proc/sched_debug (on both the host and the VM) For each cpu, there will be a section called "runnable tasks" like the following: runnable tasks: task PID tree-key switches prio wait-time sum-exec sum-sleep ---------------------------------------------------------------------------------------------------------- migration/47 244 0.000000 634 0 0.000000 112.216681 0.000000 1 / ksoftirqd/47 245 12404.353193 29 120 0.000000 0.241945 0.000000 1 / kworker/47:0 246 1072.586238 21 120 0.000000 0.105430 0.000000 1 / kworker/47:0H 247 7740.331124 14 100 0.000000 0.096559 0.000000 1 / kworker/47:1 625 13234.768505 302 120 0.000000 1.904559 0.000000 1 / kworker/47:1H 99955 7752.330393 2 100 0.000000 0.033163 0.000000 1 / On the host, you may have a openvswitch or a qemu task listed here. On the VM, you may have a thread for your VNF, like testpmd listed here. When that thread is interrupted, its 'switches' count will increment. Monitoring this over time and 'switche's not increasing will confirm that no other thread has interrupted it. Note that with non-RT kernels, IRQs still can interrupt this thread and will not be reflected here. So, it is then also important that you monitor /proc/interrupts to check if any are happening on this cpu. When using isolcpus, the scheduler will not attempt to load balance tasks on those cpus. This is usually fine because there is only 1 task per cpu, and moving tasks to/from these cpus automatically is not desirable. One problem may arise, however, when using OpenStack. When starting a VM, the Nova service will set the list of allowed cpus for the Qemu emulator thread to the same cpulist that is used for all of the vcpus. For example, if the vcpu pinning is: vcpu: host-cpu: 0 4 1 5 2 6 3 7 The Qemu emulator thread will only be able to run on host cpus 4-7. When using isolcpus, this thread will be initially placed on the first cpus in the list, in this case cpu4. And, with isolcpus present, it will never be load balanced to a different cpu. Therefore, the thread for vcpu0 and the emulator will have to share host cpu4. This in some cases can cause significant degrades in performance, especially during booting the VM. In order to resolve this problem, the emulator thread must be migrated to a different host cpu. You can achieve this with the virsh command on the host: # virsh emulatorpin <vm-name> This will show the current range of cpus allowed for the emulator thread. To change it, use: # virsh emulatorpin vm1 --cpulist <cpulist> Simply using the cpu list from the CPUAffinity option in systemd should work here: # grep CPUAffinity /etc/systemd/system.conf CPUAffinity=0 1 2 3 And now run the virsh command to update the cpulist for the emulator thread. Just make sure to use "," or "-" to describe the list of cpus: # virsh emulatorpin <vm-name> --cpulist 1,2,3,4 This is not persistent, so any time a VM is started from Nova, the emulator thread will need to be moved. However, if you do not experience a degrade (boot-up time is adequate and you don't use vcpu0 for packet processing), this step is not necessary.
(In reply to Andrew Theurer from comment #0) > The Qemu emulator thread will only be able to run on host cpus 4-7. When > using isolcpus, this thread will be initially placed on the first cpus in > the list, in this case cpu4. And, with isolcpus present, it will never be > load balanced to a different cpu. Therefore, the thread for vcpu0 and the > emulator will have to share host cpu4. This in some cases can cause > significant degrades in performance, especially during booting the VM. In > order to resolve this problem, the emulator thread must be migrated to a > different host cpu. You can achieve this with the virsh command on the host: > > # virsh emulatorpin <vm-name> > > This will show the current range of cpus allowed for the emulator thread. > To change it, use: > > # virsh emulatorpin vm1 --cpulist <cpulist> > > Simply using the cpu list from the CPUAffinity option in systemd should work > here: > > # grep CPUAffinity /etc/systemd/system.conf > CPUAffinity=0 1 2 3 > > And now run the virsh command to update the cpulist for the emulator thread. > Just make sure to use "," or "-" to describe the list of cpus: > > # virsh emulatorpin <vm-name> --cpulist 1,2,3,4 > > > This is not persistent, so any time a VM is started from Nova, the emulator > thread will need to be moved. However, if you do not experience a degrade > (boot-up time is adequate and you don't use vcpu0 for packet processing), > this step is not necessary. Manually changing CPU pinning behind Nova's back is *not* something that is considered a supported scenario from OpenStack Product POV. This is invisible to the Nova and thus may cause Nova schedular to make incorrect decisions for future guests, and is liable to be reverted by Nova during certain operations. IOW, if you do this and it subsequently breaks, you get to keep both pieces. As such we should *not* be documenting this as a supported configuration in context of OpenStack.
(In reply to Daniel Berrange from comment #3) > (In reply to Andrew Theurer from comment #0) > > The Qemu emulator thread will only be able to run on host cpus 4-7. When > > using isolcpus, this thread will be initially placed on the first cpus in > > the list, in this case cpu4. And, with isolcpus present, it will never be > > load balanced to a different cpu. Therefore, the thread for vcpu0 and the > > emulator will have to share host cpu4. This in some cases can cause > > significant degrades in performance, especially during booting the VM. In > > order to resolve this problem, the emulator thread must be migrated to a > > different host cpu. You can achieve this with the virsh command on the host: > > > > # virsh emulatorpin <vm-name> > > > > This will show the current range of cpus allowed for the emulator thread. > > To change it, use: > > > > # virsh emulatorpin vm1 --cpulist <cpulist> > > > > Simply using the cpu list from the CPUAffinity option in systemd should work > > here: > > > > # grep CPUAffinity /etc/systemd/system.conf > > CPUAffinity=0 1 2 3 > > > > And now run the virsh command to update the cpulist for the emulator thread. > > Just make sure to use "," or "-" to describe the list of cpus: > > > > # virsh emulatorpin <vm-name> --cpulist 1,2,3,4 > > > > > > This is not persistent, so any time a VM is started from Nova, the emulator > > thread will need to be moved. However, if you do not experience a degrade > > (boot-up time is adequate and you don't use vcpu0 for packet processing), > > this step is not necessary. > > Manually changing CPU pinning behind Nova's back is *not* something that is > considered a supported scenario from OpenStack Product POV. This is > invisible to the Nova and thus may cause Nova schedular to make incorrect > decisions for future guests, and is liable to be reverted by Nova during > certain operations. IOW, if you do this and it subsequently breaks, you get > to keep both pieces. As such we should *not* be documenting this as a > supported configuration in context of OpenStack. Will https://bugzilla.redhat.com/show_bug.cgi?id=1298079 permit to configure this behavior in nova.conf?
(In reply to Andrew Theurer from comment #0) > Description of problem: > > Achieving zero-packet-loss for NFV workloads on Openstack with current > guidelines may not always be achievable. This will document further steps > to achieve zero loss. This BZ is not intended to track enhancements or > fixes. > > > Version-Release number of selected component (if applicable): > > OSP-10 > openvswitch-2.5.0-*.git20160727 > RHEL 7.3 > > > Current guidelines make use of cpu-partitioning tuned profile for a OSP > compute node and VM running in that compute node, which does the following: > > 1) Uses boot options nohz_full and rcu_nocbs, specifying a list of cpus > which are reserved for openvswitch and the VM vcpus > 2) Uses systemd's CPUAffinity option to launch all user processes within a > subset of online cpus, this list = inverse of the cpulist used for nohz_full > and rcu_nocbs. This prevents user processes from running on the cpus > reserved for openvswitch and the VM. > 3) Uses the IRQBALANCE_BANNED_CPUS= option for irqbalance to avoid sending > interrupts to the reserved cpus. > > Both oenvswitch and the VM's vcpus must be configured so that they use, 1 > thread per cpu, a cpu in the reserved cpus list. > > All of this in an effort to reduce as much as possible any interruptions to > these cpus while running either openvswitch, VM-vcpu, or the inside-VM VNF > threads. However, CPUAffinity cannot remove or prevent all kernel threads > from running on these cpus. In order to prevent [most] of the kernel > threads, one must use the boot option "isolcpus=<cpulist>". This uses the > same cpu list as nohz_full and rcu_nocbs uses. Isolcpus is engaged right at > kernel boot, and thus can prevent many kernel threads from be scheduled on > those cpus. > > To enable isolcpus, first find out which cpus are being used for rcu_nocbs > and nohz_full: > > # cat /proc/cmdline > > Look for the option, for example: > > nohz_full=1,3,5,7,9,11,13 > > Use the cpulist and add isolcpus option. One way to do this is with grubby > utility: > > grubby --update-kernel=`grubby --default-kernel` --args="isolcpus=<cpulist>" For the host done already in Director first-boot.yaml For guest: script as following: isol_cpus=`awk '{ for (i = 1; i <= NF; i++) if ($i ~ /nohz/) print $i };' proc-cmdline | cut -d"=" -f2` if [ ! -z "$isol_cpus" ]; then grubby --update-kernel=grubby --default-kernel --args=isolcpus=$isol_cpus fi > > This should be done on both the host and the VM. A reboot will be required > for this option to take effect. > > Note that not all kernel threads are removed from these cpus. kworker, > migration, and ksoftirqd will still be present on every cpu. However, > kworker and migration threads should not have to run, and ksoftirqd runs > typically once per second for a period of up to 20 microseconds, as long as > only 1 user task is running on that cpu. > > After this option is configured and openvswitch and the VM is running, you > can check to see what threads are scheduled on these cpus by looking at the > contents of /proc/sched_debug (on both the host and the VM) > > For each cpu, there will be a section called "runnable tasks" like the > following: > > runnable tasks: > task PID tree-key switches prio wait-time > sum-exec sum-sleep > ----------------------------------------------------------------------------- > ----------------------------- > migration/47 244 0.000000 634 0 0.000000 > 112.216681 0.000000 1 / > ksoftirqd/47 245 12404.353193 29 120 0.000000 > 0.241945 0.000000 1 / > kworker/47:0 246 1072.586238 21 120 0.000000 > 0.105430 0.000000 1 / > kworker/47:0H 247 7740.331124 14 100 0.000000 > 0.096559 0.000000 1 / > kworker/47:1 625 13234.768505 302 120 0.000000 > 1.904559 0.000000 1 / > kworker/47:1H 99955 7752.330393 2 100 0.000000 > 0.033163 0.000000 1 / > > On the host, you may have a openvswitch or a qemu task listed here. On the > VM, you may have a thread for your VNF, like testpmd listed here. When that > thread is interrupted, its 'switches' count will increment. Monitoring this > over time and 'switche's not increasing will confirm that no other thread > has interrupted it. Note that with non-RT kernels, IRQs still can interrupt > this thread and will not be reflected here. So, it is then also important > that you monitor /proc/interrupts to check if any are happening on this cpu. > > When using isolcpus, the scheduler will not attempt to load balance tasks on > those cpus. This is usually fine because there is only 1 task per cpu, and > moving tasks to/from these cpus automatically is not desirable. One problem > may arise, however, when using OpenStack. When starting a VM, the Nova > service will set the list of allowed cpus for the Qemu emulator thread to > the same cpulist that is used for all of the vcpus. For example, if the > vcpu pinning is: > > vcpu: host-cpu: > 0 4 > 1 5 > 2 6 > 3 7 > > The Qemu emulator thread will only be able to run on host cpus 4-7. When > using isolcpus, this thread will be initially placed on the first cpus in > the list, in this case cpu4. And, with isolcpus present, it will never be > load balanced to a different cpu. Therefore, the thread for vcpu0 and the > emulator will have to share host cpu4. This in some cases can cause > significant degrades in performance, especially during booting the VM. In > order to resolve this problem, the emulator thread must be migrated to a > different host cpu. You can achieve this with the virsh command on the host: > > # virsh emulatorpin <vm-name> > > This will show the current range of cpus allowed for the emulator thread. > To change it, use: > > # virsh emulatorpin vm1 --cpulist <cpulist> > > Simply using the cpu list from the CPUAffinity option in systemd should work > here: > > # grep CPUAffinity /etc/systemd/system.conf > CPUAffinity=0 1 2 3 > > And now run the virsh command to update the cpulist for the emulator thread. > Just make sure to use "," or "-" to describe the list of cpus: > > # virsh emulatorpin <vm-name> --cpulist 1,2,3,4 > #!/bin/bash cpu_list=`grep -e "^CPUAffinity=.*" /etc/systemd/system.conf | sed -e 's/CPUAffinity=//' -e 's/ /,/'` if [ ! -z "$cpu_list" ]; then virsh_list=`virsh list| sed -e '1,2d' -e 's/\s\+/ /g' | awk -F" " '{print $2}'` if [ ! -z "$virsh_list" ]; then for vm in $virsh_list; do virsh emulatorpin $vm --cpulist $cpu_list; done fi fi > > This is not persistent, so any time a VM is started from Nova, the emulator > thread will need to be moved. However, if you do not experience a degrade > (boot-up time is adequate and you don't use vcpu0 for packet processing), > this step is not necessary.
(In reply to Yariv from comment #5) > (In reply to Andrew Theurer from comment #0) > > Description of problem: > > > > Achieving zero-packet-loss for NFV workloads on Openstack with current > > guidelines may not always be achievable. This will document further steps > > to achieve zero loss. This BZ is not intended to track enhancements or > > fixes. > > > > > > Version-Release number of selected component (if applicable): > > > > OSP-10 > > openvswitch-2.5.0-*.git20160727 > > RHEL 7.3 > > > > > > Current guidelines make use of cpu-partitioning tuned profile for a OSP > > compute node and VM running in that compute node, which does the following: > > > > 1) Uses boot options nohz_full and rcu_nocbs, specifying a list of cpus > > which are reserved for openvswitch and the VM vcpus > > 2) Uses systemd's CPUAffinity option to launch all user processes within a > > subset of online cpus, this list = inverse of the cpulist used for nohz_full > > and rcu_nocbs. This prevents user processes from running on the cpus > > reserved for openvswitch and the VM. > > 3) Uses the IRQBALANCE_BANNED_CPUS= option for irqbalance to avoid sending > > interrupts to the reserved cpus. > > > > Both oenvswitch and the VM's vcpus must be configured so that they use, 1 > > thread per cpu, a cpu in the reserved cpus list. > > > > All of this in an effort to reduce as much as possible any interruptions to > > these cpus while running either openvswitch, VM-vcpu, or the inside-VM VNF > > threads. However, CPUAffinity cannot remove or prevent all kernel threads > > from running on these cpus. In order to prevent [most] of the kernel > > threads, one must use the boot option "isolcpus=<cpulist>". This uses the > > same cpu list as nohz_full and rcu_nocbs uses. Isolcpus is engaged right at > > kernel boot, and thus can prevent many kernel threads from be scheduled on > > those cpus. > > > > To enable isolcpus, first find out which cpus are being used for rcu_nocbs > > and nohz_full: > > > > # cat /proc/cmdline > > > > Look for the option, for example: > > > > nohz_full=1,3,5,7,9,11,13 > > > > Use the cpulist and add isolcpus option. One way to do this is with grubby > > utility: > > > > grubby --update-kernel=`grubby --default-kernel` --args="isolcpus=<cpulist>" > > For the host done already in Director first-boot.yaml > > > For guest: > script as following: > isol_cpus=`awk '{ for (i = 1; i <= NF; i++) if ($i ~ /nohz/) print $i };' /proc/cmdline | cut -d"=" -f2` if [ ! -z "$isol_cpus" ]; then grubby --update-kernel=grubby --default-kernel --args=isolcpus=$isol_cpus fi > > > > > This should be done on both the host and the VM. A reboot will be required > > for this option to take effect. > > > > Note that not all kernel threads are removed from these cpus. kworker, > > migration, and ksoftirqd will still be present on every cpu. However, > > kworker and migration threads should not have to run, and ksoftirqd runs > > typically once per second for a period of up to 20 microseconds, as long as > > only 1 user task is running on that cpu. > > > > After this option is configured and openvswitch and the VM is running, you > > can check to see what threads are scheduled on these cpus by looking at the > > contents of /proc/sched_debug (on both the host and the VM) > > > > For each cpu, there will be a section called "runnable tasks" like the > > following: > > > > runnable tasks: > > task PID tree-key switches prio wait-time > > sum-exec sum-sleep > > ----------------------------------------------------------------------------- > > ----------------------------- > > migration/47 244 0.000000 634 0 0.000000 > > 112.216681 0.000000 1 / > > ksoftirqd/47 245 12404.353193 29 120 0.000000 > > 0.241945 0.000000 1 / > > kworker/47:0 246 1072.586238 21 120 0.000000 > > 0.105430 0.000000 1 / > > kworker/47:0H 247 7740.331124 14 100 0.000000 > > 0.096559 0.000000 1 / > > kworker/47:1 625 13234.768505 302 120 0.000000 > > 1.904559 0.000000 1 / > > kworker/47:1H 99955 7752.330393 2 100 0.000000 > > 0.033163 0.000000 1 / > > > > On the host, you may have a openvswitch or a qemu task listed here. On the > > VM, you may have a thread for your VNF, like testpmd listed here. When that > > thread is interrupted, its 'switches' count will increment. Monitoring this > > over time and 'switche's not increasing will confirm that no other thread > > has interrupted it. Note that with non-RT kernels, IRQs still can interrupt > > this thread and will not be reflected here. So, it is then also important > > that you monitor /proc/interrupts to check if any are happening on this cpu. > > > > When using isolcpus, the scheduler will not attempt to load balance tasks on > > those cpus. This is usually fine because there is only 1 task per cpu, and > > moving tasks to/from these cpus automatically is not desirable. One problem > > may arise, however, when using OpenStack. When starting a VM, the Nova > > service will set the list of allowed cpus for the Qemu emulator thread to > > the same cpulist that is used for all of the vcpus. For example, if the > > vcpu pinning is: > > > > vcpu: host-cpu: > > 0 4 > > 1 5 > > 2 6 > > 3 7 > > > > The Qemu emulator thread will only be able to run on host cpus 4-7. When > > using isolcpus, this thread will be initially placed on the first cpus in > > the list, in this case cpu4. And, with isolcpus present, it will never be > > load balanced to a different cpu. Therefore, the thread for vcpu0 and the > > emulator will have to share host cpu4. This in some cases can cause > > significant degrades in performance, especially during booting the VM. In > > order to resolve this problem, the emulator thread must be migrated to a > > different host cpu. You can achieve this with the virsh command on the host: > > > > # virsh emulatorpin <vm-name> > > > > This will show the current range of cpus allowed for the emulator thread. > > To change it, use: > > > > # virsh emulatorpin vm1 --cpulist <cpulist> > > > > Simply using the cpu list from the CPUAffinity option in systemd should work > > here: > > > > # grep CPUAffinity /etc/systemd/system.conf > > CPUAffinity=0 1 2 3 > > > > And now run the virsh command to update the cpulist for the emulator thread. > > Just make sure to use "," or "-" to describe the list of cpus: > > > > # virsh emulatorpin <vm-name> --cpulist 1,2,3,4 > > > #!/bin/bash cpu_list=`grep -e "^CPUAffinity=.*" /etc/systemd/system.conf | sed -e 's/CPUAffinity=//' -e 's/ /,/'` if [ ! -z "$cpu_list" ]; then virsh_list=`virsh list| sed -e '1,2d' -e 's/\s\+/ /g' | awk -F" " '{print $2}'` if [ ! -z "$virsh_list" ]; then for vm in $virsh_list; do virsh emulatorpin $vm --cpulist $cpu_list; done fi fi > > > > > > > This is not persistent, so any time a VM is started from Nova, the emulator > > thread will need to be moved. However, if you do not experience a degrade > > (boot-up time is adequate and you don't use vcpu0 for packet processing), > > this step is not necessary.
In regards to comment 3, we do not recommend anyone re-pin the emulator thread unless the user experiences specific performance problems. We realize this is not desirable. I do believe https://review.openstack.org/#/c/284094/10/specs/ocata/approved/libvirt-emulator-threads-policy.rst will completely eliminate any desire to manually pin emulator threads.