Bug 1394537

Summary: Setting pmd cores on compute node ovs dpdk unable to survive a compute reboot.
Product: Red Hat OpenStack Reporter: Maxim Babushkin <mbabushk>
Component: openvswitch-dpdkAssignee: Eelco Chaudron <echaudro>
Status: CLOSED NOTABUG QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 10.0 (Newton)CC: atelang, atheurer, dnavale, echaudro, fbaudin, fleitner, jskarvad, lbopf, mbabushk, nyechiel, oblaut, yrachman
Target Milestone: asyncKeywords: ZStream
Target Release: 10.0 (Newton)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
After a `tuned` profile is activated, `tuned` service must start before the `openvswitch` service does, in order to set the cores allocated to the PMD correctly. As a workaround, you can change the `tuned` service by running the following script: #!/bin/bash tuned_service=/usr/lib/systemd/system/tuned.service grep -q "network.target" $tuned_service if [ "$?" -eq 0 ]; then sed -i '/After=.*/s/network.target//g' $tuned_service fi grep -q "Before=.*network.target" $tuned_service if [ ! "$?" -eq 0 ]; then grep -q "Before=.*" $tuned_service if [ "$?" -eq 0 ]; then sed -i 's/^\(Before=.*\)/\1 network.target openvswitch.service/g' $tuned_service else sed -i '/After/i Before=network.target openvswitch.service' $tuned_service fi fi systemctl daemon-reload systemctl restart openvswitch exit 0
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-19 10:51:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1403309    
Bug Blocks:    
Attachments:
Description Flags
tuned_boot_fix.sh none

Description Maxim Babushkin 2016-11-13 08:23:10 UTC
Description of problem:
Setting pmd cores on compute node ovs dpdk unable to survive a compute reboot.

Version-Release number of selected component (if applicable):
RHOS 10
Product version: 10
Product core version: 10
Product build: 2016-11-04.2

openvswitch-2.5.0-14.git20160727.el7fdp.x86_64
tuna-0.13-5.el7.noarch

How reproducible:
Set pmd cores on compute node ovs dpdk and reboot the compute node.
Check the cores once compute node is up and running.

Steps to Reproduce:
1. Install RHOS 10 OVS DPDK environment.
2. Set pmd cores for compute node ovs:
   # ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=40404040
3. Verify the configuration:
   # tuna -t ovs-vswitchd -CP | grep pmd
4. Reboot the compute node and check the configuration again.
   # tuna -t ovs-vswitchd -CP | grep pmd

Actual results:
[root@compute-0 ~]# tuna -t ovs-vswitchd -CP | grep pmd
  1571   OTHER     0 0xaaaaaaaf        40          106          pmd102  
  1572   OTHER     0 0xaaaaaaaf        40        12036          pmd101  
  1573   OTHER     0 0xaaaaaaaf         0         2293          pmd103  
  1574   OTHER     0 0xaaaaaaaf         1         1678          pmd104

Expected results:
[root@compute-0 ~]# tuna -t ovs-vswitchd -CP | grep pmd
  37073  OTHER     0        6        93            5          pmd134  
  37074  OTHER     0       14        80           10          pmd133  
  37075  OTHER     0       22        25            5          pmd135  
  37076  OTHER     0       30         0            5          pmd132

Additional info:
Configured the pmd cores on a compute node:
# ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=40404040

Verified the configuration:
[root@compute-0 ~]# tuna -t ovs-vswitchd -CP | grep pmd
  37073  OTHER     0        6        93            5          pmd134  
  37074  OTHER     0       14        80           10          pmd133  
  37075  OTHER     0       22        25            5          pmd135  
  37076  OTHER     0       30         0            5          pmd132

Rebooted the compute node.
Once it is up, checked the configuration.
[root@compute-0 ~]# tuna -t ovs-vswitchd -CP | grep pmd
  1571   OTHER     0 0xaaaaaaaf        40          106          pmd102  
  1572   OTHER     0 0xaaaaaaaf        40        12036          pmd101  
  1573   OTHER     0 0xaaaaaaaf         0         2293          pmd103  
  1574   OTHER     0 0xaaaaaaaf         1         1678          pmd104

Comment 1 Franck Baudin 2016-11-14 12:01:23 UTC
Maxim, did you also updated DPDK_OPTIONS in
/etc/sysconfig/openvswitch?

Comment 2 Maxim Babushkin 2016-11-17 15:34:22 UTC
Franck, which options should be updated within the /etc/sysconfig/openvswitch?

Comment 3 Franck Baudin 2016-11-18 17:30:33 UTC
I'd like to read FBL answer.

Comment 4 Flavio Leitner 2016-11-25 16:08:40 UTC
It should survive.

Could you please check what is the configuration after a reboot in the db?
ovs-vsctl get Open_vSwitch . other_config:pmd-cpu-mask

Comment 5 Maxim Babushkin 2016-11-27 08:00:33 UTC
I made some additional tests.

If I set pmd-cpu-mask on a newly deployment overcloud, everything works just fine.

But once I set the tuned profile, configuring cpu cores an run grub2-mkconfig command, the output of the command turns into what described above after host reboot.

The output of 'ovs-vsctl get Open_vSwitch . other_config:pmd-cpu-mask' command remains the same before the tuned profile activation, when the configuration survive the reboot, and when I'm configuring the tuned profile and it shows mess after host reboot.

[root@compute-0 ~]# ovs-vsctl get Open_vSwitch . other_config:pmd-cpu-mask
"40404040"

Comment 6 Eelco Chaudron 2016-11-28 12:30:09 UTC
For the OVS PMD threads you should not use tuned for pinning the cores, as DPDK takes care of this. Can you share your config for tuned and what else you setup trough grub?

Or even better can you get me access to your setup and I'll take a peek.

Comment 7 Maxim Babushkin 2016-11-29 11:06:56 UTC
The usage of tuned profile on overcloud compute node required according to the following RPE [0] for CPUAffinity and IRQ Repinning.

As part of the tuned config, list of the cpu cores set to be cleaned from the irq interrupts. Currently, cpu-partitioning profile is used.

This adds the following arguments to the grub config file:
nohz=on nohz_full=<number_of_cores> rcu_nocbs=<number_of_cores> intel_pstate=disable nosoftlockup

In additional, the following arguments added to the grub manually:
intel_iommu=on iommu=on hugepagesz=1GB default_hugepagesz=1GB hugepages=20



[0] - https://bugzilla.redhat.com/show_bug.cgi?id=1384845

Comment 8 Eelco Chaudron 2016-11-29 12:07:25 UTC
Can you let me know which cores you exclude on the host? They should not include the cores you assign to DPDK. 

Can you get me access to your setup so I can investigate more?

Comment 9 Maxim Babushkin 2016-11-29 13:56:32 UTC
The cores I'm setting for pmd-cpu-mask are taken from the exclude cores pool, otherwise, I will get performance degradation because of the irq interrupts.

I will send you the setup connection details by mail.

Comment 10 Eelco Chaudron 2016-11-29 15:35:09 UTC
Hi Maxim,

Looked at your setup and DPDK/PDM is kicked-off the cores by your tuned settings.

  [root@compute-0 heat-admin]# more /etc/tuned/active_profile
  cpu-partitioning
  [root@compute-0 heat-admin]# more /etc/tuned/cpu-partitioning-variables.conf 
  # Examples:
  # isolated_cores=2,4-7
  isolated_cores=4,6,8,10,12,14,20,22,24,26,28,30

As DPDK is a user space process, setting the cores to kernel only use (isolated_cores=), prevents them from being used by OVS for PMD. 

So you should pick your PMD cores to not be one of the isolated_cores.

Comment 11 Maxim Babushkin 2016-11-29 15:53:36 UTC
Hi Eelco,

Thanks for the debugging.

Andrew, please, you comment on this.

Comment 12 Franck Baudin 2016-12-07 11:19:35 UTC
(In reply to Eelco Chaudron from comment #10)
> Hi Maxim,
> 
> Looked at your setup and DPDK/PDM is kicked-off the cores by your tuned
> settings.
> 
>   [root@compute-0 heat-admin]# more /etc/tuned/active_profile
>   cpu-partitioning
>   [root@compute-0 heat-admin]# more
> /etc/tuned/cpu-partitioning-variables.conf 
>   # Examples:
>   # isolated_cores=2,4-7
>   isolated_cores=4,6,8,10,12,14,20,22,24,26,28,30
> 
> As DPDK is a user space process, setting the cores to kernel only use
> (isolated_cores=), prevents them from being used by OVS for PMD. 
> 
> So you should pick your PMD cores to not be one of the isolated_cores.

How can we isolate OVS PMD threads from Linux? This is the whole purpose of using a tuned profile, what is the proper solution?

Comment 13 Eelco Chaudron 2016-12-07 12:06:01 UTC
I think this is something related to how tuned works, it seems to remove applications assigned to the cores you would like to isolate.

I do not know enough about tuned to tell you how to fix it this way. However what will work is isolate the cores from the boot command line, this way tuned does not to touch the assigned processes.

i.e. add something to the GRUB_CMDLINE_LINUX like "isolcpus=4,6,8,10,12,14,20,22,24,26,28,30" and do grub2-mkconfig -o /boot/grub2/grub.cfg

Maybe you can consult the tuned people for this part. Maybe if tuned is started before ovs it might not be a problem?

Comment 14 Andrew Theurer 2016-12-07 22:15:41 UTC
tuned must be moving all processes that fall in to the isolated cores when the profile is activated.  This may be because tuned is being activated -after- openvswitch has started.  Can you find out in what order these run?  Or, after a reboot, and the pmd placement is wrong, can you restart openvswitch and confirm the pmd threads are in the right place?

Comment 15 Maxim Babushkin 2016-12-08 14:12:50 UTC
Andrew,

Yes, once restarting openvswitch service, cores allocated for pmd displayed correctly.

I tried to play with systemd services order, so tuned will start after openvswitch, but after reboot it still show incorrect value. Just openvswitch service restart fix it.

Comment 16 Maxim Babushkin 2016-12-08 15:24:10 UTC
During the system boot, tuned service started after openvsitch, which cause the issue.


[root@compute-0 ~]# 
[root@compute-0 ~]# systemctl restart openvswitch
[root@compute-0 ~]# 
[root@compute-0 ~]# 
[root@compute-0 ~]# tuna -t ovs-vswitchd -CP | grep pmd
  7926   OTHER     0        4         0            1          pmd101  
  7927   OTHER     0        6         0            4          pmd102  
  7928   OTHER     0       20         1            3          pmd103  
  7929   OTHER     0       22      2816          228          pmd104  
[root@compute-0 ~]# 
[root@compute-0 ~]# 
[root@compute-0 ~]# 
[root@compute-0 ~]# systemctl restart tuned
[root@compute-0 ~]# 
[root@compute-0 ~]# 
[root@compute-0 ~]# tuna -t ovs-vswitchd -CP | grep pmd
  7926   OTHER     0 0xaaafaaaf         0            5          pmd101  
  7927   OTHER     0 0xaaafaaaf         0            8          pmd102  
  7928   OTHER     0 0xaaafaaaf         1            6          pmd103  
  7929   OTHER     0 0xaaafaaaf      2816          236          pmd104  
[root@compute-0 ~]#

Comment 17 Andrew Theurer 2016-12-08 15:37:01 UTC
We'll need some guidance on getting systemd to start tuned before openvswitch -we don't want openvswitch delayed indefinitely if tuned is disabled for some reason.

Comment 18 Andrew Theurer 2016-12-09 01:54:35 UTC
I have changed systemd file for openvswitch to start after tuned, and this is preserving the pinned I set up for openvswitch PMD threads.  I have also disabled tuned and rebooted to ensure openvswitch still starts, and it does.  Maxim, can you change this file (the line starting with "After" has chanhed) and try again:

/usr/lib/systemd/system/openvswitch.service:

[Unit]
Description=Open vSwitch
Before=network.target network.service
After=network-pre.target tuned.service
PartOf=network.target
BindsTo=ovsdb-server.service
BindsTo=ovs-vswitchd.service

[Service]
Type=oneshot
ExecStart=/bin/true
ExecReload=/bin/true
ExecStop=/bin/true
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

Comment 19 Franck Baudin 2016-12-09 08:50:24 UTC
The proposed patch doesn't work on my platform. I'll let Maxime confirm, but I'm pessimistic, see below.

Looking at tuned.service, it is required to start after network.service:
[Unit]
After=syslog.target systemd-sysctl.service network.target

But in openvswitch.service, we see that openvswitch is done before network.target
[Unit]
Before=network.target network.service
After=network-pre.target
PartOf=network.target

So we are in a "deadlock" situation. I tested Andrew's patch, it doesn't work on my platform I believe for the deadlock situation described above. So instead of changing openvswitch unit, I changed the tuned one as follow:

[Unit]
Description=Dynamic System Tuning Daemon
Before=network.target network.service openvswitch.target
After=syslog.target systemd-sysctl.service
Requires=dbus.service polkit.service
Conflicts=cpupower.service

I'm not sure of other side effect of launching tuned before network.*, but it looks fine on my setup and openswitch PMD threads are pinned on the proper CPUs

Comment 20 Jaroslav Škarvada 2016-12-09 12:57:28 UTC
I am currently not sure whether Tuned could be started before network.target, because there are scenarios where it's used for network performance tuning.

Actually the cpu-partitioning profile (as in tuned-profiles-cpu-partitioning-2.7.1-5.el7) do following things regarding isolated_cores:

1) runs defirqaffinity script to adjust IRQs affinity
2) patches systemd config with CPUAffinity setting to run init on inverse(isolated_cores), thus I think all childs of init will not run on isolated_cores
3) runs 'tuna --isolate' to move all threads away from specified cores

The 2) gets in effect after reboot, thus it shouldn't behave differently after reboot. I think that it could behave differently if you run the processes manually on isolated cores before Tuned is started.

However, I don't think that the magic with ordering of services is the correct solution, because you would probably end-up in the same situation if you change Tuned profile or just restart it. I think some more robust solution would work better like e.g. introduction of process whitelist in user configuration, e.g.:

 isolated_cores=4,6,8,10,12,14,20,22,24,26,28,30
 ignore_processes=pmd*

What do you think about such feature?

Comment 21 Jaroslav Škarvada 2016-12-09 13:02:24 UTC
(In reply to Jaroslav Škarvada from comment #20)
> However, I don't think that the magic with ordering of services is the
> correct solution, because you would probably end-up in the same situation if
> you change Tuned profile or just restart it. I think some more robust
> solution would work better like e.g. introduction of process whitelist in
> user configuration, e.g.:
> 
>  isolated_cores=4,6,8,10,12,14,20,22,24,26,28,30
>  ignore_processes=pmd*
> 
> What do you think about such feature?

Just brainstorming the best way how to fix the problem, such feature is not yet implemented in Tuned.

Comment 22 Franck Baudin 2016-12-09 13:17:08 UTC
(In reply to Jaroslav Škarvada from comment #21)
> (In reply to Jaroslav Škarvada from comment #20)
> > However, I don't think that the magic with ordering of services is the
> > correct solution, because you would probably end-up in the same situation if
> > you change Tuned profile or just restart it. I think some more robust
> > solution would work better like e.g. introduction of process whitelist in
> > user configuration, e.g.:
> > 
> >  isolated_cores=4,6,8,10,12,14,20,22,24,26,28,30
> >  ignore_processes=pmd*
> > 
> > What do you think about such feature?
> 
> Just brainstorming the best way how to fix the problem, such feature is not
> yet implemented in Tuned.

This would work well, we just need to define a list of regexp for ignore_processes.

While waiting for this feature to be implemented, would it make sense to update the tuned Unit or is the above implementation above quite straightforward?

Comment 23 Jaroslav Škarvada 2016-12-09 13:42:28 UTC
(In reply to Franck Baudin from comment #22)
> (In reply to Jaroslav Škarvada from comment #21)
> > (In reply to Jaroslav Škarvada from comment #20)
> > > However, I don't think that the magic with ordering of services is the
> > > correct solution, because you would probably end-up in the same situation if
> > > you change Tuned profile or just restart it. I think some more robust
> > > solution would work better like e.g. introduction of process whitelist in
> > > user configuration, e.g.:
> > > 
> > >  isolated_cores=4,6,8,10,12,14,20,22,24,26,28,30
> > >  ignore_processes=pmd*
> > > 
> > > What do you think about such feature?
> > 
> > Just brainstorming the best way how to fix the problem, such feature is not
> > yet implemented in Tuned.
> 
> This would work well, we just need to define a list of regexp for
> ignore_processes.
>
OK, please clone this bug (or open new) against Tuned as RFE.
 
> While waiting for this feature to be implemented, would it make sense to
> update the tuned Unit or is the above implementation above quite
> straightforward?

If you could patch it in your product, it should work as workaround, but I cannot deliver it, because:

- the service file is in main engine (tuned package)
- from the fast-datapath-rhel-7 dist-git branch we delivered only tuned-profiles-cpu-partitioning package and the main engine (tuned package) is shared from RHEL

Comment 24 Maxim Babushkin 2016-12-11 13:14:20 UTC
Franck, your suggestion works as expected.
To you want to implement it as a workaround in first-boot.yaml?

Comment 25 Franck Baudin 2016-12-12 17:52:43 UTC
Yes Maxim, as a temporary workaround, please put a comment with the BZ number around this section. A one line sed command should do the trick followed by a "systemctl daemon-reload" should do the trick. We do not have to include it in RHOSP10 documentation, it's too late.

Comment 26 Maxim Babushkin 2016-12-14 14:35:36 UTC
Created attachment 1231760 [details]
tuned_boot_fix.sh

Comment 27 Eelco Chaudron 2016-12-14 19:25:45 UTC
Can we close out this BZ, as the issue itself is not related to openvswitch but to tuned? I see bz 1403309 exists for the tuned enhancement.

Comment 29 Eelco Chaudron 2016-12-19 10:51:14 UTC
Closing this BZ as its not a OVS problem but tuned, which is handled in bz1403309