Bug 1394537
Summary: | Setting pmd cores on compute node ovs dpdk unable to survive a compute reboot. | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Maxim Babushkin <mbabushk> | ||||
Component: | openvswitch-dpdk | Assignee: | Eelco Chaudron <echaudro> | ||||
Status: | CLOSED NOTABUG | QA Contact: | |||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 10.0 (Newton) | CC: | atelang, atheurer, dnavale, echaudro, fbaudin, fleitner, jskarvad, lbopf, mbabushk, nyechiel, oblaut, yrachman | ||||
Target Milestone: | async | Keywords: | ZStream | ||||
Target Release: | 10.0 (Newton) | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Known Issue | |||||
Doc Text: |
After a `tuned` profile is activated, `tuned` service must start before the `openvswitch` service does, in order to set the cores allocated to the PMD correctly.
As a workaround, you can change the `tuned` service by running the following script:
#!/bin/bash
tuned_service=/usr/lib/systemd/system/tuned.service
grep -q "network.target" $tuned_service
if [ "$?" -eq 0 ]; then
sed -i '/After=.*/s/network.target//g' $tuned_service
fi
grep -q "Before=.*network.target" $tuned_service
if [ ! "$?" -eq 0 ]; then
grep -q "Before=.*" $tuned_service
if [ "$?" -eq 0 ]; then
sed -i 's/^\(Before=.*\)/\1 network.target openvswitch.service/g' $tuned_service
else
sed -i '/After/i Before=network.target openvswitch.service' $tuned_service
fi
fi
systemctl daemon-reload
systemctl restart openvswitch
exit 0
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2016-12-19 10:51:14 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1403309 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Description
Maxim Babushkin
2016-11-13 08:23:10 UTC
Maxim, did you also updated DPDK_OPTIONS in /etc/sysconfig/openvswitch? Franck, which options should be updated within the /etc/sysconfig/openvswitch? I'd like to read FBL answer. It should survive. Could you please check what is the configuration after a reboot in the db? ovs-vsctl get Open_vSwitch . other_config:pmd-cpu-mask I made some additional tests. If I set pmd-cpu-mask on a newly deployment overcloud, everything works just fine. But once I set the tuned profile, configuring cpu cores an run grub2-mkconfig command, the output of the command turns into what described above after host reboot. The output of 'ovs-vsctl get Open_vSwitch . other_config:pmd-cpu-mask' command remains the same before the tuned profile activation, when the configuration survive the reboot, and when I'm configuring the tuned profile and it shows mess after host reboot. [root@compute-0 ~]# ovs-vsctl get Open_vSwitch . other_config:pmd-cpu-mask "40404040" For the OVS PMD threads you should not use tuned for pinning the cores, as DPDK takes care of this. Can you share your config for tuned and what else you setup trough grub? Or even better can you get me access to your setup and I'll take a peek. The usage of tuned profile on overcloud compute node required according to the following RPE [0] for CPUAffinity and IRQ Repinning. As part of the tuned config, list of the cpu cores set to be cleaned from the irq interrupts. Currently, cpu-partitioning profile is used. This adds the following arguments to the grub config file: nohz=on nohz_full=<number_of_cores> rcu_nocbs=<number_of_cores> intel_pstate=disable nosoftlockup In additional, the following arguments added to the grub manually: intel_iommu=on iommu=on hugepagesz=1GB default_hugepagesz=1GB hugepages=20 [0] - https://bugzilla.redhat.com/show_bug.cgi?id=1384845 Can you let me know which cores you exclude on the host? They should not include the cores you assign to DPDK. Can you get me access to your setup so I can investigate more? The cores I'm setting for pmd-cpu-mask are taken from the exclude cores pool, otherwise, I will get performance degradation because of the irq interrupts. I will send you the setup connection details by mail. Hi Maxim, Looked at your setup and DPDK/PDM is kicked-off the cores by your tuned settings. [root@compute-0 heat-admin]# more /etc/tuned/active_profile cpu-partitioning [root@compute-0 heat-admin]# more /etc/tuned/cpu-partitioning-variables.conf # Examples: # isolated_cores=2,4-7 isolated_cores=4,6,8,10,12,14,20,22,24,26,28,30 As DPDK is a user space process, setting the cores to kernel only use (isolated_cores=), prevents them from being used by OVS for PMD. So you should pick your PMD cores to not be one of the isolated_cores. Hi Eelco, Thanks for the debugging. Andrew, please, you comment on this. (In reply to Eelco Chaudron from comment #10) > Hi Maxim, > > Looked at your setup and DPDK/PDM is kicked-off the cores by your tuned > settings. > > [root@compute-0 heat-admin]# more /etc/tuned/active_profile > cpu-partitioning > [root@compute-0 heat-admin]# more > /etc/tuned/cpu-partitioning-variables.conf > # Examples: > # isolated_cores=2,4-7 > isolated_cores=4,6,8,10,12,14,20,22,24,26,28,30 > > As DPDK is a user space process, setting the cores to kernel only use > (isolated_cores=), prevents them from being used by OVS for PMD. > > So you should pick your PMD cores to not be one of the isolated_cores. How can we isolate OVS PMD threads from Linux? This is the whole purpose of using a tuned profile, what is the proper solution? I think this is something related to how tuned works, it seems to remove applications assigned to the cores you would like to isolate. I do not know enough about tuned to tell you how to fix it this way. However what will work is isolate the cores from the boot command line, this way tuned does not to touch the assigned processes. i.e. add something to the GRUB_CMDLINE_LINUX like "isolcpus=4,6,8,10,12,14,20,22,24,26,28,30" and do grub2-mkconfig -o /boot/grub2/grub.cfg Maybe you can consult the tuned people for this part. Maybe if tuned is started before ovs it might not be a problem? tuned must be moving all processes that fall in to the isolated cores when the profile is activated. This may be because tuned is being activated -after- openvswitch has started. Can you find out in what order these run? Or, after a reboot, and the pmd placement is wrong, can you restart openvswitch and confirm the pmd threads are in the right place? Andrew, Yes, once restarting openvswitch service, cores allocated for pmd displayed correctly. I tried to play with systemd services order, so tuned will start after openvswitch, but after reboot it still show incorrect value. Just openvswitch service restart fix it. During the system boot, tuned service started after openvsitch, which cause the issue. [root@compute-0 ~]# [root@compute-0 ~]# systemctl restart openvswitch [root@compute-0 ~]# [root@compute-0 ~]# [root@compute-0 ~]# tuna -t ovs-vswitchd -CP | grep pmd 7926 OTHER 0 4 0 1 pmd101 7927 OTHER 0 6 0 4 pmd102 7928 OTHER 0 20 1 3 pmd103 7929 OTHER 0 22 2816 228 pmd104 [root@compute-0 ~]# [root@compute-0 ~]# [root@compute-0 ~]# [root@compute-0 ~]# systemctl restart tuned [root@compute-0 ~]# [root@compute-0 ~]# [root@compute-0 ~]# tuna -t ovs-vswitchd -CP | grep pmd 7926 OTHER 0 0xaaafaaaf 0 5 pmd101 7927 OTHER 0 0xaaafaaaf 0 8 pmd102 7928 OTHER 0 0xaaafaaaf 1 6 pmd103 7929 OTHER 0 0xaaafaaaf 2816 236 pmd104 [root@compute-0 ~]# We'll need some guidance on getting systemd to start tuned before openvswitch -we don't want openvswitch delayed indefinitely if tuned is disabled for some reason. I have changed systemd file for openvswitch to start after tuned, and this is preserving the pinned I set up for openvswitch PMD threads. I have also disabled tuned and rebooted to ensure openvswitch still starts, and it does. Maxim, can you change this file (the line starting with "After" has chanhed) and try again: /usr/lib/systemd/system/openvswitch.service: [Unit] Description=Open vSwitch Before=network.target network.service After=network-pre.target tuned.service PartOf=network.target BindsTo=ovsdb-server.service BindsTo=ovs-vswitchd.service [Service] Type=oneshot ExecStart=/bin/true ExecReload=/bin/true ExecStop=/bin/true RemainAfterExit=yes [Install] WantedBy=multi-user.target The proposed patch doesn't work on my platform. I'll let Maxime confirm, but I'm pessimistic, see below. Looking at tuned.service, it is required to start after network.service: [Unit] After=syslog.target systemd-sysctl.service network.target But in openvswitch.service, we see that openvswitch is done before network.target [Unit] Before=network.target network.service After=network-pre.target PartOf=network.target So we are in a "deadlock" situation. I tested Andrew's patch, it doesn't work on my platform I believe for the deadlock situation described above. So instead of changing openvswitch unit, I changed the tuned one as follow: [Unit] Description=Dynamic System Tuning Daemon Before=network.target network.service openvswitch.target After=syslog.target systemd-sysctl.service Requires=dbus.service polkit.service Conflicts=cpupower.service I'm not sure of other side effect of launching tuned before network.*, but it looks fine on my setup and openswitch PMD threads are pinned on the proper CPUs I am currently not sure whether Tuned could be started before network.target, because there are scenarios where it's used for network performance tuning. Actually the cpu-partitioning profile (as in tuned-profiles-cpu-partitioning-2.7.1-5.el7) do following things regarding isolated_cores: 1) runs defirqaffinity script to adjust IRQs affinity 2) patches systemd config with CPUAffinity setting to run init on inverse(isolated_cores), thus I think all childs of init will not run on isolated_cores 3) runs 'tuna --isolate' to move all threads away from specified cores The 2) gets in effect after reboot, thus it shouldn't behave differently after reboot. I think that it could behave differently if you run the processes manually on isolated cores before Tuned is started. However, I don't think that the magic with ordering of services is the correct solution, because you would probably end-up in the same situation if you change Tuned profile or just restart it. I think some more robust solution would work better like e.g. introduction of process whitelist in user configuration, e.g.: isolated_cores=4,6,8,10,12,14,20,22,24,26,28,30 ignore_processes=pmd* What do you think about such feature? (In reply to Jaroslav Škarvada from comment #20) > However, I don't think that the magic with ordering of services is the > correct solution, because you would probably end-up in the same situation if > you change Tuned profile or just restart it. I think some more robust > solution would work better like e.g. introduction of process whitelist in > user configuration, e.g.: > > isolated_cores=4,6,8,10,12,14,20,22,24,26,28,30 > ignore_processes=pmd* > > What do you think about such feature? Just brainstorming the best way how to fix the problem, such feature is not yet implemented in Tuned. (In reply to Jaroslav Škarvada from comment #21) > (In reply to Jaroslav Škarvada from comment #20) > > However, I don't think that the magic with ordering of services is the > > correct solution, because you would probably end-up in the same situation if > > you change Tuned profile or just restart it. I think some more robust > > solution would work better like e.g. introduction of process whitelist in > > user configuration, e.g.: > > > > isolated_cores=4,6,8,10,12,14,20,22,24,26,28,30 > > ignore_processes=pmd* > > > > What do you think about such feature? > > Just brainstorming the best way how to fix the problem, such feature is not > yet implemented in Tuned. This would work well, we just need to define a list of regexp for ignore_processes. While waiting for this feature to be implemented, would it make sense to update the tuned Unit or is the above implementation above quite straightforward? (In reply to Franck Baudin from comment #22) > (In reply to Jaroslav Škarvada from comment #21) > > (In reply to Jaroslav Škarvada from comment #20) > > > However, I don't think that the magic with ordering of services is the > > > correct solution, because you would probably end-up in the same situation if > > > you change Tuned profile or just restart it. I think some more robust > > > solution would work better like e.g. introduction of process whitelist in > > > user configuration, e.g.: > > > > > > isolated_cores=4,6,8,10,12,14,20,22,24,26,28,30 > > > ignore_processes=pmd* > > > > > > What do you think about such feature? > > > > Just brainstorming the best way how to fix the problem, such feature is not > > yet implemented in Tuned. > > This would work well, we just need to define a list of regexp for > ignore_processes. > OK, please clone this bug (or open new) against Tuned as RFE. > While waiting for this feature to be implemented, would it make sense to > update the tuned Unit or is the above implementation above quite > straightforward? If you could patch it in your product, it should work as workaround, but I cannot deliver it, because: - the service file is in main engine (tuned package) - from the fast-datapath-rhel-7 dist-git branch we delivered only tuned-profiles-cpu-partitioning package and the main engine (tuned package) is shared from RHEL Franck, your suggestion works as expected. To you want to implement it as a workaround in first-boot.yaml? Yes Maxim, as a temporary workaround, please put a comment with the BZ number around this section. A one line sed command should do the trick followed by a "systemctl daemon-reload" should do the trick. We do not have to include it in RHOSP10 documentation, it's too late. Created attachment 1231760 [details]
tuned_boot_fix.sh
Can we close out this BZ, as the issue itself is not related to openvswitch but to tuned? I see bz 1403309 exists for the tuned enhancement. Closing this BZ as its not a OVS problem but tuned, which is handled in bz1403309 |