Bug 2203142

Summary: RFE: Allow skipping rollback when changing profile or restarting tuned
Product: Red Hat Enterprise Linux 9 Reporter: Martin Sivák <msivak>
Component: tunedAssignee: Jaroslav Škarvada <jskarvad>
Status: CLOSED ERRATA QA Contact: Robin Hack <rhack>
Severity: urgent Docs Contact:
Priority: high    
Version: 9.2CC: bhu, blitton, bwensley, fbaudin, jeder, jlelli, jmencak, jorton, jskarvad, kkarampo, pibanezr
Target Milestone: rcKeywords: FutureFeature, TestCaseNeeded, Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tuned-2.21.0-0.1.rc1.el9 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-11-07 08:56:19 UTC Type: Story
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2188812, 2188934    

Description Martin Sivák 2023-05-11 10:13:53 UTC
Description of problem:

Tuned restart unloads a profile and reverts all values to defaults and then configures the profile again.

This causes issues on SR-IOV enabled systems where the queue count is configured via a tuned profile. Tuned restart reverts the queue count to the amount of cpus and then reduces it again to the configured value. This change causes SR-IOV device reset and applications using it get confused and lose packets.


The same happens when the profile changes, but that is not as disruptive, because it is something the administrator did and so he should know the consequences.

Version-Release number of selected component (if applicable):

RHEL 9.2, but all of the versions really

How reproducible:

Always, configure sysctls, cfs values, cpu affinity or nic queue counts and restart tuned.

Actual results:

Workloads are disrupted.

Expected results:

No disruption of values that have not changed.
Additional info:


More information:

This restart can happen during OCP upgrade (possibly during RHEL's yum update too?) of the master nodes and the effect on the worker nodes is unexpected.

OCP seldom upgrades profiles without a reboot, so tuned pretty much always starts with a clean system. Manual or upgrade related tuned restart therefore does not need the rollback, because after reboot it will be configuring a clean system again.

In other words: we can configure tuned to ignore the rollback just for this use case if this is made configurable.

Comment 4 Jiří Mencák 2023-05-22 14:51:54 UTC
Actually, a configuration option to explicityl disable the rollback on TuneD shutdown might be enough.

Still a WiP (needs more testing on RHOCP), but feel free to test and review.  Should already work:
https://github.com/redhat-performance/tuned/pull/533

Comment 6 Bryan Litton 2023-05-25 12:31:34 UTC
Hi,

Adding to Juri's bump of severity and priority I would like to add some additional context that this fix is needed as part of a larger set of fixes to address latency performance for our Telco RAN solution on OCP 4.13. This is currently blocking our key partners from being able to use 4.13 and we are under pressure to give them a target OCP 4.13.z release when these will be fixed. 

I've gone ahead and requested this be backported to 9.2 as soon as possible. Any priority you could give to get this verified in 9.3 and backported would be greatly appreciated by the Telco program.

Comment 8 Jiří Mencák 2023-05-26 11:13:18 UTC
Hi Bryan,
thank you for the additional context.

(In reply to Bryan Litton from comment #6)
> I've gone ahead and requested this be backported to 9.2 as soon as possible.
> Any priority you could give to get this verified in 9.3 and backported would
> be greatly appreciated by the Telco program.

If the request is to ship this feature in OCP only, then I believe we do not need to backport
to RHEL 9.2 as OCP uses TuneD via FDP (Fast Data Path).

Comment 28 errata-xmlrpc 2023-11-07 08:56:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (tuned bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:6703