RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2095829 - TuneD: Need to modify method for setting EPP/EPB on Intel processors
Summary: TuneD: Need to modify method for setting EPP/EPB on Intel processors
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: tuned
Version: 9.1
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Jaroslav Škarvada
QA Contact: Robin Hack
Šárka Jana
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-10 16:58 UTC by Joe Mario
Modified: 2023-05-09 10:43 UTC (History)
4 users (show)

Fixed In Version: tuned-2.20.0-0.1.rc1.el9
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-05-09 08:26:31 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-125015 0 None None None 2022-06-10 17:23:23 UTC
Red Hat Product Errata RHBA-2023:2585 0 None None None 2023-05-09 08:26:37 UTC

Description Joe Mario 2022-06-10 16:58:10 UTC
Description of problem:

The problem:
TuneD has always set EPB (energy_perf_bias) directly on Intel processors.  It did it via the /usr/bin/x86_energy_perf_policy binary.  Writing EPB directly was the only way and it worked fine for years.

On newer Intel processors, we should no longer write EPB.  Instead, we should only write energy_performance_preference from a list of available values.  On some newer Intel processors, writing EPB/EPP=0 may result in lower performance.

See: https://www.kernel.org/doc/html/v5.17/admin-guide/pm/intel_pstate.html#energy-vs-performance-hints

Here's the flow for how TuneD should set EPP/EPB:

$DIR=/sys/devices/system/cpu/cpuX/power/

1) If ($DIR/energy_performance_available_preferences exists) {
   then:
     a) Check that file for available EPP preferences to verify 
        the desired EPP value is available.
     b) Then write the desired (string) policy to 
        $DIR/energy_performance_preference.
   }
   else {  /* energy_performance_available_preferences does not exist */
        a) Then write directly to $DIR/energy_perf_bias using the 
           x86_energy_perf_policy binary.
           This should only be true on older processors.
   }

2) The above logic is similar for setting a governor, though 
   TuneD is already checking for valid governors first. (No change needed).


Version-Release number of selected component (if applicable):

Although I've opened this BZ against RHEL-9.x, it is important to backport any fix into the open RHEL-8.x releases.


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Joe Mario 2022-06-14 11:51:35 UTC
Hi Jaroslav:
Here's an example of an older processor (Intel Haswell) running RHEL-9 (5.14.0-8.el9.x86_64), which does not have the newer "energy_performance_available_preferences" file.  
In this case where that file does not exist, continue to use the current TuneD method of setting energy_perf_bias using /usr/bin/x86_energy_perf_policy.

[root@shakperf ~]# cat t.sh
#!/bin/bash
cd /sys/devices/system/cpu
grep . cpufreq/policy0/* 
grep . intel_pstate/* 
grep . cpu0/power/* 

[root@shakperf ~]# sh t.sh
  cpufreq/policy0/affected_cpus:0
  cpufreq/policy0/cpuinfo_max_freq:3600000
  cpufreq/policy0/cpuinfo_min_freq:1200000
  cpufreq/policy0/cpuinfo_transition_latency:20000
  cpufreq/policy0/related_cpus:0
  cpufreq/policy0/scaling_available_governors:conservative ondemand userspace powersave performance schedutil 
  cpufreq/policy0/scaling_cur_freq:2134050
  cpufreq/policy0/scaling_driver:intel_cpufreq
  cpufreq/policy0/scaling_governor:performance
  cpufreq/policy0/scaling_max_freq:3600000
  cpufreq/policy0/scaling_min_freq:3600000
  cpufreq/policy0/scaling_setspeed:<unsupported>
  intel_pstate/max_perf_pct:100
  intel_pstate/min_perf_pct:100
  intel_pstate/no_turbo:0
  intel_pstate/num_pstates:25
  intel_pstate/status:passive
  intel_pstate/turbo_pct:53
  cpu0/power/control:auto
>>  cpu0/power/energy_perf_bias:0
  cpu0/power/pm_qos_resume_latency_us:0
  cpu0/power/runtime_active_time:0
  cpu0/power/runtime_status:unsupported
  cpu0/power/runtime_suspended_time:0

Comment 7 Jan Žerdík 2022-11-14 10:07:36 UTC
I'm sorry. Did you find any way to identify newer cpus and how we should behave when intel_pstate is disabled?

Comment 8 Joe Mario 2022-11-29 17:44:22 UTC
Hi Jan:

RE: Did you find any way to identify newer cpus and how we should behave when intel_pstate is disabled?
There should be no need to have to identify cpus.

The algorithm below from comment 0 is still valid for Intel cpus, with one temporary qualification noted below:

> $DIR=/sys/devices/system/cpu/cpuX/power/
>
>   If ($DIR/energy_performance_available_preferences exists) {
>   then:
>     a) Check that file for available EPP preferences to verify 
>        the desired EPP value is available.
>     b) Then write the desired (string) policy to 
>        $DIR/energy_performance_preference.
>   }
>   else {  /* energy_performance_available_preferences does not exist */
>        a) Then write directly to $DIR/energy_perf_bias using the 
>           x86_energy_perf_policy binary.
>           This should only be true on older processors.
>   }

That last comment "should only be true on older processors" needs qualification.
Currently it will also be true when intel_pstate is disabled.  There should be an upstream patch that Prarit is looking for that will correct this.  With that patch, the above algorithm will work correctly with or without intel_pstate.  But TuneD does not need to wait for Prarit's patch. 

To summarize, TuneD need not identify newer cpus.  It should just use the above algorithm.
@prarit Please confirm the above algorithm is valid.  (It should be.)

Comment 9 Joe Mario 2022-11-30 14:05:59 UTC
Jan:
Here is a more clear version of the algorithm that TuneD should use.
Prarit: Please confirm.

If (energy_performance_available_preferences exists) {
   if (the user requested preference is one of the available preferences) {
      then write that value to the energy_performance_preference file
   else
      return // Because the requested value is not valid.
   }     
else {  /* energy_performance_available_preferences does not exist */
   /* 
    * This case should only be true on Intel cpus older than Skylake
    * or on Skylake & newer cpus not booted the intel_pstate driver enabled 
    * and without an upcoming kernel patch [1].
    */
   if (an energy_perf_bias value was also requested) {
      write it directly to energy_perf_bias using the x86_energy_perf_policy binary [2]
   else
      return 
}

[1] There's an upstream kernel patch Prarit is looking to backport that will always
    have the energy_performance_{preference,available_preferences} files available
    even if the kernel was booted with intel_pstate=disabled.
    TuneD does not need to wait for this patch to implement the above algorithm.

[2] The reason the x86_energy_perf_policy should be used is because in some cases the
    energy_perf_bias file will not exist.  If that file exists, the x86_energy_perf_policy
    binary will use it.  Else it will directly write the appropriate MSR to update the
    energy_perf_bias value.

Joe

Comment 10 Prarit Bhargava 2022-12-09 14:07:31 UTC
Sorry for the delay.  I was busy and then I had to go back through email to refresh my memory of the process.  The algorithm described in comment #9 is correct.

P.

Comment 11 Joe Mario 2023-01-17 21:17:08 UTC
Hi Jan:
Per our offline discussions, here is an updated algorithm.
Please delete the one from comment #9.

The reasons for this update are:

The algorithm from comment #9 will opens the possibility that some customer will complain about performance because we'll no longer be setting EPB to 0 as we've been doing all along.   When I look through all the ./arch/x86/kernel/cpu/intel.c history, it's pretty clear that EPB=0 is performance mode and the upstream default of EPB=6 is the "normal" mode.  (Even though our testing has shown no performance regression when setting EPB=6 vs EPB=0.)

Recall the motivation for all this is to not set an EPB value on any cpu that doesn't like that value, (currently only Alder Lake where bad things can happen if EPB<=6).   The good news is the kernel will correctly adjust the user-specified value, as long as the user updates EPB via the sysfs file.

The x86_energy_perf_policy tool will bypass the kernel and will write the EPB MSR directly if the sysfs EPB file does not exist. 
We know this happens on pre-Skylake cpus which is OK.  But we can't guarantee it won't happen on newer cpus where it won't always be OK.

Here's the updated algorithm:

if (lscpu Flags show 'hwp_epp') {
   /*
    * The cpu is Skylake or newer.
    */
   if (energy_performance_available_preferences exists) {
      /*
       * The intel_pstate driver is being used.
       */
      if (the user requested preference is an available preference) {
          write that value to the energy_performance_preference file
      }
    } 
   
    if (an energy_perf_bias value was also requested) {
       /*
        * Certain cpus newer than Skylake have restrictions on what
        * EPB values can be specified.  The kernel manages that but
        * only if the sysfs files are written to directly.
        * We don't want to use the x86_energy_perf_policy tool here 
        * because under certain circumstances, it may write the
        * EPB MSR directly, bypassing any kernel checks.
        */
       if (energy_perf_bias file exists && the user requested an EPB to be set) {
           write that value directly to the sysfs energy_perf_bias file.
       }
    }
    return

} else {  
    /*        
     * Getting here means the cpu does not support 'hwp_epp', 
     * e.g. it's older than Skylake.
     * It's important here to use the x86_energy_perf_policy tool, as
     * it will know how to write the EPB MSR directly in case the EPB
     * file does not exist.
     */
    if (an energy_perf_bias value was requested) {
       write it directly to energy_perf_bias using the x86_energy_perf_policy tool 
    }
    return
}

Joe

Comment 14 Jaroslav Škarvada 2023-02-08 15:35:59 UTC
Upstream PR:
https://github.com/redhat-performance/tuned/pull/479

Comment 22 errata-xmlrpc 2023-05-09 08:26:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (tuned bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:2585


Note You need to log in before you can comment on or make changes to this bug.