Bug 559357 - RHEL5: powernow-k8 driver no longer skips values
Summary: RHEL5: powernow-k8 driver no longer skips values
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.5
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Prarit Bhargava
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-01-27 21:13 UTC by Prarit Bhargava
Modified: 2010-10-13 19:16 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-10-13 19:16:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
dmesg output from 2.6.18-128.7.1.el5 -- scaling worked (19.75 KB, text/plain)
2010-01-28 03:17 UTC, D. Hugh Redelmeier
no flags Details
dmesg output from 2.6.18-164.el5 -- scaling does not work (19.74 KB, text/plain)
2010-01-28 03:19 UTC, D. Hugh Redelmeier
no flags Details
dmesg output from 2.6.18-164.el5.bz559357 -- scaling works with offending patch removed (19.99 KB, text/plain)
2010-01-29 04:26 UTC, D. Hugh Redelmeier
no flags Details
acpidump of misbehaving machine (64.03 KB, application/octet-stream)
2010-01-29 22:49 UTC, D. Hugh Redelmeier
no flags Details
_PSS disassembled, with added comments (2.25 KB, text/plain)
2010-01-29 22:53 UTC, D. Hugh Redelmeier
no flags Details
Initial patch (1.31 KB, patch)
2010-02-02 19:19 UTC, Prarit Bhargava
no flags Details | Diff
dmesg output from 2.6.18-164.el5.bz559357b16 -- scaling does not work (19.73 KB, application/text)
2010-02-04 02:10 UTC, D. Hugh Redelmeier
no flags Details

Description Prarit Bhargava 2010-01-27 21:13:09 UTC
The patch to fix this problem,
linux-2.6-acpi-check-_pss-frequency-to-prevent-cpufreq-crash.patch, causes my
HP Pavillion a530n to be stuck at the top CPU frequency.

The BIOS problem is that there are two entries with bad frequencies
(0x9999999!) followed by three fine entries.

Before this patch, the bad entries were ignored (with warnings) and the good
entries were accepted.  Here's an extract from dmesg:
powernow-k8: Pre-initialization of ACPI failed
powernow-k8: Found 1 AMD Athlon(tm) 64 Processor 3200+ processors (1 cpu cores)
(version 2.20.00)
powernow-k8: invalid freq entries 3300000 kHz vs. 2147483048 kHz
powernow-k8: invalid freq entries 3300000 kHz vs. 2147483048 kHz
powernow-k8: 0 : fid 0xc (2000 MHz), vid 0x2
powernow-k8: 1 : fid 0xa (1800 MHz), vid 0x6
powernow-k8: 2 : fid 0x0 (800 MHz), vid 0xa
powernow-k8: ph2 null fid transition 0xc

After this patch, the existence of the bad entries means that all entries are
ignored.  Here's an extract from dmesg:
powernow-k8: Pre-initialization of ACPI failed
powernow-k8: Found 1 AMD Athlon(tm) 64 Processor 3200+ processors (1 cpu cores)
(version 2.20.00)
ACPI: Invalid BIOS _PSS frequency: 0x9999999 MHz
powernow-k8: BIOS error: maxvid exceeded with pstate 2

Is there any way that the bad entries can be skipped but the good ones
accepted?

Before the patch, the bad entries were skipped by code in
kernel-2.6.18/vanilla/arch/i386/kernel/cpu/cpufreq/powernow-k8.c and
kernel-2.6.18/linux-2.6.18.x86_64/arch/i386/kernel/cpu/cpufreq/powernow-k8.c. 
Look for "invalid freq entries".

Alternatively, is there a kernel parameter that I could use to bypass this
problem?  There is no chance that the BIOS will be fixed at this late date.

Comment 1 Prarit Bhargava 2010-01-27 21:14:25 UTC
Hugh,

Could you please do a 

x86info -a | grep Pstate

and paste the contents in this BZ?

Thanks,

P.

Comment 3 D. Hugh Redelmeier 2010-01-27 21:56:17 UTC
There is no standard out from that pipeline (i.e. Pstate does not appear).  stderr gets "munmap: Invalid argument".

Peering into the output of the x86info command, I wonder if this is what you need to see.  It comes at the end.

 FID changes won't happen
 VID changes won't happen
 Voltage ID codes: Maximum=2.000V Startup=1.900V Currently=2.000V
 Frequency ID codes: Maximum=9.0x Startup=9.0x Currently=9.0x
 Decoding BIOS PST tables (maxfid=c, startvid=2)
 Found PSB header at 0x2b10db6d30a0
 Table version: 0x14
 Sorry, only v1.2 tables supported right now

[For the benefit of other readers: I originally tacked my comment onto 
https://bugzilla.redhat.com/show_bug.cgi?id=500311.  Prarit decided it should be a separate bz.]

Comment 4 Prarit Bhargava 2010-01-27 23:09:53 UTC
Hugh, could you run sosreport on your system and attach the report here?

Thanks,

P.

Comment 5 D. Hugh Redelmeier 2010-01-28 03:16:16 UTC
It looks as if sosreport program might gather confidential information.  Certainly sosreport says "[the information collected] will be considered confidential information."  I don't think that you can do that with a bz attachment.

I should say that I am not currently a Red Hat customer.  I'm using CentOS.

What would you like to know about my system?

Some information:

- it is running CentOS 5.4 on x86_64 with all updates

- the last released kernel that allowed clock scaling was 2.6.18-128.7.1

- the next kernel that I tried did not do clock scaling: 2.5.18-164.el5

- system web page from vendor: http://h10025.www1.hp.com/ewfrf/wc/product?product=404646&lc=en&cc=ca&dlc=en&lang=en&tmp_track_link=ot_we/prodlink/en_ca/404646/loc:0&cc=ca

- motherboard specifications: http://h10025.www1.hp.com/ewfrf/wc/document?docname=c00064822&lc=en&dlc=en&cc=ca&product=404646&lang=en

I will attach dmesg output for each.  That should show a fair bit of detail about the hardware.  The first message highlights the bit of the dmesg log that seems relevant.

Comment 6 D. Hugh Redelmeier 2010-01-28 03:17:44 UTC
Created attachment 387222 [details]
dmesg output from 2.6.18-128.7.1.el5 -- scaling worked

Comment 7 D. Hugh Redelmeier 2010-01-28 03:19:44 UTC
Created attachment 387224 [details]
dmesg output from 2.6.18-164.el5 -- scaling does not work

Comment 8 Prarit Bhargava 2010-01-28 14:04:53 UTC
Reporter refuses to do a sosreport and is running CentOS.

CLOSED as INSUFFICIENT_DATA.

P.

Comment 9 D. Hugh Redelmeier 2010-01-29 04:24:18 UTC
If you give me a secure and confidential way to submit an sosreport, I will do so.


I built kernel 2.6.18-164.11.1.el5, suppressing the patch linux-2.6-acpi-check-_pss-frequency-to-prevent-cpufreq-crash.patch.  The resulting kernel seems to work and does allow frequency scaling.  I will attach dmesg output.

Comment 10 D. Hugh Redelmeier 2010-01-29 04:26:43 UTC
Created attachment 387493 [details]
dmesg output from 2.6.18-164.el5.bz559357 -- scaling works with offending patch removed

Comment 11 D. Hugh Redelmeier 2010-01-29 22:49:31 UTC
Created attachment 387665 [details]
acpidump of misbehaving machine

This should be the only system information required to understand why the kernel code can no longer scale the CPU frequency.

When the acpidump was executed, it produced the message "Wrong checksum for OEMB!".

Comment 12 D. Hugh Redelmeier 2010-01-29 22:53:47 UTC
Created attachment 387666 [details]
_PSS disassembled, with added comments

This is the "human readable" version of _PSS on the system.  It turns out that the invalid entries are at the end.

If the working entries are to be believed, the power consumption is over 50 Watts higher in idle mode when CPU frequency scaling is not used.  Ouch.

Comment 13 D. Hugh Redelmeier 2010-01-30 06:21:39 UTC
See also http://bugzilla.kernel.org/show_bug.cgi?id=15174

Comment 14 Prarit Bhargava 2010-02-01 20:38:52 UTC
Hugh, I *think* I might have a quick solution.  Do you have the ability to install the kernel source, patch, and build?  (It seems like you do given the data you've provided me in this BZ)

P.

Comment 15 D. Hugh Redelmeier 2010-02-02 03:32:11 UTC
Thanks, Prarit Bhargava

Yes, I have the ability to patch, build, and install a kernel.  My base is a CentOS kernel.  I don't know all the differences between a CentOS and RHEL kernel, but any patch you propose would surely apply.

Since upstream has expressed interest in this problem, you might wish to post to the bz entry I mentioned in #13.  There is no point in RHEL and the Mainline kernel diverging uselessly.

I don't urgently need a fix since I have one already, as mentioned in #9.  Even so, I would be happy to test a fix you propose.  It would be really good if it could also be tested on a machine that required the original linux-2.6-acpi-check-_pss-frequency-to-prevent-cpufreq-crash.patch.

Comment 16 Prarit Bhargava 2010-02-02 19:19:47 UTC
Created attachment 388367 [details]
Initial patch

Hugh,

Here's a patch that might fix the problem.  Please let me know how it goes...

If it succeeds or fails, could you please post the dmesg log from the boot?

Thanks,

P.

Comment 17 D. Hugh Redelmeier 2010-02-02 21:00:25 UTC
Thanks, Prarit.

Do you wish me to test with or without linux-2.6-acpi-check-_pss-frequency-to-prevent-cpufreq-crash.patch?

Comment 18 Prarit Bhargava 2010-02-03 13:39:37 UTC
(In reply to comment #17)
> Thanks, Prarit.
> 
> Do you wish me to test with or without
> linux-2.6-acpi-check-_pss-frequency-to-prevent-cpufreq-crash.patch?  

Hey Hugh -- I'd like you to test with linux-2.6-acpi-check-_pss-frequency-to-prevent-cpufreq-crash.patch.

P.

Comment 19 D. Hugh Redelmeier 2010-02-04 02:10:04 UTC
Created attachment 388688 [details]
dmesg output from 2.6.18-164.el5.bz559357b16 -- scaling does not work

dmesg from stock 2.6.18-164.el5 + "initial patch" (see #16).

This kernel does not frequency scale as far as I can tell.

Comment 20 D. Hugh Redelmeier 2010-02-04 16:00:03 UTC
I just went back and checked the BUILD directory.  Yes, the patch is reflected in the source code that I built and tested.

Prarit: could you have a look at the patch I put in the kernel.org bz entry?  I don't know enough to be sure that it is the right approach, but I imagine that you do.

Comment 21 D. Hugh Redelmeier 2010-03-26 03:24:53 UTC
After working on this for a while upstream, this is the conclusion:

The patch linux-2.6-acpi-check-_pss-frequency-to-prevent-cpufreq-crash.patch backported into the RHEL kernel introduced the problem observed on my machine

The original patch, in the upstream kernel, would not have caused this problem.

The reason is that the upstream patch was applied after http://git.moblin.org/cgit.cgi/acpica/commit/?id=ffd0eca830ee3f762e387fe5519fe34fc44b0231

This missing patch eliminates package elements beyond the number that the package specifies.

In the case of my machine, the _PSS package says it has 3 elements.  It actually has more, but the extra ones are invalid.  The invalid elements cause code from linux-2.6-acpi-check-_pss-frequency-to-prevent-cpufreq-crash.patch to discard my machine's whole _PSS.  Upstream would only consider the 3 valid elements.

So: since linux-2.6-acpi-check-_pss-frequency-to-prevent-cpufreq-crash.patch has been backported and included in the RHEL kernel, I suggest that ffd0eca830ee3f762e387fe5519fe34fc44b0231 (or a successor) be adopted.

To understand this better, please read https://bugzilla.kernel.org/show_bug.cgi?id=15174

Comment 22 Mark Langsdorf 2010-10-01 16:44:37 UTC
Could this be retested with the latest 5.x driver?  AMD has changed the way we handle p-states and I don't think this should be an issue any longer.

Comment 23 D. Hugh Redelmeier 2010-10-12 15:31:58 UTC
@Mark Langsdorf:

I don't know what you mean by the latest 5.5 driver.  I have updated to kernel-2.6.18-194.17.1.el5 and the problem is still there.

As I understand it, the problem is in code common to Intel and AMD, before the AMD-specific code is executed.  Furthermore, the problem is in RHEL, not kernel.org: RHEL cherry-picked patches; they adopted one that created this problem (and solved others for Intel) but not another that avoided it.

Comment 24 Prarit Bhargava 2010-10-13 17:31:58 UTC
Sorry D. Hugh, I've been really busy with some other critical issues and finally had time to come back to this.

I'm putting together a patch based on the suggested patch in the kernel.org bug.  I'm testing it on some "known good" systems and then I'll attach it to this BZ for you to test.

Would that work for you?

Again, sorry for the long delay,

P.

Comment 25 D. Hugh Redelmeier 2010-10-13 18:09:30 UTC
@Prarit:

That would be great for me.  I just wonder if it is worthwhile for Red Hat.

1) I haven't heard of others hitting this problem, so it probably doesn't affect many.  (I admit that not everyone affected by a problem reports it or even recognizes it.)

2) There is a small chance that the "fix" would break other systems that are currently working.  Not only does the code have to be correct, it has to not tickle any BIOS bugs that are currently latent.  I imagine that chance is very slight since kernel.org has used this fix already, but it still must be considered.

3) I can live with my current work-around: whenever a new kernel is shipped, I simply rebuild it with Patch 24199 removed.

Comment 26 Prarit Bhargava 2010-10-13 19:16:01 UTC
(In reply to comment #25)
> @Prarit:
> 
> That would be great for me.  I just wonder if it is worthwhile for Red Hat.
> 
> 1) I haven't heard of others hitting this problem, so it probably doesn't
> affect many.  (I admit that not everyone affected by a problem reports it or
> even recognizes it.)

Right -- and I suspect that some people may not have noticed.

> 
> 2) There is a small chance that the "fix" would break other systems that are
> currently working.  Not only does the code have to be correct, it has to not
> tickle any BIOS bugs that are currently latent.  I imagine that chance is very
> slight since kernel.org has used this fix already, but it still must be
> considered.

Yeah -- I'm thinking of using a boot parameter to enable the new check.

> 
> 3) I can live with my current work-around: whenever a new kernel is shipped, I
> simply rebuild it with Patch 24199 removed.

Okay ... you're more than welcome to do that.  I'll close this as WONTFIX for now.  If you have any problems please feel free to ping me directly.

P.


Note You need to log in before you can comment on or make changes to this bug.