205305 – cpu scaling fails, locked on lowest setting at all times

Bug 205305 - cpu scaling fails, locked on lowest setting at all times

Summary: cpu scaling fails, locked on lowest setting at all times

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-09-05 22:18 UTC by David Nielsen
Modified:	2007-11-30 22:11 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-02-01 14:53:52 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Output from /proc/cpuinfo under heavy load (7.11, 6.88, 5.35) (1.25 KB, text/plain) 2006-09-05 22:18 UTC, David Nielsen	no flags	Details
dmesg output (40.68 KB, text/plain) 2006-09-05 22:18 UTC, David Nielsen	no flags	Details
dmesg output with cpufreq.debug=7 set and printk inserted (23.88 KB, text/plain) 2006-09-21 15:55 UTC, David Nielsen	no flags	Details
dmesg since boot with cpufreq.debug=7 (33.23 KB, text/plain) 2006-09-23 00:50 UTC, Thomas J. Baker	no flags	Details
disassmbled DSDT for the system (159.96 KB, text/plain) 2006-10-11 12:45 UTC, David Nielsen	no flags	Details
View All

Description David Nielsen 2006-09-05 22:18:15 UTC

Description of problem:
My lovely AMD64 X2 doesn't scale ondemand as expected, it remains locked at the
lowest performance setting at all times which is rather unfortunate.

Version-Release number of selected component (if applicable):
cpuspeed-1.2.1-1.40.fc6
kernel-2.6.17-1.2617.2.1.fc6

How reproducible:
100%

Steps to Reproduce:
1. Boot lovely AMD64 X2 setup
  
Actual results:
Both cores set at 1GHz at all times

Expected results:
Correct scaling through to 2.2GHz per core under load

Additional info:

Comment 1 David Nielsen 2006-09-05 22:18:15 UTC

Created attachment 135611 [details]
Output from /proc/cpuinfo under heavy load (7.11, 6.88, 5.35)

Comment 2 David Nielsen 2006-09-05 22:18:49 UTC

Created attachment 135612 [details]
dmesg output

Comment 3 David Nielsen 2006-09-09 17:10:46 UTC

This posting to LKML indicates that this is a kernel issue os I'm moving the bug
into DaveJ territory.

http://www.ussg.iu.edu/hypermail/linux/kernel/0609.1/0380.html

I looked over the git changelogs for cpufreq.c which I gather would be the
vector for the wrong return and I found:

http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3bcb09a35641f2840bd59d8f82154f830dca282c

The intersting part of the change goes:
+ err = -EBUSY;
+ if (__find_governor(governor->name) == NULL) {
+ err = 0;
+ list_add(&governor->governor_list, &cpufreq_governor_list);

Since we always seem to return -EBUSY according to the posting that would mean

+ if (__find_governor(governor->name) == NULL) {

is the cause of all this havoc, it does not seem that we ever us that test or
err would have been set to 0 and all would be well.

Comment 4 Dave Jones 2006-09-12 00:54:27 UTC

Those messages about being unable to turn on the fan are somewhat disturbing.
Maybe the ACPI maintainers have some clues whats going on here?

Comment 5 David Nielsen 2006-09-12 01:15:38 UTC

Don't get me wrong, the fan is running but ACPI has been complaining like that
for ages.

I have one open for that issue mainly because it's spewing that message all over
my active VC making it rather unusable.

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=199812

Comment 6 Ryan Skadberg 2006-09-19 20:13:44 UTC

I'm seeing something similar.  Sadly, it doesn't seem to be the exact same thing.

I am running on an Intel Centrino.  When I first boot up, things work fine, CPU
scale as you would assume.  Some time later (usually not too long), this stops
working and it sticks at the lowest setting.  I've running things:

/usr/bin/cpufreq-set --max 1.4Ghz
/usr/bin/cpufreq-set -f 1.4GHz

to no avail, it still continues to sit at the 600Mhz setting.

Just to add a little more fun, every once in a while, it will start scaling
again correctly for a bit and then stop again.

It looks like this bug was added around the beginning of Sept and that's about
when I started seeing this (maybe even a week or so earlier), so it seems that
something changed in that timeframe that caused this to happen.

Comment 7 Dave Jones 2006-09-19 21:07:59 UTC

Ryan, is there anything in dmesg when it does this limiting ?
(Try booting with cpufreq.debug=7 too for more info)

Comment 8 Ryan Skadberg 2006-09-20 17:21:05 UTC

So, some interesting progress on this from my end.  After seeing the fan
messages in David's output, I decided to check my fan.  I noticed it didn't seem
to be running.  I played around a bit and didn't seem to be able to force it on.
 I also noticed the computer was running at about 54 degrees celcius.

I opened up my laptop, cleaned out the fan (which was full of crap) and
rebooted.  The fan spun right up.  Since doing this, my computer seems to be
scaling no problem.  I forced the CPU speed up to 1.4Ghz and it has stayed there
since (which in the past, it would eventually reset itself to 600Mhz).  I looked
at the temperature now and it is at 45 degrees celcius.

So, it SEEMS like my machine may have been overheating and the computer was
trying to compinsate by lowering the cpu level.  Is there a hard limit somewhere
in the kernel (again probably something that would have been added in late
August/early September) for temperature?  Seems like this may have been causing
my issues.

Will let you know if this stops working, but I've been running for about 2 1/2
hours without issue.

Comment 9 Ryan Skadberg 2006-09-20 22:54:50 UTC

After 8 hours, things still seem to be working just fine.  Seems like it was the
heat, for me at least.

Comment 10 David Nielsen 2006-09-20 23:43:01 UTC

In case it matter in terms of limiting the variables this is a Shuttle SN95G5V3
(motherboard is a Shuttle FN95) and the CPU is a +4400 AMD64 X2 with 1 MB of
cache per core (I'm unsure of AMDs current naming scheme). 

This, unlike Ryans issue, is not a heat issue I checked by doing a manual
recompile of the kernel to factor out cpuscaling element and run the CPU at
maximum load for days (encoding Ogg Theora files of my DVD collection), the fan
speed scales fine under those circumstances according to emitted heat and no
crash or other overheating indication occured.

Comment 11 Venkatesh Pallipadi 2006-09-21 02:04:13 UTC

David,

When ondemand stops working, does /sys/..../cpufreq/cpuinfo_max_freq drop down 
to lowest freq? I guess it is, and if it is, it should be happening due to a 
call from thermal code. If it is indeed happening that way, can you add some 
print messages in driver/acpi/processor_thermal.c
acpi_thermal_cpufreq_decrease() and acpi_thermal_cpufreq_notifier()
and check when those are getting called.

Thanks,
Venki

Comment 12 David Nielsen 2006-09-21 02:32:26 UTC

stops working?? It doesn't appear to start working. it's locked hard on 1GHz.

output of cpufreq-info (reduced to one cpu for readablitiy):

analyzing CPU 1:
  driver: powernow-k8
  CPUs which need to switch frequency at the same time: 0 1
  hardware limits: 1000 MHz - 2.20 GHz
  available frequency steps: 2.20 GHz, 2.00 GHz, 1.80 GHz, 1000 MHz
  available cpufreq governors: ondemand, userspace, performance
  current policy: frequency should be within 1000 MHz and 1000 MHz.
                  The governor "userspace" may decide which speed to use
                  within this range.
  current CPU frequency is 1000 MHz (asserted by call to hardware).

acpitool claims that the fan is on so I don't get why I get all those ACPI
messages about not being able to turn the fan on.

It seems to me that ACPI is getting a tad confused here. 

It's very non-responsive to changing the governor to performance (or any other
for that matter) not even error output or messages in dmesg.

I'll try to add the print messages and report back.

Comment 13 Venkatesh Pallipadi 2006-09-21 13:43:58 UTC

Oh. I thought you were trying to use 'ondemand' governor and that failed. 
Looks like you were just saying CPU frequency doesn't scale on demand from the 
load. And the above messages say you were using the 'userspace' governor too.
My guess is, the problem is not about how governor is behaving. But, how 
kernel is finding out what freqs CPU can run at and is there something in the 
kernel (like thermal) that is limiting this frequency.

If you are recompiling the kernel, you should also make sure you have 
CPU_FREQ_DEBUG config option enabled and boot with cpufreq.debug=7. That will 
give a lot more messages related to cpufreq, which can give some hint about 
the problem.

Comment 14 David Nielsen 2006-09-21 15:55:16 UTC

Created attachment 136872 [details]
dmesg output with cpufreq.debug=7 set and printk inserted

dmesg output from kernel-2.6.17-1.2647 with cpufreq.debug=7 set and printk
calls inserted at various spots in the requested functions. 

* Next time I promise to remember to terminate my strings, it has been to long
since I coded anything real.

Comment 15 Len Brown 2006-09-21 19:50:02 UTC

Thanks for the debug log David.  
Seems that the ACPI thermal module is trying to passively cool  
this system by lowering the maximum frequency, and that cpufreq  
is doing exactly as requested.  After that fails, the thermal  
tries to enable a fan, which claims to fail.  
  
(we should probably have some debug messages in thermal to make  
 this this easier to discover...)  
  
What do you see when you dump the contents of /proc/acpi/thermal_zone/*/*

Comment 16 David Nielsen 2006-09-21 20:06:13 UTC

cooling mode:   active
<polling disabled>
state:                   passive 
temperature:             53 C
critical (S5):           60 C
passive:                 50 C: tc1=4 tc2=3 tsp=60 devices=0xffff810003f6a298 
active[0]:               50 C: devices=0xffff81007ff8c810

Comment 17 Thomas J. Baker 2006-09-22 14:33:33 UTC

This is defintely happening for me on an Intel Pentium M laptop. The system has
been up for 1:20 and CPU scaling was working for some part of that time. 

dmesg does show this kernel error (seems unrelated though):

=============================================
[ INFO: possible recursive locking detected ]
2.6.17-1.2647.fc6 #1
---------------------------------------------
java/3787 is trying to acquire lock:
 (slock-AF_INET6){-+..}, at: [<c05b392e>] sk_clone+0xd4/0x2d8

but task is already holding lock:
 (slock-AF_INET6){-+..}, at: [<f8b1d4c9>] tcp_v6_rcv+0x327/0x736 [ipv6]

other info that might help us debug this:
1 lock held by java/3787:
 #0:  (slock-AF_INET6){-+..}, at: [<f8b1d4c9>] tcp_v6_rcv+0x327/0x736 [ipv6]

stack backtrace:
 [<c04051ee>] show_trace_log_lvl+0x58/0x171
 [<c0405802>] show_trace+0xd/0x10
 [<c040591b>] dump_stack+0x19/0x1b
 [<c043b9e1>] __lock_acquire+0x778/0x99c
 [<c043c176>] lock_acquire+0x4b/0x6d
 [<c061539b>] _spin_lock+0x19/0x28
 [<c05b392e>] sk_clone+0xd4/0x2d8
 [<c05dc49b>] inet_csk_clone+0xf/0x72
 [<c05ed2d9>] tcp_create_openreq_child+0x1b/0x3a1
 [<f8b1c155>] tcp_v6_syn_recv_sock+0x271/0x5b3 [ipv6]
 [<c05ed834>] tcp_check_req+0x1d5/0x2e9
 [<f8b1b441>] tcp_v6_do_rcv+0x142/0x340 [ipv6]
 [<f8b1d883>] tcp_v6_rcv+0x6e1/0x736 [ipv6]
 [<f8b03a6f>] ip6_input+0x1c3/0x296 [ipv6]
 [<f8b03fdf>] ipv6_rcv+0x1d2/0x21f [ipv6]
 [<c05b9ab6>] netif_receive_skb+0x2e2/0x366
 [<c05bb42f>] process_backlog+0x99/0xfa
 [<c05bb612>] net_rx_action+0x9d/0x196
 [<c04293bf>] __do_softirq+0x78/0xf2
 [<c040668b>] do_softirq+0x5a/0xbe
 [<c04291b6>] local_bh_enable_ip+0xa9/0xcf
 [<c0615339>] _spin_unlock_bh+0x25/0x28
 [<c05b272f>] release_sock+0xb0/0xb8
 [<c05f5552>] inet_stream_connect+0x113/0x206
 [<c05b1692>] sys_connect+0x67/0x84
 [<c05b1d04>] sys_socketcall+0x8c/0x186
 [<c0403faf>] syscall_call+0x7/0xb
DWARF2 unwinder stuck at syscall_call+0x7/0xb
Leftover inexact backtrace:

I'm going to reboot with the cpufreq.debug=7 and see if that will reveal more.

Comment 18 Thomas J. Baker 2006-09-22 14:34:32 UTC

I guess I should add that I'm running 2.6.17-1.2647.fc6.

Comment 19 Thomas J. Baker 2006-09-23 00:50:20 UTC

Created attachment 136983 [details]
dmesg since boot with cpufreq.debug=7

With 2.6.18-1.2689.fc6 and cpufreq.debug=7, the initial range is (600000 -
2000000 kHz) and later gets changed to (600000 - 600000 kHz).

Comment 20 Thomas J. Baker 2006-09-24 21:47:27 UTC

This bug is like one I had in FC4 (#137995) where it would get my max cpu freq
wrong and the same work around applies to this bug:

 echo -n "2000000" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq

It went away with FC5 but is back again. With the 2689 kernel, this is the only
way I can get my cpu above 600MHz (the min.)

Comment 21 David Nielsen 2006-09-24 22:06:31 UTC

I had to disappoint you, manually setting the frequency does not workaround the
issue, the ACPI system appears to be locking the frequency because it mistakenly
thinks it can't turn on the fan. It appears to be a malfunctioning failsafe.

Comment 22 Venkatesh Pallipadi 2006-09-24 22:40:41 UTC

It is related to thermal driver. Either rightly or wrongs, the thermal driver 
thinks that temperature is too high and tries to reduce the frequency to 
control temperature adn/or tries to turn the fan on. 
David: What do you see when you dump the contents 
of /proc/acpi/thermal_zone/*/*

Comment 23 David Nielsen 2006-09-24 23:00:54 UTC

I assume by dump you mean read (for which I used cat), for which you can see
comment #16 but here goes another sampling for my friends at Intel.

cooling mode:   active
<polling disabled>
state:                   passive 
temperature:             54 C
critical (S5):           60 C
passive:                 50 C: tc1=4 tc2=3 tsp=60 devices=0xffff810003f6c298 
active[0]:               50 C: devices=0xffff81007ff8e810 

The fan is running at what's called in the BIOS "smart fan", I could adjust that
to have it running at full speed making all manners of noise to see if it
continues to scale down even if thermal issues are absolutely no present. 

I also did check the CPU fan for technical errors or excessive amounts of dust
none were present I could determined. No problem were present on the chipset fan
either.

Comment 24 David Nielsen 2006-09-25 00:52:10 UTC

son of a .....

After updating my system and doing a reboot, I found this:

analyzing CPU 1:
  driver: powernow-k8
  CPUs which need to switch frequency at the same time: 0 1
  hardware limits: 1000 MHz - 2.20 GHz
  available frequency steps: 2.20 GHz, 2.00 GHz, 1.80 GHz, 1000 MHz
  available cpufreq governors: ondemand, userspace, performance
  current policy: frequency should be within 1000 MHz and 2.20 GHz.
                  The governor "ondemand" may decide which speed to use
                  within this range.
  current CPU frequency is 2.20 GHz (asserted by call to hardware).

I tested that it worked by switching governor and it seems that whatever Dave
did in the recent update volley the issue seems to be gone.. I don't like it
when things just stop being broken for no apparent reason but in this case I'm
overjoyed.

Comment 25 David Nielsen 2006-09-28 23:46:24 UTC

Bug appears to be back with kernel-2.6.18-1.2699.fc6 it was perfectly fine with
kernel-2.6.18-1.2693.fc6..

Comment 26 David Nielsen 2006-10-11 12:45:06 UTC

Created attachment 138235 [details]
disassmbled DSDT for the system

I've been playing a bit with various settings and it seems to work somewhat
reliably if you set the low noise fan setting but not when using the default
smart fan setting or the ultra low noise fan setting.

I've attached a disassembled DSDT for the system in the hope that it might
help.

Comment 27 David Nielsen 2007-01-11 23:06:45 UTC

This seems to have gone away, is anyone else still experiencing this bug, if not
then I think we can close this.

Note You need to log in before you can comment on or make changes to this bug.