920289 – Regular hard freezes with 3.9 kernels (intel_pstate_driver)

Bug 920289 - Regular hard freezes with 3.9 kernels (intel_pstate_driver)

Summary: Regular hard freezes with 3.9 kernels (intel_pstate_driver)

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	19
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	952244
TreeView+	depends on / blocked

Reported:	2013-03-11 17:31 UTC by Adam Williamson
Modified:	2013-10-08 19:15 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Clone Of:
Clones:	952244 (view as bug list)
Environment:
Last Closed:	2013-10-08 17:31:47 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
photo of the trace (324.22 KB, image/jpeg) 2013-03-11 17:31 UTC, Adam Williamson	no flags	Details
View All

Description Adam Williamson 2013-03-11 17:31:24 UTC

Created attachment 708543 [details]
photo of the trace

Running 3.9.0-0.rc1.git0.5.1.fc19.x86_64 - which is a scratch build from the Rawhide spec as of 3.9.0-0.rc1.git0.5.fc19 with http://cgit.freedesktop.org/nouveau/linux-2.6/commit/?id=50db54340c0412752577001b5ab3b54e6f3b9383 added - I've been seeing the system hard freeze every so often (once or twice a day). Usually it just completely freezes at the desktop, nothing in the system logs after a reboot and no ssh access possible, so I can't get much useful info. But yesterday I saw a kernel oops. Not 100% sure it's the same bug - it could be some _other_ one - but either way, it's obviously worth reporting. I took a picture of the trace: I'll attach it.

System is a self-built one, i7-2600k CPU on a P8P67 Deluxe, Nouveau 9600 GT graphics, I can provide lspci or whatever if the hardware's important.

Comment 1 Josh Boyer 2013-03-11 17:34:31 UTC

Can you add "pause_on_oops=60" to your command line and try to capture the top of the stacktrace?

Comment 2 Adam Williamson 2013-03-13 21:18:59 UTC

added, running, of course it hasn't crashed since then...

Comment 3 Dave Jones 2013-03-19 15:39:32 UTC

there may be a better trace of this in bug 923102.

Comment 4 Dirk Brandewie 2013-03-19 16:53:46 UTC

Can you attach your kernel config I will try to reproduce. I don't have a system that can use the nouveau driver though.

Comment 5 Adam Williamson 2013-03-19 18:07:53 UTC

I still haven't seen this again. I think it got fixed at rc2 or rc3, for me.

Comment 6 Dave Jones 2013-03-19 18:51:22 UTC

the trace in 923102 is from rc3, so this clearly isn't fixed.

lets just dupe this bug there, because that at least seems to be debuggable.

*** This bug has been marked as a duplicate of bug 923102 ***

Comment 7 Dave Jones 2013-03-21 14:50:25 UTC

re-opening, as it's not the same bug.

Adam, 3.9.0-0.rc3.git1.2 has a patch that should reduce the spew a little so hopefully we can see the top of the trace.

Comment 8 Adam Williamson 2013-03-21 17:06:03 UTC

OK, I'll grab that and try not to do anything very important while I wait for it to explode =)

Comment 9 Adam Williamson 2013-03-21 17:13:30 UTC

In case it helps, from yum history (how did we ever live without that?) it looks like I ran the nodebug build of kernel-3.9.0-0.rc2.git0.3.fc19.x86_64 for about a week, between 2013-03-12 and 2013-03-19, which was the time I didn't see the bug. On 2013-03-19 I updated to kernel-3.9.0-0.rc3.git0.5.fc20.x86_64 (also from nodebug), and it started happening again.

Comment 10 Adam Williamson 2013-03-22 20:32:29 UTC

I've been running 3.9.0-0.rc3.git1.3.fc19.x86_64 for over a day now, and have not hit the bug. I wonder if the changes to kernel config for Rawhide post-f19 affect this for me somehow?

Comment 12 Parag Warudkar 2013-03-28 15:48:57 UTC

The bug was never easily reproducible. I have been running mainline 3.9.x (almost daily recompiles) and the first time this happened to me was soon after I remember enabling the PSTATE timer config option. Then after that it happened twice followed by a lull. Then with yesterday's git pull it surfaced again. So I am pretty sure it is definitely there at least in mainline - just takes time to pop up.

I posted some initial analysis and latest oops photo yesterday - http://marc.info/?l=linux-kernel&m=136443537224510&w=2

Comment 13 Fedora End Of Life 2013-04-03 19:29:38 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle.
Changing version to '19'.

(As we did not run this process for some time, it could affect also pre-Fedora 19 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.)

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora19

Comment 14 Dirk Brandewie 2013-04-04 16:00:32 UTC

I was finally able to see this happen and caputure debug info. Here is the patch to keep the driver from setting up a race with itself

I will get this queued for the next RC


commit f404a661b000499b002919ffda43c8cb8c5d614d
Author: Dirk Brandewie <dirk.brandewie>
Date:   Thu Apr 4 08:55:29 2013 -0700

    cpufreq/intel_pstate: Set timer timeout correctly
    
    The current calculation of the delay time is wrong and a cut and paste
    error from a previous experimental driver.  This can result in the
    timeout being set to jiffies + 1 which setup the driver to race with
    it's self if the apic timer interrupt happen at just the right time.
    
    
    https://bugzilla.redhat.com/show_bug.cgi?id=920289
    
    Reported-by: Adam Williamson <awilliam>
    Reported-by: Parag Warudkar <parag.lkml>
    
    Signed-off-by: Dirk Brandewie <dirk.brandewie>
---
 drivers/cpufreq/intel_pstate.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 43ffe1c..4d6b988 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -502,7 +502,6 @@ static inline void intel_pstate_set_sample_time(struct cpudata *cpu)
 
 	sample_time = cpu->pstate_policy->sample_rate_ms;
 	delay = msecs_to_jiffies(sample_time);
-	delay -= jiffies % delay;
 	mod_timer_pinned(&cpu->timer, jiffies + delay);
 }

Comment 15 Adam Williamson 2013-04-04 21:26:50 UTC

Thanks for that, sorry I wasn't able to provide better debugging info.

Comment 16 Adam Williamson 2013-04-11 03:25:41 UTC

I just booted 3.9.0-0.rc6.git0.1.fc19.x86_64 and it hard froze in less than 30 minutes. Didn't show the trace though :/ I'll see if I can catch it and see if it's different now.

Comment 17 Josh Boyer 2013-09-18 20:54:00 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 19 kernel bugs.

Fedora 19 has now been rebased to 3.11.1-200.fc19.  Please test this kernel update and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you experience different issues, please open a new bug report for those.

Comment 18 Josh Boyer 2013-10-08 17:31:47 UTC

This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 2 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.

Comment 19 Adam Williamson 2013-10-08 19:15:52 UTC

I haven't been in the same country as the system lately, so I couldn't check this...I'll re-open it if it comes up again when I'm home.

Note You need to log in before you can comment on or make changes to this bug.