Bug 768416

Summary: Machine locks up
Product: [Fedora] Fedora Reporter: Pascal Patry <iscy>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 16CC: gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-02-07 19:10:04 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
/var/log/messages
none
/var/log/messages none

Description Pascal Patry 2011-12-16 15:42:14 UTC
Created attachment 547848 [details]
/var/log/messages

Description of problem:
Machine locks-up after 24 to 48 hours of uptime.

Version-Release number of selected component (if applicable):
Fedora 16 - Kernel 3.1.5-1

How reproducible:
I haven't notice any other trigger than time.

Additional info:
Interesting part of /var/log/messages has been attached.

I used to get _raw_spin_lock issues on Kernel 3.1.0 and since I updated, this problem started to occur.

Comment 1 Josh Boyer 2011-12-16 16:05:50 UTC
Can you recreate this without the nvidia module loaded?

Comment 2 Pascal Patry 2011-12-16 16:21:29 UTC
Short answer: Yes, long answer..

I have another machine, running on:
Linux sheol 2.6.33.6-147.fc13.x86_64 #1 SMP

Yes, I agree, a bit old, but it's able to easily get uptime of more than 200 days. That computer doesn't have the same hardware, but it has the same silent graphic card and uses the exact same nvidia module. I know that it taints the kernel, and that the tow kernel are different versions, but it proved itself to be quite stable.

If you really want me to disable that module and reproduce it, I can do it.

Comment 3 Josh Boyer 2011-12-16 16:29:39 UTC
(In reply to comment #2)
> Short answer: Yes, long answer..
> 
> I have another machine, running on:
> Linux sheol 2.6.33.6-147.fc13.x86_64 #1 SMP

That's irrelevant to this bug report, sorry.

> If you really want me to disable that module and reproduce it, I can do it.

Disabling the nvidia module and reproducing on the 3.1.5 kernel is really the only way to make progress here.

Comment 4 Pascal Patry 2011-12-16 16:32:59 UTC
Sure, I also grabbed the debug pkg to have more info. I'll post as soon as I reproduced it.

Comment 5 Pascal Patry 2011-12-28 06:14:22 UTC
Created attachment 549789 [details]
/var/log/messages

As promised, this is the /var/log/messages including the kernel stack of this problem without 'nvidia' tainting the Kernel. It took 11 days before locking up.

Kernel is 3.1.5-2.fc16.x86_64

Comment 6 Josh Boyer 2012-01-04 20:37:36 UTC
We have a similar oops in _raw_spinlock from a different user in bug 771559.  They hit this quite a while after they resumed from a suspend.  Did you happen to also resume from a suspend/hibernate at some point during the uptime?

Comment 7 Pascal Patry 2012-01-04 23:16:02 UTC
No, this workstation never goes to sleep/suspend. It runs 24/7 and doesn't even have a screen saver...

User interaction and/or having load is not necessary either. Most of the time, it locks up while being used, but it did also happen over night. I also got it after ~38 hours of uptime.

Comment 8 Pascal Patry 2012-01-14 21:39:18 UTC
Still reproducible with latest kernel pkg (3.1.8-2.fc16.x86_64).

Comment 9 Pascal Patry 2012-01-14 21:49:12 UTC
Looks like someone has put his finger on this issue a few days ago:
https://lkml.org/lkml/2012/1/9/114

Comment 10 Pascal Patry 2012-02-07 18:35:39 UTC
Currently on 3.2.2-1.fc16.x86_64 with an uptime of 6 days and an half. No issues to report yet. If the problem was really caused by comment #9, then 3.2.2 has the fix and I shouldn't be able to reproduce it.

Comment 11 Josh Boyer 2012-02-07 19:10:04 UTC
Agreed.  Let's close this one out for now.  If you see it again on 3.2.2 or newer, please reopen.