Bug 746485 - System crashes on rc9 but not rc8
System crashes on rc9 but not rc8
Status: CLOSED ERRATA
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
16
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Kernel Maintainer List
Fedora Extras Quality Assurance
:
Depends On:
Blocks: F16-accepted/F16FinalFreezeExcept
  Show dependency treegraph
 
Reported: 2011-10-16 10:38 EDT by Bruno Wolff III
Modified: 2011-10-20 00:02 EDT (History)
7 users (show)

See Also:
Fixed In Version: kernel-3.1.0-0.rc10.git0.1.fc16
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-10-20 00:02:15 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
traceback captured with netconsole (13.49 KB, text/plain)
2011-10-16 10:39 EDT, Bruno Wolff III
no flags Details

  None (edit)
Description Bruno Wolff III 2011-10-16 10:38:03 EDT
Description of problem:
I have been seeing intermittent crashes when using the rc9 kernel. It sometimes can happen very quickly, or systems can stay up for days. The busier systems seem to crash faster.

This does not happen with rc8.

I have attached a traceback captured with netconsole that hopefully will point to where the problem is.

Version-Release number of selected component (if applicable):
kernel-PAE-3.1.0-0.rc9.git0.0.fc16.i686

How reproducible:
Highly variable. I have two machines that have been up for days now, and two that typically crash in hours. Though this morning one was crashing in minutes. This may or may not have been related to a raid resync going on. The two machines having more serious issues both are accepting inbound email (using qmail) and may have noticeably different disk access patterns than the other ones. All of the machines are using luks and software raid.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 1 Bruno Wolff III 2011-10-16 10:39:21 EDT
Created attachment 528393 [details]
traceback captured with netconsole
Comment 2 Bruno Wolff III 2011-10-16 10:41:08 EDT
Because this doesn't seem to affect all systems, I am proposing this for NTH instead of blocker.
Comment 3 Bruno Wolff III 2011-10-16 10:46:53 EDT
One other NTH note is that rc9 is currently in testing, not stable. (I thought it had moved there but misremembered.) The NTH will only apply if it gets moved to stable before final.
Comment 4 Bruno Wolff III 2011-10-16 10:50:40 EDT
I misread the tags. It is currently tagged for both f16 and f16-updates-testing. So it is already an issue for final.
Comment 5 Chuck Ebbert 2011-10-17 23:54:20 EDT
Looks like some kind of subtle deadlock:

CPU 0: [<c043640f>] __task_rq_lock+0x28/0x46
       [<c0440fd6>] wake_up_new_task+0x3a/0xa3

CPU 1: [<c0435e84>] account_group_exec_runtime+0x2c/0x49
       [<c0435fc4>] update_curr+0x123/0x139

CPU 2: [<c043640f>] __task_rq_lock+0x28/0x46
       [<c0440fd6>] wake_up_new_task+0x3a/0xa3

CPU 3: [<c04363ba>] task_rq_lock+0x43/0x70
       [<c0441414>] task_sched_runtime+0x1f/0x9f
Comment 6 Bruno Wolff III 2011-10-17 23:57:49 EDT
This might be the bug discussed in this thread:
http://lkml.org/lkml/2011/10/7/45
I'll look at testing the patch that seemed to work for him.
Comment 7 Chuck Ebbert 2011-10-18 01:25:43 EDT
CPUs 0 and 2 are at kernel/sched.c:954:
                raw_spin_lock(&rq->lock);

CPU 1 is at kernel/sched_stats.h:333:
        spin_lock(&cputimer->lock);

CPU 3 is at kernel/sched.c:973:
                raw_spin_lock(&rq->lock);
Comment 8 Chuck Ebbert 2011-10-18 11:06:31 EDT
More recent patch:
http://article.gmane.org/gmane.linux.kernel/1204676
Comment 9 Chuck Ebbert 2011-10-18 17:10:09 EDT
Patch committed, will be in the next test kernel.
Comment 10 Bruno Wolff III 2011-10-18 17:40:11 EDT
Thanks. I'll test it as soon as it shows up. My build with the later patch is still running and may not finish until after I go to sleep tonight. I might try running the kernel I built with the older patch just to confirm it is really the same issue.
Comment 11 Bruno Wolff III 2011-10-19 10:11:19 EDT
I currently have three systems running 3.1.0-0.rc10.git0.1.fc16.i686.PAE. So far no problems, but it's too soon to declare victory. I'll have an x86_64 running the analagous kernel in a couple of hours. If things are all still working late tonight, then it will be very likely the problem is fixed.
Comment 12 Fedora Update System 2011-10-19 10:48:40 EDT
kernel-3.1.0-0.rc10.git0.1.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/kernel-3.1.0-0.rc10.git0.1.fc16
Comment 13 Bruno Wolff III 2011-10-19 18:09:44 EDT
The two machines that were typically crashing in under a couple of hours have been up for about 8 and 6 hours now, so I think there is a pretty good chance the problem is fixed.
Comment 14 Fedora Update System 2011-10-19 22:22:56 EDT
Package kernel-3.1.0-0.rc10.git0.1.fc16:
* should fix your issue,
* was pushed to the Fedora 16 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-3.1.0-0.rc10.git0.1.fc16'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2011-14609
then log in and leave karma (feedback).
Comment 15 Fedora Update System 2011-10-20 00:02:15 EDT
kernel-3.1.0-0.rc10.git0.1.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.