746485 – System crashes on rc9 but not rc8

Bug 746485 - System crashes on rc9 but not rc8

Summary: System crashes on rc9 but not rc8

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	16
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	F16-accepted, F16FinalFreezeExcept
TreeView+	depends on / blocked

Reported:	2011-10-16 14:38 UTC by Bruno Wolff III
Modified:	2011-10-20 04:02 UTC (History)
CC List:	7 users (show)
Fixed In Version:	kernel-3.1.0-0.rc10.git0.1.fc16
Clone Of:
Environment:
Last Closed:	2011-10-20 04:02:15 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
traceback captured with netconsole (13.49 KB, text/plain) 2011-10-16 14:39 UTC, Bruno Wolff III	no flags	Details
View All

Description Bruno Wolff III 2011-10-16 14:38:03 UTC

Description of problem:
I have been seeing intermittent crashes when using the rc9 kernel. It sometimes can happen very quickly, or systems can stay up for days. The busier systems seem to crash faster.

This does not happen with rc8.

I have attached a traceback captured with netconsole that hopefully will point to where the problem is.

Version-Release number of selected component (if applicable):
kernel-PAE-3.1.0-0.rc9.git0.0.fc16.i686

How reproducible:
Highly variable. I have two machines that have been up for days now, and two that typically crash in hours. Though this morning one was crashing in minutes. This may or may not have been related to a raid resync going on. The two machines having more serious issues both are accepting inbound email (using qmail) and may have noticeably different disk access patterns than the other ones. All of the machines are using luks and software raid.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Bruno Wolff III 2011-10-16 14:39:21 UTC

Created attachment 528393 [details]
traceback captured with netconsole

Comment 2 Bruno Wolff III 2011-10-16 14:41:08 UTC

Because this doesn't seem to affect all systems, I am proposing this for NTH instead of blocker.

Comment 3 Bruno Wolff III 2011-10-16 14:46:53 UTC

One other NTH note is that rc9 is currently in testing, not stable. (I thought it had moved there but misremembered.) The NTH will only apply if it gets moved to stable before final.

Comment 4 Bruno Wolff III 2011-10-16 14:50:40 UTC

I misread the tags. It is currently tagged for both f16 and f16-updates-testing. So it is already an issue for final.

Comment 5 Chuck Ebbert 2011-10-18 03:54:20 UTC

Looks like some kind of subtle deadlock:

CPU 0: [<c043640f>] __task_rq_lock+0x28/0x46
       [<c0440fd6>] wake_up_new_task+0x3a/0xa3

CPU 1: [<c0435e84>] account_group_exec_runtime+0x2c/0x49
       [<c0435fc4>] update_curr+0x123/0x139

CPU 2: [<c043640f>] __task_rq_lock+0x28/0x46
       [<c0440fd6>] wake_up_new_task+0x3a/0xa3

CPU 3: [<c04363ba>] task_rq_lock+0x43/0x70
       [<c0441414>] task_sched_runtime+0x1f/0x9f

Comment 6 Bruno Wolff III 2011-10-18 03:57:49 UTC

This might be the bug discussed in this thread:
http://lkml.org/lkml/2011/10/7/45
I'll look at testing the patch that seemed to work for him.

Comment 7 Chuck Ebbert 2011-10-18 05:25:43 UTC

CPUs 0 and 2 are at kernel/sched.c:954:
                raw_spin_lock(&rq->lock);

CPU 1 is at kernel/sched_stats.h:333:
        spin_lock(&cputimer->lock);

CPU 3 is at kernel/sched.c:973:
                raw_spin_lock(&rq->lock);

Comment 8 Chuck Ebbert 2011-10-18 15:06:31 UTC

More recent patch:
http://article.gmane.org/gmane.linux.kernel/1204676

Comment 9 Chuck Ebbert 2011-10-18 21:10:09 UTC

Patch committed, will be in the next test kernel.

Comment 10 Bruno Wolff III 2011-10-18 21:40:11 UTC

Thanks. I'll test it as soon as it shows up. My build with the later patch is still running and may not finish until after I go to sleep tonight. I might try running the kernel I built with the older patch just to confirm it is really the same issue.

Comment 11 Bruno Wolff III 2011-10-19 14:11:19 UTC

I currently have three systems running 3.1.0-0.rc10.git0.1.fc16.i686.PAE. So far no problems, but it's too soon to declare victory. I'll have an x86_64 running the analagous kernel in a couple of hours. If things are all still working late tonight, then it will be very likely the problem is fixed.

Comment 12 Fedora Update System 2011-10-19 14:48:40 UTC

kernel-3.1.0-0.rc10.git0.1.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/kernel-3.1.0-0.rc10.git0.1.fc16

Comment 13 Bruno Wolff III 2011-10-19 22:09:44 UTC

The two machines that were typically crashing in under a couple of hours have been up for about 8 and 6 hours now, so I think there is a pretty good chance the problem is fixed.

Comment 14 Fedora Update System 2011-10-20 02:22:56 UTC

Package kernel-3.1.0-0.rc10.git0.1.fc16:
* should fix your issue,
* was pushed to the Fedora 16 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-3.1.0-0.rc10.git0.1.fc16'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2011-14609
then log in and leave karma (feedback).

Comment 15 Fedora Update System 2011-10-20 04:02:15 UTC

kernel-3.1.0-0.rc10.git0.1.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.