Description of problem: I have been seeing intermittent crashes when using the rc9 kernel. It sometimes can happen very quickly, or systems can stay up for days. The busier systems seem to crash faster. This does not happen with rc8. I have attached a traceback captured with netconsole that hopefully will point to where the problem is. Version-Release number of selected component (if applicable): kernel-PAE-3.1.0-0.rc9.git0.0.fc16.i686 How reproducible: Highly variable. I have two machines that have been up for days now, and two that typically crash in hours. Though this morning one was crashing in minutes. This may or may not have been related to a raid resync going on. The two machines having more serious issues both are accepting inbound email (using qmail) and may have noticeably different disk access patterns than the other ones. All of the machines are using luks and software raid. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 528393 [details] traceback captured with netconsole
Because this doesn't seem to affect all systems, I am proposing this for NTH instead of blocker.
One other NTH note is that rc9 is currently in testing, not stable. (I thought it had moved there but misremembered.) The NTH will only apply if it gets moved to stable before final.
I misread the tags. It is currently tagged for both f16 and f16-updates-testing. So it is already an issue for final.
Looks like some kind of subtle deadlock: CPU 0: [<c043640f>] __task_rq_lock+0x28/0x46 [<c0440fd6>] wake_up_new_task+0x3a/0xa3 CPU 1: [<c0435e84>] account_group_exec_runtime+0x2c/0x49 [<c0435fc4>] update_curr+0x123/0x139 CPU 2: [<c043640f>] __task_rq_lock+0x28/0x46 [<c0440fd6>] wake_up_new_task+0x3a/0xa3 CPU 3: [<c04363ba>] task_rq_lock+0x43/0x70 [<c0441414>] task_sched_runtime+0x1f/0x9f
This might be the bug discussed in this thread: http://lkml.org/lkml/2011/10/7/45 I'll look at testing the patch that seemed to work for him.
CPUs 0 and 2 are at kernel/sched.c:954: raw_spin_lock(&rq->lock); CPU 1 is at kernel/sched_stats.h:333: spin_lock(&cputimer->lock); CPU 3 is at kernel/sched.c:973: raw_spin_lock(&rq->lock);
More recent patch: http://article.gmane.org/gmane.linux.kernel/1204676
Patch committed, will be in the next test kernel.
Thanks. I'll test it as soon as it shows up. My build with the later patch is still running and may not finish until after I go to sleep tonight. I might try running the kernel I built with the older patch just to confirm it is really the same issue.
I currently have three systems running 3.1.0-0.rc10.git0.1.fc16.i686.PAE. So far no problems, but it's too soon to declare victory. I'll have an x86_64 running the analagous kernel in a couple of hours. If things are all still working late tonight, then it will be very likely the problem is fixed.
kernel-3.1.0-0.rc10.git0.1.fc16 has been submitted as an update for Fedora 16. https://admin.fedoraproject.org/updates/kernel-3.1.0-0.rc10.git0.1.fc16
The two machines that were typically crashing in under a couple of hours have been up for about 8 and 6 hours now, so I think there is a pretty good chance the problem is fixed.
Package kernel-3.1.0-0.rc10.git0.1.fc16: * should fix your issue, * was pushed to the Fedora 16 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing kernel-3.1.0-0.rc10.git0.1.fc16' as soon as you are able to, then reboot. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2011-14609 then log in and leave karma (feedback).
kernel-3.1.0-0.rc10.git0.1.fc16 has been pushed to the Fedora 16 stable repository. If problems still persist, please make note of it in this bug report.