Hide Forgot
Description of problem: The following happened: [ 1780.187231] BUG: unable to handle kernel paging request at ffffffff01820660 [ 1780.187861] IP: [<ffffffff8107a064>] dup_mm+0x394/0x7f0 [ 1780.188004] PGD 1c07067 PUD 0 [ 1780.188004] Oops: 0000 [#1] SMP [ 1780.188004] CPU 0 ..... [ 1780.188004] BUG: sleeping function called from invalid context at kernel/rwsem.c:21 [ 1780.188004] in_atomic(): 0, irqs_disabled(): 1, pid: 17011, name: mandb ..... clearly because man-db.cron decided that it is time to get busy. Version-Release number of selected component (if applicable): kernel-3.2.0-0.rc3.git1.1.fc17.x86_64 How reproducible: I do not know yet Additional info: 'mandb' as an oops trigger reminds me of a bug with a long history which was eventually tracked down not so long time ago. Unfortunately I am unable to find a corresponding bug report. IIRC despite of the same trigger the other issue was different. I have recently seen "sleeping function called from invalid context at kernel/rwsem.c:21" on the same installation. This is bug 743453. Hm, that does not look so familar.
Created attachment 539470 [details] 3.2.0-0.rc3.git1.1.fc17.x86_64 oopsing in dup_mm+0x394/0x7f0
Oh, I noticed only now. Attached opps is "tainted" due to a long standing bug 537697, promissing a possible DEADLOCK, which is still there; but is getting boring to add the same comment with every kernel update.
I am afraid that my machine just crashed running 3.2.0-0.rc3.git1.1.fc17.x86_64 and the last entry before that crash in logs was: "starting man-db.cron" in /var/log/cron. Nothing else anywhere and a screen was dark when this happened so I am not absolutely sure that this was the same thing but this is pretty likely. So it may be repeatable. OTOH running 'mandb', or 'mandb -c', from a command line does not seem to be doing anything nasty to a kernel. Sigh, dejavu once again.
this smells like memory corruption. Can you run memtest86 for a while to rule out bad hardware ?
(In reply to comment #4) > Can you run memtest86 for a while to rule out bad hardware ? OK. Although this memory was checked not so long time ago but these things happen. I will see what I can do. Tommorow I am supposed to have an eye surgery so I may be out of circulation for some time.
(In reply to comment #4) > > Can you run memtest86 for a while to rule out bad hardware ? So far I run three full cycles of memtest86 v4.20. That is over three hours of testing. No errors were found. I have to stop these runs for the time beeing. If there is a bad hardware somewhere it does not seem to be obviously bad.
very curious. Do you use suspend/resume or hibernate at all when this bug happens ?
(In reply to comment #7) > Do you use suspend/resume or hibernate at all when this bug happens ? No. This is desktop machine running assorted test setups and although it was observed on some occassions that it can suspend, sometimes better sometimes not so great, such occasions are exceedingly rare. None of that here. Also "happens" is really "happened" unless you include here this mystery crash mentioned in comment #3.
3.2.0-0.rc4.git4.2.fc17 is building now which has a very expensive debug option turned on that might catch something earlier. Give it a try, and see if that gives any different traces.
(In reply to comment #9) > 3.2.0-0.rc4.git4.2.fc17 is building now which has a very expensive debug option > turned on that might catch something earlier. All rigth. As soon as my post-surgery sight will start to behave a bit better I will give it a run. I think that I will try to run from cron a loop doing 'mandb -c' and I will see how far that will get me.
Created attachment 542361 [details] dmesg with an oops from 3.2.0-0.rc4.git4.2.fc17.x86_64 > 3.2.0-0.rc4.git4.2.fc17 is building now which has a very expensive debug option I tried running 'mandb -c' from cron in a loop using 3.2.0-0.rc4.git4.2.fc17. On the second round this ended up with an oops that looked like that: BUG: unable to handle kernel paging request at ffffffff80c2dcf0 IP: [<ffffffff8129a85c>] exit_shm+0x1c/0x90 PGD 1c07067 PUD 1c0b063 PMD 0 Oops: 0000 [#1] SMP DEBUG_PAGEALLOC Just in case the whole dmesg output with a trace and up to "Fixing recursive fault ..." is attached. Nothing else was recorded.
hmm. that doesn't really give any new clues as to what's going on. was rc3.git1 the first time you saw this ? I'm wondering if bisecting this bug might be the best plan. if you have a last-known good kernel that isn't from too long before that first broken build, that might narrow the search somewhat.
(In reply to comment #12) > > was rc3.git1 the first time you saw this ? There was in the past a bug which was triggered pretty reliably by cron running mandb. I am afraid that I lost somewhat track of it and bugzilla searches seem to be more and more unreliable. I failed to find it. IIRC Vivek Goyal eventually tracked it down. No idea if a proposed patch was eventually accepted in the mainline or is waiting for better times. I could not compare traces but from a vague memory they could be somewhat related (but I may be completely wrong). Yes, with recent kernels rc3.git1 was the first one I observed something like that but this is not that reliable. 'mandb' is not an absolute trigger and cron runs that weekly. What happened in comment #3 was that the previous mandb run was left incomplete and clearly tried again. > I'm wondering if bisecting this bug > might be the best plan. How you are bisecting Fedora kernels? As a matter of fact I do not even have at this moment a git tree for any kernel. Does not mean that I cannot pull one but right now I do not even know from where. If you have a reference to a detailed bisecting instructions handy that would be highly appreciated. > if you have a last-known good kernel .... That seems to be somewhat fuzzy, I am afraid. Also tickling that bug is an open-ended proposition.
> There was in the past a bug which was triggered pretty reliably by cron running > mandb. I am afraid that I lost somewhat track of it and bugzilla searches seem > to be more and more unreliable. I failed to find it. IIRC Vivek Goyal > eventually tracked it down. No idea if a proposed patch was eventually > accepted in the mainline or is waiting for better times. I could not compare > traces but from a vague memory they could be somewhat related (but I may be > completely wrong). That sounds like the cfq bug that got fixed fairly recently. Probably unrelated to this. > How you are bisecting Fedora kernels? As a matter of fact I do not even have > at this moment a git tree for any kernel. Does not mean that I cannot pull one > but right now I do not even know from where. You could try the rpm's already built by grabbing them from http://koji.fedoraproject.org/koji/packageinfo?packageID=8 just to see if you can narrow it down to a specific build. it's going to be time-consuming, but I don't really have any better ideas right now, unless we start seeing other people report a similar bug. > That seems to be somewhat fuzzy, I am afraid. Also tickling that bug is an > open-ended proposition. that's what's going to make this a fairly long-winded process. confirming a 'good' kernel will take a while I guess.
(In reply to comment #14) > > You could try the rpm's already built by grabbing them from > http://koji.fedoraproject.org/koji/packageinfo?packageID=8 > just to see if you can narrow it down to a specific build. So far I know that I did not see that with 3.2.0-0.rc3.git0.1.fc17.x86_64 (which may mean only that I did not bump into the problem there) and I noticed it staring from 3.2.0-0.rc3.git1.1.fc17.x86_64. Unfortunately this is the oldest kernel I have now available. Yes, I know about koji. > that's what's going to make this a fairly long-winded process. confirming a > 'good' kernel will take a while I guess. I realize that and, again, I am not sure how to do bisection with Fedora kernels. On the top of it my wife will likely kill me. She already complains loudly that after my eye surgery I eit way too long in front of computer screens. I am afraid that she may be right.
(In reply to comment #14) > You could try the rpm's already built by grabbing them from > http://koji.fedoraproject.org/koji/packageinfo?packageID=8 > just to see if you can narrow it down to a specific build. I cannot even get kernel-3.2.0-0.rc3.git0.1.fc17.x86_64.rpm; 403 from koji. Sigh!
> On the top of it my wife will likely kill me. She already complains loudly > that > after my eye surgery I eit way too long in front of computer screens. I am > afraid that she may be right. the wife is always right ;-)
(In reply to comment #16) > > I cannot even get kernel-3.2.0-0.rc3.git0.1.fc17.x86_64.rpm; 403 from koji. Now I can retrieve binary packages from koji again and with 3.2.0-0.rc3.git0.1.fc17.x86_64 I tried over twenty cycles of 'mandb -c' running as a cron job. On my test machine this was a three hours run. Nothing bad happened. OTOH six cycles of the same with the current 3.2.0-0.rc4.git5.1.fc17.x86_64 also went through without any incidents. What all that really means I have no idea.
It appears that I got again one of those oopses while running 3.2.0-0.rc5.git2.2.fc17.x86_64 kernel. I cannot be really sure what was that as I was away at that time and later I found a machine totally locked up with a dark screen and I only know that at that time cron was running 'mandb -c' loop and it did not get very far (I collect an output from that). Only when I am not watching. There are absolutely no traces in /var/log/messages and abrt also failed to catch anything.
Created attachment 550093 [details] the whole oops are registered by dmesg with 3.2.0-0.rc7.git0.1.fc17.x86_64 I got one more of these while running 3.2.0-0.rc7.git0.1.fc17.x86_64 this time. Again while in mandb from a cron job. It looks like that it is reproducing itself only when not expected. :-) The whole works - "fixing but reboot is needed" and "sleeping function called from invalid context at kernel/rwsem.c:21". A call trace is a bit different although dup_mm is there. I attach what of this oops was registered by dmesg. A call trace itself looks like this: Call Trace: [<ffffffff8116ab32>] __pte_alloc+0x32/0x150 [<ffffffff8116b7fe>] copy_pte_range+0x37e/0x430 [<ffffffff8116e733>] copy_page_range+0x2d3/0x490 [<ffffffff81675368>] ? mutex_lock_nested+0x2f8/0x3a0 [<ffffffff8107a134>] dup_mm+0x384/0x7f0 [<ffffffff8107b62a>] copy_process+0x105a/0x1750 [<ffffffff8107be8b>] do_fork+0x11b/0x460 [<ffffffff8116981c>] ? might_fault+0x5c/0xb0 [<ffffffff810230c8>] sys_clone+0x28/0x30 [<ffffffff81680b63>] stub_clone+0x13/0x20 [<ffffffff816807c2>] ? system_call_fastpath+0x16/0x1b Code: Bad RIP value. RIP [<ffffffff8004b3bb>] 0xffffffff8004b3ba RSP <ffff88006ab89b50> CR2: ffffffff8004b3bb ---[ end trace e0647bd33f68339d ]---
Created attachment 556104 [details] similar oops trace for 3.2.1-4.fc17.x86_64 Oopses of the same sort as described show up from time to time. Unfortunately not in a really predictable manner nor in a way I can try even semi-reliably to reproduce. Here is the latest example for a kernel-3.2.1-4.fc17.x86_64. It differes in some details from what was registered previously. Attached for a record.
This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle. Changing version to '19'. (As we did not run this process for some time, it could affect also pre-Fedora 19 development cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.) More information and reason for this action is here: https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora19
Is this still happening with 3.9-rc kernels for F19?
(In reply to comment #23) > Is this still happening with 3.9-rc kernels for F19? I did not see that for a very long time.