Bug 759328 - kernel 3.2.0-0.rc3.git1.1.fc17 oopses with "paging request" at dup_mm+0x394/0x7f0
Summary: kernel 3.2.0-0.rc3.git1.1.fc17 oopses with "paging request" at dup_mm+0x394/0...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 19
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-12-02 01:59 UTC by Michal Jaegermann
Modified: 2013-04-05 18:36 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-04-05 18:36:21 UTC
Type: ---


Attachments (Terms of Use)
3.2.0-0.rc3.git1.1.fc17.x86_64 oopsing in dup_mm+0x394/0x7f0 (5.21 KB, text/plain)
2011-12-02 02:01 UTC, Michal Jaegermann
no flags Details
dmesg with an oops from 3.2.0-0.rc4.git4.2.fc17.x86_64 (57.88 KB, text/plain)
2011-12-08 04:28 UTC, Michal Jaegermann
no flags Details
the whole oops are registered by dmesg with 3.2.0-0.rc7.git0.1.fc17.x86_64 (5.76 KB, text/plain)
2011-12-30 20:38 UTC, Michal Jaegermann
no flags Details
similar oops trace for 3.2.1-4.fc17.x86_64 (10.10 KB, text/plain)
2012-01-18 19:47 UTC, Michal Jaegermann
no flags Details

Description Michal Jaegermann 2011-12-02 01:59:57 UTC
Description of problem:

The following happened:

[ 1780.187231] BUG: unable to handle kernel paging request at ffffffff01820660
[ 1780.187861] IP: [<ffffffff8107a064>] dup_mm+0x394/0x7f0
[ 1780.188004] PGD 1c07067 PUD 0
[ 1780.188004] Oops: 0000 [#1] SMP
[ 1780.188004] CPU 0
.....
[ 1780.188004] BUG: sleeping function called from invalid context at kernel/rwsem.c:21
[ 1780.188004] in_atomic(): 0, irqs_disabled(): 1, pid: 17011, name: mandb
.....

clearly because man-db.cron decided that it is time to get busy.


Version-Release number of selected component (if applicable):
kernel-3.2.0-0.rc3.git1.1.fc17.x86_64

How reproducible:
I do not know yet

Additional info:
'mandb' as an oops trigger reminds me of a bug with a long history which was eventually tracked down not so long time ago.  Unfortunately I am unable to find  a corresponding bug report. IIRC despite of the same trigger the other issue was different.

I have recently seen "sleeping function called from invalid context at kernel/rwsem.c:21" on the same installation.  This is bug 743453. Hm, that does not look so familar.

Comment 1 Michal Jaegermann 2011-12-02 02:01:24 UTC
Created attachment 539470 [details]
3.2.0-0.rc3.git1.1.fc17.x86_64 oopsing in dup_mm+0x394/0x7f0

Comment 2 Michal Jaegermann 2011-12-02 02:08:43 UTC
Oh, I noticed only now.  Attached opps is "tainted" due to a long standing bug 537697, promissing a possible DEADLOCK, which is still there; but is getting boring to add the same comment with every kernel update.

Comment 3 Michal Jaegermann 2011-12-04 21:44:30 UTC
I am afraid that my machine just crashed running 3.2.0-0.rc3.git1.1.fc17.x86_64 and the last entry before that crash in logs was: "starting man-db.cron" in /var/log/cron.  Nothing else anywhere and a screen was dark when this happened so I am not absolutely sure that this was the same thing but this is pretty likely. So it may be repeatable.

OTOH running 'mandb', or 'mandb -c', from a command line does not seem to be doing anything nasty to a kernel. Sigh, dejavu once again.

Comment 4 Dave Jones 2011-12-05 19:40:55 UTC
this smells like memory corruption.

Can you run memtest86 for a while to rule out bad hardware ?

Comment 5 Michal Jaegermann 2011-12-05 21:25:57 UTC
(In reply to comment #4)

> Can you run memtest86 for a while to rule out bad hardware ?

OK.  Although this memory was checked not so long time ago but these things happen.

I will see what I can do.  Tommorow I am supposed to have an eye surgery so I may be out of circulation for some time.

Comment 6 Michal Jaegermann 2011-12-06 00:53:37 UTC
(In reply to comment #4)
> 
> Can you run memtest86 for a while to rule out bad hardware ?

So far I run three full cycles of memtest86 v4.20.  That is over three hours of testing.  No errors were found.  I have to stop these runs for the time beeing. If there is a bad hardware somewhere it does not seem to be obviously bad.

Comment 7 Dave Jones 2011-12-06 15:12:48 UTC
very curious.  Do you use suspend/resume or hibernate at all when this bug happens ?

Comment 8 Michal Jaegermann 2011-12-06 16:20:55 UTC
(In reply to comment #7)
> Do you use suspend/resume or hibernate at all when this bug happens ?

No. This is desktop machine running assorted test setups and although it was observed on some occassions that it can suspend, sometimes better sometimes not so great, such occasions are exceedingly rare.  None of that here.

Also "happens" is really "happened" unless you include here this mystery crash mentioned in comment #3.

Comment 9 Dave Jones 2011-12-06 21:38:30 UTC
3.2.0-0.rc4.git4.2.fc17 is building now which has a very expensive debug option turned on that might catch something earlier.

Give it a try, and see if that gives any different traces.

Comment 10 Michal Jaegermann 2011-12-06 23:06:20 UTC
(In reply to comment #9)
> 3.2.0-0.rc4.git4.2.fc17 is building now which has a very expensive debug option
> turned on that might catch something earlier.

All rigth. As soon as my post-surgery sight will start to behave a bit better I will give it a run.  I think that I will try to run from cron a loop doing
'mandb -c' and I will see how far that will get me.

Comment 11 Michal Jaegermann 2011-12-08 04:28:55 UTC
Created attachment 542361 [details]
dmesg with an oops from 3.2.0-0.rc4.git4.2.fc17.x86_64

> 3.2.0-0.rc4.git4.2.fc17 is building now which has a very expensive debug option

I tried running 'mandb -c' from cron in a loop using 3.2.0-0.rc4.git4.2.fc17.
On the second round this ended up with an oops that looked like that:

 BUG: unable to handle kernel paging request at ffffffff80c2dcf0
 IP: [<ffffffff8129a85c>] exit_shm+0x1c/0x90
 PGD 1c07067 PUD 1c0b063 PMD 0 
 Oops: 0000 [#1] SMP DEBUG_PAGEALLOC

Just in case the whole dmesg output with a trace and up to "Fixing recursive fault ..." is attached.  Nothing else was recorded.

Comment 12 Dave Jones 2011-12-08 16:47:56 UTC
hmm. that doesn't really give any new clues as to what's going on.

was rc3.git1 the first time you saw this ? I'm wondering if bisecting this bug might be the best plan.

if you have a last-known good kernel that isn't from too long before that first broken build, that might narrow the search somewhat.

Comment 13 Michal Jaegermann 2011-12-08 17:17:09 UTC
(In reply to comment #12)
> 
> was rc3.git1 the first time you saw this ?

There was in the past a bug which was triggered pretty reliably by cron running mandb.  I am afraid that I lost somewhat track of it and bugzilla searches seem to be more and more unreliable.  I failed to find it.  IIRC Vivek Goyal eventually tracked it down.  No idea if a proposed patch was eventually accepted in the mainline or is waiting for better times.  I could not compare traces but from a vague memory they could be somewhat related (but I may be completely wrong).

Yes, with recent kernels rc3.git1 was the first one I observed something like that but this is not that reliable. 'mandb' is not an absolute trigger and cron runs that weekly. What happened in comment #3 was that the previous mandb run was left incomplete and clearly tried again. 

> I'm wondering if bisecting this bug
> might be the best plan.

How you are bisecting Fedora kernels?  As a matter of fact I do not even have at this moment a git tree for any kernel.  Does not mean that I cannot pull one but right now I do not even know from where.

If you have a reference to a detailed bisecting instructions handy that would be highly appreciated.

> if you have a last-known good kernel ....

That seems to be somewhat fuzzy, I am afraid.  Also tickling that bug is an open-ended proposition.

Comment 14 Dave Jones 2011-12-08 20:09:33 UTC
> There was in the past a bug which was triggered pretty reliably by cron running
> mandb.  I am afraid that I lost somewhat track of it and bugzilla searches seem
> to be more and more unreliable.  I failed to find it.  IIRC Vivek Goyal
> eventually tracked it down.  No idea if a proposed patch was eventually
> accepted in the mainline or is waiting for better times.  I could not compare
> traces but from a vague memory they could be somewhat related (but I may be
> completely wrong).

That sounds like the cfq bug that got fixed fairly recently. Probably unrelated to this.
 
> How you are bisecting Fedora kernels?  As a matter of fact I do not even have
> at this moment a git tree for any kernel.  Does not mean that I cannot pull one
> but right now I do not even know from where.

You could try the rpm's already built by grabbing them from http://koji.fedoraproject.org/koji/packageinfo?packageID=8
just to see if you can narrow it down to a specific build.

it's going to be time-consuming, but I don't really have any better ideas right now, unless we start seeing other people report a similar bug.

> That seems to be somewhat fuzzy, I am afraid.  Also tickling that bug is an
> open-ended proposition.

that's what's going to make this a fairly long-winded process. confirming a 'good' kernel will take a while I guess.

Comment 15 Michal Jaegermann 2011-12-08 22:16:05 UTC
(In reply to comment #14)

> 
> You could try the rpm's already built by grabbing them from
> http://koji.fedoraproject.org/koji/packageinfo?packageID=8
> just to see if you can narrow it down to a specific build.

So far I know that I did not see that with 3.2.0-0.rc3.git0.1.fc17.x86_64
(which may mean only that I did not bump into the problem there) and I noticed it
staring from 3.2.0-0.rc3.git1.1.fc17.x86_64.  Unfortunately this is the oldest
kernel I have now available.  Yes, I know about koji.

> that's what's going to make this a fairly long-winded process. confirming a
> 'good' kernel will take a while I guess.

I realize that and, again, I am not sure how to do bisection with Fedora kernels.

On the top of it my wife will likely kill me.  She already complains loudly that
after my eye surgery I eit way too long in front of computer screens.  I am afraid that she may be right.

Comment 16 Michal Jaegermann 2011-12-08 22:24:05 UTC
(In reply to comment #14)

> You could try the rpm's already built by grabbing them from
> http://koji.fedoraproject.org/koji/packageinfo?packageID=8
> just to see if you can narrow it down to a specific build.

I cannot even get kernel-3.2.0-0.rc3.git0.1.fc17.x86_64.rpm; 403 from koji.  Sigh!

Comment 17 Dave Jones 2011-12-09 17:57:26 UTC
> On the top of it my wife will likely kill me.  She already complains loudly
> that
> after my eye surgery I eit way too long in front of computer screens.  I am
> afraid that she may be right.

the wife is always right ;-)

Comment 18 Michal Jaegermann 2011-12-11 04:39:40 UTC
(In reply to comment #16)
> 
> I cannot even get kernel-3.2.0-0.rc3.git0.1.fc17.x86_64.rpm; 403 from koji. 

Now I can retrieve binary packages from koji again and with 3.2.0-0.rc3.git0.1.fc17.x86_64 I tried over twenty cycles of 'mandb -c' running as a cron job.  On my test machine this was a three hours run.  Nothing bad happened.

OTOH six cycles of the same with the current 3.2.0-0.rc4.git5.1.fc17.x86_64 also went through without any incidents.  What all that really means I have no idea.

Comment 19 Michal Jaegermann 2011-12-16 04:26:24 UTC
It appears that I got again one of those oopses while running 3.2.0-0.rc5.git2.2.fc17.x86_64 kernel.  I cannot be really sure what was that as I was away at that time and later I found a machine totally locked up with a dark screen and I only know that at that time cron was running 'mandb -c' loop and it did not get very far (I collect an output from that).  Only when I am not watching.  There are absolutely no traces in /var/log/messages and abrt also failed to catch anything.

Comment 20 Michal Jaegermann 2011-12-30 20:38:42 UTC
Created attachment 550093 [details]
the whole oops are registered by dmesg with 3.2.0-0.rc7.git0.1.fc17.x86_64

I got one more of these while running 3.2.0-0.rc7.git0.1.fc17.x86_64 this time. Again while in mandb from a cron job.  It looks like that it is reproducing itself only when not expected. :-)  The whole works - "fixing but reboot is needed" and "sleeping function called from invalid context at kernel/rwsem.c:21".

A call trace is a bit different although dup_mm is there.  I attach what of this oops was registered by dmesg.  A call trace itself looks like this:

Call Trace:
 [<ffffffff8116ab32>] __pte_alloc+0x32/0x150
 [<ffffffff8116b7fe>] copy_pte_range+0x37e/0x430
 [<ffffffff8116e733>] copy_page_range+0x2d3/0x490
 [<ffffffff81675368>] ? mutex_lock_nested+0x2f8/0x3a0
 [<ffffffff8107a134>] dup_mm+0x384/0x7f0
 [<ffffffff8107b62a>] copy_process+0x105a/0x1750
 [<ffffffff8107be8b>] do_fork+0x11b/0x460
 [<ffffffff8116981c>] ? might_fault+0x5c/0xb0
 [<ffffffff810230c8>] sys_clone+0x28/0x30
 [<ffffffff81680b63>] stub_clone+0x13/0x20
 [<ffffffff816807c2>] ? system_call_fastpath+0x16/0x1b
Code:  Bad RIP value. 
RIP  [<ffffffff8004b3bb>] 0xffffffff8004b3ba
 RSP <ffff88006ab89b50>
CR2: ffffffff8004b3bb
---[ end trace e0647bd33f68339d ]---

Comment 21 Michal Jaegermann 2012-01-18 19:47:22 UTC
Created attachment 556104 [details]
similar oops trace for 3.2.1-4.fc17.x86_64

Oopses of the same sort as described show up from time to time.  Unfortunately not in a really predictable manner nor in a way I can try even semi-reliably to reproduce.  Here is the latest example for a kernel-3.2.1-4.fc17.x86_64.  It differes in some details from what was registered previously. Attached for a record.

Comment 22 Fedora End Of Life 2013-04-03 15:37:53 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle.
Changing version to '19'.

(As we did not run this process for some time, it could affect also pre-Fedora 19 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.)

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora19

Comment 23 Justin M. Forbes 2013-04-03 16:54:48 UTC
Is this still happening with 3.9-rc kernels for F19?

Comment 24 Michal Jaegermann 2013-04-03 17:00:32 UTC
(In reply to comment #23)
> Is this still happening with 3.9-rc kernels for F19?

I did not see that for a very long time.


Note You need to log in before you can comment on or make changes to this bug.