Bug 428329

Summary: Oops: unable to handle kernel paging request at virtual address 60001018
Product: [Fedora] Fedora Reporter: Christopher Beland <beland>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: low    
Version: 8CC: esandeen, james, jfrieben
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-01-09 05:45:26 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Output in /var/log/messages
none
Latest oops
none
Kernel oops section in /var/log/messages for kernel 2.6.25-0.50.rc2.fc9 none

Description Christopher Beland 2008-01-10 20:49:37 UTC
While logged into Gnome, I experienced a kernel oops requiring a hard system
reboot to recover.  (The system did not freeze, but the normal shutdown sequence
did not succeed, and gdm crashed during the attempt.)  The last thing I did was
start audacious from the command line.

I forced fsck.ext3 to scan the filesystem, but it did not report any problems.

This is with kernel-2.6.23.9-85.fc8.

Comment 1 Christopher Beland 2008-01-10 20:49:37 UTC
Created attachment 291329 [details]
Output in /var/log/messages

Comment 2 Chuck Ebbert 2008-01-11 00:00:46 UTC
Hmm, that 60001018 looks familiar...

Bug 270141 and bug 426863 ppear to be the same thing.

Comment 3 Chuck Ebbert 2008-01-11 00:40:56 UTC
   f:   55                      push   %ebp
  10:   57                      push   %edi
  11:   89 c7                   mov    %eax,%edi
  13:   56                      push   %esi
  14:   53                      push   %ebx
  15:   8b 70 bc                mov    0xffffffbc(%eax),%esi ... esi <- block_i
  18:   8b 80 9c 00 00 00       mov    0x9c(%eax),%eax       ... eax <- ei
  1e:   85 f6                   test   %esi,%esi
  20:   8b 98 64 01 00 00       mov    0x164(%eax),%ebx      ... eax <- rsv_lock
  26:   74 2f                   je     0x57
  28:   8d 6e 14                lea    0x14(%esi),%ebp       ... ebp <- rsv

So block_i == 60001000
   rsv == 60001014

00000000 <.text>:
   0:   83 7d 04 00             cmpl   $0x0,0x4(%ebp)
OOPS =>        if (!rsv_is_empty(&rsv->rsv_window)) {

   4:   74 26                   je     0x2c
   6:   8d 83 00 41 00 00       lea    0x4100(%ebx),%eax
   c:   e8 72 a6 cf d1          call   0xd1cfa683
  11:   83 7d 04 00             cmpl   $0x0,0x4(%ebp)

Seems to happen more than once in this code.


Comment 4 Eric Sandeen 2008-01-11 03:21:53 UTC
hm, kswapd thread in the other bug too (can't tell on the 3rd bug, truncated
oops)....

Comment 5 Eric Sandeen 2008-01-11 03:47:53 UTC
Christopher, I see that you reported all 3 bugs, #428329, #270141, and
#426863... feel free to just re-open or update an existing bug next time :)

It's interesting that you have the same bad value in all 3...

Similar oops: http://article.gmane.org/gmane.linux.kernel/584999

It's also in kswapd, but, it has a different bad address.

Nice long discussion there, but no good ideas.  Seems to me like either
something stomped on this memory, or use after free... tho that seems unlikely
since you have the same bad value all 3x.  I'd ask you to run memtest86 but
since one other person hit the same thing in the same callchain... hrm.

Comment 6 Eric Sandeen 2008-01-11 04:45:47 UTC
Christopher, how often are you hitting this?  I wonder if running with a debug
kernel variant would yield any more info next time, if you hit it often.

Comment 7 Eric Sandeen 2008-01-11 05:00:57 UTC
http://ubuntuforums.org/archive/index.php/t-337310.html
http://www.ussg.iu.edu/hypermail/linux/kernel/0603.2/0403.html
https://bugzilla.novell.com/show_bug.cgi?id=213905

I'm going to ask after all... will you go ahead & let memtest run for a while?

Comment 9 Christopher Beland 2008-01-14 17:20:43 UTC
> Christopher, how often are you hitting this?

When I get a kernel oops, I almost always report it.  I don't know why I didn't
find the previous reports, but they are probably the only times I've seen this
problem.

I ran memtest last night and it did not detect any problems.

Comment 10 Eric Sandeen 2008-01-18 18:45:13 UTC
This is very odd.  Enough people have hit this, it seems real, but there's not a
lot to go on.  Running with a debug kernel, in hopes that it gets hit again,
might offer some clues.  Or, setting up to get a system dump on an oops would
also be a huge help.

I'll look through the code some more...

Comment 11 Christopher Beland 2008-01-18 19:46:40 UTC
I'm not sure what you mean by "debug kernel".  I do have the kernel-debuginfo
and kernel-debuginfo-common RPMs installed.

I read through this guide to kdump:
 http://fedoranews.org/mediawiki/index.php/Using_Kexec_and_Kdump_in_Rawhide
which explains how to get a system dump from a kernel panic.  Will a dump get
triggered automatically on oops if I put "crashkernel=64M@16M" in the kernel's
startup command in grub, and I do "chkconfig kdump on"?

Comment 12 Christopher Beland 2008-01-19 18:51:43 UTC
This just happened again, and once again fsck didn't find any problems in the
filesystem.

I haven't used my wireless card since my last reboot, so that can't be the
cause.  I suspend and resume all the time, but one thing I did yesterday that I
don't usually do is hibernate and resume.

I've taken the kdump-related steps in comment 11, in case that helps.  If there
are any hibneration-related diagnostics I should do, let me know.

Comment 13 Eric Sandeen 2008-01-19 20:15:53 UTC
Sorry I didn't get back to you yet on comment #11.  Honestly, I always have to
read up on how to make kdump work, myself.  :)  So, did you get a system dump
then?  If you can share that it might yield some very good clues!

Thanks,

-Eric

Comment 14 Christopher Beland 2008-01-19 20:54:26 UTC
Alas, no, I hadn't added the kernel argument before the latest oops.  But I've
just hibernated and resumed, so I'm expecting it to happen again at any moment...

Comment 15 Eric Sandeen 2008-01-19 23:14:42 UTC
If you didn't get a dump this time, I think you can test it with echo c >
/proc/sysrq-trigger (maybe preceded by a few "syncs" for the filesystem's
benefit) to see if it's working, before the next fleeting panic.

Hm, and another report somewhat along the same lines...

http://article.gmane.org/gmane.linux.kernel/626582

and a bug I closed due to tainting, but perhaps related:

https://bugzilla.redhat.com/show_bug.cgi?id=208488

Comment 16 Christopher Beland 2008-01-23 17:50:29 UTC
I just read at http://www.ibm.com/developerworks/library/l-fs8.html that some
laptop hard drives throw away their write caches when being put into a low-power
state, which can cause filesystem corruption.  Could this cause memory
corruption when I hibernate, if I am unlucky in my timing?

Comment 17 Eric Sandeen 2008-01-23 17:56:25 UTC
I'd have to read up on how hibernate works, but I thought that at least recent
code did a block device freeze, which should get everything safely on disk...

Comment 18 Christopher Beland 2008-01-23 22:07:31 UTC
The same article says that some hard drives say they have committed things to
disk from write cache when they actually haven't, and problems obviously result
when the two are combined.

Comment 19 Christopher Beland 2008-02-14 19:27:40 UTC
Created attachment 294937 [details]
Latest oops

Another oops, this time at "EIP is at ext3_discard_reservation+0x1c/0x4d
[ext3]".  I didn't get a dump in /var/crash, though, and I'm not sure why.  I
did reboot from a LiveCD (to do an fsck) before rebooting again from my hard
drive.	(fsck -f didn't find any filesystem problems.)

Comment 20 Eric Sandeen 2008-02-14 19:54:42 UTC
Was this also after any suspend activity?

You can use "echo c > /proc/sysrq-trigger" to trigger a "crash" and see if your
crashdump utility is set up properly...

Comment 21 Eric Sandeen 2008-02-14 21:14:55 UTC
*** Bug 426863 has been marked as a duplicate of this bug. ***

Comment 22 Eric Sandeen 2008-02-14 21:40:10 UTC
I'm thinking this is most likely use after free... I wonder if we could get you
set up & running with a kernel which would catch that, with CONFIG_SLAB_DEBUG or
whatever is appropriate for this kernel...

Comment 23 Christopher Beland 2008-02-16 00:45:06 UTC
Yes, the latest crash happened after several days of uptime, after which I'd
slept and hibernated and restored several times.

I did "echo c > /proc/sysrq-trigger".  A bunch of text flew by on the console. 
Near the end it looked like an attempted run of "fsck" that couldn't find
/etc/fstab.  I didn't get any files in /var/crash. The kernel line in
/etc/grub.conf I'm using is:

 kernel /boot/vmlinuz-2.6.23.15-137.fc8 ro root=LABEL=/1 rhgb quiet
usbcore.autosuspend=1 crashkernel=64M@16M

"/sbin/chkconfig kdump --list" produces:
 kdump           0:off   1:off   2:on    3:on    4:on    5:on    6:off
and I'm using runlevel 5.

Did I do something wrong or incompletely?

If someone can package it in an RPM, I'm happy to run any kernel which would
help debug this.

Comment 24 Joachim Frieben 2008-02-17 10:54:09 UTC
Created attachment 295096 [details]
Kernel oops section in /var/log/messages for kernel 2.6.25-0.50.rc2.fc9

This happened on a current "rawhide" x86_64 box after booting kernel
2.6.25-0.50.rc2.fc9. No additional modules such as "madwifi" had been
installed yet. My home partition uses "ext4dev" but looking at the
initial report this does not appear to be at the root of the issue.
Strange enough, this oops usually occurs when I'm building some RPM as
an ordinary user in /home. The compiler package is gcc-4.3.0-0.9.
I'm not sure when this issue occurred for the first time - I would
guess sometime this month.

Comment 25 Eric Sandeen 2008-02-18 03:13:11 UTC
Joachim, please open a new bug for that.  It looks completely unrelated to this
bug, and in fact is probably an ext4 problem.

Thanks,
-Eric

Comment 26 Eric Sandeen 2008-03-01 21:59:24 UTC
Christopher, can you try running
http://koji.fedoraproject.org/packages/kernel/2.6.23.15/137.fc8/i686/kernel-debug-2.6.23.15-137.fc8.i686.rpm

(yum install kernel-debug, probably)

it's the same version of your kernel but w/ debugging bells & whistles turned
on.  if you hit it again it might yield more info, though a crashdump would
probably still be best...

-Eric

Comment 27 Christopher Beland 2008-03-02 07:25:33 UTC
OK, I'm running kernel-debug-2.6.23.15-137.fc8.i686.rpm now.

Comment 28 Bug Zapper 2008-11-26 09:22:45 UTC
This message is a reminder that Fedora 8 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 8.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '8'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 8's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 8 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 29 Bug Zapper 2009-01-09 05:45:26 UTC
Fedora 8 changed to end-of-life (EOL) status on 2009-01-07. Fedora 8 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.