Bug 524201 - Kernel panic does not reload not kernel
Summary: Kernel panic does not reload not kernel
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kexec-tools
Version: 12
Hardware: All
OS: Linux
low
medium
Target Milestone: ---
Assignee: Neil Horman
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-09-18 11:00 UTC by Quentin Armitage
Modified: 2010-02-18 13:58 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-02-18 13:58:34 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
dmesg log file (37.09 KB, text/plain)
2009-09-21 14:15 UTC, Quentin Armitage
no flags Details
Console output following sysrq invoked panic (2.10 KB, text/plain)
2009-09-21 14:15 UTC, Quentin Armitage
no flags Details
Panic module source code (636 bytes, application/octet-stream)
2009-09-21 14:16 UTC, Quentin Armitage
no flags Details
patch to debug shutdown process (919 bytes, application/octet-stream)
2009-09-21 16:18 UTC, Neil Horman
no flags Details
Patch to remove dma-debug commits (53.50 KB, patch)
2009-09-30 07:53 UTC, Quentin Armitage
no flags Details | Diff
kernel config file (95.51 KB, text/plain)
2009-10-05 08:04 UTC, Quentin Armitage
no flags Details

Description Quentin Armitage 2009-09-18 11:00:22 UTC
Description of problem:
If I load a module that calls panic(), or if I do echo c >/proc/sysrq-trigger, the system hangs, and no crashdump kernel is loaded. There is minor screen corruption after the panic, but nothing else happens, and the system stops responding.

Version-Release number of selected component (if applicable):
kexec-tools-2.0.0-27.fc12.i686
kernel-2.6.31-23.fc12.i686

How reproducible:
Always

Steps to Reproduce:
1.echo c >/proc/sysrq-trigger
2.
3.
  
Actual results:
Minor screen corruption, and crashdump kernel does not load. System has totally died.

Expected results:
crashdump kernel loads and dumps old kernel

Additional info:

Comment 1 Neil Horman 2009-09-18 18:14:45 UTC
Can you attach a serial console to the system, configure it to operate with:
console=ttyS<number of port>,38400n8
and see if it records any output?

Also please attach the dmesg log of the normal boot of the kernel prior to calling echo c > /proc/sysrq-trigger

Also, check /proc/sys/kernel/panic_on_oops to and /proc/sys/kernel/sysrq to ensure that sysrq is enabled and a panic will result from a sysrq-c

Also, it would be good to ensure that the kdump kernel is loaded via /sys/kernel/kexec_crash_loaded

Comment 2 Quentin Armitage 2009-09-21 14:15:02 UTC
Created attachment 361947 [details]
dmesg log file

Comment 3 Quentin Armitage 2009-09-21 14:15:59 UTC
Created attachment 361948 [details]
Console output following sysrq invoked panic

Comment 4 Quentin Armitage 2009-09-21 14:16:34 UTC
Created attachment 361949 [details]
Panic module source code

Comment 5 Quentin Armitage 2009-09-21 14:21:38 UTC
Re comment #1:

Console output for echo c >/proc/sysrq-trigger attached (id=361948)

dmesg log attached (id=361947)

$ cat /proc/sys/kernel/panic_on_oops
0
$ cat /proc/sys/kernel/sysrq
0
$ cat /sys/kernel/kexec_crash_loaded
1

Setting panic_on_oops and sysrq to 1 appears to make no difference. The console output attached was with both panic_on_oops and sysrq set to 1.

I have also created a module that when loaded causes a panic - source attached (i=361949). The output on the console when it is loaded is:
Kernel panic - not syncing: Panic due to module loaded

This was also loaded when panic_on_oops and sysrq were set to 1.

Comment 6 Neil Horman 2009-09-21 16:18:32 UTC
Created attachment 361971 [details]
patch to debug shutdown process

Ok, so it looks like we're not even booting the second kernel.  CAn you spin your kernel with this patch in place.  It should help give us some idea of where we're haning on shutdown.

Comment 7 Quentin Armitage 2009-09-21 23:01:18 UTC
With the patched kernel, I get the following:
IN CRASH_KEXEC
GOT KEXEC_MUTEX
TRYING TO SETUP KEXEC
Calling MACHINE_CRASH_SHUTDOWN
TRYING MACHINE_KEXEC

and then no futher output.

Comment 8 Neil Horman 2009-09-22 13:16:42 UTC
great, we're into the machine specific reboot code.  what arch/system are you seeing this on?

Comment 9 Quentin Armitage 2009-09-22 13:29:15 UTC
The system is a Dell D800 laptop, i686. The output of uname is:
$ uname -a
Linux samson.armitage.org.uk 2.6.31-33.fc12.i686 #1 SMP Thu Sep 17 15:56:11 EDT 2009 i686 i686 i386 GNU/Linux

Comment 10 Quentin Armitage 2009-09-22 15:14:28 UTC
I'm not sure if this is of any interest, but I also have F-11 installed on the same system. I have booted that, with crashkernel=64M bootparam, and then did echo c >/proc/sysrq-trigger. Although I couldn't see any output messages due to complete screen corruption, (I wasn't running it with a serial console), it was clearly dumping something to disk, and after rebooting there was a /var/crash/2009-09-22...../vmcore file about 2Gb in size.

So it would appear that the crashkernel reboot works on the system for F-11, and that it is a problem confined to Rawhide.

Comment 11 Neil Horman 2009-09-22 16:38:00 UTC
ok, do me a favor and try to boot the rawhide kernel with F-11's userspace if you could please.  Id like to eliminate the possibilty that this is a user space utiltiy error.  If that fails we can assume this is strictly a kernel problem, and I can start a bisect.  Give me the version of the most recent F-11 kernel that you have success with, and I'll provide you  a series of kernels so that we can determine where this has broken.

Thanks!

Comment 12 Quentin Armitage 2009-09-22 16:59:56 UTC
I've never done  this before, so I assume that I use the Rawhide bootstring, but change the root= to the F-11 root. Is that correct?

It seems to me that there would be a problem with that, since the kernel modules that reside in the root filesystem will not be the right ones. Do I need to copy some kernel specific directories from the Rawhide root filesystem to the F-11 root filesystem? Or have I got the wrong end of the stick?

Comment 13 Neil Horman 2009-09-22 18:25:42 UTC
The easiest  way to do it is to boot your F-11 system, and simply install the rawhide kernel rpm there, then reboot and select the new kernel.

Comment 14 Quentin Armitage 2009-09-22 20:47:00 UTC
There seems to be a rather long dependency chain for installing kernel-2.6.31-33.fc12.i686.rpm on my F-11 system. So far, I have got to needing:
audit-libs-2.0-3.fc12.i686.rpm
audit-libs-devel-2.0-3.fc12.i686.rpm
audit-libs-python-2.0-3.fc12.i686.rpm
dash-0.5.5.1-3.fc12.i686.rpm
dracut-002-3.git8eb16b08.fc12.noarch.rpm
dracut-kernel-002-3.git8eb16b08.fc12.noarch.rpm
e2fsprogs-1.41.9-3.fc12.i686.rpm
e2fsprogs-libs-1.41.9-3.fc12.i686.rpm
glibc-2.10.90-23.i686.rpm
glibc-common-2.10.90-23.i686.rpm
glibc-devel-2.10.90-23.i686.rpm
glibc-headers-2.10.90-23.i686.rpm
grubby-7.0.7-1.fc12.i686.rpm
kernel-2.6.31-33.fc12.i686.rpm
kernel-firmware-2.6.31-33.fc12.noarch.rpm
libuuid-2.16-10.fc12.i686.rpm
prelink-0.4.2-2.fc12.i686.rpm
ql2100-firmware-1.19.38-3.fc12.noarch.rpm
ql2200-firmware-2.02.08-3.fc12.noarch.rpm
ql23xx-firmware-3.03.27-3.fc12.noarch.rpm
ql2400-firmware-4.04.09-1.fc12.noarch.rpm
ql2500-firmware-4.04.09-1.fc12.noarch.rpm
util-linux-ng-2.16-10.fc12.i686.rpm

and now it is reporting conflicts between the new packages and other existing packages.

Unless I am missing a point here, I am not sure that installing the Rawhide kernel on F-11 is viable.

Comment 15 Neil Horman 2009-09-23 01:08:47 UTC
the kernel firmware stuff is likely in conflict.  You don't actually need them all though.  Most direct bet is to simply get the rawhide src rpm, extract it, and do a manual make config; make; make modules_install; make install

Comment 16 Quentin Armitage 2009-09-23 22:53:04 UTC
Sorry for the delay in responding. It took a while to get a kernel build, what with running out of disk space on 1 build, and make config being quite difficult with all the questions it asked.

Anyway, I have built a kernel based on 2.6.31-33.fc12 (i.e. all patches applied, and using the generated config file).

F-11 booting from kernel 2.6.30.5-43.fc11.i586 and kexec-tools 2.0.0.16.fc11 successfully dumps successfully dumps memory (i.e. creates /var/crash/2009-09-..../*) following panic, and reboots.

F-11 booting from my built kernel (2.6.31-33.fc12 built on F-11) with same kexec-tools hangs after panic and does not dump memory.

Comment 17 Neil Horman 2009-09-24 13:34:17 UTC
Ok, thats good to know.  So this should be a simple bisection.  Are you familiar with cvs?  Wtih this CVSROOT:
:pserver:anonymous.redhat.com:/cvs/dist

you can check out the rpms/kernel project with this command:
cvs co rpms/kernel

The rawhide kernel tree is under the rpms/kernel/devel subdirectory, which you can assemble with the make prep target.  The rawhide tree has tags on it for kernels starting from pre 2.6.29 to the latest.  You should be able to bisect down to the kernel that starts failing, and that will give us a good idea of the commits that may have contributed to your failure.

Thanks!

Comment 18 Quentin Armitage 2009-09-24 16:43:41 UTC
I'm not really familiar with cvs, but I can work my way through it (I'm more familiar with SVN).

It seems to me that the CVSROOT should be :pserver:anonymous.redhat.com:/cvs/pkgs rather than :pserver:anonymous.redhat.com:/cvs/dist; is that correct? Using the pkgs version, I only get kernels up to FC-6, and the devel version is dated sometime in 2007.

I have done a cvs co rpms/kernel/F-12, since as far as I can see that is where the current Rawhide kernels are, and this seems to provide the HEAD version only. What I am not clear about is how to get versions earlier than HEAD.

When I do a make prep i686, will that also make the PAE version, and if so, how can I stop the PAE version being built? A kernel build on my system takes about 2 hours, so I am clearly keen to minimize what needs to be done (is there any way I could do scratch builds on Koji to speed things up?).

I assume that what we are wanting to do is build the Rawhide kernels on the F-11 system, as before. I had wondered about installing earlier (F12) Rawhide kernels on my Rawhide system and trying the binary chop that way (since it would avoid having to rebuild the kernels), but I installed the last F12 Rawhide 2.6.30 stable kernel from Koji (2.6.30-6.fc12.i586), and that still exhibited the problem, so I guess that's not a useful way forward.

Comment 19 Neil Horman 2009-09-24 17:02:55 UTC
the dist tree is the one you want.  pkgs is old, dist has kernels up through today.  Some of the naming history should be available on the fedora wiki if you're interested in how that came to be.  

If you want older versions, you need to use the:
 cvs log kernel.spec
command (you can get the log of any file you like, but the spec has all the labels).  Anywho, that shows you all the labels, which are conviently named after the kernel versions they match with.  Then you use the:
cvs update -r <label>
command to checkout a particular version.  From there you do a:
make prep
which will assemble the source tree from that cvs version, which you can cd to and build as per a normal kernel.  Note you don't make prep a particular config, the individual configs are extracted in the make prep stage, and placed in the kernel-<version>/linux-<version>/configs subdirectory, where you will find all the appropriate configs for all the kernel flavors.  Just copy the appropriate config to .config at the top of the kernel tree and type:
make oldconfig
that will setup the tree you have extracted with the config you want automatically (it shouldn't ask you for any input, like doing a normal make config does).  From there you can do a make; make modules_installl; make install like with any kernel build

If you would like to use koji, you of course can do scratch builds there, just issue a:
make test-srpm
after you check out a given version from cvs, and that will make a srpm for you, suitable for uploading to koji.  I had assumed since you were having dependency issues, you would want to avoid that, but if you're doing this all on your rawhide userspace, that should be fine (as you mention above).  I would suggest trying to boot the kernel version that worked on F-11 first on your rawhide user space to make sure you have a working start point.

As long as the last working kernel that you just built boots as a kdump kernel under F-12, I think thats a fine way forward.  The goal is just to find the kernel where this failure started occurring.  It looks like theres a few hundred labels between the last working kernel and the most recent.  a bisect of that should only require about 8 or 9 kernel builds.

Comment 20 Quentin Armitage 2009-09-25 14:00:33 UTC
I have installed kernel-2.6.30.6-53.fc11.i586 (the latest F-11 kernel, which successfully creates crashdumps) on my Rawhide system (I have found that I can install the kernels using rpm -i --oldpackage, whereas the problem I had before was when trying to install with yum). Although it doesn't successfully create a crashdump, the kernel does reboot after a panic, and starts executing the init script in the initrd. So I am seeing that as working from the point of view of what we are trying to do at the moment.

The problem is that there is no progression from there to F12/Rawhide. The only real progression I can see is that the first F12/Rawhide kernel (that built) was 2.6.30-0.34.rc0.git8.fc12, and this was forked from 2.6.29-21.fc11. I tried 2.6.30-0.34.rc0.git8.fc12 on my Rawhide user-space, and that failed to reboot after a panic, as did 2.6.29-21.fc11. I then tried 2.6.29-21.fc11 on my F11 user-space, and that also failed to reboot after a panic.

As far as I can see, we have:
On F-11: 2.6.29-21.fc11 (broken) -> 2.6.29.6-217.fc11 (works) -> 2.6.30.5-43.f11 (works) -> 2.6.30.6-53.f11 (works)
On Rawhide: 2.6.29-21.fc11 (broken) -> 2.6.30-0.34.rc0.git8.fc12 (broken) -> 2.6.31-33.fc12 (broken)

So it would seem that somewhere along the line in F-11, but after the fork for F-12/Rawhide, the panic reboot started working.
For Rawhide, we have a sequence where we don't have any known working point. There may be some kernels somewhere in the middle of the sequence that worked, but there again, it could be that done of them does.

Unfortunately I cannot see any sequence that we can bisect to find a transition from working to broken. We could, presumably, bisect the F-11 kernels to find the transition from broken to working, and then see if that change has been applied to the Rawhide kernels. I'm not sure if that is a useful thing to do or not.

Comment 21 Neil Horman 2009-09-25 14:10:23 UTC
Actually I think thats exactly the right thing to do.  It sounds like some upstream fix got applied to F-11, but it never made it to rawhide, and the current rawhide kernel hasn't reached that point in the devel stream.

So I think that bisect is your option.  The only other choice I could see would be, get an upstream kernel, build it with the rawhide config, and if that works, just use that until such time as rawhide catches up to that point upstream.

Comment 22 Quentin Armitage 2009-09-25 18:25:56 UTC
Using my F-11 user-space, I have done a bisect of the F-11 kernels that werre successfully built in Koji. The result is that kernel-2.6.29.1-42.rc1.fc11.i586
is the last kernel that didn't reboot on panic, and kernel-2.6.29.1-46.fc11.i586
is the first kernel that does reboot, and it also successfully creates a crashdump and then reboots again.

I hope this gives sufficient to work on now. I am of course happy to try any patches or whatever might be necessary.

Comment 23 Neil Horman 2009-09-25 19:56:42 UTC
Interesting.  A diff on the F-11 kernel spec file on those 2 tags shows that these 2 patches were dropped in that time frame:

linux-2.6-debug-dma-api.patch
dma-api-debug-fixes.patch

There was some other build config changes, but it relates to debug builds, so I think these 2 patches are the key

They relate to upstream commits f2f45e5f3c921c73c913e9a9c00f21ec01c86b4d and 8ddc951c73cbc317148c0b9973dde81eece57e4c respectively (the former might have some supporting commits).  The comments indicate that those were removed because they were merged upstream, but I don't see us pulling them in with the latest F-11 kernels, so it seems like something about them might be causing us problems.  I would suggest, that you extract those two commit from an upstream tree, and then reverse apply them to the latest rawhide kernel (using the diff -R option).  If we remove those patches, and the resulting kernel allows kdump to work, I think we can start investigating what it is about those patches that might have broken us.

Comment 24 Quentin Armitage 2009-09-30 07:24:20 UTC
I have attempted the suggestion of reverse applying the two commits to the latest Rawhide kernel, but there ae the following issues.

Firstly, the commits cannot simply be applied with patch -R due to subsequent changes to the source files upstream. The subsequent changes to the e1000e driver (re commit 8ddc951c73cbc317148c0b9973dde81eece57e4c) are such that I cannot work out how to reverse the specific commit; on the other hand  I do not have an e1000e in my system, so I suspect that that is unlikely to be the culprit.

The supporting commits for f2f45e5f3c921c73c913e9a9c00f21ec01c86b4d are:
187f9c3f05373df4f7cbae2e656450acdbba7558
2118d0c548e8a2205e1a29eb5b89e5f2e9ae2c8b
5ee00bd4691e7364bb7b62e2068d473cd5cb9320
30dfa90cc8c4c9621d8d5aa9499f3a5df3376307
3b1e79ed734f58ac41ca0a287ff03ca355f120ad
6bf078715c1998d4d10716251cc10ce45908594c
59d3daafa17265f01149df8eab3fb69b9b42cb2e
788dcfa6f17424695823152890d30da09f62f9c3
2d62ece14fe04168a7d16688ddd2d17ac472268c
f62bc980e6fd26434012c0d5676ecb17179d9ee4
972aa45ceaf65376f33aa75958fcaefc9e752fa4
6bfd4498764d6201399849d2e80fda95db7742c0
b9d2317e0c4aed02afd20022083b2a485289605d
948408ba3e2a67ed0f95e18ed5be1c622c2c5fc3
a31fba5d68cebf8f5fefd03e079dab94875e25f5
ac26c18bd35d982d1ba06020a992b1085fefc3e2
and later commit 1bf20f0dc5629032ddd07617139d9fbca66c1642 is also needed to be reversed since it uses the features provided by the above commits.
Again, these commits cannot simply be reversed with patch -R, but I have produced a patch that effectively reverses all these out, which I will attach, but it's probably not that interesting.

The upshot of it is that reversing out this latter set of commits, but leaving 8ddc951c73cbc317148c0b9973dde81eece57e4c (e1000e), to 2.6.31.1-48.fc12 does not resolve the problem and so the resulting kernel does not reboot after a panic.

Unless you have any other thoughts, I think the way forward is to do a bisect on the changes from kernel-2.6.29.1-42.rc1.fc11.i586 to kernel-2.6.29.1-46.fc11.i586, starting with reapplying the above commits, and finding which change broke it.

I expect it will take me a few days to work though that.

BTW, can one do incremental changes and build of the kernel? By that, I mean if I apply a patch to the kernel, is there a way to rebuild it by only recompiling the module affected, or does a full rebuild need to be done every time?

Comment 25 Quentin Armitage 2009-09-30 07:53:10 UTC
Created attachment 363143 [details]
Patch to remove dma-debug commits

Comment 26 Neil Horman 2009-09-30 21:33:18 UTC
I think you approach sounds like a fine idea.  let me know

Comment 27 Quentin Armitage 2009-10-04 11:52:56 UTC
Got there at last, after a few red herrings.

Applying the config-nodebug from kernel-2.6.29.1-42.rc1.fc11 to kernel-2.6.29.1-46.fc11 causes the panic reboot to stop working.

Comment 28 Neil Horman 2009-10-05 00:04:14 UTC
Ok, I'll look more closely at that.  Just so that I'm clear:

1) When you say apply the config-nodebug to the -46 kernel, you understand that thats just a partial config right, you should be applying config from the configs subdirectory after you do a make prep.  It will have a name in the format kernel-<version>-<arch>-[PAE|largesmp|debug|etc].config

2) If you apply a config from the -46 tree to the -46, does it also fail (i.e. is the config application a don't-care state)?

Comment 29 Quentin Armitage 2009-10-05 08:04:22 UTC
Created attachment 363644 [details]
kernel config file

I have build the kernel using rpmbuild. First I copied the -46 sources to the rpmbuild/SOURCES directory, and the kernel.spec to the rpmbuild/SPECS directory. I then copied config-nodebug from the -42 sources to the rpmbuild/SOURCES directory (i.e. overwriting the -46 version).

I edited the rpmbuild/SPECS/kernel.spec file to make the following changes (the last two changes were simply to save build time):
< # % define buildid .local
---
> %define buildid .config_nodebug
94c94
< %define with_debuginfo %{?_without_debuginfo: 0} %{?!_without_debuginfo: 1}
---
> %define with_debuginfo %{?_without_debuginfo: 0} %{?!_without_debuginfo: 0}
200c200
< %define with_pae 1
---
> %define with_pae 0

and then executed rpmbuild -bb --target i586 --nodeps rpmbuild/SPECS/kernel.spec.

The /boot/config-2.6.29.1-46.config_nodebug.fc11.i586 file attached shows that the contents of the -42 config.nodebug have been incorporated into the kernel.

Comment 30 Quentin Armitage 2009-10-05 08:17:26 UTC
Apologies, I didn't answer question 2 in comment #28, but I am not sure it applies given the description in #29 above.

As a further example of what I did, again using rpmbuild and the method I described above, I built a kernel based on the -46 sources, but added the linux-2.6-debug-dma-api.patch dma-api-debug-fixes.patch patches from the -42 sources (editing kernel.spec appropriately). That kernel successfully rebooted after a panic. I then copied the -42 config-nodebug into the rpmbuild/SOURCES directory, built that kernel and it did not reboot after a panic.

After this, I produced the kernel described in #29 above to finally show that the only change needed to the -46 sources was the application of the -42 config-nodebug.

Comment 31 Neil Horman 2009-10-05 14:02:24 UTC
ah, ok, I understand now.  Thank you, I'll take a closer look at the sources based on what you've told me here.

Comment 32 Neil Horman 2009-10-19 20:18:58 UTC
so those two kernels use version 1.31 and 1.32 of config-nodebug in them.  One of the changes in those two version is CONFIG_DMA_API_DEBUG gets turned off.  Could you by any chance try a -46 build with just the CONFIG_DMA_API_DEBUG toggled in that version of the config-nodebug file?  Thanks!

Comment 33 Bug Zapper 2009-11-16 12:36:13 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 12 development cycle.
Changing version to '12'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 34 Neil Horman 2010-02-18 13:58:34 UTC
Closing due to lack of response.


Note You need to log in before you can comment on or make changes to this bug.