Description of problem: If I load a module that calls panic(), or if I do echo c >/proc/sysrq-trigger, the system hangs, and no crashdump kernel is loaded. There is minor screen corruption after the panic, but nothing else happens, and the system stops responding. Version-Release number of selected component (if applicable): kexec-tools-2.0.0-27.fc12.i686 kernel-2.6.31-23.fc12.i686 How reproducible: Always Steps to Reproduce: 1.echo c >/proc/sysrq-trigger 2. 3. Actual results: Minor screen corruption, and crashdump kernel does not load. System has totally died. Expected results: crashdump kernel loads and dumps old kernel Additional info:
Can you attach a serial console to the system, configure it to operate with: console=ttyS<number of port>,38400n8 and see if it records any output? Also please attach the dmesg log of the normal boot of the kernel prior to calling echo c > /proc/sysrq-trigger Also, check /proc/sys/kernel/panic_on_oops to and /proc/sys/kernel/sysrq to ensure that sysrq is enabled and a panic will result from a sysrq-c Also, it would be good to ensure that the kdump kernel is loaded via /sys/kernel/kexec_crash_loaded
Created attachment 361947 [details] dmesg log file
Created attachment 361948 [details] Console output following sysrq invoked panic
Created attachment 361949 [details] Panic module source code
Re comment #1: Console output for echo c >/proc/sysrq-trigger attached (id=361948) dmesg log attached (id=361947) $ cat /proc/sys/kernel/panic_on_oops 0 $ cat /proc/sys/kernel/sysrq 0 $ cat /sys/kernel/kexec_crash_loaded 1 Setting panic_on_oops and sysrq to 1 appears to make no difference. The console output attached was with both panic_on_oops and sysrq set to 1. I have also created a module that when loaded causes a panic - source attached (i=361949). The output on the console when it is loaded is: Kernel panic - not syncing: Panic due to module loaded This was also loaded when panic_on_oops and sysrq were set to 1.
Created attachment 361971 [details] patch to debug shutdown process Ok, so it looks like we're not even booting the second kernel. CAn you spin your kernel with this patch in place. It should help give us some idea of where we're haning on shutdown.
With the patched kernel, I get the following: IN CRASH_KEXEC GOT KEXEC_MUTEX TRYING TO SETUP KEXEC Calling MACHINE_CRASH_SHUTDOWN TRYING MACHINE_KEXEC and then no futher output.
great, we're into the machine specific reboot code. what arch/system are you seeing this on?
The system is a Dell D800 laptop, i686. The output of uname is: $ uname -a Linux samson.armitage.org.uk 2.6.31-33.fc12.i686 #1 SMP Thu Sep 17 15:56:11 EDT 2009 i686 i686 i386 GNU/Linux
I'm not sure if this is of any interest, but I also have F-11 installed on the same system. I have booted that, with crashkernel=64M bootparam, and then did echo c >/proc/sysrq-trigger. Although I couldn't see any output messages due to complete screen corruption, (I wasn't running it with a serial console), it was clearly dumping something to disk, and after rebooting there was a /var/crash/2009-09-22...../vmcore file about 2Gb in size. So it would appear that the crashkernel reboot works on the system for F-11, and that it is a problem confined to Rawhide.
ok, do me a favor and try to boot the rawhide kernel with F-11's userspace if you could please. Id like to eliminate the possibilty that this is a user space utiltiy error. If that fails we can assume this is strictly a kernel problem, and I can start a bisect. Give me the version of the most recent F-11 kernel that you have success with, and I'll provide you a series of kernels so that we can determine where this has broken. Thanks!
I've never done this before, so I assume that I use the Rawhide bootstring, but change the root= to the F-11 root. Is that correct? It seems to me that there would be a problem with that, since the kernel modules that reside in the root filesystem will not be the right ones. Do I need to copy some kernel specific directories from the Rawhide root filesystem to the F-11 root filesystem? Or have I got the wrong end of the stick?
The easiest way to do it is to boot your F-11 system, and simply install the rawhide kernel rpm there, then reboot and select the new kernel.
There seems to be a rather long dependency chain for installing kernel-2.6.31-33.fc12.i686.rpm on my F-11 system. So far, I have got to needing: audit-libs-2.0-3.fc12.i686.rpm audit-libs-devel-2.0-3.fc12.i686.rpm audit-libs-python-2.0-3.fc12.i686.rpm dash-0.5.5.1-3.fc12.i686.rpm dracut-002-3.git8eb16b08.fc12.noarch.rpm dracut-kernel-002-3.git8eb16b08.fc12.noarch.rpm e2fsprogs-1.41.9-3.fc12.i686.rpm e2fsprogs-libs-1.41.9-3.fc12.i686.rpm glibc-2.10.90-23.i686.rpm glibc-common-2.10.90-23.i686.rpm glibc-devel-2.10.90-23.i686.rpm glibc-headers-2.10.90-23.i686.rpm grubby-7.0.7-1.fc12.i686.rpm kernel-2.6.31-33.fc12.i686.rpm kernel-firmware-2.6.31-33.fc12.noarch.rpm libuuid-2.16-10.fc12.i686.rpm prelink-0.4.2-2.fc12.i686.rpm ql2100-firmware-1.19.38-3.fc12.noarch.rpm ql2200-firmware-2.02.08-3.fc12.noarch.rpm ql23xx-firmware-3.03.27-3.fc12.noarch.rpm ql2400-firmware-4.04.09-1.fc12.noarch.rpm ql2500-firmware-4.04.09-1.fc12.noarch.rpm util-linux-ng-2.16-10.fc12.i686.rpm and now it is reporting conflicts between the new packages and other existing packages. Unless I am missing a point here, I am not sure that installing the Rawhide kernel on F-11 is viable.
the kernel firmware stuff is likely in conflict. You don't actually need them all though. Most direct bet is to simply get the rawhide src rpm, extract it, and do a manual make config; make; make modules_install; make install
Sorry for the delay in responding. It took a while to get a kernel build, what with running out of disk space on 1 build, and make config being quite difficult with all the questions it asked. Anyway, I have built a kernel based on 2.6.31-33.fc12 (i.e. all patches applied, and using the generated config file). F-11 booting from kernel 2.6.30.5-43.fc11.i586 and kexec-tools 2.0.0.16.fc11 successfully dumps successfully dumps memory (i.e. creates /var/crash/2009-09-..../*) following panic, and reboots. F-11 booting from my built kernel (2.6.31-33.fc12 built on F-11) with same kexec-tools hangs after panic and does not dump memory.
Ok, thats good to know. So this should be a simple bisection. Are you familiar with cvs? Wtih this CVSROOT: :pserver:anonymous.redhat.com:/cvs/dist you can check out the rpms/kernel project with this command: cvs co rpms/kernel The rawhide kernel tree is under the rpms/kernel/devel subdirectory, which you can assemble with the make prep target. The rawhide tree has tags on it for kernels starting from pre 2.6.29 to the latest. You should be able to bisect down to the kernel that starts failing, and that will give us a good idea of the commits that may have contributed to your failure. Thanks!
I'm not really familiar with cvs, but I can work my way through it (I'm more familiar with SVN). It seems to me that the CVSROOT should be :pserver:anonymous.redhat.com:/cvs/pkgs rather than :pserver:anonymous.redhat.com:/cvs/dist; is that correct? Using the pkgs version, I only get kernels up to FC-6, and the devel version is dated sometime in 2007. I have done a cvs co rpms/kernel/F-12, since as far as I can see that is where the current Rawhide kernels are, and this seems to provide the HEAD version only. What I am not clear about is how to get versions earlier than HEAD. When I do a make prep i686, will that also make the PAE version, and if so, how can I stop the PAE version being built? A kernel build on my system takes about 2 hours, so I am clearly keen to minimize what needs to be done (is there any way I could do scratch builds on Koji to speed things up?). I assume that what we are wanting to do is build the Rawhide kernels on the F-11 system, as before. I had wondered about installing earlier (F12) Rawhide kernels on my Rawhide system and trying the binary chop that way (since it would avoid having to rebuild the kernels), but I installed the last F12 Rawhide 2.6.30 stable kernel from Koji (2.6.30-6.fc12.i586), and that still exhibited the problem, so I guess that's not a useful way forward.
the dist tree is the one you want. pkgs is old, dist has kernels up through today. Some of the naming history should be available on the fedora wiki if you're interested in how that came to be. If you want older versions, you need to use the: cvs log kernel.spec command (you can get the log of any file you like, but the spec has all the labels). Anywho, that shows you all the labels, which are conviently named after the kernel versions they match with. Then you use the: cvs update -r <label> command to checkout a particular version. From there you do a: make prep which will assemble the source tree from that cvs version, which you can cd to and build as per a normal kernel. Note you don't make prep a particular config, the individual configs are extracted in the make prep stage, and placed in the kernel-<version>/linux-<version>/configs subdirectory, where you will find all the appropriate configs for all the kernel flavors. Just copy the appropriate config to .config at the top of the kernel tree and type: make oldconfig that will setup the tree you have extracted with the config you want automatically (it shouldn't ask you for any input, like doing a normal make config does). From there you can do a make; make modules_installl; make install like with any kernel build If you would like to use koji, you of course can do scratch builds there, just issue a: make test-srpm after you check out a given version from cvs, and that will make a srpm for you, suitable for uploading to koji. I had assumed since you were having dependency issues, you would want to avoid that, but if you're doing this all on your rawhide userspace, that should be fine (as you mention above). I would suggest trying to boot the kernel version that worked on F-11 first on your rawhide user space to make sure you have a working start point. As long as the last working kernel that you just built boots as a kdump kernel under F-12, I think thats a fine way forward. The goal is just to find the kernel where this failure started occurring. It looks like theres a few hundred labels between the last working kernel and the most recent. a bisect of that should only require about 8 or 9 kernel builds.
I have installed kernel-2.6.30.6-53.fc11.i586 (the latest F-11 kernel, which successfully creates crashdumps) on my Rawhide system (I have found that I can install the kernels using rpm -i --oldpackage, whereas the problem I had before was when trying to install with yum). Although it doesn't successfully create a crashdump, the kernel does reboot after a panic, and starts executing the init script in the initrd. So I am seeing that as working from the point of view of what we are trying to do at the moment. The problem is that there is no progression from there to F12/Rawhide. The only real progression I can see is that the first F12/Rawhide kernel (that built) was 2.6.30-0.34.rc0.git8.fc12, and this was forked from 2.6.29-21.fc11. I tried 2.6.30-0.34.rc0.git8.fc12 on my Rawhide user-space, and that failed to reboot after a panic, as did 2.6.29-21.fc11. I then tried 2.6.29-21.fc11 on my F11 user-space, and that also failed to reboot after a panic. As far as I can see, we have: On F-11: 2.6.29-21.fc11 (broken) -> 2.6.29.6-217.fc11 (works) -> 2.6.30.5-43.f11 (works) -> 2.6.30.6-53.f11 (works) On Rawhide: 2.6.29-21.fc11 (broken) -> 2.6.30-0.34.rc0.git8.fc12 (broken) -> 2.6.31-33.fc12 (broken) So it would seem that somewhere along the line in F-11, but after the fork for F-12/Rawhide, the panic reboot started working. For Rawhide, we have a sequence where we don't have any known working point. There may be some kernels somewhere in the middle of the sequence that worked, but there again, it could be that done of them does. Unfortunately I cannot see any sequence that we can bisect to find a transition from working to broken. We could, presumably, bisect the F-11 kernels to find the transition from broken to working, and then see if that change has been applied to the Rawhide kernels. I'm not sure if that is a useful thing to do or not.
Actually I think thats exactly the right thing to do. It sounds like some upstream fix got applied to F-11, but it never made it to rawhide, and the current rawhide kernel hasn't reached that point in the devel stream. So I think that bisect is your option. The only other choice I could see would be, get an upstream kernel, build it with the rawhide config, and if that works, just use that until such time as rawhide catches up to that point upstream.
Using my F-11 user-space, I have done a bisect of the F-11 kernels that werre successfully built in Koji. The result is that kernel-2.6.29.1-42.rc1.fc11.i586 is the last kernel that didn't reboot on panic, and kernel-2.6.29.1-46.fc11.i586 is the first kernel that does reboot, and it also successfully creates a crashdump and then reboots again. I hope this gives sufficient to work on now. I am of course happy to try any patches or whatever might be necessary.
Interesting. A diff on the F-11 kernel spec file on those 2 tags shows that these 2 patches were dropped in that time frame: linux-2.6-debug-dma-api.patch dma-api-debug-fixes.patch There was some other build config changes, but it relates to debug builds, so I think these 2 patches are the key They relate to upstream commits f2f45e5f3c921c73c913e9a9c00f21ec01c86b4d and 8ddc951c73cbc317148c0b9973dde81eece57e4c respectively (the former might have some supporting commits). The comments indicate that those were removed because they were merged upstream, but I don't see us pulling them in with the latest F-11 kernels, so it seems like something about them might be causing us problems. I would suggest, that you extract those two commit from an upstream tree, and then reverse apply them to the latest rawhide kernel (using the diff -R option). If we remove those patches, and the resulting kernel allows kdump to work, I think we can start investigating what it is about those patches that might have broken us.
I have attempted the suggestion of reverse applying the two commits to the latest Rawhide kernel, but there ae the following issues. Firstly, the commits cannot simply be applied with patch -R due to subsequent changes to the source files upstream. The subsequent changes to the e1000e driver (re commit 8ddc951c73cbc317148c0b9973dde81eece57e4c) are such that I cannot work out how to reverse the specific commit; on the other hand I do not have an e1000e in my system, so I suspect that that is unlikely to be the culprit. The supporting commits for f2f45e5f3c921c73c913e9a9c00f21ec01c86b4d are: 187f9c3f05373df4f7cbae2e656450acdbba7558 2118d0c548e8a2205e1a29eb5b89e5f2e9ae2c8b 5ee00bd4691e7364bb7b62e2068d473cd5cb9320 30dfa90cc8c4c9621d8d5aa9499f3a5df3376307 3b1e79ed734f58ac41ca0a287ff03ca355f120ad 6bf078715c1998d4d10716251cc10ce45908594c 59d3daafa17265f01149df8eab3fb69b9b42cb2e 788dcfa6f17424695823152890d30da09f62f9c3 2d62ece14fe04168a7d16688ddd2d17ac472268c f62bc980e6fd26434012c0d5676ecb17179d9ee4 972aa45ceaf65376f33aa75958fcaefc9e752fa4 6bfd4498764d6201399849d2e80fda95db7742c0 b9d2317e0c4aed02afd20022083b2a485289605d 948408ba3e2a67ed0f95e18ed5be1c622c2c5fc3 a31fba5d68cebf8f5fefd03e079dab94875e25f5 ac26c18bd35d982d1ba06020a992b1085fefc3e2 and later commit 1bf20f0dc5629032ddd07617139d9fbca66c1642 is also needed to be reversed since it uses the features provided by the above commits. Again, these commits cannot simply be reversed with patch -R, but I have produced a patch that effectively reverses all these out, which I will attach, but it's probably not that interesting. The upshot of it is that reversing out this latter set of commits, but leaving 8ddc951c73cbc317148c0b9973dde81eece57e4c (e1000e), to 2.6.31.1-48.fc12 does not resolve the problem and so the resulting kernel does not reboot after a panic. Unless you have any other thoughts, I think the way forward is to do a bisect on the changes from kernel-2.6.29.1-42.rc1.fc11.i586 to kernel-2.6.29.1-46.fc11.i586, starting with reapplying the above commits, and finding which change broke it. I expect it will take me a few days to work though that. BTW, can one do incremental changes and build of the kernel? By that, I mean if I apply a patch to the kernel, is there a way to rebuild it by only recompiling the module affected, or does a full rebuild need to be done every time?
Created attachment 363143 [details] Patch to remove dma-debug commits
I think you approach sounds like a fine idea. let me know
Got there at last, after a few red herrings. Applying the config-nodebug from kernel-2.6.29.1-42.rc1.fc11 to kernel-2.6.29.1-46.fc11 causes the panic reboot to stop working.
Ok, I'll look more closely at that. Just so that I'm clear: 1) When you say apply the config-nodebug to the -46 kernel, you understand that thats just a partial config right, you should be applying config from the configs subdirectory after you do a make prep. It will have a name in the format kernel-<version>-<arch>-[PAE|largesmp|debug|etc].config 2) If you apply a config from the -46 tree to the -46, does it also fail (i.e. is the config application a don't-care state)?
Created attachment 363644 [details] kernel config file I have build the kernel using rpmbuild. First I copied the -46 sources to the rpmbuild/SOURCES directory, and the kernel.spec to the rpmbuild/SPECS directory. I then copied config-nodebug from the -42 sources to the rpmbuild/SOURCES directory (i.e. overwriting the -46 version). I edited the rpmbuild/SPECS/kernel.spec file to make the following changes (the last two changes were simply to save build time): < # % define buildid .local --- > %define buildid .config_nodebug 94c94 < %define with_debuginfo %{?_without_debuginfo: 0} %{?!_without_debuginfo: 1} --- > %define with_debuginfo %{?_without_debuginfo: 0} %{?!_without_debuginfo: 0} 200c200 < %define with_pae 1 --- > %define with_pae 0 and then executed rpmbuild -bb --target i586 --nodeps rpmbuild/SPECS/kernel.spec. The /boot/config-2.6.29.1-46.config_nodebug.fc11.i586 file attached shows that the contents of the -42 config.nodebug have been incorporated into the kernel.
Apologies, I didn't answer question 2 in comment #28, but I am not sure it applies given the description in #29 above. As a further example of what I did, again using rpmbuild and the method I described above, I built a kernel based on the -46 sources, but added the linux-2.6-debug-dma-api.patch dma-api-debug-fixes.patch patches from the -42 sources (editing kernel.spec appropriately). That kernel successfully rebooted after a panic. I then copied the -42 config-nodebug into the rpmbuild/SOURCES directory, built that kernel and it did not reboot after a panic. After this, I produced the kernel described in #29 above to finally show that the only change needed to the -46 sources was the application of the -42 config-nodebug.
ah, ok, I understand now. Thank you, I'll take a closer look at the sources based on what you've told me here.
so those two kernels use version 1.31 and 1.32 of config-nodebug in them. One of the changes in those two version is CONFIG_DMA_API_DEBUG gets turned off. Could you by any chance try a -46 build with just the CONFIG_DMA_API_DEBUG toggled in that version of the config-nodebug file? Thanks!
This bug appears to have been reported against 'rawhide' during the Fedora 12 development cycle. Changing version to '12'. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Closing due to lack of response.