From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8b5) Gecko/20051008 Fedora/1.5-0.5.0.beta2 Firefox/1.4.1 Description of problem: 2.6.13-1.1622_FC5 does not have this problem; it started on 2.6.13-1.1623_FC5. When I boot up, shortly after rc.sysinit makes the first change to the root filesystem, overwriting /var/log/dmesg. Sometimes it makes a bit of additional progress, but it never goes very far. Since at that point rhgb is generally already loaded, you get the impression that the system just froze, but SysRq is still functional. If you switch to VT1 and issue some SysRq command that produces some output, you'll still get a chance to observe an oops on hung ext3 commits, after one minute of inactivity or so. I haven't observed the oops without issuing SysRq commands, but maybe I just wasn't sufficiently patient. Version-Release number of selected component (if applicable): kernel-2.6.13-1.1624_FC5 How reproducible: Always Steps to Reproduce: 1.Boot up with 20051024's rawhide Actual Results: It freezes after saying it enabled all swap partitions, but adding debugging code to /etc/rc.d/rc.sysinit shows it actually gets at least as far as overwriting /var/log/dmesg. Sometimes it hangs immediately, sometimes it takes a few additional seconds to hang. Expected Results: It shouldn't hang; the kernel from two days before didn't. Additional info: I'll post a picture with the oops momentarily.
Created attachment 120339 [details] Reduced picture of the soft lockup oops This is barely readable, but I hope it's enough. There's the tail of a SysRq-T output before the lock up oops, so you can see that this time it got as far as checking for new hardware. The end of the oops is the same as that I got several other times, before I managed to stop the system from switching from 80x60 to 80x25 on boot.
sounds like this could be a dupe of 171615 and 171632 Could you try the kernel at http://people.redhat.com/davej/kernels/Fedora/devel/ please?
Looks like the same problem, indeed. I'll try 1626 when my rawhide update completes, but from the two other bug reports, I won't hold my breath. My box is a UP notebook, and the only oddity I can think of is the use of external disks on both USB and Firewire, with root on LVM on raid 1, with one of the raid 1 members on one of the external disks, and some additional raid 1 (additional swap included) between the two external disks. A minor oddity, eh? :-)
Created attachment 120373 [details] lspic output
If 1.1622 is working than this is rc5-git2. git2 has very few patches. We already turned off powernow patch and I built without the hugetlb patch. What is left in there that relevant to architecture are some drm, dccp, tcp, and posix timers patches. Also, there are few Fedora patches (autofs-lookup and serial-of). Any educated guesses? Can build the kernel and try. I did build git5 with no success, so if it is posix timers they have not got it right yet.
For the record... The only difference between 2.6.13-1.1622_FC5 and 2.6.13-1.1623_FC5 was that the latter had CONFIG_CC_OPTIMIZE_FOR_SIZE=y. The -git patches were not being applied because the %patch2 command was commented out. This unfortunately makes it both easy and difficult to fix the problem :-/
Yes, indeed. I rebuilt 1.1629 with actually applying git7 (this has optimize set to N) and system booted but after working for two minutes it froze. So, there is something still wrong in the git patch. I now building again with only the posix-thread patches applied from git tree to see if they are the ones causing the problem.
Ok. adding all of the posix/thread related patches from git upto this time builds and runs fine. I have been running for an hour without problems.
this should be fixed in 1629
*** Bug 171632 has been marked as a duplicate of this bug. ***
It is fixed, indeed, as in, the problem no longer occurs. Until someone decides to turn -Os on again. I know I've seen this very same failure before, so I figured I'd track it down. So I built the entire kernel with -Os and it failed. Then I rebuilt only arch/x86_64/lib/bitops.o with -O2 and it would work fine. Then I compared the code of this file, compiled with -Os and -O2, and the only significant difference was that with -O2 find_first_zero_bit() would be inlined into find_next_zero_bit(). So I rename find_first_zero_bit to __find_first_zero_bit, make it always_inline, create a new find_first_zero_bit that just calls the always_inline function, and get find_next_zero_bit to call the always_inline function. At that point, the code in both object files is equivalent, so it should all work, righ? Well, it still doesn't, and I'm totally confused as to why. (As for how to get the kernel to not recompile everything when I change from -O2 to -Os or vice-versa, I commented out the addition of -O2 and -Os in the top-level Makefile, created `compile.Os' and `compile.O2' scripts that run CC with the corresponding option appended to the command line, then set up a soft-link to point to one of the other, and run `make bzImage CC=/that/soft/link'. :-) May I suggest that we keep this bug open such that we can eventually switch to an -Os kernel on amd64?
Is this perhaps a gcc problem, just triggered by the kernel? Just wondering...
Created attachment 120544 [details] Patch that enables a kernel optimized for size to work Nope, just the usual bug in asm statements that different compiler optimizations often expose. The patch file contains a long explanation of the bug and the various minor changes I made while fixing it.
Fixed upstream, and in rawhide for a while.