Red Hat Bugzilla – Bug 171672
x86_64 -Os kernel hangs after rc.sysinit overwrites dmesg
Last modified: 2015-01-04 17:22:46 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8b5) Gecko/20051008 Fedora/1.5-0.5.0.beta2 Firefox/1.4.1
Description of problem:
2.6.13-1.1622_FC5 does not have this problem; it started on 2.6.13-1.1623_FC5. When I boot up, shortly after rc.sysinit makes the first change to the root filesystem, overwriting /var/log/dmesg. Sometimes it makes a bit of additional progress, but it never goes very far. Since at that point rhgb is generally already loaded, you get the impression that the system just froze, but SysRq is still functional. If you switch to VT1 and issue some SysRq command that produces some output, you'll still get a chance to observe an oops on hung ext3 commits, after one minute of inactivity or so. I haven't observed the oops without issuing SysRq commands, but maybe I just wasn't sufficiently patient.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.Boot up with 20051024's rawhide
Actual Results: It freezes after saying it enabled all swap partitions, but adding debugging code to /etc/rc.d/rc.sysinit shows it actually gets at least as far as overwriting /var/log/dmesg. Sometimes it hangs immediately, sometimes it takes a few additional seconds to hang.
Expected Results: It shouldn't hang; the kernel from two days before didn't.
I'll post a picture with the oops momentarily.
Created attachment 120339 [details]
Reduced picture of the soft lockup oops
This is barely readable, but I hope it's enough. There's the tail of a SysRq-T
output before the lock up oops, so you can see that this time it got as far as
checking for new hardware. The end of the oops is the same as that I got
several other times, before I managed to stop the system from switching from
80x60 to 80x25 on boot.
sounds like this could be a dupe of 171615 and 171632
Could you try the kernel at http://people.redhat.com/davej/kernels/Fedora/devel/
Looks like the same problem, indeed. I'll try 1626 when my rawhide update
completes, but from the two other bug reports, I won't hold my breath.
My box is a UP notebook, and the only oddity I can think of is the use of
external disks on both USB and Firewire, with root on LVM on raid 1, with one of
the raid 1 members on one of the external disks, and some additional raid 1
(additional swap included) between the two external disks. A minor oddity, eh? :-)
Created attachment 120373 [details]
If 1.1622 is working than this is rc5-git2. git2 has very few patches. We already
turned off powernow patch and I built without the hugetlb patch. What is left
in there that relevant to architecture are some drm, dccp, tcp, and posix timers patches.
Also, there are few Fedora patches (autofs-lookup and serial-of). Any educated guesses?
Can build the kernel and try. I did build git5 with no success, so if it is posix timers they
have not got it right yet.
For the record... The only difference between 2.6.13-1.1622_FC5 and
2.6.13-1.1623_FC5 was that the latter had CONFIG_CC_OPTIMIZE_FOR_SIZE=y. The
-git patches were not being applied because the %patch2 command was commented
out. This unfortunately makes it both easy and difficult to fix the problem :-/
Yes, indeed. I rebuilt 1.1629 with actually applying git7 (this has optimize
set to N) and system booted but after working for two minutes it froze. So,
there is something still wrong in the git patch. I now building again with only
the posix-thread patches applied from git tree to see if they are the ones causing
Ok. adding all of the posix/thread related patches from git upto this time
builds and runs fine. I have been running for an hour without problems.
this should be fixed in 1629
*** Bug 171632 has been marked as a duplicate of this bug. ***
It is fixed, indeed, as in, the problem no longer occurs. Until someone decides
to turn -Os on again. I know I've seen this very same failure before, so I
figured I'd track it down.
So I built the entire kernel with -Os and it failed. Then I rebuilt only
arch/x86_64/lib/bitops.o with -O2 and it would work fine. Then I compared the
code of this file, compiled with -Os and -O2, and the only significant
difference was that with -O2 find_first_zero_bit() would be inlined into
find_next_zero_bit(). So I rename find_first_zero_bit to __find_first_zero_bit,
make it always_inline, create a new find_first_zero_bit that just calls the
always_inline function, and get find_next_zero_bit to call the always_inline
function. At that point, the code in both object files is equivalent, so it
should all work, righ? Well, it still doesn't, and I'm totally confused as to why.
(As for how to get the kernel to not recompile everything when I change from -O2
to -Os or vice-versa, I commented out the addition of -O2 and -Os in the
top-level Makefile, created `compile.Os' and `compile.O2' scripts that run CC
with the corresponding option appended to the command line, then set up a
soft-link to point to one of the other, and run `make bzImage
May I suggest that we keep this bug open such that we can eventually switch to
an -Os kernel on amd64?
Is this perhaps a gcc problem, just triggered by the kernel?
Created attachment 120544 [details]
Patch that enables a kernel optimized for size to work
Nope, just the usual bug in asm statements that different compiler
optimizations often expose. The patch file contains a long explanation of the
bug and the various minor changes I made while fixing it.
Fixed upstream, and in rawhide for a while.