171672 – x86_64 -Os kernel hangs after rc.sysinit overwrites dmesg

Bug 171672 - x86_64 -Os kernel hangs after rc.sysinit overwrites dmesg

Summary: x86_64 -Os kernel hangs after rc.sysinit overwrites dmesg

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Dave Jones
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	171632 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-10-25 03:59 UTC by Alexandre Oliva
Modified:	2015-01-04 22:22 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2005-12-28 05:16:20 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Reduced picture of the soft lockup oops (59.95 KB, image/jpeg) 2005-10-25 04:06 UTC, Alexandre Oliva	no flags	Details
lspic output (1.60 KB, text/plain) 2005-10-25 16:50 UTC, Alexandre Oliva	no flags	Details
Patch that enables a kernel optimized for size to work (7.55 KB, patch) 2005-10-29 21:43 UTC, Alexandre Oliva	no flags	Details \| Diff
View All

Description Alexandre Oliva 2005-10-25 03:59:06 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8b5) Gecko/20051008 Fedora/1.5-0.5.0.beta2 Firefox/1.4.1

Description of problem:
2.6.13-1.1622_FC5 does not have this problem; it started on 2.6.13-1.1623_FC5.  When I boot up, shortly after rc.sysinit makes the first change to the root filesystem, overwriting /var/log/dmesg.  Sometimes it makes a bit of additional progress, but it never goes very far.  Since at that point rhgb is generally already loaded, you get the impression that the system just froze, but SysRq is still functional.  If you switch to VT1 and issue some SysRq command that produces some output, you'll still get a chance to observe an oops on hung ext3 commits, after one minute of inactivity or so.  I haven't observed the oops without issuing SysRq commands, but maybe I just wasn't sufficiently patient.

Version-Release number of selected component (if applicable):
kernel-2.6.13-1.1624_FC5

How reproducible:
Always

Steps to Reproduce:
1.Boot up with 20051024's rawhide

Actual Results:  It freezes after saying it enabled all swap partitions, but adding debugging code to /etc/rc.d/rc.sysinit shows it actually gets at least as far as overwriting /var/log/dmesg.  Sometimes it hangs immediately, sometimes it takes a few additional seconds to hang.

Expected Results:  It shouldn't hang; the kernel from two days before didn't.

Additional info:

I'll post a picture with the oops momentarily.

Comment 1 Alexandre Oliva 2005-10-25 04:06:58 UTC

Created attachment 120339 [details]
Reduced picture of the soft lockup oops

This is barely readable, but I hope it's enough.  There's the tail of a SysRq-T
output before the lock up oops, so you can see that this time it got as far as
checking for new hardware.  The end of the oops is the same as that I got
several other times, before I managed to stop the system from switching from
80x60 to 80x25 on boot.

Comment 2 Dave Jones 2005-10-25 06:20:29 UTC

sounds like this could be a dupe of 171615 and 171632

Could you try the kernel at http://people.redhat.com/davej/kernels/Fedora/devel/
 please?

Comment 3 Alexandre Oliva 2005-10-25 16:48:59 UTC

Looks like the same problem, indeed.  I'll try 1626 when my rawhide update
completes, but from the two other bug reports, I won't hold my breath.

My box is a UP notebook, and the only oddity I can think of is the use of
external disks on both USB and Firewire, with root on LVM on raid 1, with one of
the raid 1 members on one of the external disks, and some additional raid 1
(additional swap included) between the two external disks.  A minor oddity, eh? :-)

Comment 4 Alexandre Oliva 2005-10-25 16:50:15 UTC

Created attachment 120373 [details]
lspic output

Comment 5 Sammy 2005-10-25 20:32:48 UTC

If 1.1622 is working than this is rc5-git2. git2 has very few patches. We already    
turned off powernow patch and I built without the hugetlb patch. What is left    
in there that relevant to architecture are some drm,  dccp, tcp, and posix timers patches.  
Also, there are few Fedora patches (autofs-lookup and serial-of). Any educated guesses? 
Can build the kernel and try. I did build git5 with no success, so if it is posix timers they 
have not got it right yet.

Comment 6 Alexandre Oliva 2005-10-27 17:11:09 UTC

For the record...  The only difference between 2.6.13-1.1622_FC5 and
2.6.13-1.1623_FC5 was that the latter had CONFIG_CC_OPTIMIZE_FOR_SIZE=y.  The
-git  patches were not being applied because the %patch2 command was commented
out.  This unfortunately makes it both easy and difficult to fix the problem :-/

Comment 7 Sammy 2005-10-27 18:15:58 UTC

Yes, indeed. I rebuilt 1.1629 with actually applying git7 (this has optimize
set to N) and system booted but after working for two minutes it froze. So,
there is something still wrong in the git patch. I now building again with only
the posix-thread patches applied from git tree to see if they are the ones causing
the problem.

Comment 8 Sammy 2005-10-27 18:59:53 UTC

Ok. adding all of the posix/thread related patches from git upto this time 
builds and runs fine. I have been running for an hour without problems.

Comment 9 Dave Jones 2005-10-27 23:23:22 UTC

this should be fixed in 1629

Comment 10 Dave Jones 2005-10-27 23:24:07 UTC

*** Bug 171632 has been marked as a duplicate of this bug. ***

Comment 11 Alexandre Oliva 2005-10-29 16:21:33 UTC

It is fixed, indeed, as in, the problem no longer occurs.  Until someone decides
to turn -Os on again.  I know I've seen this very same failure before, so I
figured I'd track it down.

So I built the entire kernel with -Os and it failed.  Then I rebuilt only
arch/x86_64/lib/bitops.o with -O2 and it would work fine.  Then I compared the
code of this file, compiled with -Os and -O2, and the only significant
difference was that with -O2 find_first_zero_bit() would be inlined into
find_next_zero_bit().  So I rename find_first_zero_bit to __find_first_zero_bit,
make it always_inline, create a new find_first_zero_bit that just calls the
always_inline function, and get find_next_zero_bit to call the always_inline
function.  At that point, the code in both object files is equivalent, so it
should all work, righ?  Well, it still doesn't, and I'm totally confused as to why.

(As for how to get the kernel to not recompile everything when I change from -O2
to -Os or vice-versa, I commented out the addition of -O2 and -Os in the
top-level Makefile, created `compile.Os' and `compile.O2' scripts that run CC
with the corresponding option appended to the command line, then set up a
soft-link to point to one of the other, and run `make bzImage
CC=/that/soft/link'. :-)

May I suggest that we keep this bug open such that we can eventually switch to
an -Os kernel on amd64?

Comment 12 Horst H. von Brand 2005-10-29 21:34:34 UTC

Is this perhaps a gcc problem, just triggered by the kernel?

Just wondering...

Comment 13 Alexandre Oliva 2005-10-29 21:43:51 UTC

Created attachment 120544 [details]
Patch that enables a kernel optimized for size to work

Nope, just the usual bug in asm statements that different compiler
optimizations often expose.  The patch file contains a long explanation of the
bug and the various minor changes I made while fixing it.

Comment 14 Dave Jones 2005-12-28 05:16:20 UTC

Fixed upstream, and in rawhide for a while.

Note You need to log in before you can comment on or make changes to this bug.