Bug 442285 - Solve real cause of #437701 and rebuild memtest86+ with GCC 4.3
Solve real cause of #437701 and rebuild memtest86+ with GCC 4.3
Status: CLOSED RAWHIDE
Product: Fedora
Classification: Fedora
Component: memtest86+ (Show other bugs)
rawhide
All Linux
low Severity medium
: ---
: ---
Assigned To: Jarod Wilson
Fedora Extras Quality Assurance
: Reopened
Depends On:
Blocks: F10Target F11Target
  Show dependency treegraph
 
Reported: 2008-04-13 17:41 EDT by Robert Scheck
Modified: 2013-01-09 23:39 EST (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-12-25 12:42:02 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description Robert Scheck 2008-04-13 17:41:11 EDT
Description of problem:
Please solve real cause of bug #437701 and rebuild memtest86+ with GCC 4.3,
shipping and or requiring legacy components is a worse thing.

Version-Release number of selected component (if applicable):
memtest86+-2.01-3

Actual results:
GCC 3.4 is used to build memtest86+ because of a GCC or memtest86+ bug.

Expected results:
Solve real cause of #437701 and rebuild memtest86+ with GCC 4.3

Additional info:
This is likely to be a post F9 thing.
Comment 1 Bug Zapper 2008-05-14 05:23:18 EDT
Changing version to '9' as part of upcoming Fedora 9 GA.
More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Comment 2 Robert Scheck 2008-07-27 10:34:59 EDT
Ping? Maybe Jakub is able to provide some help here?
Comment 3 Robert Scheck 2008-11-01 14:55:40 EDT
Ping?
Comment 4 Bug Zapper 2008-11-25 21:12:34 EST
This bug appears to have been reported against 'rawhide' during the Fedora 10 development cycle.
Changing version to '10'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Comment 5 Robert Scheck 2008-12-28 15:58:45 EST
Warren - any news here? Jakub, can you please help investigating here to
solve the issue...I think, you're a good C guy? Adding Jesse that we maybe
can get rid of that before Fedora 11?
Comment 6 Warren Togami 2008-12-29 04:13:49 EST
memtest86+ version 2.10 announcement mentioned:
Warning : GCC 4.2+ is not yet supported. Thanks to the new pointers over/underflows limitations introduced with that version ! Memtest86+ compiled with GCC 4.2+ will result in a non-working binary (hangs in the first seconds). We're working on a fix but there is really many parts of the code to check & change everywhere. So, please use GCC 4.1 instead.

So this isn't our fault.  This is heavy lifting that must happen upstream.
Comment 7 Fedora Admin XMLRPC Client 2009-03-10 21:17:36 EDT
This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.
Comment 8 Jesse Keating 2009-04-13 20:31:36 EDT
... and it looks like it isn't happening for F11.  I don't think we'd delay the release for this, moving over to target.
Comment 9 Bug Zapper 2009-06-09 05:32:49 EDT
This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle.
Changing version to '11'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Comment 10 Jarod Wilson 2009-07-24 17:04:25 EDT
I've been asked to look into this, and have been getting up to speed on things intertwined with other tasks, and just started in on some bisection of .o files. I've got a build with all gcc 4.4 build .o files, except for test.o and patn.o, and its already most of the way through a test run (already way further than a pure gcc 4.4 build got). Assuming this pass completes, I'll check if its only one or both of these causing failure, then start looking at the respective source files themselves. Its looking like fixing this won't actually be all that bad.
Comment 11 Jarod Wilson 2009-07-24 17:22:07 EDT
Narrowed the failure down to test.o, looking over its source now...
Comment 12 Jakub Jelinek 2009-07-24 17:28:56 EDT
Does it fail even with -O0 when built with gcc 4.4?  Or when built with -O2 -fno-inline -fno-inline-small-functions?  If inlining isn't necessary to reproduce the failure, you can as well continue doing the binary search among functions within that file (put half of functions into one file, half in another, remove static and/or add prototypes where needed).
Or, if it doesn't fail at -O0, you can use __attribute__((__optimize__(0))) on various functions to compile them non-optimized, while keeping the rest optimized.
Comment 13 Jarod Wilson 2009-07-24 18:50:13 EDT
(In reply to comment #12)
> Does it fail even with -O0 when built with gcc 4.4?

It works just fine built with -O0

> Or when built with -O2
> -fno-inline -fno-inline-small-functions?

Fails the same as before with these options.

(stock is -Os, fwiw)

> If inlining isn't necessary to
> reproduce the failure, you can as well continue doing the binary search among
> functions within that file (put half of functions into one file, half in
> another, remove static and/or add prototypes where needed).
> Or, if it doesn't fail at -O0, you can use __attribute__((__optimize__(0))) on
> various functions to compile them non-optimized, while keeping the rest
> optimized.  

I'll give the latter a shot next. Thanks much!
Comment 14 Jarod Wilson 2009-07-24 18:59:29 EDT
So given that it was failing on test #2, I'd been figuring it was most likely the function addr_tst2() in test.c at fault, so I threw __attribute__((__optimize(0))) on that first, and that indeed the resulting binary does work, so it confirms my suspicion about the guilty function.

Still looking over the code, but after a few reads through, I'm not sure what it could be. I'm assuming probably something in the /* Check for overflow */ area is getting optimized out, and that's probably where I'll tinker next, but its now 7pm on Friday night, and the kids are wanting some attention. ;)
Comment 15 Jarod Wilson 2009-07-24 19:01:16 EDT
Oh, wait... I think I might see it now... But yeah, kid time first...
Comment 16 Jakub Jelinek 2009-07-24 19:08:47 EDT
Relying on undefined signed overflow or something similar?
Comment 17 Jakub Jelinek 2009-07-24 19:25:12 EDT
if (pe + SPINSZ > pe)
or
if (pe - SPINSZ < pe)
or
if (pe + SPINSZ*4 > pe)
for pe being pointer is obviously all optimized away, pointers in C never wrap around and so for positive SPINSZ all these can be assumed always true.
Just use
if ((((uintptr_t)pe) + SPINSZ * sizeof (ulong)) > (uintptr_t)pe)
and similar.  Alternatively you could use -fno-strict-overflow, but it is better to just fix the buggy code.
Comment 18 Jarod Wilson 2009-07-24 22:47:18 EDT
> Just use
> if ((((uintptr_t)pe) + SPINSZ * sizeof (ulong)) > (uintptr_t)pe)
> and similar.

So I tried this (actually, a number of different variations of it too), but its still blowing up the same as before. Must be doing something wrong, or missing something else... Going to table it 'til Monday, but we're close...
Comment 19 Jarod Wilson 2009-07-27 16:28:07 EDT
In a hurry to try to get this figured out last week, I think I may have stumbled on one of the last few steps... After back-tracking, building with -O0 works, but the attribute addition to addr_tst2() only route is failing, same as before. Going to retrace my steps from there forward now...
Comment 20 Jarod Wilson 2009-07-27 17:26:43 EDT
Okay... So I was right the first time, adding the attribute to addr_tst2() does work. However, its definitely not the pe comparison that is tripping things up (not to say that it doesn't also need to be fixed). Looking like somehow the BAILR bit is getting optimized out. Here's the minimal working diff:

--- rpmbuild/BUILD/memtest86+-2.11/test.c	2008-11-15 19:18:14.000000000 -0500
+++ memtest86+-2.11/test.c	2009-07-27 13:20:34.885941221 -0400
@@ -130,6 +130,15 @@ void addr_tst1()
 	}
 }
 
+static int bail_check()
+{
+	if (bail)
+		return 1;
+	else
+		return 0;
+
+} __attribute__((__optimize__(0)))
+
 /*
  * Memory address test, own address
  */
@@ -181,7 +190,9 @@ void addr_tst2()
 				: "D" (p), "d" (pe)
 			);
 			do_tick();
-			BAILR
+			//BAILR
+			if (bail_check())
+				return;
 		} while (!done);
 	}
 
@@ -243,7 +254,9 @@ void addr_tst2()
 				: "ecx"
 			);
 			do_tick();
-			BAILR
+			//BAILR
+			if (bail_check())
+				return;
 		} while (!done);
 	}
 }


BAILR is simply:
#define BAILR           if (bail) return;
Comment 21 Jarod Wilson 2009-07-27 17:49:34 EDT
Oh fun. The plot thickens. Its not actually addr_tst2() where we're running into trouble. That correlates to Test #1, not #2. I'd assumed it started at Test #1, because Test #0 happens so fast, you never see it on-screen. Test #2, where we're dying, is actually "Moving inversions, ones & zeros", which corresponds to movinvr(). So why the above helps, I dunno... Digging some more...
Comment 22 Jarod Wilson 2009-07-27 17:54:37 EDT
Oh crap. I think I'm dumb, and don't know what I'm doing with where I'm putting the attribute... Just looked at the docs, yep, I'm Doing It Wrong. Ugh. Ignore the last several comments, lemme redo this the right way.
Comment 23 Jarod Wilson 2009-07-27 18:20:47 EDT
Alright, I'm back on track now... And I have a theory as to what is going on. It seems that gcc thinks 'bail' is always zero, and so the BAILR lines are always optimized out. bail is defined in main.c, then referenced as an extern int in both test.c and config.c. It can be set to non-zero by config.c, but is never touched by test.c. Pretty much any effort I take to NOT optimize something that touches bail within test.c results in a working Test #2, which would explain why my earlier misguided attempts worked, even though I was prodding the wrong function... I don't know if this is a gcc buglet, or if the code is just doing something it really shouldn't... Or if I'm way off and don't know what I'm talking about. ;)
Comment 24 Jakub Jelinek 2009-07-27 18:37:11 EDT
gcc -S -fverbose-asm -Wall -march=i486 -m32 -Os -fomit-frame-pointer -fno-builtin -ffreestanding test.c
grep bail test.s  | sort | uniq -c
     21 	cmpl	$0, bail	#, bail
grep BAILR test.c | wc -l
21
So, at least in test.c that doesn't seem to be the case.
Have you tried compiling test.c with -march=i486 -Os -fomit-frame-pointer -fno-builtin  -fftreestanding -fno-strict-overflow
(or s/-fno-strict-overflow/-fwrapv)?
Comment 25 Jarod Wilson 2009-07-28 09:45:23 EDT
> So, at least in test.c that doesn't seem to be the case.

Huh, so it isn't. I'm rather confused why simply un-optimizing the parts that touch bail seems to make a difference then...

> Have you tried compiling test.c with -march=i486 -Os -fomit-frame-pointer
> -fno-builtin  -fftreestanding -fno-strict-overflow

I thought I had, but apparently not. Just tried that, and the resulting binary does work as expected.
Comment 26 Jarod Wilson 2009-07-30 21:16:03 EDT
(In reply to comment #18)
> > Just use
> > if ((((uintptr_t)pe) + SPINSZ * sizeof (ulong)) > (uintptr_t)pe)
> > and similar.
> 
> So I tried this (actually, a number of different variations of it too), but its
> still blowing up the same as before. Must be doing something wrong, or missing
> something else... Going to table it 'til Monday, but we're close...  

Finally figured it out... This doesn't work:

if ((((uintptr_t)pe) + SPINSZ) > (uintptr_t)pe)

But this does:

if ((uintptr_t)(pe + SPINSZ) > (uintptr_t)pe)

Is that expected?
Comment 27 Jarod Wilson 2009-07-30 21:16:54 EDT
Moving to rawhide and reassigning.
Comment 28 Jarod Wilson 2009-08-05 14:45:54 EDT
So while casting the result of the addition gets us through test #2 (and 3-6), test #7 still fails spectacularly, for different reasons. Marking the movinvr() function as -O0 gets it working, so I started refactoring the code to try to narrow down where it was going wrong. If I simply move this block (the first do while loop) out of movinvr() and into its own function, called by movinvr(), then a full -Os compiled build works. Without moving the code, it results in test #7 failing.

----8<----
do {
    /* Check for overflow */
    if ((uintptr_t)(pe + SPINSZ) > (uintptr_t)pe) {
        pe += SPINSZ;
    } else {
        pe = end;
    }
    if (pe >= end) {
        pe = end;
        done++;
    }
    if (p == pe ) {
        break;
    }
/* Original C code replaced with hand tuned assembly code */
/*
    for (; p < pe; p++) {
        *p = rand();
    }
 */

    asm __volatile__ (
        "jmp L200\n\t"
        ".p2align 4,,7\n\t"
        "L200:\n\t"
        "call rand\n\t"
        "movl %%eax,(%%edi)\n\t"
        "addl $4,%%edi\n\t"
        "cmpl %%ebx,%%edi\n\t"
        "jb L200\n\t"
        : "=D" (p)
        : "D" (p), "b" (pe)
        : "eax"
    );

    do_tick();
    BAILR
} while (!done);
----8<----

I'm pretty much at a loss why moving that code *unaltered* beyond putting it in a separate function works...
Comment 29 Jarod Wilson 2009-08-17 17:01:44 EDT
Also works within the loop if movinvr() is tagged with __attribute__((__optimize__(2))). About the only thing left that I can think of is perhaps a register in addition to eax is getting clobbered at some point. Will do a bit of poking in this direction...
Comment 30 Jarod Wilson 2009-08-17 17:22:14 EDT
Indeed, simply adding edx to the clobber list makes things fully functional without the loop moved external to movinvr(). I'm guessing perhaps the call to rand() somehow stomps on it, since the rest of the asm doesn't use it at all... Going to call it good with this change, flip the switch, and rebuild memtest86+ w/o resorting to gcc34 now.
Comment 31 Jarod Wilson 2009-09-17 09:44:25 EDT
We've been including gcc 4.4-built memtest86+ packages in rawhide for about a month now, haven't seen any new bug reports come in, so I'll go ahead and close this as fixed.
Comment 32 Fedora Update System 2009-10-20 12:41:19 EDT
memtest86+-4.00-2.fc11 has been submitted as an update for Fedora 11.
http://admin.fedoraproject.org/updates/memtest86+-4.00-2.fc11
Comment 33 Robert Scheck 2009-12-25 12:25:52 EST
Reopening, because of "BuildRequires: compat-gcc-34" - why do we need this?
Comment 34 Robert Scheck 2009-12-25 12:42:02 EST
We don't need this, fixed with memtest86+-4.00-3.fc13.
Comment 35 Fedora Update System 2010-01-18 19:57:26 EST
memtest86+-4.00-2.fc11 has been pushed to the Fedora 11 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.