Description of problem: tla currently fails building on PPC due to what appears to be the recent ABI change in rawhide. Version-Release number of selected component (if applicable): tla-1.3.3-5 How reproducible: Always Steps to Reproduce: 1. checkout tla package from Extras CVS 2. rebuild it on PPC 3. Actual results: ================ unit-ar tests ================ /home/jwboyer/rpm/BUILD/tla-1.3.3/src/hackerlab/tests/arrays-tests/unit-ar.c:53:botched invariant 0 == ((unsigned long)ar & (16 - 1)) PANIC: exiting on botched invariant make[3]: *** [tests-timestamp] Error 2 make[3]: Leaving directory `/home/jwboyer/rpm/BUILD/tla-1.3.3/src/=build/hackerlab/tests/arrays-tests' make[2]: *** [test] Error 2 make[2]: Leaving directory `/home/jwboyer/rpm/BUILD/tla-1.3.3/src/=build/hackerlab/tests' make[1]: *** [test] Error 2 make[1]: Leaving directory `/home/jwboyer/rpm/BUILD/tla-1.3.3/src/=build/hackerlab' make: *** [test] Error 2 make: Leaving directory `/home/jwboyer/rpm/BUILD/tla-1.3.3/src/=build' error: Bad exit status from /var/tmp/rpm-tmp.74562 (%check) RPM build errors: Bad exit status from /var/tmp/rpm-tmp.74562 (%check) Expected results: It builds correctly. Additional info: This appears to be some kind of issue between the computed MACHINE_ALIGNMENT macro and the actual values returned from the testcase. I've pinged some people on the fedora-ppc list about this and will look into it further when I have time.
I have a local update to 1.3.4 available, verified to build on Rawhide x86_64 and FC4 ppc. I suppose you don't mind if I go ahead and commit it? I don't have access to a Rawhide ppc box though. Do you? (By the way, for some reason I wasn't auto-added to Cc on this bug, so I'd appreciate if you could keep an eye on that and add manually if needed in the future.)
(In reply to comment #1) > I have a local update to 1.3.4 available, verified to build on Rawhide x86_64 > and FC4 ppc. I suppose you don't mind if I go ahead and commit it? I don't > have access to a Rawhide ppc box though. Do you? No I don't mind at all. That was going to be my next step anyway :). I do have access to a rawhide PPC box so I can do a test build there. I don't think an upgrade with fix this, but I might be pleasantly surprised. > (By the way, for some reason I wasn't auto-added to Cc on this bug, so I'd > appreciate if you could keep an eye on that and add manually if needed in the > future.) I just noticed that. I'll ping Elliot as to why owners.list didn't auto-add you and remember to manually add you until it gets fixed.
Ok, committed but not tagged yet nor any builds requested. I don't have anything to add to the package at the moment, so feel free...
(In reply to comment #3) > Ok, committed but not tagged yet nor any builds requested. I don't have > anything to add to the package at the moment, so feel free... Same fail: ================ unit-ar tests ================ /home/jwboyer/rpm/BUILD/tla-1.3.4/src/hackerlab/tests/arrays-tests/unit-ar.c:53:botched invariant 0 == ((unsigned long)ar & (16 - 1)) PANIC: exiting on botched invariant make[3]: *** [tests-timestamp] Error 2 make[3]: Leaving directory `/home/jwboyer/rpm/BUILD/tla-1.3.4/src/=build/hackerlab/tests/arrays-tests' make[2]: *** [test] Error 2 make[2]: Leaving directory `/home/jwboyer/rpm/BUILD/tla-1.3.4/src/=build/hackerlab/tests' make[1]: *** [test] Error 2 make[1]: Leaving directory `/home/jwboyer/rpm/BUILD/tla-1.3.4/src/=build/hackerlab' make: *** [test] Error 2 make: Leaving directory `/home/jwboyer/rpm/BUILD/tla-1.3.4/src/=build' error: Bad exit status from /var/tmp/rpm-tmp.95932 (%check) RPM build errors: Bad exit status from /var/tmp/rpm-tmp.95932 (%check) Crap. Have to go digging to figure this out.
Ok, did some more digging. The testcase calls ar_copy, which will copy an array to another array. ar_copy looks like this (after adding some printfs of my own): void * ar_copy (void * base, alloc_limits limits, size_t szof) { void * answer; answer = 0; printf("ar_size: %ld\n", ar_size(base, limits, szof)); printf("answer: %ld\n", (unsigned long) answer); ar_setsize (&answer, limits, ar_size (base, limits, szof), szof); printf("answer: %ld\n", (unsigned long) answer); mem_move (answer, base, szof * ar_size (base, limits, szof)); return answer; } ar_setsize calls ar_ref, which in turn calls lim_malloc. lim_malloc calls must_malloc, which in turn just calls malloc. So in the output below: ================ unit-ar tests ================ sizes: 360, arrays: 268689456 ar_size: 360 answer: 0 answer: 268690920 sizes: 360, arrays: 268690920 /home/jwboyer/rpm/BUILD/tla-1.3.4/src/hackerlab/tests/arrays-tests/unit-ar.c:53:botched invariant 0 == ((unsigned long)ar & (16 - 1)) PANIC: exiting on botched invariant we see that answer was NULL before calling ar_setsize. After it goes through ar_setsize -> ar_ref -> lim_malloc -> malloc, the address of 268690920 (or 0x1003E5E8) is returned. This isn't 16 byte aligned. So it looks like malloc really is returning a non-16 byte aligned address...
That would probably be a glibc bug then. Reassigning.
As I said in an email to the fedora-ppc list, I can recreate this will a simple program too. Here it is below: [jwboyer@net2-102 ~]$ gcc -o malloc-check malloc-check.c [jwboyer@net2-102 ~]$ ./malloc-check foo is not 16-byte aligned: 10011008 [jwboyer@net2-102 ~]$ #include <stdio.h> #include <stdlib.h> int main(int argc, char **argv) { void *foo = NULL; void *bar = NULL; int i; for (i = 0; i < 1000; i++) { foo = malloc(360); bar = malloc(360); if ((unsigned long) foo & 0xf) { printf("foo is not 16-byte aligned: %lx\n", (unsigned long) foo); return -1; } if ((unsigned long) bar & 0xf) { printf("bar is not 16-byte aligned: %lx\n", (unsigned long) bar); return -1; } if (i % 150) { free(foo); free(bar); foo = NULL; bar = NULL; } } return 0; }
If this is ppc 32-bit, then relying on malloc returning 16 byte aligned addresses is wrong. malloc only guarantees 2 * sizeof (void *) alignment, on ppc32 that's 8 byte. If you need more, you should use posix_memalign or memalign.
(In reply to comment #8) > If this is ppc 32-bit, then relying on malloc returning 16 byte aligned > addresses is wrong. malloc only guarantees 2 * sizeof (void *) alignment, > on ppc32 that's 8 byte. If you need more, you should use > posix_memalign or memalign. Hm.. that's good and bad news at the same time. Good because glibc is ok, bad because now I have to go hack tla to generate the correct MACHINE_ALIGNMENT macro on PPC32. Currently, it "discovers" machine alignment by walking through the standard types and getting the sizeof(). sizeof(long double) returns 16, which is where it's getting it's value from. I suppose ignoring long double would work. I'll hack up a patch shortly.
Created attachment 125287 [details] Patch to ignore long double for alignment purposes This patch allows tla to build successfully on ppc again. It's not the prettiest thing in that it completely ignores long double in the gen-alignment program, but I personally consider that programs behavior to be semi-broken to begin with. I can't think of a better way to work around the problem at the moment either. Ville, if you have no objections I'll commit this to CVS and kick off a rebuild.
Sure, go ahead. Out of curiosity, is __PPC__ defined on ppc64, and does this patch do the right thing on it too?
(In reply to comment #11) > Sure, go ahead. Out of curiosity, is __PPC__ defined on ppc64, and does this > patch do the right thing on it too? Yes, __PPC__ is defined on ppc64. I can't find a predefined macro that is only defined on ppc32 that we can use. So it'd have to be something like: #ifndef __PPC__ and __PPC64__ ... However, the package doesn't build on PPC64 for other reasons as well. And the FE buildsys always uses --target ppc to rpmbuild. I plan on opening a bug against upstream tla for this issue, so I think this patch will get us by until it's fixed upstream.
Upstream bug: https://savannah.gnu.org/bugs/?func=detailitem&item_id=15915
tla built successfully on rawhide.
Wow. I always knew that `gen-alignment' would eventually come back and bite me somehow -- I just never knew exactly how. Now I do. If I'm following the reports correctly, the real bug is in the invariant check. The invariant check makes a false assumption that the machine's strictest alignment requirements are determined by the size of the largest scalar type. The macro MACHINE_ALIGNMENT is poorly named. It should really be something like: CONSERVATIVE_ESTIMATE_OF_MACHINE_ALIGNMENT The primary use of MACHINE_ALIGNMENT is to safely wrap `malloc' in an allocator that adds a header to allocated memory. `ar.c' is an example of such an allocator. The user asks for N bytes. We want a K byte header. So we allocate: N + ALIGNMENT * ((K + ALIGNMENT - 1) / ALIGNMENT) bytes, put the header at byte 0, and start the user data at byte: ALIGNMENT * ((K + ALIGNMENT - 1) / ALIGNMENT) The invariant that we are truly trying to maintain is that, so long as `malloc' returns memory suitably aligned for anything, so does our wrapper (such as the `ar' functions). The actual invariant test in the code tests for a stricter condition that happens not to be true in this case. It's the invariant, not `gen-alignment', that is bogus. -t
One more thing: the "fix" of ignoring sizeof (long double) introduces a bug and should be removed (if it's there). -t
(In reply to comment #16) > One more thing: the "fix" of ignoring sizeof (long double) > introduces a bug and should be removed (if it's there). Which bug is that? Perhaps I've misunderstood all the cases MACHINE_ALIGNMENT is used for, but in my testing the patch at least allowed tla to build on Fedora Extras. I'll gladly revert the patch and add a patch for the real fix, as soon as I know what that is.
> Which bug is that? Allocation functions should return memory suitably aligned for any type. If gen-alignment ignores some scalar types, then some hackerlab functions can *theoretically* return memory that is not suitably aligned for those types. A better work-around and arguably a real fix might be to simply delete the call to `invariant' that caused the build to fail. -t
At any rate, I don't mean to be harsh, but "at least allowed tla to build" is kind of a random approach. We all do that kind of thing from time to time to solve some immediate nebishy little problem but it isn't an appropriate approach to take for something that is going to be distributed to others who are counting on some level of quality control. -t
(In reply to comment #18) > > Which bug is that? > > Allocation functions should return memory suitably aligned for > any type. If gen-alignment ignores some scalar types, then some > hackerlab functions can *theoretically* return memory that is not > suitably aligned for those types. Yes, sure. However, in this specific case the hackerlab functions are relying on malloc to return addresses for such alignment and that isn't valid. > A better work-around and arguably a real fix might be to simply > delete the call to `invariant' that caused the build to fail. I can't understand how this is any better. The invariant was there to check for the exact condition you described above, right? So if we delete the invariant, it's still not going to check for alignment on long double types. If hackerlab functions want to guarantee alignment for all types, they need to use memalign as Jakub suggested above. That seems to be more of a real fix than just deleting the invariant. I'm not trying to argue here, I'm just trying to understand where you're coming from given that I'm pretty new to the tla code.
(In reply to comment #19) > At any rate, I don't mean to be harsh, but "at least allowed tla > to build" is kind of a random approach. We all do that kind of > thing from time to time to solve some immediate nebishy little > problem but it isn't an appropriate approach to take for something > that is going to be distributed to others who are counting on > some level of quality control. That's all very fair and I appreciate the critcism. However, in this specific case it wasn't random. I looked into the problem and based on what I saw, disabling long double wouldn't cause any real-world problems. It's a workaround, not a fix and I understand that completely. Now, it's always a possibility that I've misunderstood something in the code here. I've opened the upstream bug to make sure that this problem is tracked and resolved in the correct manner. I could have simply "called it good" and let it go at that, but I didn't. And we still have time before FE5 is released so there is time to fix this properly. I'm reopening the bug so people don't get the wrong impression that this is really fixed.
> Yes, sure. However, in this specific case the hackerlab functions are relying > on malloc to return addresses [aligned for any type] and that isn't valid. Is that really true? This is a genuine question -- I'm unclear what various standards require on this point. I'm currently content to have libhackerlab assume that `malloc' returns an address suitably aligned for any type because, regardless of what the standards say, to implement `malloc' any other way is dumb. >> A better work-around and arguably a real fix might be to simply >> delete the call to `invariant' that caused the build to fail. > I can't understand how this is any better. The invariant was there to check > for the exact condition you described above, right? So if we delete the > invariant, it's still not going to check for alignment on long double types. So, long double is 16 bytes on PPC-32 and glibc malloc returns memory aligned on 8 byte boundaries. I'm *assuming* that PPC-32 can read or write long doubles on any 8 byte boundary even though a long double is 16 bytes. If glibc malloc is sometimes not returning memory suitably aligned for long double values then, I don't care what the standards say -- that is a serious glibc bug. Meanwhile, the libhackerlab invariant is testing for a condition that is stronger than needed and that can not hold true on a reasonable system. The whole MACHINE_ALIGNMENT hack is meant only to preserve the alignment guarantee of (a sane) `malloc' in situations like `ar.c'. It doesn't have to literally figure out the machine's alignment requirements. The invariant was probably handy when I wrote the code on an x86 box and ported it to solaris, but it looks from the discussion in this bug list that, in general, it's a bogus invariant. > If hackerlab functions want to guarantee alignment for all types, they need to > use memalign as Jakub suggested above. That seems to be more of a real fix > than just deleting the invariant. AFAIK, memalign is for handling alignment requirements that go *beyond* the machine architecture requirements. For example, an I/O device or the OS might require that a chunk of memory be aligned on a *page* boundary, even though no machine instruction has any such requirement. That's what memalign is for. Meanwhile: my_type * x = malloc (sizeof (my_type)); should always yield NULL or a valid pointer for a `my_type' lvalue. No? > I'm not trying to argue here, I'm just trying to understand where you're > coming from given that I'm pretty new to the tla code. And I'm trying not to be a jackass, to be helpful, and to explain what is admittedly the pretty subtle reasoning behind MACHINE_ALIGNMENT. I'm open to the possibility I'm full of it on these points but I don't think I am. We're cool. -t
(In reply to comment #22) > >> A better work-around and arguably a real fix might be to simply > >> delete the call to `invariant' that caused the build to fail. > > > I can't understand how this is any better. The invariant was there to check > > for the exact condition you described above, right? So if we delete the > > invariant, it's still not going to check for alignment on long double types. > > So, long double is 16 bytes on PPC-32 and glibc malloc returns memory aligned on > 8 byte boundaries. I'm *assuming* that PPC-32 can read or write long doubles > on any 8 byte boundary even though a long double is 16 bytes. > > If glibc malloc is sometimes not returning memory suitably aligned for > long double values then, I don't care what the standards say -- that is a > serious glibc bug. I've added Jakub on CC again. Jakub, any change you could expand upon this situation a bit? > Meanwhile, the libhackerlab invariant is testing for a condition that is > stronger than needed and that can not hold true on a reasonable system. > The whole MACHINE_ALIGNMENT hack is meant only to preserve the alignment > guarantee of (a sane) `malloc' in situations like `ar.c'. It doesn't > have to literally figure out the machine's alignment requirements. The > invariant was probably handy when I wrote the code on an x86 box and > ported it to solaris, but it looks from the discussion in this bug list that, > in general, it's a bogus invariant. > > > If hackerlab functions want to guarantee alignment for all types, they need to > > use memalign as Jakub suggested above. That seems to be more of a real fix > > than just deleting the invariant. > > AFAIK, memalign is for handling alignment requirements that go *beyond* the > machine architecture requirements. For example, an I/O device or the OS might > require that a chunk of memory be aligned on a *page* boundary, even though > no machine instruction has any such requirement. That's what memalign is > for. > > Meanwhile: > > my_type * x = malloc (sizeof (my_type)); > > should always yield NULL or a valid pointer for a `my_type' lvalue. No? I think the pointer should be valid, just not always aligned to sizeof(my_tpye). If that is the case, then I think I agree that the invariant can probably go.
I noticed this in today's rawhide report: glibc-2.3.91-1 -------------- * Thu Mar 02 2006 Jakub Jelinek <jakub> 2.3.91-1 - update from CVS - fixes for various arches - ensure malloc returns pointers aligned to at least MIN (2 * sizeof (size_t), __alignof__ (long double)) (only on ppc32 this has not been the case lately with addition of 128-bit long double, #182742) Jakub, can I assume that ppc32 malloc addresses not being aligned to sizeof(long double) really was a bug then?
I just rebuilt tla without the hack patch and it succeeded. In talking with David this morning, it was indeed a glibc bug. Patch reverted in CVS and a build has been queued up. Once that completes, I'll close both this and the upstream bug.
Yes, unfortunately this bugreport initially didn't mention the magic keyword long double anywhere :(. ISO C99 requires that malloc returns pointers suitably aligned for all ISO C99 standard types and long double is one of them. BTW, sizeof (long double) is completely wrong thing to check when you check alignment, you need to check __alignof__ (long double). On many architectures, some types have different alignment than their size, e.g. on sparc32 or s390 long double is 8 byte aligned, but 16 byte in size.
Can you please re-add the workaround, we decided to revert the glibc change for FC5, it is too dangerous this late in the devel cycle. While we fixed #183895, we have yet to discuss how to fix #183894, which will be much harder.
(In reply to comment #27) > Can you please re-add the workaround, we decided to revert the glibc change > for FC5, it is too dangerous this late in the devel cycle. Damn. > While we fixed #183895, we have yet to discuss how to fix #183894, which > will be much harder. Yes, I think we can add it. Tom, do you agree that removing the invariant for this particular failure would be the workaround for now?
I'm slightly dubious about the plan to ship glibc with this known to be misbehaving -- it would be nicer to make sure _new_ allocations are all aligned to 16 bytes, while still allowing free() of objects which are less aligned than that. But since it can be fixed in an update to FC5 glibc, since 'long double' does actually _work_ with 8-byte alignment anyway (despite the ABI requiring 16 bytes), and since 'long double' is fairly esoteric, I suppose it isn't the end of the world.
Ok, I've removed the invariant check in Extras. When glibc is fixed, I'll revert this patch. In the meantime, tla should build and function properly.
Not to be overly pedantic but I think it is more correct to *not* revert the patch even after glibc is fixed. The invariant should be fixed in the upstream Arch, too. The MACHINE_ALIGNMENT macro (poorly named) is a conservative but reasonable guess about alignment requirements. The invariant wrongly insists that the native `malloc' use the same value. Make sense? -t
Sure, that makes sense. And I like it when people are pedantic, it makes me understand things better. I'll rephrase: I'll be sure to drop the patch when we pull in the next upstream update that contains the same or a similar fix. ;)
It may be worth reviewing the code too (and I've noted this thread for upstream) to double check that the bogus alignment assumption about malloc isn't presumed -- especially in the `ar_' module. I don't personally have time to do so, at the moment. -t
As the upstream for now, I have noted of this problem and will review the ar_ code and try to come up with a correct fix.
I have a fix in the latest tla snapshot. I would like to test the fix in a PPC32 environment. Any help for this is appreciated.