malloc/tst-malloc-stats-cancellation hangs when I use the following configure line:
../configure CFLAGS="-v -w -g -O2 -iplugindir=/usr/lib/gcc/armv7hl-redhat-linux-gnueabi/11/plugin -fplugin=annobin" --prefix=/usr --with-nonshared-cflags="-fplugin=annobin -fplugin-arg-annobin-disable" --disable-werror
...but not when I use this:
../configure CFLAGS="-v -w -g -O2" --prefix=/usr --disable-werror
I'm not sure how the hang is related to annobin, but: a child thread is cancelled but the cancellation does not occur cleanly: a lock on stderr is not released; and the parent tries to acquire the lock after the child's cancellation, ending up waiting on it until the test times out.
In theory annobin should no affect on the execution of any binary to which it has been applied. The plugin just creates a non-loadable note section and some extra symbols in the symbol table. In practice those extra symbols can sometimes be problematical, and maybe this is the case in this particular scenario.
Without knowing more about why the lock is being held, it is hard to say any more. But a possible place to look is any ARM specific code in the thread library. In particular is there any code that scans the symbol table of ARM binaries, possibly looking for function symbols or the like ?
ARM EABI uses non-DWARF exception handling. Perhaps that's why it's disturbed by annobin data and the extra symbols?
If it is the annobin symbols that are causing a problem, then you *might* be able to make the test work by stripping them out. For example:
objcopy --strip-unneeded a.out a.stripped.
Of course this might also break the ARM unwinder by removing symbols that it needs, so no guarantees that it won't make things worse...
Proposed as a Blocker for 35-beta by Fedora user pbrobinson using the blocker tracking app because:
This is actually a mass rebuild blocker but we don't have the ability to add that so adding it here so it's tracked somewhere.
This may be fixed by annobin-9.72-1.fc35. Arjun - please can you check ?
(In reply to Nick Clifton from comment #5)
> This may be fixed by annobin-9.72-1.fc35. Arjun - please can you check ?
Thanks, Nick! I'm on it.
"This is actually a mass rebuild blocker but we don't have the ability to add that so adding it here so it's tracked somewhere."
That's what the prioritized bug tracker is for:
Given your recent results, I think that were actually two problems:
1. The hang in pthread cancellation. This I think was not caused
by the annobin problem (below) but rather something else. A
recent commit to the glibc sources appears to have fixed the
problem, even if annobin is used when compiling the sources.
2. When a relocatable link is performed on ARM object files that
have been annotated by the annobin plugin, the resulting
unwind information is corrupt. I think that this has been
fixed in the annobin-9.72-1.fc35 build.
Do you agree ? If so, then I think that we can close this BZ. If 1)
is true but 2) is not, then it would be better to open a separate BZ
for it. But if 1) is false, then more investigation is needed,
although I am not sure where.
So, I tested with "-Wl,--force-group-allocation" for libc_pic.os and
that seems to remove the hang. i.e.:
* Without the option but with annobin turned on: it hangs
* With the option and with annobin turned on: it does not hang
Note that this is at a glibc commit that was already hanging.
What we know now:
1. A hang started occuring at glibc commit "C1" (say).
2. Any *one* of three events appear to remove the hang:
* turning off annobin
* building libc_pic.os with --force-group-allocation
* fast-forwarding glibc to commit "C2"
Does this pinpoint any more about where bug #1 might lie?
> Does this pinpoint any more about where bug #1 might lie?
Yes - I think that it is safe to say that there is a latent problem with ARM unwind information and annobin annotated code. Commit C1 exposed this problem, (which presumably has existed for a long time, but is only now coming to light) and commit C2 has hidden it again.
I had really hoped that annobin-9.73 would fix this problem, as it contains ARM specific code to disable the generation of section groups. (I believe annobin's use of section groups to be the underlying cause of the problem).
So back to the drawing board for me I guess.
In today's Prioritized Bugs meeting, we accepted this as a Prioritized Bug.
If anyone has additional input or can do additional testing, please comment.
This bug appears to have been reported against 'rawhide' during the Fedora 35 development cycle.
Changing version to 35.
In today's Prioritized Bugs meeting, we agreed that this bug is no longer a prioritized bug as the mass rebuild seems to have completed successfully without a fix.