Bug 1951492 - A glibc test hangs upon pthread cancellation when glibc is compiled with annobin turned on
Summary: A glibc test hangs upon pthread cancellation when glibc is compiled with anno...
Keywords:
Status: ASSIGNED
Alias: None
Product: Fedora
Classification: Fedora
Component: annobin
Version: 35
Hardware: armv7hl
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: Nick Clifton
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: ARMTracker
TreeView+ depends on / blocked
 
Reported: 2021-04-20 09:16 UTC by Arjun Shankar
Modified: 2021-08-11 15:38 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug


Attachments (Terms of Use)

Description Arjun Shankar 2021-04-20 09:16:10 UTC
malloc/tst-malloc-stats-cancellation hangs when I use the following configure line:

../configure CFLAGS="-v -w -g -O2 -iplugindir=/usr/lib/gcc/armv7hl-redhat-linux-gnueabi/11/plugin -fplugin=annobin" --prefix=/usr --with-nonshared-cflags="-fplugin=annobin -fplugin-arg-annobin-disable" --disable-werror

...but not when I use this:

../configure CFLAGS="-v -w -g -O2" --prefix=/usr --disable-werror

I'm not sure how the hang is related to annobin, but: a child thread is cancelled but the cancellation does not occur cleanly: a lock on stderr is not released; and the parent tries to acquire the lock after the child's cancellation, ending up waiting on it until the test times out.

Comment 1 Nick Clifton 2021-04-20 12:46:42 UTC
In theory annobin should no affect on the execution of any binary to which it has been applied.  The plugin just creates a non-loadable note section and some extra symbols in the symbol table.  In practice those extra symbols can sometimes be problematical, and maybe this is the case in this particular scenario.

Without knowing more about why the lock is being held, it is hard to say any more.  But a possible place to look is any ARM specific code in the thread library.  In particular is there any code that scans the symbol table of ARM binaries, possibly looking for function symbols or the like ?

Comment 2 Florian Weimer 2021-04-20 13:26:31 UTC
ARM EABI uses non-DWARF exception handling. Perhaps that's why it's disturbed by annobin data and the extra symbols?

Comment 3 Nick Clifton 2021-04-21 12:15:34 UTC
Hi Arjun,

  If it is the annobin symbols that are causing a problem, then you *might* be able to make the test work by stripping them out.  For example:

    objcopy --strip-unneeded a.out a.stripped.

  Of course this might also break the ARM unwinder by removing symbols that it needs, so no guarantees that it won't make things worse...

Cheers
  Nick

Comment 4 Fedora Blocker Bugs Application 2021-05-19 08:02:17 UTC
Proposed as a Blocker for 35-beta by Fedora user pbrobinson using the blocker tracking app because:

 This is actually a mass rebuild blocker but we don't have the ability to add that so adding it here so it's tracked somewhere.

Comment 5 Nick Clifton 2021-05-20 12:27:15 UTC
This may be fixed by annobin-9.72-1.fc35.  Arjun - please can you check ?

Comment 6 Arjun Shankar 2021-05-20 14:04:41 UTC
(In reply to Nick Clifton from comment #5)
> This may be fixed by annobin-9.72-1.fc35.  Arjun - please can you check ?

Thanks, Nick! I'm on it.

Comment 7 Adam Williamson 2021-05-20 16:12:32 UTC
"This is actually a mass rebuild blocker but we don't have the ability to add that so adding it here so it's tracked somewhere."

That's what the prioritized bug tracker is for:
https://docs.fedoraproject.org/en-US/program_management/prioritized_bugs/

Comment 8 Nick Clifton 2021-05-21 08:22:33 UTC
Hi Arjun,

  Given your recent results, I think that were actually two problems:

    1. The hang in pthread cancellation.  This I think was not caused
       by the annobin problem (below) but rather something else.  A
       recent commit to the glibc sources appears to have fixed the
       problem, even if annobin is used when compiling the sources.

    2. When a relocatable link is performed on ARM object files that
       have been annotated by the annobin plugin, the resulting 
       unwind information is corrupt.  I think that this has been 
       fixed in the annobin-9.72-1.fc35 build.

  Do you agree ?  If so, then I think that we can close this BZ.  If 1)
  is true but 2) is not, then it would be better to open a separate BZ
  for it.  But if 1) is false, then more investigation is needed,
  although I am not sure where.

Cheers
  Nick

Comment 9 Arjun Shankar 2021-05-24 15:06:52 UTC
Hi Nick!

So, I tested with "-Wl,--force-group-allocation" for libc_pic.os and
that seems to remove the hang. i.e.:

* Without the option but with annobin turned on: it hangs
* With the option and with annobin turned on: it does not hang

Note that this is at a glibc commit that was already hanging.

What we know now:

1. A hang started occuring at glibc commit "C1" (say).

2. Any *one* of three events appear to remove the hang:
 * turning off annobin
 * building libc_pic.os with --force-group-allocation
 * fast-forwarding glibc to commit "C2"

Does this pinpoint any more about where bug #1 might lie?

Comment 10 Nick Clifton 2021-05-28 10:33:41 UTC
Hi Arjun,

> Does this pinpoint any more about where bug #1 might lie?

  Yes - I think that it is safe to say that there is a latent problem with ARM unwind information and annobin annotated code.  Commit C1 exposed this problem, (which presumably has existed for a long time, but is only now coming to light) and commit C2 has hidden it again.

  I had really hoped that annobin-9.73 would fix this problem, as it contains ARM specific code to disable the generation of section groups.  (I believe annobin's use of section groups to be the underlying cause of the problem).

  So back to the drawing board for me I guess.

Cheers
  Nick

Comment 11 Ben Cotton 2021-06-16 15:28:40 UTC
In today's Prioritized Bugs meeting[1], we accepted this as a Prioritized Bug.

[1] https://meetbot.fedoraproject.org/fedora-meeting-1/2021-06-16/fedora_prioritized_bugs_and_issues.2021-06-16-15.01.log.html#l-46

Comment 12 Ben Cotton 2021-07-09 15:32:30 UTC
If anyone has additional input or can do additional testing, please comment.

Comment 13 Ben Cotton 2021-08-10 12:59:22 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 35 development cycle.
Changing version to 35.

Comment 14 Ben Cotton 2021-08-11 15:38:57 UTC
In today's Prioritized Bugs meeting, we agreed that this bug is no longer a prioritized bug as the mass rebuild seems to have completed successfully without a fix.

https://meetbot.fedoraproject.org/fedora-meeting-1/2021-08-11/fedora_prioritized_bugs_and_issues.2021-08-11-15.00.html


Note You need to log in before you can comment on or make changes to this bug.