Bug 2322754

Summary: SIGSEGV in __llvm_write_binary_ids on aarch64
Product: [Fedora] Fedora Reporter: Josh Stone <jistone>
Component: llvmAssignee: Tom Stellard <tstellar>
Status: CLOSED COMPLETED QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: unspecified    
Version: rawhideCC: dmalcolm, dvlasenk, fweimer, jakub, jchecahi, jistone, josmyers, kkleine, nickc, npopov, scottt.tw, sergesanspaille, siddharth.kde, sipoyare, suraj.ghimire7, tbaeder, tstellar, tuliom, yahmad
Target Milestone: ---Keywords: Regression
Target Release: ---   
Hardware: aarch64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2025-01-07 15:19:05 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Josh Stone 2024-10-30 16:25:24 UTC
When a program is compiled with coverage instrumentation, it crashes while writing the file during exit (post-main). I only see this on aarch64 rawhide, not on other arches nor on aarch64 F41 with the same toolchain versions.

I originally ran into this in Rust, but I've reproduced it with Clang here.

Reproducible: Always

Steps to Reproduce:
1. dnf install clang compiler-rt
2. echo 'int main() {}' >main.c
3. clang -fprofile-instr-generate -fcoverage-mapping main.c -o main
4. ./main
Actual Results:  
Segmentation fault (core dumped)

and default.profraw was created but empty.

Expected Results:  
No visible output, and coverage data in default.profraw.

(gdb) bt
#0  0x00000000004048c0 in __llvm_write_binary_ids ()
#1  0x00000000004040c4 in lprofWriteDataImpl ()
#2  0x0000000000403f2c in lprofWriteData ()
#3  0x00000000004022c0 in writeFile ()
#4  0x0000000000402104 in __llvm_profile_write_file ()
#5  0x0000fffff7e2dfcc in __run_exit_handlers (status=0, listp=0xfffff7fb0690 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:118
#6  0x0000fffff7e2e0c0 in __GI_exit (status=<optimized out>) at exit.c:148
#7  0x0000fffff7e162e0 in __libc_start_call_main (main=main@entry=0x4012b4 <_start+52>, argc=argc@entry=1, argv=argv@entry=0xffffffffeda8) at ../sysdeps/nptl/libc_start_call_main.h:74
#8  0x0000fffff7e163bc in __libc_start_main_impl (main=0x4012b4 <_start+52>, argc=1, argv=0xffffffffeda8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=<optimized out>) at ../csu/libc-start.c:360
#9  0x00000000004012b0 in _start ()

Comment 1 Josh Stone 2024-10-30 17:16:00 UTC
This seems to be a binutils regression -- it works with 2.43.50-2.fc42, but crashes with -4 and -5.

Comment 2 Florian Weimer 2024-10-30 17:19:32 UTC
Given what __llvm_write_binary_ids does, I would expect that this is the result of a binutils change.

We used to have:

Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  PHDR           0x000040 0x0000000000400040 0x0000000000400040 0x0001f8 0x0001f8 R   0x8
  INTERP         0x000238 0x0000000000400238 0x0000000000400238 0x00001b 0x00001b R   0x1
        [Requesting program interpreter: /lib/ld-linux-aarch64.so.1]
  LOAD           0x000000 0x0000000000400000 0x0000000000400000 0x0093b8 0x0093b8 R E 0x10000
  LOAD           0x00fcf8 0x000000000041fcf8 0x000000000041fcf8 0x006510 0x008788 RW  0x10000
  DYNAMIC        0x00fd38 0x000000000041fd38 0x000000000041fd38 0x0001e0 0x0001e0 RW  0x8
  NOTE           0x000254 0x0000000000400254 0x0000000000400254 0x000044 0x000044 R   0x4
  GNU_EH_FRAME   0x0078c4 0x00000000004078c4 0x00000000004078c4 0x000474 0x000474 R   0x4
  GNU_STACK      0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW  0x10
  GNU_RELRO      0x00fcf8 0x000000000041fcf8 0x000000000041fcf8 0x000308 0x000308 R   0x1

Now it's:

Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  PHDR           0x000040 0x0000000000400040 0x0000000000400040 0x000230 0x000230 R   0x8
  INTERP         0x000294 0x0000000000400294 0x0000000000400294 0x00001b 0x00001b R   0x1
        [Requesting program interpreter: /lib/ld-linux-aarch64.so.1]
  LOAD           0x000000 0x0000000000400000 0x0000000000400000 0x0093f8 0x0093f8 R E 0x10000
  LOAD           0x00fcf8 0x000000000041fcf8 0x000000000041fcf8 0x0087a8 0x0087a8 RW  0x10000
  DYNAMIC        0x00fd38 0x000000000041fd38 0x000000000041fd38 0x0001e0 0x0001e0 RW  0x8
  NOTE           0x000270 0x0000000000400270 0x0000000000400270 0x000024 0x000024 R   0x4
  NOTE           0x018480 0x0000000000428480 0x0000000000428480 0x000020 0x000020 R   0x4
  GNU_EH_FRAME   0x007904 0x0000000000407904 0x0000000000407904 0x000474 0x000474 R   0x4
  GNU_STACK      0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW  0x10
  GNU_RELRO      0x00fcf8 0x000000000041fcf8 0x000000000041fcf8 0x000308 0x000308 R   0x1

There are two NOTE segments. Probably related to this binutils change:

commit bf6d7087de0a7351fd1dfd5f41522a7f4f576180
Author: Nick Clifton <nickc>
Date:   Thu Sep 19 16:45:30 2024 +0100

    ld: Move the .note.build-id section to near the start of the memory map.
    
    This helps GDB to locate the debug information associated with a core dump.
    Core dumps include the first page of an executable's image, and if this
    page include the .note.build-id section then GDB can find it and then track
    down a debug info file for that build-id.

It's unclear if this is a binutils bug, or if the code in LLVM has never worked with multiple NOTE segments (or certain type of NOTE segments).

Comment 3 Josh Stone 2024-10-31 00:09:02 UTC
Here's a scratch build without stripping static debuginfo, especially for compiler-rt:
https://koji.fedoraproject.org/koji/taskinfo?taskID=125373051

Note that the second NOTE has an additional 0x10000 memory offset relative to its file offset. With gdb's help, I noticed a problem here:

https://github.com/llvm/llvm-project/blob/847f4ef21b4a953bb6dd6477791e8d95b6db2509/compiler-rt/lib/profile/InstrProfilingPlatformLinux.c#L212-L231

    /*
     * When examining notes in file, use p_offset, which is the offset within
     * the elf file, to find the start of notes.
     */
    if (ProgramHeader[I].p_memsz == 0 ||
        ProgramHeader[I].p_memsz == ProgramHeader[I].p_filesz) {
      Note = (const ElfW(Nhdr) *)((uintptr_t)ElfHeader +
                                  ProgramHeader[I].p_offset);
      NotesEnd = (const ElfW(Nhdr) *)((const char *)(Note) +
                                      ProgramHeader[I].p_filesz);
    } else {
      /*
       * When examining notes in memory, use p_vaddr, which is the address of
       * section after loaded to memory, to find the start of notes.
       */
      Note =
          (const ElfW(Nhdr) *)((uintptr_t)ElfHeader + ProgramHeader[I].p_vaddr);
      NotesEnd =
          (const ElfW(Nhdr) *)((const char *)(Note) + ProgramHeader[I].p_memsz);
    }

For both notes, we're hitting the first case because the sizes are equal, even though we're examining memory.
So for the second note, that's computing "Note = 0x400000 + 0x018480 = 0x418480" when it is actually loaded at 0x428480.

But the else case wouldn't be better either, since ElfHeader+vaddr would double-count the base 0x400000. I guess it would work for PIE or SOs though.

There's nothing aarch64-specific here, but I guess we're just lucky on other arches that the bad address still happens to land within a loaded segment.

Comment 4 Nick Clifton 2024-10-31 14:45:17 UTC
(In reply to Josh Stone from comment #3)
 
> Note that the second NOTE has an additional 0x10000 memory offset relative
> to its file offset. 

I am assuming that this is done in order to save space in the file.  It the
second NOTE section were to be placed at a file offset of 0x028480 there would
be a 0x10000 byte gap in the file.

>     /*
>      * When examining notes in file, use p_offset, which is the offset within
>      * the elf file, to find the start of notes.
>      */
>     if (ProgramHeader[I].p_memsz == 0 ||
>         ProgramHeader[I].p_memsz == ProgramHeader[I].p_filesz) {
>       Note = (const ElfW(Nhdr) *)((uintptr_t)ElfHeader +
>                                   ProgramHeader[I].p_offset);
>       NotesEnd = (const ElfW(Nhdr) *)((const char *)(Note) +
>                                       ProgramHeader[I].p_filesz);

This seems wrong to me.  Using a file offset to locate data in a memory
image ought to be incorrect.  (Or is this code looking at a memory
mapped copy of the file ?)  But I expect that I am misunderstanding
something ... :-)


>     } else {
>       /*
>        * When examining notes in memory, use p_vaddr, which is the address of
>        * section after loaded to memory, to find the start of notes.
>        */
>       Note =
>           (const ElfW(Nhdr) *)((uintptr_t)ElfHeader +
> ProgramHeader[I].p_vaddr);
>       NotesEnd =
>           (const ElfW(Nhdr) *)((const char *)(Note) +
> ProgramHeader[I].p_memsz);
>     }
> > But the else case wouldn't be better either, since ElfHeader+vaddr would
> double-count the base 0x400000. I guess it would work for PIE or SOs though.

Again I am probably confused here, but why is ElfHeader being added to vaddr
in the first place.  Isn't vaddr supposed to be the full address of the in-memory
copy of the NOTE section ?


> There's nothing aarch64-specific here, but I guess we're just lucky on other
> arches that the bad address still happens to land within a loaded segment.

There is a possibility that this issue is also related to:

  https://bugzilla.redhat.com/show_bug.cgi?id=2321588

Which appears to be another linker bug related to the patch to move
the .note.build-id section back to the start of an executable's image.

Comment 5 Josh Stone 2024-10-31 17:58:17 UTC
(In reply to Nick Clifton from comment #4)
> (In reply to Josh Stone from comment #3)
>  
> > Note that the second NOTE has an additional 0x10000 memory offset relative
> > to its file offset. 
> 
> I am assuming that this is done in order to save space in the file.  It the
> second NOTE section were to be placed at a file offset of 0x028480 there
> would
> be a 0x10000 byte gap in the file.

Yeah, sure, I don't expect the file to have the gap. The difference is that before, there was no such gap to contend with, at least as far as NOTEs were concerned.

Before, we had one NOTE at offset 0x254 size 0x44. After, we have one NOTE at offset 0x270 size 0x24, and the second at file offset 0x18480 size 0x20 -- with an extra 0x10000 LOAD offset in memory. So it seems like the build-id note was already near the start of the memory map before, at least in this small case, but the change split them so that the remaining notes are now placed elsewhere.

I do agree LLVM has a bug in handling all these offsets, even if I don't understand what the binutils change accomplished.

> >     /*
> >      * When examining notes in file, use p_offset, which is the offset within
> >      * the elf file, to find the start of notes.
> >      */
> >     if (ProgramHeader[I].p_memsz == 0 ||
> >         ProgramHeader[I].p_memsz == ProgramHeader[I].p_filesz) {
> >       Note = (const ElfW(Nhdr) *)((uintptr_t)ElfHeader +
> >                                   ProgramHeader[I].p_offset);
> >       NotesEnd = (const ElfW(Nhdr) *)((const char *)(Note) +
> >                                       ProgramHeader[I].p_filesz);
> 
> This seems wrong to me.  Using a file offset to locate data in a memory
> image ought to be incorrect.  (Or is this code looking at a memory
> mapped copy of the file ?)  But I expect that I am misunderstanding
> something ... :-)

I agree that this is wrong -- I don't even understand how those "if" conditions were supposed to distinguish anything. AFAIK the program headers will look the same either way.

> >     } else {
> >       /*
> >        * When examining notes in memory, use p_vaddr, which is the address of
> >        * section after loaded to memory, to find the start of notes.
> >        */
> >       Note =
> >           (const ElfW(Nhdr) *)((uintptr_t)ElfHeader +
> > ProgramHeader[I].p_vaddr);
> >       NotesEnd =
> >           (const ElfW(Nhdr) *)((const char *)(Note) +
> > ProgramHeader[I].p_memsz);
> >     }
> > > But the else case wouldn't be better either, since ElfHeader+vaddr would
> > double-count the base 0x400000. I guess it would work for PIE or SOs though.
> 
> Again I am probably confused here, but why is ElfHeader being added to vaddr
> in the first place.  Isn't vaddr supposed to be the full address of the
> in-memory
> copy of the NOTE section ?

If it's a PIE or SO, then the file's program header can't know where the runtime vaddr will be. e.g. "readelf -Wl /bin/ls" shows vaddrs near zero, and so adding ElfHeader represents the base address at runtime.

(but the note's filesz and memsz are still equal, so it seems like we won't reach this "else" case at all)

Comment 6 Josh Stone 2024-11-01 21:17:44 UTC
I filed upstream https://github.com/llvm/llvm-project/issues/114605

Comment 7 Nikita Popov 2025-01-07 15:19:05 UTC
Has been fixed upstream in https://github.com/llvm/llvm-project/pull/114907 and backported in https://src.fedoraproject.org/rpms/llvm/pull-request/339, so closing this issue.