RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1879780 - [LLNL 8.X BUG] register x8 and quad sized NEON registers are not properly saved when using ld_audit
Summary: [LLNL 8.X BUG] register x8 and quad sized NEON registers are not properly sav...
Keywords:
Status: CLOSED DUPLICATE of bug 2047981
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: glibc
Version: 8.3
Hardware: aarch64
OS: Linux
unspecified
high
Target Milestone: alpha
: 8.7
Assignee: glibc team
QA Contact: qe-baseos-tools-bugs
URL:
Whiteboard:
: 1880069 (view as bug list)
Depends On: 1991591
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-17 00:52 UTC by Ben Woodard
Modified: 2023-07-18 14:30 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2003291 (view as bug list)
Environment:
Last Closed: 2022-07-25 09:19:24 UTC
Type: Bug
Target Upstream Version: 2.35
Embargoed:


Attachments (Terms of Use)
reproducer (2.74 KB, application/gzip)
2020-09-17 00:52 UTC, Ben Woodard
no flags Details
UNTESTED proposed patch (9.19 KB, application/mbox)
2020-09-18 04:23 UTC, Ben Woodard
no flags Details
Part 2 of patch (4.36 KB, patch)
2020-09-22 01:38 UTC, Ben Woodard
no flags Details | Diff
Tested patch that fixes the x8 problem (7.65 KB, patch)
2020-09-22 05:59 UTC, Ben Woodard
no flags Details | Diff
V2 of proposed patch (9.09 KB, patch)
2020-09-23 01:05 UTC, Ben Woodard
no flags Details | Diff
Latest version of the patch (9.85 KB, patch)
2020-09-24 19:40 UTC, Ben Woodard
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Sourceware 26643 0 P2 NEW register x8 and quad sized NEON registers are not properly saved when using ld_audit on aarch64 2021-02-03 10:34:08 UTC

Description Ben Woodard 2020-09-17 00:52:00 UTC
Created attachment 1715148 [details]
reproducer

Description of problem:
According to the Arm Cortex-A Series Programmers Guide for ARMv8-A It is part of the ABI for the architecture.

See Section 9.1.1 Parameters in general-purpose registers on page 9-3 where it states:

   Registers with a special purpose (X8, X16-X18, X29, X30)
   • X8 is the indirect result register. This is used to pass the address location of an indirect result, for example, where a function returns a large structure.gister should be preserved. 

You can see that this register is saved in the normal code path which is what is normally run. See sysdeps/aarch64/dl-trampoline.S:48 :
	/* Save arguments.  */
	stp	x8, x9, [sp, #-(80+8*16)]!
	cfi_adjust_cfa_offset (80+8*16)
	cfi_rel_offset (x8, 0)
	cfi_rel_offset (x9, 8)

Side note:
I do not understand why x9 is saved there. According to the ABI spec. X9-X15 are Caller-saved temporary registers and so I would expect X9 to need to be saved if it was going to be used. However, I do not see any use of x9 in the code between the point at which it was saved in line 48 and when it is restored in line 113.

However, when you use LD_AUDIT the rtld shifts into what it calls profiling mode. In that profiling mode x8 is not properly saved. See there is no x8 being saved around line 185 in _dl_runtime_profile:

	stp	x0, x1, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*0]
	cfi_rel_offset (x0, OFFSET_RG + DL_OFFSET_RG_X0 + 16*0 + 0)
	cfi_rel_offset (x1, OFFSET_RG + DL_OFFSET_RG_X0 + 16*0 + 8)
	stp	x2, x3, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*1]
	cfi_rel_offset (x2, OFFSET_RG + DL_OFFSET_RG_X0 + 16*1 + 0)
	cfi_rel_offset (x3, OFFSET_RG + DL_OFFSET_RG_X0 + 16*1 + 8)
	stp	x4, x5, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*2]
	cfi_rel_offset (x4, OFFSET_RG + DL_OFFSET_RG_X0 + 16*2 + 0)
	cfi_rel_offset (x5, OFFSET_RG + DL_OFFSET_RG_X0 + 16*2 + 8)
	stp	x6, x7, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*3]
	cfi_rel_offset (x6, OFFSET_RG + DL_OFFSET_RG_X0 + 16*3 + 0)
	cfi_rel_offset (x7, OFFSET_RG + DL_OFFSET_RG_X0 + 16*3 + 8)

Yeah you got the Parameter and result registers but where is the Indirect result location register? After that block of code, you immediately start saving the d registers.

Version-Release number of selected component (if applicable):
All - this bug has been there since the beginning of the aarch64 port

How reproducible:
Always

Steps to Reproduce:
There are two reproducers in the uploaded tar.gz which I got for the original reporter. One is very simple and should work but variations in the compiler can change how the structure is returned and cause that reproducer not to work. The other reproducer is a much bigger bit of code which always forces the use of x8 but it is much more code.


Additional info:
This bug report was provided by developers at Rice University Jonathon Anderson, John Mellor-Crummey, Xiaozhu Meng, Mark Krentel working on behalf of LLNL

Comment 1 Ben Woodard 2020-09-17 00:54:53 UTC
Note that this bug also affects RHEL7-alt, and Fedora

Comment 2 Carlos O'Donell 2020-09-17 02:28:02 UTC
I've reached out to Arm to discuss this issue.

Comment 3 Ben Woodard 2020-09-17 16:01:32 UTC
It is explicitly requested that you open this BZ to ARM as we believe that this will help resolve the issue in a timely manner.

Also note that this affects upstream as well.

Comment 5 Ben Woodard 2020-09-17 23:34:26 UTC
See also https://bugzilla.redhat.com/show_bug.cgi?id=1880069

When you start looking at the code in _dl_runtime_profile it looks like it has been copied from a previous version of the processor where there were no Q registers in the V register group. In other words the V registers were only uint64_t's not uint128_t's. 

x29 ends up being loaded with a pointer to a struct of type la_aarch64_regs 

/* Registers for entry into PLT on AArch64.  */
typedef struct La_aarch64_regs
{
  uint64_t lr_xreg[8];
  uint64_t lr_dreg[8];
  uint64_t lr_sp;
  uint64_t lr_lr;
} La_aarch64_regs;

This doesn't include space for x8 the indirect result register. Nor does it have space for the full sized v registers. Furthermore when you look at line 136 where it describes the stack frame layout

	   Stack frame layout:
	   [sp,   #...] lr
	   [sp,   #...] &PLTGOT[n]
	   [sp,    #96] La_aarch64_regs
	   [sp,    #48] La_aarch64_retval
	   [sp,    #40] frame size return from pltenter
	   [sp,    #32] dl_profile_call saved x1
	   [sp,    #24] dl_profile_call saved x0
	   [sp,    #16] t1
	   [sp,     #0] x29, lr   <- x29

what _dl_runtime_resolve actually pushes on the stack the La_aarch64_regs part doesn't match the what _dl_runtime_profile is expecting. The size of the structure is quite different.

To be able to support those the structure would have to be:

/* Registers for entry into PLT on AArch64.  */
typedef struct La_aarch64_regs
{
  uint64_t    lr_xreg[10];
  __uint128_t lr_vreg[8]; /* or uint64_t lr_vreg[16]; */
  uint64_t    lr_sp;
  uint64_t    lr_lr;
} La_aarch64_regs;

The fact that the number of X regs that are pushed onto the stack is 10 rather than 9 becomes even more perplexing. Reading the ABI spec I would have thought that there is no reason to push x9.

Even though x9 is on the stack at the time when _dl_runtime_resolve is called my reading is that it doesn't have to be in the La_aarch64_regs structure so the structure could be:

/* Registers for entry into PLT on AArch64.  */
typedef struct La_aarch64_regs
{
  uint64_t    lr_xreg[9];
  __uint128_t lr_vreg[8]; /* or uint64_t lr_vreg[16]; */
  uint64_t    lr_sp;
  uint64_t    lr_lr;
} La_aarch64_regs;

Personally, I would use a sto rather than a stp up in _dl_profile_resolve and just push x8 by itself.

In that modification it should be noted that I changed the name of the NEON registers from dreg to vreg since that more accurately names that part of the register file.

even this is inadequate to contain the registers which are specified by SVE/SVE2

There may be some ABI implications to changing this structure but I would argue that this only impacts users of LD_AUDIT and that the fact that these bugs have been latent for so long suggests that the users who reported this to me are the first to make use of this interface.

Comment 6 Ben Woodard 2020-09-18 04:23:37 UTC
Created attachment 1715295 [details]
UNTESTED proposed patch

This is an untested patch which suggests the approach that I think is needed to fix this problem. I will try to test it tomorrow. However, someone with much more experience with ARM assembly code needs to look at it.

Comment 7 Ben Woodard 2020-09-18 14:15:23 UTC
*** Bug 1880069 has been marked as a duplicate of this bug. ***

Comment 8 Ben Woodard 2020-09-18 23:10:11 UTC
This document is important in understanding the topic: 
https://developer.arm.com/documentation/100986/0000/

Comment 9 Ben Woodard 2020-09-18 23:15:16 UTC
Solving the problem for SVE is more challenging due to the large and variable length registers. 
I think that I would like to do something like this:

typedef struct La_sve_bits
{
  __uint128_t lr_zreg[8];
  uint16_t    lr_preg[3];
}; /* 134 bytes in size */

typedef struct La_aarch64_regs
{
  unsigned int version:8;
  unsigned int sve_len:8;
  uint64_t lr_xreg[9];
  uint64_t lr_sp;
  uint64_t lr_lr;
  union {
    __uint128_t        lr_vreg[8];
    struct La_sve_bits lr_sve_reg[1];
  };
} La_aarch64_regs;

There are several ideas here:
1) The structure would be versioned so that if we need to update it at some point you can read the version number of the structure and know how to interpret it. God help us if we ever need more than 2-3 versions of this structure but let's give it a whole 8 bits.
2) I moved the SP and the LR up in the structure so that I could change the size of the SVE registers without overwriting them. 
3) The lr_vreg would be what I've got now. The NEON quad sized floats. However, now it is stuffed in an anonymous union.
4) SVE obviously has different sized registers depending on the processor implementation. However the internal size is a multiple of len and so I put sve_len as one of the fields. Currently it can be  1-16. Let's be generous and give it a whole 8 bits. In the non-sve case the sve_len will be 0; this can happen on an individual function call basis. Most functions will not be using the SVE registers. We don't want to be throwing around more than 2K unless we really have to.
5) Then in the case where we do use SVE sve_len will be set to the processor's true SVE register length in bits/128.
So in the case of a processor like A64FX which has a SVE register length of 512 the sve_len would be set to 4 and there would then be 3 more lr_sve_reg structures immediately following the one found inside the La_aarch64_regs structure.
Admittedly this would take some fiddling to reconstruct the full registers used to pass parameters to SVE but it would seem to me to be a better approach then making all the La_aarch64_regs 2376 bytes long.

In the default case the size of the structure would only be 226 bytes long. IMHO reasonable.
In the case of the A64FX it would be 628 bytes.

Comment 10 Ben Woodard 2020-09-22 01:38:23 UTC
Created attachment 1715610 [details]
Part 2 of patch

I'll squash these into one patch later. This fixes the alignment problem that I introduced and addresses a place problem where missed renaming a structure member.

Comment 11 Ben Woodard 2020-09-22 05:59:55 UTC
Created attachment 1715632 [details]
Tested patch that fixes the x8 problem

This patch was tested to work on the x8 reproducer attached to this bug.

$ LD_AUDIT=../break-glibc/arm-audit-bug/simple/libauditor.so ../break-glibc/arm-audit-bug/simple/main
Segmentation fault (core dumped)
$ builddir=`pwd` LD_AUDIT=../break-glibc/arm-audit-bug/simple/libauditor.so CONV_PATH="${builddir}"/iconvdata LOCPATH="${builddir}"/localedata LC_ALL=C  "${builddir}"/elf/ld-linux-aarch64.so.1 --library-path "${builddir}":"${builddir}"/math:"${builddir}"/elf:"${builddir}"/dlfcn:"${builddir}"/nss:"${builddir}"/nis:"${builddir}"/rt:"${builddir}"/resolv:"${builddir}"/mathvec:"${builddir}"/support:"${builddir}"/crypt:"${builddir}"/nptl ../break-glibc/arm-audit-bug/simple/main
$

The NEON Vector register problem is also fixed but a reproducer has not been developed yet.
The analogous SVE problem is not yet solved.

Comment 12 Ben Woodard 2020-09-23 01:05:56 UTC
Created attachment 1715873 [details]
V2 of proposed patch

The La_aarch64_retval structure also needed to be expanded to accommodate the quad sized V registers at the same time, the ABI also specifies that additional registers can be used for return values, therefore these were also added to the La_aarch64_retval structure.

Comment 13 Ben Woodard 2020-09-24 19:40:28 UTC
Created attachment 1716412 [details]
Latest version of the patch

Comment 14 Ben Woodard 2020-09-24 19:42:20 UTC
Current upstream status - paused pending upstream consideration about what to do with SVE

Comment 16 Jon Masters 2020-09-28 15:49:00 UTC
Note that AAPCS64 says:

"The first eight registers, v0-v7, are used to pass argument values into a subroutine and to return result values from a function. They may also be used to hold intermediate values within a routine (but, in general, only between subroutine calls).

Registers v8-v15 must be preserved by a callee across subroutine calls; the remaining registers (v0-v7, v16-v31) do not need to be preserved (or should be preserved by the caller). Additionally, only the bottom 64 bits of each value stored in v8-v15 need to be preserved 7; it is the responsibility of the caller to preserve larger values."

Thus you don't need a 128-bit quantity to save those V registers, only the lower 64-bits. It looks like just the indirect result register is at issue here?

Comment 17 Jon Masters 2020-09-28 16:23:42 UTC
Ok. Having talked about this some more, it seems there are two separate issues:

1). The existing profile trampoline entry/exit isn't doing the needed save/restore on x86 on the local stack. Non-controversial issue.

2). A separate INTERNAL GLIBC ABI nothing to do with AAPCS64 issue in the audit interface where users desire to log x86 plus full SIMD/SVE register state, requiring a change to the INTERNAL GLIBC audit structures.

We should split this issue into the trivial fix, and the audit fix. As separate issues.

Jon.

Comment 18 Jon Masters 2020-09-28 16:24:08 UTC
Oops - x8 not x86 :)

Comment 25 Ben Woodard 2021-08-19 20:24:24 UTC
This problem as well as the related one with regards to SVE have been looked at upstream. They have been proposed for glibc-2.35 but they haven't been extensively tested yet nor have they been accepted. 

git clone https://sourceware.org/git/glibc.git
git checkout --track remotes/origin/azanella/ld-audit-fixes

Most of the work was done in: 
0020-elf-Fix-runtime-linker-auditing-on-aarch64-BZ-26643.patch                                                                                                                                        
0021-elf-Add-SVE-support-for-aarch64-rtld-audit.patch  

It just needs to be backported to RHEL8 and RHEL9.

Comment 38 Florian Weimer 2022-07-25 09:19:24 UTC
Delivered via bug 2047981.

*** This bug has been marked as a duplicate of bug 2047981 ***


Note You need to log in before you can comment on or make changes to this bug.