1464211 – valgrind: Mask CPUID support in HWCAP on aarch64

Bug 1464211 - valgrind: Mask CPUID support in HWCAP on aarch64

Summary: valgrind: Mask CPUID support in HWCAP on aarch64

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	valgrind
Sub Component:
Version:	rawhide
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Mark Wielaard
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1464085 1467952
TreeView+	depends on / blocked

Reported:	2017-06-22 16:23 UTC by Florian Weimer
Modified:	2018-06-14 12:40 UTC (History)
CC List:	6 users (show)
Fixed In Version:	valgrind-3.13.0-4.fc26
Clone Of:	1464085
Environment:
Last Closed:	2017-07-07 23:05:15 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
valgrind output (4.10 KB, text/plain) 2018-06-13 19:36 UTC, Rob Clark	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
KDE Software Compilation	381556	0	NOR	RESOLVED	arm64: Handle feature registers access on 4.11 Linux kernel or later	2019-12-10 05:25:19 UTC

Description Florian Weimer 2017-06-22 16:23:16 UTC

+++ This bug was initially created as a clone of Bug #1464085 +++

valgrind currently does not know anything about the CPUID flag added to the HWCAP auxv entry in kernel 4.11.  It passes this flag through to applications, but it will then choke when the application uses it, like this:

ARM64 front end: branch_etc
disInstr(arm64): unhandled instruction 0xD5380000
disInstr(arm64): 1101'0101 0011'1000 0000'0000 0000'0000
==924== valgrind: Unrecognised instruction at address 0x11f548.
==924==    at 0x11F548: init_cpu_features (cpu-features.c:32)
==924==    by 0x11F548: dl_platform_init (dl-machine.h:241)
==924==    by 0x11F548: _dl_sysdep_start (dl-sysdep.c:231)
==924==    by 0x10981B: _dl_start_final (rtld.c:412)
==924==    by 0x109AAB: _dl_start (rtld.c:520)

The crashing instruction is the mrs in the glibc startup code, which means that currently no applications run under valgrind:

  if (hwcap & HWCAP_CPUID)
    {
      register uint64_t id = 0;
      asm volatile ("mrs %0, midr_el1" : "=r"(id));
      cpu_features->midr_el1 = id;
    }
  else
    cpu_features->midr_el1 = 0;

Perhaps valgrind should mask all the HWCAP bits it knows nothing about.

Workaround: Run with “LD_HWCAP_MASK=1”.

Comment 1 Mark Wielaard 2017-06-23 10:52:41 UTC

See also upstream https://bugs.kde.org/show_bug.cgi?id=381556
arm64: Handle feature registers access on 4.11 Linux kernel or later

For now worked around in valgrind valgrind-3.13.0-3.fc27 as suggested in the original description of this bug:

--- a/coregrind/m_initimg/initimg-linux.c
+++ b/coregrind/m_initimg/initimg-linux.c
@@ -703,6 +703,12 @@ Addr setup_client_stack( void*  init_sp,
                   (and anything above) are not supported by Valgrind. */
                auxv->u.a_val &= VKI_HWCAP_S390_TE - 1;
             }
+#           elif defined(VGP_arm64_linux)
+            {
+               /* Linux 4.11 started pupulating this for arm64, but we
+                  currently don't support any. */
+               auxv->u.a_val = 0;
+            }
 #           endif
             break;
 #        if defined(VGP_ppc64be_linux) || defined(VGP_ppc64le_linux)

Keeping this bug open to see how upstream resolves this.

Comment 2 Fedora Update System 2017-06-29 20:11:01 UTC

valgrind-3.13.0-4.fc26 has been submitted as an update to Fedora 26. https://bodhi.fedoraproject.org/updates/FEDORA-2017-4315a2f0cd

Comment 3 Fedora Update System 2017-06-30 20:25:29 UTC

valgrind-3.13.0-4.fc26 has been pushed to the Fedora 26 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2017-4315a2f0cd

Comment 4 Fedora Update System 2017-07-07 23:05:15 UTC

valgrind-3.13.0-4.fc26 has been pushed to the Fedora 26 stable repository. If problems still persist, please make note of it in this bug report.

Comment 5 Rob Clark 2018-06-13 19:15:57 UTC

(In reply to Mark Wielaard from comment #1)
> See also upstream https://bugs.kde.org/show_bug.cgi?id=381556
> arm64: Handle feature registers access on 4.11 Linux kernel or later
> 
> For now worked around in valgrind valgrind-3.13.0-3.fc27 as suggested in the
> original description of this bug:
> 
> --- a/coregrind/m_initimg/initimg-linux.c
> +++ b/coregrind/m_initimg/initimg-linux.c
> @@ -703,6 +703,12 @@ Addr setup_client_stack( void*  init_sp,
>                    (and anything above) are not supported by Valgrind. */
>                 auxv->u.a_val &= VKI_HWCAP_S390_TE - 1;
>              }
> +#           elif defined(VGP_arm64_linux)
> +            {
> +               /* Linux 4.11 started pupulating this for arm64, but we
> +                  currently don't support any. */
> +               auxv->u.a_val = 0;
> +            }
>  #           endif
>              break;
>  #        if defined(VGP_ppc64be_linux) || defined(VGP_ppc64le_linux)
> 
> Keeping this bug open to see how upstream resolves this.


hmm, I just saw the same issue on rawhide (valgrind 1:3.13.0-18.fc29).. did a patch get lost from the spec file?

Comment 6 Mark Wielaard 2018-06-13 19:24:33 UTC

(In reply to Rob Clark from comment #5)
> hmm, I just saw the same issue on rawhide (valgrind 1:3.13.0-18.fc29).. did
> a patch get lost from the spec file?

The patch (valgrind-3.13.0-arm64-hwcap.patch) is there (and still the same, no change upstream), and applied. Is the issue exactly the same as in the description? Could you paste the command line and the valgrind error message?

Comment 7 Rob Clark 2018-06-13 19:36:20 UTC

cmdline:

  valgrind --leak-check=yes ./deqp-gles31 --deqp-case=dEQP-GLES31.functional.ssbo.layout.random.arrays_of_arrays.1

(debuging some dEQP test crashes in mesa/freedreno)

output (without LD_HWCAP_MASK=1 which works around the issue) (also attached):

==32073== Memcheck, a memory error detector
==32073== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==32073== Using Valgrind-3.13.0.SVN and LibVEX; rerun with -h for copyright info
==32073== Command: ./deqp-gles31 --deqp-visibility=hidden --deqp-case=dEQP-GLES31.functional.ssbo.layout.random.arrays_of_arrays.1 --deqp-log-filename=results/dEQP-GLES31.functional.ssbo.layout.random.arrays_of_arrays.1.qpa
==32073== 
ARM64 front end: branch_etc
disInstr(arm64): unhandled instruction 0xD5380000
disInstr(arm64): 1101'0101 0011'1000 0000'0000 0000'0000
==32073== valgrind: Unrecognised instruction at address 0x40150cc.
==32073==    at 0x40150CC: init_cpu_features (cpu-features.c:72)
==32073==    by 0x40150CC: dl_platform_init (dl-machine.h:208)
==32073==    by 0x40150CC: _dl_sysdep_start (dl-sysdep.c:231)
==32073==    by 0x40018C3: _dl_start_final (rtld.c:411)
==32073==    by 0x4001B3F: _dl_start (rtld.c:520)
==32073==    by 0x4001047: ??? (in /usr/lib64/ld-2.27.9000.so)
==32073== Your program just tried to execute an instruction that Valgrind
==32073== did not recognise.  There are two possible reasons for this.
==32073== 1. Your program has a bug and erroneously jumped to a non-code
==32073==    location.  If you are running Memcheck and you just saw a
==32073==    warning about a bad jump, it's probably your program's fault.
==32073== 2. The instruction is legitimate but Valgrind doesn't handle it,
==32073==    i.e. it's Valgrind's fault.  If you think this is the case or
==32073==    you are not sure, please let us know and we'll try to fix it.
==32073== Either way, Valgrind will now raise a SIGILL signal which will
==32073== probably kill your program.
==32073== 
==32073== Process terminating with default action of signal 4 (SIGILL): dumping core
==32073==  Illegal opcode at address 0x40150CC
==32073==    at 0x40150CC: init_cpu_features (cpu-features.c:72)
==32073==    by 0x40150CC: dl_platform_init (dl-machine.h:208)
==32073==    by 0x40150CC: _dl_sysdep_start (dl-sysdep.c:231)
==32073==    by 0x40018C3: _dl_start_final (rtld.c:411)
==32073==    by 0x4001B3F: _dl_start (rtld.c:520)
==32073==    by 0x4001047: ??? (in /usr/lib64/ld-2.27.9000.so)

valgrind: m_coredump/coredump-elf.c:506 (fill_fpu): Assertion 'Unimplemented functionality' failed.
valgrind: valgrind

host stacktrace:
==32073==    at 0x3803E0FC: show_sched_status_wrk (m_libcassert.c:378)
==32073==    by 0x3803E22B: report_and_quit (m_libcassert.c:449)
==32073==    by 0x3803E387: vgPlain_assert_fail (m_libcassert.c:515)
==32073==    by 0x380706FB: fill_fpu.isra.4 (coredump-elf.c:506)
==32073==    by 0x380708CF: dump_one_thread (coredump-elf.c:563)
==32073==    by 0x380708CF: make_elf_coredump (coredump-elf.c:667)
==32073==    by 0x380708CF: vgPlain_make_coredump (coredump-elf.c:748)
==32073==    by 0x3805654F: default_action (m_signals.c:1937)
==32073==    by 0x3805654F: deliver_signal (m_signals.c:1997)
==32073==    by 0x38056D0B: vgPlain_synth_sigill (m_signals.c:2106)
==32073==    by 0x380982DB: vgPlain_scheduler (scheduler.c:1577)
==32073==    by 0x380A939F: thread_wrapper (syswrap-linux.c:103)
==32073==    by 0x380A939F: run_a_thread_NORETURN (syswrap-linux.c:156)
==32073==    by 0xFFFFFFFFFFFFFFFF: ???

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable (lwpid 32073)
==32073==    at 0x40150CC: init_cpu_features (cpu-features.c:72)
==32073==    by 0x40150CC: dl_platform_init (dl-machine.h:208)
==32073==    by 0x40150CC: _dl_sysdep_start (dl-sysdep.c:231)
==32073==    by 0x40018C3: _dl_start_final (rtld.c:411)
==32073==    by 0x4001B3F: _dl_start (rtld.c:520)
==32073==    by 0x4001047: ??? (in /usr/lib64/ld-2.27.9000.so)


Note: see also the FAQ in the source distribution.
It contains workarounds to several common problems.
In particular, if Valgrind aborted or crashed after
identifying problems in your program, there's a good chance
that fixing those problems will prevent Valgrind aborting or
crashing, especially if it happened in m_mallocfree.c.

If that doesn't help, please report this bug to: www.valgrind.org

In the bug report, send all the above text, the valgrind
version, and what OS and version you are using.  Thanks.

Comment 8 Rob Clark 2018-06-13 19:36:52 UTC

Created attachment 1451010 [details]
valgrind output

Comment 9 Florian Weimer 2018-06-13 19:47:16 UTC

That's from the midr_el1 read:

  /* If there was no useful tunable override, query the MIDR if the kernel
     allows it.  */
  if (midr == UINT64_MAX)
    {
      if (hwcap & HWCAP_CPUID)
	asm volatile ("mrs %0, midr_el1" : "=r"(midr));
      else
	midr = 0;
    }

So it looks like we get the wrong (host) hwcap value without masking.

Comment 10 Florian Weimer 2018-06-13 19:48:17 UTC

It might be helpful to run “LD_SHOW_AUXV=1 /bin/true” with and without valgrind.

Comment 11 Rob Clark 2018-06-13 20:05:40 UTC

so, quick disclaimer, but I'm running a non-standard kernel atm, if any kernel config/etc could effect this, I can retry w/ a vanilla kernel (but not immediately, and possibly not on the same device)

(In reply to Florian Weimer from comment #10)
> It might be helpful to run “LD_SHOW_AUXV=1 /bin/true” with and without
> valgrind.

[robclark@db820c:~]$ LD_SHOW_AUXV=1 /bin/true
AT_SYSINFO_EHDR: 0xffff81924000
AT_HWCAP:        8ff
AT_PAGESZ:       4096
AT_CLKTCK:       100
AT_PHDR:         0xaaaac8ba2040
AT_PHENT:        56
AT_PHNUM:        9
AT_BASE:         0xffff818f6000
AT_FLAGS:        0x0
AT_ENTRY:        0xaaaac8ba38d0
AT_UID:          1000
AT_EUID:         1000
AT_GID:          1000
AT_EGID:         1000
AT_SECURE:       0
AT_RANDOM:       0xfffff1883f68
AT_EXECFN:       /bin/true
AT_PLATFORM:     aarch64
[robclark@db820c:~]$ 
[robclark@db820c:~]$ LD_SHOW_AUXV=1 valgrind --leak-check=yes /bin/true
AT_SYSINFO_EHDR: 0xffff9eb51000
AT_HWCAP:        8ff
AT_PAGESZ:       4096
AT_CLKTCK:       100
AT_PHDR:         0x400040
AT_PHENT:        56
AT_PHNUM:        9
AT_BASE:         0xffff9eb23000
AT_FLAGS:        0x0
AT_ENTRY:        0x4011d0
AT_UID:          1000
AT_EUID:         1000
AT_GID:          1000
AT_EGID:         1000
AT_SECURE:       0
AT_RANDOM:       0xffffc66278c8
AT_EXECFN:       /usr/local/bin/valgrind
AT_PLATFORM:     aarch64
==1668== Memcheck, a memory error detector
==1668== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==1668== Using Valgrind-3.13.0.SVN and LibVEX; rerun with -h for copyright info
==1668== Command: /bin/true
==1668== 
ARM64 front end: branch_etc
disInstr(arm64): unhandled instruction 0xD5380000
disInstr(arm64): 1101'0101 0011'1000 0000'0000 0000'0000
==1668== valgrind: Unrecognised instruction at address 0x40150cc.
==1668==    at 0x40150CC: init_cpu_features (cpu-features.c:72)
==1668==    by 0x40150CC: dl_platform_init (dl-machine.h:208)
==1668==    by 0x40150CC: _dl_sysdep_start (dl-sysdep.c:231)
==1668==    by 0x40018C3: _dl_start_final (rtld.c:411)
==1668==    by 0x4001B3F: _dl_start (rtld.c:520)
==1668==    by 0x4001047: ??? (in /usr/lib64/ld-2.27.9000.so)
==1668== Your program just tried to execute an instruction that Valgrind
==1668== did not recognise.  There are two possible reasons for this.
==1668== 1. Your program has a bug and erroneously jumped to a non-code
==1668==    location.  If you are running Memcheck and you just saw a
==1668==    warning about a bad jump, it's probably your program's fault.
==1668== 2. The instruction is legitimate but Valgrind doesn't handle it,
==1668==    i.e. it's Valgrind's fault.  If you think this is the case or
==1668==    you are not sure, please let us know and we'll try to fix it.
==1668== Either way, Valgrind will now raise a SIGILL signal which will
==1668== probably kill your program.
==1668== 
==1668== Process terminating with default action of signal 4 (SIGILL): dumping core
==1668==  Illegal opcode at address 0x40150CC
==1668==    at 0x40150CC: init_cpu_features (cpu-features.c:72)
==1668==    by 0x40150CC: dl_platform_init (dl-machine.h:208)
==1668==    by 0x40150CC: _dl_sysdep_start (dl-sysdep.c:231)
==1668==    by 0x40018C3: _dl_start_final (rtld.c:411)
==1668==    by 0x4001B3F: _dl_start (rtld.c:520)
==1668==    by 0x4001047: ??? (in /usr/lib64/ld-2.27.9000.so)

valgrind: m_coredump/coredump-elf.c:506 (fill_fpu): Assertion 'Unimplemented functionality' failed.
valgrind: valgrind

host stacktrace:
==1668==    at 0x3803E0FC: show_sched_status_wrk (m_libcassert.c:378)
==1668==    by 0x3803E22B: report_and_quit (m_libcassert.c:449)
==1668==    by 0x3803E387: vgPlain_assert_fail (m_libcassert.c:515)
==1668==    by 0x380706FB: fill_fpu.isra.4 (coredump-elf.c:506)
==1668==    by 0x380708CF: dump_one_thread (coredump-elf.c:563)
==1668==    by 0x380708CF: make_elf_coredump (coredump-elf.c:667)
==1668==    by 0x380708CF: vgPlain_make_coredump (coredump-elf.c:748)
==1668==    by 0x3805654F: default_action (m_signals.c:1937)
==1668==    by 0x3805654F: deliver_signal (m_signals.c:1997)
==1668==    by 0x38056D0B: vgPlain_synth_sigill (m_signals.c:2106)
==1668==    by 0x380982DB: vgPlain_scheduler (scheduler.c:1577)
==1668==    by 0x380A939F: thread_wrapper (syswrap-linux.c:103)
==1668==    by 0x380A939F: run_a_thread_NORETURN (syswrap-linux.c:156)
==1668==    by 0xFFFFFFFFFFFFFFFF: ???

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable (lwpid 1668)
==1668==    at 0x40150CC: init_cpu_features (cpu-features.c:72)
==1668==    by 0x40150CC: dl_platform_init (dl-machine.h:208)
==1668==    by 0x40150CC: _dl_sysdep_start (dl-sysdep.c:231)
==1668==    by 0x40018C3: _dl_start_final (rtld.c:411)
==1668==    by 0x4001B3F: _dl_start (rtld.c:520)
==1668==    by 0x4001047: ??? (in /usr/lib64/ld-2.27.9000.so)


Note: see also the FAQ in the source distribution.
It contains workarounds to several common problems.
In particular, if Valgrind aborted or crashed after
identifying problems in your program, there's a good chance
that fixing those problems will prevent Valgrind aborting or
crashing, especially if it happened in m_mallocfree.c.

If that doesn't help, please report this bug to: www.valgrind.org

In the bug report, send all the above text, the valgrind
version, and what OS and version you are using.  Thanks.

Comment 12 Mark Wielaard 2018-06-13 20:14:44 UTC

hohum, so that shows the HWCAP of valgrind itself, which then execs /bin/true and crashes before showing the auxv Maybe try:

 LD_HWCAP_MASK=1 LD_SHOW_AUXV=1 valgrind -q /bin/true

Comment 13 Rob Clark 2018-06-14 11:50:06 UTC

heh, so this makes my problem a bit more obvious.. at one point in the past I had built my own valgrind (in /usr/local/bin which was ahead of /usr/bin in $PATH).. so in fact the problem all along was not with fedora's valgrind but pebkac ;-)

/me reaches for brown paper bag

------
[robclark@db820c:~]$ LD_HWCAP_MASK=1 LD_SHOW_AUXV=1 valgrind -q /bin/true
AT_SYSINFO_EHDR: 0xffffb56ca000
AT_HWCAP:        8ff
AT_PAGESZ:       4096
AT_CLKTCK:       100
AT_PHDR:         0x400040
AT_PHENT:        56
AT_PHNUM:        9
AT_BASE:         0xffffb569c000
AT_FLAGS:        0x0
AT_ENTRY:        0x4011d0
AT_UID:          1000
AT_EUID:         1000
AT_GID:          1000
AT_EGID:         1000
AT_SECURE:       0
AT_RANDOM:       0xffffd156b538
AT_EXECFN:       /usr/local/bin/valgrind
AT_PLATFORM:     aarch64
AT_HWCAP:        8ff
AT_PAGESZ:       4096
AT_CLKTCK:       100
AT_PHDR:         0x108040
AT_PHENT:        56
AT_PHNUM:        9
AT_BASE:         0x4000000
AT_FLAGS:        0x0
AT_ENTRY:        0x1098d0
AT_UID:          1000
AT_EUID:         1000
AT_GID:          1000
AT_EGID:         1000
AT_SECURE:       0
AT_RANDOM:       0xfff000fda
AT_EXECFN:       /bin/true
AT_PLATFORM:     aarch64

Comment 14 Mark Wielaard 2018-06-14 12:40:40 UTC

(In reply to Rob Clark from comment #13)
> heh, so this makes my problem a bit more obvious.. at one point in the past
> I had built my own valgrind (in /usr/local/bin which was ahead of /usr/bin
> in $PATH).. so in fact the problem all along was not with fedora's valgrind
> but pebkac ;-)
> 
> /me reaches for brown paper bag

No worries. Thanks for walking through it with us.
If there is any reason in the future to build an upstream valgrind please let me know. I am happy to backport any fixes to the fedora package.

Note You need to log in before you can comment on or make changes to this bug.