2257546 – jq: valgrind error when compiled with sse4.1

Bug 2257546 - jq: valgrind error when compiled with sse4.1

Summary: jq: valgrind error when compiled with sse4.1

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	valgrind
Sub Component:
Version:	40
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Assignee:	Mark Wielaard
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-01-10 02:59 UTC by Yaakov Selkowitz
Modified:	2024-03-23 00:41 UTC (History)
CC List:	16 users (show)
Fixed In Version:	valgrind-3.22.0-7.fc40
Clone Of:
Environment:
Last Closed:	2024-03-23 00:41:21 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Testcase; may be incorrect; see comment 6 (777 bytes, text/plain) 2024-03-06 08:52 UTC, Julian Seward	no flags	Details
Alternative impl for `ptest`. DO NOT LAND. (1.72 KB, text/plain) 2024-03-06 08:54 UTC, Julian Seward	no flags	Details
a proper fix (11.87 KB, patch) 2024-03-10 08:47 UTC, Julian Seward	no flags	Details \| Diff
Show Obsolete (2) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Fedora Package Sources	jq pull-request 7	0	None	None	None	2024-01-24 03:28:36 UTC

Description Yaakov Selkowitz 2024-01-10 02:59:12 UTC

jq 1.7.1, currently in rawhide, is failing some tests when built for ELN.  The testsuite uses valgrind, which is throwing the same error in both failed tests:

==1826901== Conditional jump or move depends on uninitialised value(s)
==1826901==    at 0x487945B: UnknownInlinedFun (jq_test.c:58)
==1826901==    by 0x487945B: run_jq_tests (jq_test.c:99)
==1826901==    by 0x487B58C: jq_testsuite (jq_test.c:39)
==1826901==    by 0x10C401: main (main.c:591)
==1826901== 

The source file in question is https://github.com/jqlang/jq/blob/master/src/jq_test.c

While this is being triggered in ELN due to its default of x86-64-v3, through trial and error I managed to narrow it down to SSE4.1, and it can also be seen in rawhide by adding -msse4.1, or negated with -march=x86-64-v[23] -mno-sse4.1, in CFLAGS or %__cflags_arch_x86_64.

FWIW, jq 1.6 built fine in December with the same gcc.

Reproducible: Always

Comment 1 Yaakov Selkowitz 2024-01-19 09:19:07 UTC

This issue persists in gcc-14.0.1-0.1.

Comment 2 Jakub Jelinek 2024-01-19 10:43:34 UTC

This is just valgrind unable to cope with what gcc validly emits for the strcmp calls.

Short self-contained testcase - rh2257546.c:
#include <stdio.h>

long long j;

static int checkfail(const char* buf) {
  return __builtin_strcmp(buf, "%%FAIL\n") == 0 || __builtin_strcmp(buf, "%%FAIL IGNORE MSG\n") == 0;
}

void
foo ()
{
  char prog[4096];
  for (int i = 0; i < 64; ++i)
    {
      if (!fgets(prog, sizeof(prog), stdin)) break;
      if (checkfail(prog))
        ++j;
    }
}

int
main ()
{
  foo ();
}

gcc -o rh2257546{,.c} -O3 -msse4 -g
echo a | valgrind ./rh2257546

Before https://gcc.gnu.org/r9-922 gcc has been emitting either strcmp calls or repz cmpsb, but since that commit
gcc has been emitting
movabsq $5269295441534919973, %r13
movabsq $6002489360460500551, %r12
movabsq $2898627278546213, %rbp
...
        movq    8(%rsp), %rdx
        movq    (%rsp), %rax
        xorq    %r12, %rdx
        xorq    %r13, %rax
        orq     %rax, %rdx
        jne     .L8
        cmpw    $2631, 16(%rsp)
        je      .L17
...
        cmpq    %rbp, (%rsp)
        jne     .L18
etc. where 2898627278546213 is "%%FAIL\n\0", 5269295441534919973 "%%FAIL I", 6002489360460500551 is "GNORE MS" and 2631 is "G\n".
So, even when the prog buffer contains uninitialized bytes after the terminating '\0', it reads 8 or 16 bytes out of the buffer and compares
it against the expected strings.  This is a valid optimization, the compiler knows that the buffer is on the stack and the bytes can be read from it,
and if there is some '\0' and random uninitialized bytes after it before the 8 or 16 bytes mark, it doesn't matter, while the comparison will compare
the uninitialized bytes against some expected value, if there is the '\0' terminator earlier, it won't compare equal, which is all the program
cares about.
Now, valgrind must have some workaround for this, because it doesn't trigger valgrind diagnostics (e.g. when compiled by gcc 9, 10, 11, 12).
The generated code changed again with https://gcc.gnu.org/r13-1607 and instead of doing the 128-bit comparison in general purpose registers
(i.e. that movq 8(%rsp), %rdx; movq (%rsp), %rax; xorq %r12, %rdx; xorq %r13, %rax; orq %rax, %rdx; jne .L8 it now emits
       movdqa  (%rsp), %xmm0
       pxor    .LC2(%rip), %xmm0
       ptest   %xmm0, %xmm0
       je      .L17
i.e. uses vector instructions to do the equivalent.  Or with -mavx2 it could do e.g.
        vmovdqa (%rsp), %xmm0
        vpxor   .LC2(%rip), %xmm0, %xmm0
        vptest  %xmm0, %xmm0
        je      .L18
).  So, I think valgrind needs to repeat the hack it does for gpr comparisons for such vector code too.

Comment 3 Mark Wielaard 2024-01-22 14:21:40 UTC

I have to do some more analysis to understand what is really going on. memcheck has --expensive-definedness-checks which are normally enabled, which turns the VEX CmpEQ/CmpNE into expensive, but accurate checks (see memcheck/mc_translate.c expensiveCmpEQorNE for an explanation). These are only defined for 8, 16, 32 and 64 byte comparisons. I guess we will have to extend this to 128/256 sizes.

Comment 4 Jakub Jelinek 2024-01-22 14:30:48 UTC

I'm afraid it is even more complicated than that, because I bet pxor/ptest or vpxor/vptest instructions don't invoke any kind of cheap or expensive EQ/NE comparison, rather a vector xor and a test whether all elements of the vector1 & vector2 (in this case both operands are the same) are zero.
Perhaps valgrind can somehow pattern match a {,v}pxor used solely in {,v}ptest as a 128-bit or 256-bit expensive comparison.

Comment 5 Aoife Moloney 2024-02-15 23:09:28 UTC

This bug appears to have been reported against 'rawhide' during the Fedora Linux 40 development cycle.
Changing version to 40.

Comment 6 Julian Seward 2024-03-06 08:50:45 UTC

I looked at this.  Some comments:

(1) I tried to write a standalone test case, but was (and still am) unclear
about whether the "other" string in the comparison should be the same as the
zero-terminated string up to the point of the zero termination, or not.  I'll
attach it in the next comment.  It would be helpful if someone can tell me if
this accurately reflects the scenario as encountered by the gcc-generated
code.

(2) I looked in detail at the IR translation of ptest and of the memcheck
instrumentation, but cannot see why it would produce an uninit-value error.
(This merely means my powers of analysis are inadequate ..)

(3) I made an alternative and easier-to-analyse implementation of ptest.  This
also gets an uninit-value on my test case.

So the critical point is (1), really.

Comment 7 Julian Seward 2024-03-06 08:52:23 UTC

Created attachment 2020339 [details]
Testcase; may be incorrect; see comment 6

Comment 8 Julian Seward 2024-03-06 08:54:10 UTC

Created attachment 2020340 [details]
Alternative impl for `ptest`.  DO NOT LAND.

Comment 9 Jakub Jelinek 2024-03-06 09:14:27 UTC

I think it doesn't really matter if there is a match in all the characters before '\0' or not, that changes just the outcome of the function, not whether there is a false positive valgrind diagnostics on it or not.
What matters is that the '\0' in one of the strings appears anywhere but on the last character in the 16-byte (or 32-byte) chunk that is loaded together.
And GCC performs this optimization only if it knows it is ok to load the extra bytes, just that their value doesn't matter for the outcome of the function (such as when it is allocated on the stack and it knows how large the object in which it appears is, or guess in theory it could do the same with heap allocated memory too).
If there is a difference, it will then deal with the individual bytes somehow, but that can necessarily happen only on the last chunk that is being processed, all the others will be equal and so that is what it optimizes for.

Comment 10 Julian Seward 2024-03-06 11:24:07 UTC

I have a fix for the 128-bit case now.  Q: do I need to also handle the
equivalent 256-bit case [regarding gcc support] ?

The problem seems to be -- at least in part -- that ptest's computation of the
C flag "pollutes" %rflags, even though we don't care about it for the
following jz.  I traced that to V not knowing that

  AndV128(t22,XorV128(t22,V128{0xFFFF})) == V128){0x0000}

Folding that out in ir_opt.c improves matters.

Comment 11 Mark Wielaard 2024-03-06 21:47:49 UTC

(In reply to Julian Seward from comment #10)
> I have a fix for the 128-bit case now.  Q: do I need to also handle the
> equivalent 256-bit case [regarding gcc support] ?

x86-64-v3 does imply AVX/AVX2, which does support the 256-bit case using ymm registers.
But I haven't been able to generate a 256-bit case with -march=x86-64-v3 on gcc 13.
But maybe it changed with gcc 14?
Jakub, do you know if gcc 14 will try to use the 256 variant?

Comment 12 Jakub Jelinek 2024-03-07 11:27:21 UTC

Consider
__attribute__((noipa, noinline, noclone)) void
foo (char *p)
{
  *p = '\0';
}

__attribute__((noipa, noinline, noclone)) int
bar (void)
{
  char buf[4096];
  foo (buf);
  return __builtin_strcmp (buf, "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz") == 0;
}

__attribute__((noipa, noinline, noclone)) int
baz (void)
{
  char buf[4096];
  foo (buf);
  return __builtin_strcmp (buf, "abcdefghijklmnopqrstuvwxyz") == 0;
}

int
main ()
{
  bar ();
  baz ();
}
With just -O2 -msse4 this emits 128-bit ptest in baz and strcmp call in bar, with -O2 -mavx the same but with VEX encoded 128-bit vector instructions.
With e.g. -O2 -mavx2 -mprefer-vector-width=256 or -O2 -mavx2 -mtune=znver2 it emits 256-bit vptest in bar and 128-bit in baz.
With e.g. -O2 -mavx512f or -O2 -march=znver4 it emits 512-bit vectors:
        vmovdqu8        (%rsp), %zmm0
        vpcmpd  $0, .LC1(%rip), %zmm0, %k0
        kortestw        %k0, %k0
        jnc     .L4
The 256-bit vector case is
        vmovdqa .LC1(%rip), %ymm0
        vpxor   (%rsp), %ymm0, %ymm0
        vptest  %ymm0, %ymm0
        jne     .L4

Comment 13 Mark Wielaard 2024-03-07 11:57:46 UTC

So with the current gcc defaults, even when someone uses -mavx, -mavx2 or -march=x86-64-v3 we should get away with just supporting 128bit.

But as soon as someone tweaks/tunes that using -mprefer-vector-width=256 or -mtune=znver2 we will start seeing 256bit vectors.
Or even 512bit, but then valgrind still doesn't support those.

Comment 14 Julian Seward 2024-03-10 08:47:08 UTC

Created attachment 2020933 [details]
a proper fix

Here is what I hope is a fairly comprehensive fix for the problem.

* amd64 front end: redo the translation into IR for PTEST, so as to use only
  IROps which we know Memcheck can do exact instrumentation for.  Handling for
  both the 128- and 256-bit cases is has been changed.

* ir_opt.c: add some constant folding rules to support the above.  In
  particular, for the case `ptest %reg, %reg` (the same reg twice), we want
  rflags.C to be set to a defined-1 even if %reg is completely undefined.
  Doing that requires folding `x and not(x)` to zero when x has type V128 or
  V256.

* memcheck/tests/amd64/rh2257546_{128,256}.c: new test cases

This seems to work on amd64.  I haven't tested on any other targets.  It might
be worth testing them a bit carefully because the ir_opt.c changes could
possibly change the IR presented to backends on other targets, although I
doubt that that will be a problem.

Comment 15 Mark Wielaard 2024-03-11 21:36:15 UTC

Created a new package with the patch from comment #14.
valgrind-3.22.0-7.fc41 (rawhide)
https://koji.fedoraproject.org/koji/buildinfo?buildID=2418600

Test results on aarch64, i686, ppc64le, s390x and x86_64 look good.

Now waiting for the eln koji builder to pick this up.

Comment 16 Mark Wielaard 2024-03-11 22:13:04 UTC

That was quick. valgrind-3.22.0-7.eln136
https://koji.fedoraproject.org/koji/buildinfo?buildID=2418628

Yaakov, would it be possible to rebuild jq with the valgrind testcase enabled so we can see if it fixes things?

Comment 17 Yaakov Selkowitz 2024-03-11 22:56:17 UTC

There is still one test failure: https://koji.fedoraproject.org/koji/taskinfo?taskID=114819276

Comment 18 Mark Wielaard 2024-03-11 23:42:49 UTC

(In reply to Yaakov Selkowitz from comment #17)
> There is still one test failure:
> https://koji.fedoraproject.org/koji/taskinfo?taskID=114819276

Thanks. I see FAIL: tests/shtest in the build.log, but not the test-suite.log with the valgrind output.
Do you happen to have the test-suite.log available?

Comment 19 Yaakov Selkowitz 2024-03-12 00:27:32 UTC

+ valgrind --error-exitcode=1 --leak-check=full --suppressions=/builddir/build/BUILD/jq-jq-1.7.1/tests/onig.supp --suppressions=/builddir/build/BUILD/jq-jq-1.7.1/tests/local.supp /builddir/build/BUILD/jq-jq-1.7.1/jq -b -n --jsonargs null invalid
==572251== Memcheck, a memory error detector
==572251== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==572251== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info
==572251== Command: /builddir/build/BUILD/jq-jq-1.7.1/jq -b -n --jsonargs null invalid
==572251== 
/builddir/build/BUILD/jq-jq-1.7.1/jq: invalid JSON text passed to --jsonargs
Use /builddir/build/BUILD/jq-jq-1.7.1/jq --help for help with command-line options,
or see the jq manpage, or online docs  at https://jqlang.github.io/jq
==572251== 
==572251== HEAP SUMMARY:
==572251==     in use at exit: 5,613 bytes in 7 blocks
==572251==   total heap usage: 212 allocs, 205 frees, 26,937 bytes allocated
==572251== 
==572251== 117 (24 direct, 93 indirect) bytes in 1 blocks are definitely lost in loss record 2 of 7
==572251==    at 0x4841827: malloc (vg_replace_malloc.c:442)
==572251==    by 0x486E5FC: jv_mem_alloc (jv_alloc.c:141)
==572251==    by 0x486E62B: jv_invalid_with_msg (jv.c:150)
==572251==    by 0x4883066: jv_parse_sized_custom_flags (jv_parse.c:893)
==572251==    by 0x10AA9A: main (main.c:371)
==572251== 
==572251== LEAK SUMMARY:
==572251==    definitely lost: 24 bytes in 1 blocks
==572251==    indirectly lost: 93 bytes in 1 blocks
==572251==      possibly lost: 0 bytes in 0 blocks
==572251==    still reachable: 5,496 bytes in 5 blocks
==572251==         suppressed: 0 bytes in 0 blocks
==572251== Reachable blocks (those to which a pointer was found) are not shown.
==572251== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==572251== 
==572251== For lists of detected and suppressed errors, rerun with: -s
==572251== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
+ EC=1

Comment 20 Mark Wielaard 2024-03-12 10:44:06 UTC

Interesting, no idea why that fails on x86_64 only. Will try to investigate, but it seems unrelated to the partial uninitialised vector issue.

Comment 21 Mark Wielaard 2024-03-13 17:08:36 UTC

(In reply to Mark Wielaard from comment #20)
> Interesting, no idea why that fails on x86_64 only. Will try to investigate,
> but it seems unrelated to the partial uninitialised vector issue.

Bit of a mystery. It replicates in an fedora-eln-x86_64 mockbuild. There you get the leak, so --error-exitcode=1 gives you an exit code of 1 instead of the expected one of 2. Making the test fail.
But I cannot replicate on fedora 39. Odd.

Comment 22 Mark Wielaard 2024-03-13 17:26:11 UTC

The leak is real. The following patch makes the test pass even on eln:

$ cat jq-jq-1.7.1-invalid-JSON-jv_free.patch 
--- jq-jq-1.7.1/src/main.c.orig	2024-03-13 18:16:26.526453278 +0100
+++ jq-jq-1.7.1/src/main.c	2024-03-13 18:18:14.163956313 +0100
@@ -371,6 +371,7 @@
         jv v =  jv_parse(argv[i]);
         if (!jv_is_valid(v)) {
           fprintf(stderr, "%s: invalid JSON text passed to --jsonargs\n", progname);
+          jv_free(v);
           die();
         }
         ARGS = jv_array_append(ARGS, v);

Comment 23 Fedora Update System 2024-03-19 00:08:31 UTC

FEDORA-2024-bd86296829 (valgrind-3.22.0-7.fc40) has been submitted as an update to Fedora 40.
https://bodhi.fedoraproject.org/updates/FEDORA-2024-bd86296829

Comment 24 Fedora Update System 2024-03-19 01:55:22 UTC

FEDORA-2024-bd86296829 has been pushed to the Fedora 40 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2024-bd86296829`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2024-bd86296829

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 25 Fedora Update System 2024-03-23 00:41:21 UTC

FEDORA-2024-bd86296829 (valgrind-3.22.0-7.fc40) has been pushed to the Fedora 40 stable repository.
If problem still persists, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.