Bug 2035807

Summary: Valgrind crashes with illegal instruction error on s390x when trying to build snapd package for EPEL 9
Product: Red Hat Enterprise Linux 9 Reporter: Neal Gompa <ngompa13>
Component: valgrindAssignee: Mark Wielaard <mjw>
valgrind sub component: system-version QA Contact: Jesus Checa <jchecahi>
Status: CLOSED ERRATA Docs Contact:
Severity: unspecified    
Priority: unspecified CC: arnez, bstinson, fche, fweimer, jakub, jchecahi, jwboyer, maciek.borzecki, ohudlick, tdawson
Version: CentOS StreamKeywords: Patch, Triaged, Upstream
Target Milestone: rcFlags: pm-rhel: mirror+
Target Release: ---   
Hardware: s390x   
OS: Unspecified   
Whiteboard:
Fixed In Version: valgrind-3.18.1-8.el9 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-17 12:48:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Mock build log of snapd-2.54.1-1.el9 failing with Valgrind crashing none

Description Neal Gompa 2021-12-27 18:39:06 UTC
Created attachment 1847988 [details]
Mock build log of snapd-2.54.1-1.el9 failing with Valgrind crashing

Description of problem:
When trying to build snapd on s390x, Valgrind crashes with an illegal instruction error, causing the tests to fail.

Version-Release number of selected component (if applicable):
1:3.18.1-5.el9

How reproducible:
Always

Steps to Reproduce:
1. Build snapd for EPEL 9 on s390x at the following commit: https://src.fedoraproject.org/rpms/snapd/c/3d93cdbdc3dadedbf46e2ae5b13a358362549462

Actual results:
Valgrind dies with the following error: "==4036598== valgrind: Unrecognised instruction at address 0x48d5ca4."


Expected results:
Valgrind passes as it does on Fedora.

Comment 1 Mark Wielaard 2022-01-03 12:28:14 UTC
I am on vacation this week, back Jan 11. But lets see if we can make some progress anyway.

/usr/bin/valgrind ./libsnap-confine-private/unit-tests
==4036598== Memcheck, a memory error detector
==4036598== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==4036598== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==4036598== Command: ./libsnap-confine-private/unit-tests
==4036598== 
# random seed: R02Sbb6363cb17884bef5e524f51f99e4a24
1..138
ok 1 /fault-injection
vex s390->IR: specification exception: E700 0008 40C5
==4036598== valgrind: Unrecognised instruction at address 0x48d5ca4.
==4036598==    at 0x48D5CA4: ??? (in /usr/lib64/libglib-2.0.so.0.6800.4)
==4036598==    by 0x48D9B41: ??? (in /usr/lib64/libglib-2.0.so.0.6800.4)
==4036598==    by 0x48DA107: g_test_run_suite (in /usr/lib64/libglib-2.0.so.0.6800.4)
==4036598==    by 0x48DA147: g_test_run (in /usr/lib64/libglib-2.0.so.0.6800.4)
==4036598==    by 0x10BAD3: UnknownInlinedFun (unit-tests.c:28)
==4036598==    by 0x10BAD3: main (unit-tests-main.c:21)

- Does this only happen on centos9?
  I believe CentOS and Fedora have different base architecture defaults,
  so this could be a difference between z12 vs z13/

- Could you install debuginfo for libglib-2.0.so.0.6800.4
  and/or could you disassemble the binary so we can see the instruction at 0x48d5ca4?

Comment 2 Neal Gompa 2022-01-03 13:09:10 UTC
(In reply to Mark Wielaard from comment #1)
> I am on vacation this week, back Jan 11. But lets see if we can make some
> progress anyway.
> 
> /usr/bin/valgrind ./libsnap-confine-private/unit-tests
> ==4036598== Memcheck, a memory error detector
> ==4036598== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
> ==4036598== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright
> info
> ==4036598== Command: ./libsnap-confine-private/unit-tests
> ==4036598== 
> # random seed: R02Sbb6363cb17884bef5e524f51f99e4a24
> 1..138
> ok 1 /fault-injection
> vex s390->IR: specification exception: E700 0008 40C5
> ==4036598== valgrind: Unrecognised instruction at address 0x48d5ca4.
> ==4036598==    at 0x48D5CA4: ??? (in /usr/lib64/libglib-2.0.so.0.6800.4)
> ==4036598==    by 0x48D9B41: ??? (in /usr/lib64/libglib-2.0.so.0.6800.4)
> ==4036598==    by 0x48DA107: g_test_run_suite (in
> /usr/lib64/libglib-2.0.so.0.6800.4)
> ==4036598==    by 0x48DA147: g_test_run (in
> /usr/lib64/libglib-2.0.so.0.6800.4)
> ==4036598==    by 0x10BAD3: UnknownInlinedFun (unit-tests.c:28)
> ==4036598==    by 0x10BAD3: main (unit-tests-main.c:21)
> 
> - Does this only happen on centos9?
>   I believe CentOS and Fedora have different base architecture defaults,
>   so this could be a difference between z12 vs z13/
> 

It only happens on CentOS Stream 9. Fedora Rawhide is fine still.

Comment 3 Mark Wielaard 2022-01-03 15:27:07 UTC
(In reply to Neal Gompa from comment #2)
> (In reply to Mark Wielaard from comment #1)
> > I am on vacation this week, back Jan 11. But lets see if we can make some
> > progress anyway.
> > 
> > /usr/bin/valgrind ./libsnap-confine-private/unit-tests
> > ==4036598== Memcheck, a memory error detector
> > ==4036598== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
> > ==4036598== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright
> > info
> > ==4036598== Command: ./libsnap-confine-private/unit-tests
> > ==4036598== 
> > # random seed: R02Sbb6363cb17884bef5e524f51f99e4a24
> > 1..138
> > ok 1 /fault-injection
> > vex s390->IR: specification exception: E700 0008 40C5
> > ==4036598== valgrind: Unrecognised instruction at address 0x48d5ca4.
> > ==4036598==    at 0x48D5CA4: ??? (in /usr/lib64/libglib-2.0.so.0.6800.4)
> > ==4036598==    by 0x48D9B41: ??? (in /usr/lib64/libglib-2.0.so.0.6800.4)
> > ==4036598==    by 0x48DA107: g_test_run_suite (in
> > /usr/lib64/libglib-2.0.so.0.6800.4)
> > ==4036598==    by 0x48DA147: g_test_run (in
> > /usr/lib64/libglib-2.0.so.0.6800.4)
> > ==4036598==    by 0x10BAD3: UnknownInlinedFun (unit-tests.c:28)
> > ==4036598==    by 0x10BAD3: main (unit-tests-main.c:21)
> > 
> > - Does this only happen on centos9?
> >   I believe CentOS and Fedora have different base architecture defaults,
> >   so this could be a difference between z12 vs z13/
> > 
> 
> It only happens on CentOS Stream 9. Fedora Rawhide is fine still.

Thanks. So it likely is a z13 or z14 only issue. But without knowing the actual code that is at address 0x48d5ca4 it is hard to say what is going on without access to a z13/z14 capable s390x machine.

As far as I can tell E700 xxxx xxC5 is a VFLR instruction, which is an aarch12 (z14) only instruction.
But one that valgrind should implement. So I am not really clear on why it produces an specification exception.

Do you have access to an machine were this fails? Could you disassemble /usr/lib64/libglib-2.0.so.0.6800.4 around address 0x48d5ca4?

Comment 4 Mark Wielaard 2022-01-03 15:36:06 UTC
Also could you make sure that the machine where you are running does actually support the z14 instruction set?
I see you are building with -march=z196 but then using a library which seems to use z14 instructions.

Comment 5 Neal Gompa 2022-01-03 16:13:24 UTC
I don't have access to s390x hardware. I only observe this in the Fedora Koji build system. I'm sorry, I can't provide any more detail. :(

Comment 6 Neal Gompa 2022-01-03 16:14:52 UTC
Here's the Koji build task that failed: https://koji.fedoraproject.org/koji/taskinfo?taskID=80531908

Comment 7 Florian Weimer 2022-01-03 16:47:21 UTC
(In reply to Neal Gompa from comment #6)
> Here's the Koji build task that failed:
> https://koji.fedoraproject.org/koji/taskinfo?taskID=80531908

Thanks. It used glib2-2.68.4-3.el9.s390x.

Unfortunately the valgrind output does not show the relative (in-object) offset, so I have to guess.

$ s390x-linux-gnu-objdump -d --reloc  usr/lib64/libglib-2.0.so.0.6800.4 | grep ca4:.*e7
   7fca4:	e7 00 00 08 40 c5 	wflrx	%v0,%v0,0,0

That's the only hit. m3 is 4, so it's extended format. Extended format does not seem to be implemented:

static const HChar *
s390_irgen_VFLR(UChar v1, UChar v2, UChar m3, UChar m4, UChar m5)
{
   s390_insn_assert("vflr", m3 == 3 || (s390_host_has_vxe && m3 == 2));

   if (m3 == 3)
      s390_vector_fp_convert(Iop_F64toF32, Ity_F64, Ity_F32, True,
                             v1, v2, m3, m4, m5);
   else
      s390_vector_fp_convert(Iop_F128toF64, Ity_F128, Ity_F64, True,
                             v1, v2, m3, m4, m5);

   return "vflr";
}

This was added with Vector-enhancements facility 1 in the 12th edition (arch12, that is, z14), so I think it's valid for RHEL 9 binaries.

Comment 8 Andreas Arnez 2022-01-05 19:29:35 UTC
This (In reply to Florian Weimer from comment #7)
> ...
> That's the only hit. m3 is 4, so it's extended format. Extended format does
> not seem to be implemented:
> ...
Actually, it is.  But the code checks for the wrong format code.  Instead of checking for 4, it checks for 2.  This is a typo.

I created a Valgrind Bug for tracking and attached a possible fix:
  https://bugs.kde.org/show_bug.cgi?id=447991

Comment 9 Mark Wielaard 2022-01-12 15:53:39 UTC
The fix looks good, testing a fedora rawhide build with that patch now.

Comment 10 Mark Wielaard 2022-01-12 17:18:36 UTC
Did a rpmbuild --rebuild snapd-2.54.1-1.fc36.src.rpm with the old valgrind-3.18.1-6.el9.s390x.rpm which replicated the issue.
Then installed the new fedora rawhide valgrind-3.18.1-8.fc36.s390x.rpm which contains the proposed fix and the rpmbuild succeeded.
All valgrind invocations in the build.log look fine with the patched valgrind.

Comment 11 Jesus Checa 2022-01-14 14:46:22 UTC
The following snippet reproduces the issue in old build valgrind-3.18.1-6.el9:

int main(){
    asm("wflrx %v0,%v0,0,0");
    return 0 ;
}

Verified that it doesn't reproduce with new build valgrind-3.18.1-8.el9. wflrx instruction doesn't cause valgrind to raise a SIGILL or report a specification exception anymore.

Comment 16 errata-xmlrpc 2022-05-17 12:48:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (new packages: valgrind), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:2401