Created attachment 1477022 [details] fflas-ffpack spec file that uses openblas instead of atlas Description of problem: I have been asked to switch packages I maintain from atlas to openblas. However, I ran into trouble with the fflas-ffpack package. The testsuite segfaults on s390x. See https://koji.fedoraproject.org/koji/taskinfo?taskID=29190683 for a scratch build demonstrating the problem. All tests pass on other architectures, and they pass on s390x with both atlas and the reference blas implementation. Version-Release number of selected component (if applicable): openblas-0.3.2-2.fc30 How reproducible: Always Steps to Reproduce: 1. fedpkg clone fflas-ffpack 2. Replace fflas-ffpack.spec with the attached version, which uses openblas 3. Build for s390x Actual results: Segfaults in the testsuite Expected results: Passing testsuite Additional info:
Recommend following https://fedoraproject.org/wiki/Architectures/s390x#Shell_access_for_debugging to obtain the core file.
https://fedoraproject.org/wiki/Architectures/s390x#Notes_for_application_developers_and_package_maintainers suggests increasing stack size to fix SIGSEGV when running pcre tests. Indeed, the workaround is still there: https://src.fedoraproject.org/rpms/pcre/blob/master/f/pcre.spec#_170 It's worth a try.
Our public s390x guest is out of service at moment, but it would be interesting to see what would happen when openblas-0.3.2-1.fc29 would be used (it has incorrectly used a z13 based kernel). Adding to my to-do list ...
(In reply to Dominik 'Rathann' Mierzejewski from comment #2) > https://fedoraproject.org/wiki/Architectures/ > s390x#Notes_for_application_developers_and_package_maintainers suggests > increasing stack size to fix SIGSEGV when running pcre tests. Indeed, the > workaround is still there: > https://src.fedoraproject.org/rpms/pcre/blob/master/f/pcre.spec#_170 > > It's worth a try. Sadly, no, increasing the stack limit did not change the outcome: https://koji.fedoraproject.org/koji/taskinfo?taskID=29209212
Building locally with mock --forcearch s390x, I see this for one of the test programs that dumped core: [mockbuild@3dd16e3c6a964f7f8f24f6d548e7b042 tests]$ ./test-ftrsm Checking with Modular<double> mod 523427 terminate called after throwing an instance of 'FailureTrsmCheck' Aborted (core dumped) So it isn't a segfault; it's an abort because something computed the wrong value and nothing caught the resulting exception. Sorry for erroneously calling the failure a segfault. Looking through the test results, I see several test failures. Some result in a core file and some don't. The bottom line is that openblas is computing values that the fflas-ffpack test suite considers incorrect. I need to see what the nonmatching values are to diagnose the problem. Sadly, I have now hit the limits of mock --forcearch: (gdb) run Starting program: /builddir/build/BUILD/fflas_ffpack-2.3.2/tests/test-ftrsm qemu: Unsupported syscall: 26 warning: Could not trace the inferior process. Error: warning: ptrace: Function not implemented During startup program exited with code 127. Dan, is there any chance that public s390x guest might come back? If not, I would appreciate any help those with access to s390x hardware can give. Footnotes: [1] Almost. With ATLAS, I had to disable the test-lu and test-echelon tests on ppc64 and ppc64le, because of bug 1410633. With openblas, those tests pass on ppc64 and ppc64le, but the following tests fail on s390x: - test-ftrtri - test-ftrmv - test-ftrsm (aborts) - test-ftrsm-check (aborts) - test-ftrmm - test-pluq-check (aborts) - test-fsytrf - test-invert-check (aborts) - test-det-check - test-echelon So to get passing tests, I should build with openblas on all arches except s390x, and use atlas on s390x. But that's ugly and horrible and I don't want to do it if there is any chance at all that the problem with openblas + s390x can be identified and fixed.
I commented out the floating point tests for the failing test programs, and sure enough, the integer tests pass. I thought I would try rebuilding openblas with -ffloat-store or maybe -ffp-contract=off, so I grabbed the SRPM, inserted that everywhere that %{optflags} appears and kicked off an s390x mock build. I am seeing a large number of files compiled without %{optflags}. This is probably due to lines 394 through 400 of the spec file: %if 0%{?rhel} == 5 # Gfortran too old to recognize -frecursive COMMON="%{optflags} -fPIC" FCOMMON="%{optflags} -fPIC" %else FCOMMON="%{optflags} -fPIC -frecursive" %endif Notice the lack of a COMMON definition in the second case. Could that cause this issue?
Created attachment 1478543 [details] test suite log file And this is the info from some of the aborts. [sharkcz@devel10 fflas-ffpack]$ coredumpctl info 45560 PID: 45560 (test-invert-che) UID: 1000 (sharkcz) GID: 1012 (sharkcz) Signal: 6 (ABRT) Timestamp: Fri 2018-08-24 09:15:27 EDT (13min ago) Command Line: ./test-invert-check Executable: /home/sharkcz/fflas-ffpack/fflas_ffpack-2.3.2/tests/test-invert-check Control Group: /user.slice/user-1000.slice/session-59.scope Unit: session-59.scope Slice: user-1000.slice Session: 59 Owner UID: 1000 (sharkcz) Boot ID: cad3ea6c02cb4ef7aa5c17cbc3bae66f Machine ID: 9f494311b8fe4625a05e6f0acd9c4b3f Hostname: devel10.s390.bos.redhat.com Storage: /var/lib/systemd/coredump/core.test-invert-che.1000.cad3ea6c02cb4ef7aa5c17cbc3bae66f.45560.1535116527000000.lz4 Message: Process 45560 (test-invert-che) of user 1000 dumped core. Stack trace of thread 45560: #0 0x0000020032cbe454 raise (libc.so.6) #1 0x0000020032ca3ce8 abort (libc.so.6) #2 0x00000200328ab150 _ZN9__gnu_cxx27__verbose_terminate_handlerEv (libstdc++.so.6) #3 0x00000200328a8a5e n/a (libstdc++.so.6) #4 0x00000200328a8ac0 _ZSt9terminatev (libstdc++.so.6) #5 0x00000200328a8d96 __cxa_throw (libstdc++.so.6) #6 0x000002aa284491d2 _ZNK6FFPACK18CheckerImplem_PLUQIN6Givaro7ModularIddEEE5checkEPKdmN5FFLAS10FFLAS_DIAGEmPmS9_ (test-invert-check) #7 0x000002aa2844967e _ZN6FFPACK9Protected11GaussJordanIN6Givaro7ModularIddEEEEmRKT_mmNS5_11Element_ptrEmmmmPmS9_NS_13FFPACK_LU_TAGE (test-invert-check) #8 0x000002aa2844a396 _ZN6FFPACK21ReducedRowEchelonFormIN6Givaro7ModularIddEEEEmRKT_mmNS4_11Element_ptrEmPmS8_bNS_13FFPACK_LU_TAGE (test-invert-check) #9 0x000002aa28409e84 main (test-invert-check) #10 0x0000020032ca4172 __libc_start_main (libc.so.6) #11 0x000002aa2840a204 _start (test-invert-check) [sharkcz@devel10 fflas-ffpack]$ coredumpctl info 45367 PID: 45367 (test-pluq-check) UID: 1000 (sharkcz) GID: 1012 (sharkcz) Signal: 6 (ABRT) Timestamp: Fri 2018-08-24 09:14:30 EDT (20min ago) Command Line: ./test-pluq-check Executable: /home/sharkcz/fflas-ffpack/fflas_ffpack-2.3.2/tests/test-pluq-check Control Group: /user.slice/user-1000.slice/session-59.scope Unit: session-59.scope Slice: user-1000.slice Session: 59 Owner UID: 1000 (sharkcz) Boot ID: cad3ea6c02cb4ef7aa5c17cbc3bae66f Machine ID: 9f494311b8fe4625a05e6f0acd9c4b3f Hostname: devel10.s390.bos.redhat.com Storage: /var/lib/systemd/coredump/core.test-pluq-check.1000.cad3ea6c02cb4ef7aa5c17cbc3bae66f.45367.1535116470000000.lz4 (inaccessible) Message: Process 45367 (test-pluq-check) of user 1000 dumped core. Stack trace of thread 45367: #0 0x00000200083be454 raise (libc.so.6) #1 0x00000200083a3ce8 abort (libc.so.6) #2 0x0000020007fab150 _ZN9__gnu_cxx27__verbose_terminate_handlerEv (libstdc++.so.6) #3 0x0000020007fa8a5e n/a (libstdc++.so.6) #4 0x0000020007fa8ac0 _ZSt9terminatev (libstdc++.so.6) #5 0x0000020007fa8d96 __cxa_throw (libstdc++.so.6) #6 0x000002aa0a84542e _ZN5FFLAS19CheckerImplem_ftrsmIN6Givaro7ModularIddEEE5checkENS_10FFLAS_SIDEENS_10FFLAS_UPLOENS_15FFLAS_TRANSPOSEENS_10FFLAS_DIAGEmmPKdmSA_m (test-pluq> #7 0x000002aa0a845712 _ZN6FFPACK5_PLUQIN6Givaro7ModularIddEEEEmRKT_N5FFLAS10FFLAS_DIAGEmmNS4_11Element_ptrEmPmSA_m (test-pluq-check) #8 0x000002aa0a809994 _ZN6FFPACK4PLUQIN6Givaro7ModularIddEEEEmRKT_N5FFLAS10FFLAS_DIAGEmmNS4_11Element_ptrEmPmSA_m (test-pluq-check) #9 0x00000200083a4172 __libc_start_main (libc.so.6) #10 0x000002aa0a80ab94 _start (test-pluq-check) Stack trace of thread 45370: #0 0x000002000821b2f8 n/a (libgomp.so.1) #1 0x00000200082188e2 n/a (libgomp.so.1) #2 0x00000200083080fe start_thread (libpthread.so.0) #3 0x0000020008479f96 thread_start (libc.so.6) Stack trace of thread 45368: #0 0x000002000821b2f8 n/a (libgomp.so.1) #1 0x00000200082188e2 n/a (libgomp.so.1) #2 0x00000200083080fe start_thread (libpthread.so.0) #3 0x0000020008479f96 thread_start (libc.so.6) Stack trace of thread 45369: #0 0x000002000821b2f8 n/a (libgomp.so.1) #1 0x00000200082188e2 n/a (libgomp.so.1) #2 0x00000200083080fe start_thread (libpthread.so.0) #3 0x0000020008479f96 thread_start (libc.so.6)
(In reply to Jerry James from comment #5) > > Dan, is there any chance that public s390x guest might come back? If not, I > would appreciate any help those with access to s390x hardware can give. The plan from the Marist people is/was to have the hypervisor ready again today, so there is hope it won't take long to have the public guest back.
backtrace from gdb for the "45560" abort (gdb) where #0 0x0000020032cbe454 in raise () from /lib64/libc.so.6 #1 0x0000020032ca3ce8 in abort () from /lib64/libc.so.6 #2 0x00000200328ab150 in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6 #3 0x00000200328a8a5e in ?? () from /lib64/libstdc++.so.6 #4 0x00000200328a8ac0 in std::terminate() () from /lib64/libstdc++.so.6 #5 0x00000200328a8d96 in __cxa_throw () from /lib64/libstdc++.so.6 #6 0x000002aa284491d2 in FFPACK::CheckerImplem_PLUQ<Givaro::Modular<double, double> >::check (Q=0x2aa605c3920, P=0x2aa605c3330, r=189, Diag=FFLAS::FflasUnit, lda=378, A=0x2003309a010, this=<synthetic pointer>) at ../fflas-ffpack/utils/fflas_memory.h:90 #7 FFPACK::PLUQ<Givaro::Modular<double, double> > (BCThreshold=256, Q=0x2aa605c3920, P=0x2aa605c3330, lda=<optimized out>, A=0x2003309a010, N=189, M=189, Diag=FFLAS::FflasUnit, Fi=...) at ../fflas-ffpack/ffpack/ffpack_pluq.inl:662 #8 FFPACK::RowEchelonForm<Givaro::Modular<double, double> > (LuTag=FFPACK::FfpackTileRecursive, transform=true, Qt=0x2aa605c3920, P=0x2aa605c3330, lda=<optimized out>, A=0x2003309a010, N=189, M=189, F=...) at ../fflas-ffpack/ffpack/ffpack_echelonforms.inl:67 #9 FFPACK::ReducedRowEchelonForm<Givaro::Modular<double, double> > (F=..., M=189, N=189, A=0x2003309a010, lda=<optimized out>, P=0x2aa605c3330, Qt=0x2aa605c3920, transform=true, LuTag=FFPACK::FfpackTileRecursive) at ../fflas-ffpack/ffpack/ffpack_echelonforms.inl:121 #10 0x000002aa2844967e in FFPACK::Protected::GaussJordan<Givaro::Modular<double, double> > (F=..., M=189, N=189, A=0x2003309a010, lda=378, colbeg=0, rowbeg=0, colsize=189, P=0x2aa605c3330, Q=0x2aa605c3920, LuTag=FFPACK::FfpackGaussJordanTile) at ../fflas-ffpack/ffpack/ffpack_echelonforms.inl:144 #11 0x000002aa2844a396 in FFPACK::ReducedRowEchelonForm<Givaro::Modular<double, double> > (LuTag=FFPACK::FfpackGaussJordanTile, transform=true, Qt=0x2aa605c3920, P=0x2aa605c3330, lda=378, A=0x2003309a010, N=<optimized out>, M=<optimized out>, F=...) at ../fflas-ffpack/ffpack/ffpack_echelonforms.inl:111 #12 FFPACK::Invert<Givaro::Modular<double, double> > (F=..., M=<optimized out>, A=0x2003309a010, lda=378, nullity=@0x3ffcb87dfb4: 833249088) at ../fflas-ffpack/ffpack/ffpack_invert.inl:51 #13 0x000002aa28409e84 in main (argc=<optimized out>, argv=<optimized out>) at test-invert-check.C:80
(In reply to Jerry James from comment #6) > I am seeing a large number of files compiled without %{optflags}. This is > probably due to lines 394 through 400 of the spec file: > > %if 0%{?rhel} == 5 > # Gfortran too old to recognize -frecursive > COMMON="%{optflags} -fPIC" > FCOMMON="%{optflags} -fPIC" > %else > FCOMMON="%{optflags} -fPIC -frecursive" > %endif > > Notice the lack of a COMMON definition in the second case. Could that cause > this issue? Good catch.
Unfortunately, it didn't cause this issue. So here is what I am noticing now. Take a look in build.log for the latest build. On s390x only, no other architecture, the test suite reports test failures, over 100 failures, in fact. So there are two bugs here: (1) test failures don't cause %check to fail the build; and (2) tests are failing on s390x. Here is an example test failure: SGEMM PASSED THE TESTS OF ERROR-EXITS SGEMM PASSED THE COMPUTATIONAL TESTS ( 17496 CALLS) SSYMM PASSED THE TESTS OF ERROR-EXITS SSYMM PASSED THE COMPUTATIONAL TESTS ( 1296 CALLS) STRMM PASSED THE TESTS OF ERROR-EXITS ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE ******* EXPECTED RESULT COMPUTED RESULT 1 0.186813 0.373626 ******* STRMM FAILED ON CALL NUMBER: 506: STRMM ('L','U','N','U', 1, 1, 1.0, A, 2, B, 2) . This strongly suggests to me that the fflas-ffpack test suite is right: openblas is computing incorrect results on s390x. Looking through the failures, I see something interesting: the computed result is exactly two times the expected result in every failure I have looked at so far. There is probably an off-by-one bit shift error somewhere in the s390x support code. Might I also suggest the use of %ldconfig_scriptlets in place of explicit invocations of ldconfig?
Sorry, I was imprecise: I'm talking about the openblas build.log, and test failures in the openblas test suite.
I also notice a lot of warnings like this: BUILDSTDERR: xerbla.c: In function 'cblas_xerbla': BUILDSTDERR: xerbla.c:16:35: warning: format '%d' expects argument of type 'int', but argument 3 has type 'blasint' {aka 'long int'} [-Wformat=] BUILDSTDERR: fprintf(stderr, "Parameter %d to routine %s was incorrect\n", p, rout); BUILDSTDERR: ~^ ~ BUILDSTDERR: %ld That means that fprintf is only accessing 32-bits of the 64-bit value passed to it. On little endian architectures, you can often get away with this, as the upper 32 bits are often zero, and you fortuitously get the lower 32 bits. On a big endian architecture like s390x, though, you get the upper 32 bits (which are often zero). For error messages, maybe we don't care, but there may be non-error messages in the code base where this does matter. These warnings should be fixed (e.g., by specifying %ld and casting the argument to long, in case blasint is shorter than a long on some architectures.)
(In reply to Jerry James from comment #11) > Unfortunately, it didn't cause this issue. So here is what I am noticing > now. Take a look in build.log for the latest build. On s390x only, no > other architecture, the test suite reports test failures, over 100 failures, > in fact. So there are two bugs here: > (1) test failures don't cause %check to fail the build; and > (2) tests are failing on s390x. Yay... Would you mind reporting the issues to OpenBLAS upstream? You appear to know much more about the problem than I.
Reported upstream. I don't speak Fortran or s390x assembly, and I don't have access to any real s390x systems at the moment, so I probably won't be much help debugging this.
(In reply to Dan Horák from comment #8) > (In reply to Jerry James from comment #5) > > > > Dan, is there any chance that public s390x guest might come back? If not, I > > would appreciate any help those with access to s390x hardware can give. > > The plan from the Marist people is/was to have the hypervisor ready again > today, so there is hope it won't take long to have the public guest back. And it is up again. Beware it's a z13 machine, so openblas needs to be built with TARGET= to enable the generic backend to match the HW Fedora supports (zEC12 and newer)
(In reply to Dan Horák from comment #16) > And it is up again. Beware it's a z13 machine, so openblas needs to be built > with TARGET= to enable the generic backend to match the HW Fedora supports > (zEC12 and newer) That's already been done.
(In reply to Susi Lehtola from comment #17) > (In reply to Dan Horák from comment #16) > > And it is up again. Beware it's a z13 machine, so openblas needs to be built > > with TARGET= to enable the generic backend to match the HW Fedora supports > > (zEC12 and newer) > > That's already been done. right, but it's needed when building openblas from sources directly on the public guest
It looks like it's fixed by https://github.com/martin-frbg/OpenBLAS/commit/f3fd44a731c1997b1d79d4d16abc25d78dce88a7 and the fix will be included in 0.3.3.
(In reply to Dominik 'Rathann' Mierzejewski from comment #19) > It looks like it's fixed by > https://github.com/martin-frbg/OpenBLAS/commit/ > f3fd44a731c1997b1d79d4d16abc25d78dce88a7 and the fix will be included in > 0.3.3. Dan's already building fixed packages https://koji.fedoraproject.org/koji/buildinfo?buildID=1140405
I'm building openblas-0.3.2-5.fc30 that includes the fix right now. We should know better in a while, if it will fix some of the issues appearing on s390x.
And all fflas-ffpack tests pass with the new openblas build is in the buildroot. Going to test some other packages too.
with current rawhide buildroot I see only FAIL: test-fgemm on aarch64 when using the spec file from attachment https://koji.fedoraproject.org/koji/taskinfo?taskID=29919285
This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle. Changing version to '31'.
This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle. Changing version to 31.
I guess this can be closed, Susi?
I think so.