Bug 1512229 - elpa testsuite failures
Summary: elpa testsuite failures
Keywords:
Status: ASSIGNED
Alias: None
Product: Fedora
Classification: Fedora
Component: elpa
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Dominik 'Rathann' Mierzejewski
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: PPCTracker x86Tracker
TreeView+ depends on / blocked
 
Reported: 2017-11-11 21:49 UTC by Dominik 'Rathann' Mierzejewski
Modified: 2019-12-17 10:35 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-11-27 19:47:35 UTC


Attachments (Terms of Use)

Description Dominik 'Rathann' Mierzejewski 2017-11-11 21:49:54 UTC
Description of problem:
Some tests in elpa testsuite fail on i686, ppc64le and s390x.

Version-Release number of selected component (if applicable):
2017.05.003-1

How reproducible:
Always.

Steps to Reproduce:
1. Build current elpa master branch --with check

Actual results:
i686    (openmpi hang)
ppc64le (18/50 FAIL for serial+openmp, 25/73 FAIL for mpich+openmp)
s390x   (25/50 FAIL for serial, 39/73 FAIL for mpi)

Expected results:
all PASS

Comment 1 Alexander Ploumistos 2017-11-11 22:40:59 UTC
Is there a koji or copr log someplace?

I glanced at the code upstream and I saw this in test/Fortran/test.F90 (which was updated in the last couple of months):

https://gitlab.mpcdf.mpg.de/elpa/elpa/blob/master/test/Fortran/test.F90

(it starts at line 233)

#ifdef TEST_QR_DECOMPOSITION
#if TEST_GPU == 1
#ifdef WITH_MPI
     call mpi_finalize(mpierr)
#endif
     stop 77
#endif
   if (nblk .lt. 64) then
     if (myid .eq. 0) then
       print *,"At the moment QR decomposition need blocksize of at least 64"
     endif
     if ((na .lt. 64) .and. (myid .eq. 0)) then
       print *,"This is why the matrix size must also be at least 64 or only 1 MPI task can be used"
     endif


Also, the last committed version of test/shared/test_setup_mpi.F90

https://gitlab.mpcdf.mpg.de/elpa/elpa/blob/master/test/shared/test_setup_mpi.F90

has this comment:

"Clean exit if QR is skipped"


Could this be what's causing the failures or is that blocksize completely irrelevant to arch word size?

Comment 2 Dominik 'Rathann' Mierzejewski 2017-11-11 23:12:28 UTC
Thanks for your interest. You can find ppc64, ppc64le, s390x and x86_64 logs in this scratch build: https://koji.fedoraproject.org/koji/taskinfo?taskID=23053097 . You'll be able to find aarch64, armv7hl and i686 in this scratch build once it completes in about 36 hours (armv7hl is slowest): https://koji.fedoraproject.org/koji/taskinfo?taskID=23064548 .

On ppc64le, what you described in comment #1 does account for only a few of the failures in each failing case. Do you know why these are not failing on x86_64? Is it because SIMD kernels are used instead of generic ones? But then, why is ppc64 not failing?

On s390x, there are some more of these exit status: 77 errors, but still they don't account for all failures.

Comment 3 Alexander Ploumistos 2017-11-12 19:25:01 UTC
(In reply to Dominik 'Rathann' Mierzejewski from comment #2)
> Thanks for your interest. You can find ppc64, ppc64le, s390x and x86_64 logs
> in this scratch build:
> https://koji.fedoraproject.org/koji/taskinfo?taskID=23053097 . You'll be
> able to find aarch64, armv7hl and i686 in this scratch build once it
> completes in about 36 hours (armv7hl is slowest):
> https://koji.fedoraproject.org/koji/taskinfo?taskID=23064548 .

Why does it take that long? I've been looking at the i686 build log since yesterday and it seems stuck at the openmpi tests. 

> On ppc64le, what you described in comment #1 does account for only a few of
> the failures in each failing case. Do you know why these are not failing on
> x86_64? Is it because SIMD kernels are used instead of generic ones? But
> then, why is ppc64 not failing?

I have no idea. Perhaps someone on devel could answer that. Have you asked upstream by any chance?

Comment 4 Alexander Ploumistos 2017-11-14 13:26:47 UTC
That's interesting, the i686 openmpi checks timed out, but they completed on armv7hl. I think upstream should have an explanation for that, I would expect both builds to fail if it was a word size issue.

Comment 5 Dominik 'Rathann' Mierzejewski 2017-12-29 13:06:45 UTC
New scratch build, since the previous ones were garbage-collected already: https://koji.fedoraproject.org/koji/taskinfo?taskID=23941175 .

Comment 6 Dominik 'Rathann' Mierzejewski 2018-01-11 13:44:15 UTC
Ok, so the i686 build get stuck in various double precision tests under OpenMPI randomly. Across three local mock -r fedora-rawhide-i386 runs, I saw the following getting stuck just eating CPU cycles:
test_real_double_hermitian_multiply_1stage_all_layouts.sh
test_real_double_eigenvalues_1stage_all_layouts.sh
test_real_double_eigenvectors_2stage_all_kernels_all_layouts.sh

For the time being, I'm going to disable testing with OpenMPI on i686. I reported this upstream and they're investigating.

Comment 7 Fedora End Of Life 2018-02-20 15:34:31 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 28 development cycle.
Changing version to '28'.

Comment 8 Dominik 'Rathann' Mierzejewski 2018-07-30 13:58:02 UTC
With elpa-2018.05.001, I observe the following failures on rawhide:
s390x:   serial+openmp  6
ppc64le: serial+openmp 12, mpich+openmp  8
ppc64:   serial+openmp  8, mpich+openmp  5
x86_64:  serial+openmp 20, mpich+openmp 12
aarch64: serial+openmp 10, mpich+openmp  8, openmpi+openmp 6
i686:    serial+openmp  9, mpich+openmp  8
armv7hl: serial+openmp 11, mpich+openmp  9, openmpi+openmp 8

Comment 9 Jan Kurik 2018-08-14 10:16:55 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 29 development cycle.
Changing version to '29'.

Comment 10 Christoph Junghans 2018-11-05 21:18:26 UTC
I can reproduce the hang with gromacs-2018.3 on i686 and armv7hl, but only with Openmpi. However if I reorder the test execution from (openmpi, mpich, serial) to (serial, mpich, openmpi) it passes at least on i686.

Comment 11 Ben Cotton 2019-10-31 20:33:41 UTC
This message is a reminder that Fedora 29 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 29 on 2019-11-26.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '29'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 29 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 12 Ben Cotton 2019-11-27 19:47:35 UTC
Fedora 29 changed to end-of-life (EOL) status on 2019-11-26. Fedora 29 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 13 Dominik 'Rathann' Mierzejewski 2019-12-06 14:25:51 UTC
I tried building 2019.05.002 on rawhide and the good news is, OpenMPI tests are no longer hanging on i686. The bad news is, I got quite a few new failures:

# Test suite failures
# Build time: minutes
# Number of tests: serial - 83, MPI - 127
#         build time serial serial+omp mpich mpich+omp openmpi openmpi+omp
# aarch64         20      3          9     3         7       3           3
# armv7hl        N/A    N/A        N/A   N/A       N/A     N/A         N/A
# i686            24      0          8     0         8       0           0
# ppc64le         13      9         16    10        14      10          10
# s390x           11      0          5     0         2       0           0
# x86_64          23      0         18     0        12       0           0

armv7hl is not building because OpenMPI has broken dependencies (bug 1780584).

Comment 14 Dominik 'Rathann' Mierzejewski 2019-12-17 10:35:58 UTC
Work in progress with upstream. No more failures on s390x and ARM. 1 failure on x86 (serial+omp) and 9-10 failures on ppc64le remaining.


Note You need to log in before you can comment on or make changes to this bug.