Description of problem:
Some tests in elpa testsuite fail on i686, ppc64le and s390x.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Build current elpa master branch --with check
i686 (openmpi hang)
ppc64le (18/50 FAIL for serial+openmp, 25/73 FAIL for mpich+openmp)
s390x (25/50 FAIL for serial, 39/73 FAIL for mpi)
Is there a koji or copr log someplace?
I glanced at the code upstream and I saw this in test/Fortran/test.F90 (which was updated in the last couple of months):
(it starts at line 233)
#if TEST_GPU == 1
if (nblk .lt. 64) then
if (myid .eq. 0) then
print *,"At the moment QR decomposition need blocksize of at least 64"
if ((na .lt. 64) .and. (myid .eq. 0)) then
print *,"This is why the matrix size must also be at least 64 or only 1 MPI task can be used"
Also, the last committed version of test/shared/test_setup_mpi.F90
has this comment:
"Clean exit if QR is skipped"
Could this be what's causing the failures or is that blocksize completely irrelevant to arch word size?
Thanks for your interest. You can find ppc64, ppc64le, s390x and x86_64 logs in this scratch build: https://koji.fedoraproject.org/koji/taskinfo?taskID=23053097 . You'll be able to find aarch64, armv7hl and i686 in this scratch build once it completes in about 36 hours (armv7hl is slowest): https://koji.fedoraproject.org/koji/taskinfo?taskID=23064548 .
On ppc64le, what you described in comment #1 does account for only a few of the failures in each failing case. Do you know why these are not failing on x86_64? Is it because SIMD kernels are used instead of generic ones? But then, why is ppc64 not failing?
On s390x, there are some more of these exit status: 77 errors, but still they don't account for all failures.
(In reply to Dominik 'Rathann' Mierzejewski from comment #2)
> Thanks for your interest. You can find ppc64, ppc64le, s390x and x86_64 logs
> in this scratch build:
> https://koji.fedoraproject.org/koji/taskinfo?taskID=23053097 . You'll be
> able to find aarch64, armv7hl and i686 in this scratch build once it
> completes in about 36 hours (armv7hl is slowest):
> https://koji.fedoraproject.org/koji/taskinfo?taskID=23064548 .
Why does it take that long? I've been looking at the i686 build log since yesterday and it seems stuck at the openmpi tests.
> On ppc64le, what you described in comment #1 does account for only a few of
> the failures in each failing case. Do you know why these are not failing on
> x86_64? Is it because SIMD kernels are used instead of generic ones? But
> then, why is ppc64 not failing?
I have no idea. Perhaps someone on devel could answer that. Have you asked upstream by any chance?
That's interesting, the i686 openmpi checks timed out, but they completed on armv7hl. I think upstream should have an explanation for that, I would expect both builds to fail if it was a word size issue.
New scratch build, since the previous ones were garbage-collected already: https://koji.fedoraproject.org/koji/taskinfo?taskID=23941175 .
Ok, so the i686 build get stuck in various double precision tests under OpenMPI randomly. Across three local mock -r fedora-rawhide-i386 runs, I saw the following getting stuck just eating CPU cycles:
For the time being, I'm going to disable testing with OpenMPI on i686. I reported this upstream and they're investigating.
This bug appears to have been reported against 'rawhide' during the Fedora 28 development cycle.
Changing version to '28'.
With elpa-2018.05.001, I observe the following failures on rawhide:
s390x: serial+openmp 6
ppc64le: serial+openmp 12, mpich+openmp 8
ppc64: serial+openmp 8, mpich+openmp 5
x86_64: serial+openmp 20, mpich+openmp 12
aarch64: serial+openmp 10, mpich+openmp 8, openmpi+openmp 6
i686: serial+openmp 9, mpich+openmp 8
armv7hl: serial+openmp 11, mpich+openmp 9, openmpi+openmp 8
This bug appears to have been reported against 'rawhide' during the Fedora 29 development cycle.
Changing version to '29'.
I can reproduce the hang with gromacs-2018.3 on i686 and armv7hl, but only with Openmpi. However if I reorder the test execution from (openmpi, mpich, serial) to (serial, mpich, openmpi) it passes at least on i686.
This message is a reminder that Fedora 29 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 29 on 2019-11-26.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '29'.
Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.
Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 29 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.
Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.
Fedora 29 changed to end-of-life (EOL) status on 2019-11-26. Fedora 29 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.
If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
Thank you for reporting this bug and we are sorry it could not be fixed.
I tried building 2019.05.002 on rawhide and the good news is, OpenMPI tests are no longer hanging on i686. The bad news is, I got quite a few new failures:
# Test suite failures
# Build time: minutes
# Number of tests: serial - 83, MPI - 127
# build time serial serial+omp mpich mpich+omp openmpi openmpi+omp
# aarch64 20 3 9 3 7 3 3
# armv7hl N/A N/A N/A N/A N/A N/A N/A
# i686 24 0 8 0 8 0 0
# ppc64le 13 9 16 10 14 10 10
# s390x 11 0 5 0 2 0 0
# x86_64 23 0 18 0 12 0 0
armv7hl is not building because OpenMPI has broken dependencies (bug 1780584).
Work in progress with upstream. No more failures on s390x and ARM. 1 failure on x86 (serial+omp) and 9-10 failures on ppc64le remaining.