Description of problem:
Some tests in elpa testsuite fail on i686, ppc64le and s390x.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Build current elpa master branch --with check
i686 (openmpi hang)
ppc64le (18/50 FAIL for serial+openmp, 25/73 FAIL for mpich+openmp)
s390x (25/50 FAIL for serial, 39/73 FAIL for mpi)
Is there a koji or copr log someplace?
I glanced at the code upstream and I saw this in test/Fortran/test.F90 (which was updated in the last couple of months):
(it starts at line 233)
#if TEST_GPU == 1
if (nblk .lt. 64) then
if (myid .eq. 0) then
print *,"At the moment QR decomposition need blocksize of at least 64"
if ((na .lt. 64) .and. (myid .eq. 0)) then
print *,"This is why the matrix size must also be at least 64 or only 1 MPI task can be used"
Also, the last committed version of test/shared/test_setup_mpi.F90
has this comment:
"Clean exit if QR is skipped"
Could this be what's causing the failures or is that blocksize completely irrelevant to arch word size?
Thanks for your interest. You can find ppc64, ppc64le, s390x and x86_64 logs in this scratch build: https://koji.fedoraproject.org/koji/taskinfo?taskID=23053097 . You'll be able to find aarch64, armv7hl and i686 in this scratch build once it completes in about 36 hours (armv7hl is slowest): https://koji.fedoraproject.org/koji/taskinfo?taskID=23064548 .
On ppc64le, what you described in comment #1 does account for only a few of the failures in each failing case. Do you know why these are not failing on x86_64? Is it because SIMD kernels are used instead of generic ones? But then, why is ppc64 not failing?
On s390x, there are some more of these exit status: 77 errors, but still they don't account for all failures.
(In reply to Dominik 'Rathann' Mierzejewski from comment #2)
> Thanks for your interest. You can find ppc64, ppc64le, s390x and x86_64 logs
> in this scratch build:
> https://koji.fedoraproject.org/koji/taskinfo?taskID=23053097 . You'll be
> able to find aarch64, armv7hl and i686 in this scratch build once it
> completes in about 36 hours (armv7hl is slowest):
> https://koji.fedoraproject.org/koji/taskinfo?taskID=23064548 .
Why does it take that long? I've been looking at the i686 build log since yesterday and it seems stuck at the openmpi tests.
> On ppc64le, what you described in comment #1 does account for only a few of
> the failures in each failing case. Do you know why these are not failing on
> x86_64? Is it because SIMD kernels are used instead of generic ones? But
> then, why is ppc64 not failing?
I have no idea. Perhaps someone on devel could answer that. Have you asked upstream by any chance?
That's interesting, the i686 openmpi checks timed out, but they completed on armv7hl. I think upstream should have an explanation for that, I would expect both builds to fail if it was a word size issue.
New scratch build, since the previous ones were garbage-collected already: https://koji.fedoraproject.org/koji/taskinfo?taskID=23941175 .
Ok, so the i686 build get stuck in various double precision tests under OpenMPI randomly. Across three local mock -r fedora-rawhide-i386 runs, I saw the following getting stuck just eating CPU cycles:
For the time being, I'm going to disable testing with OpenMPI on i686. I reported this upstream and they're investigating.
This bug appears to have been reported against 'rawhide' during the Fedora 28 development cycle.
Changing version to '28'.
With elpa-2018.05.001, I observe the following failures on rawhide:
s390x: serial+openmp 6
ppc64le: serial+openmp 12, mpich+openmp 8
ppc64: serial+openmp 8, mpich+openmp 5
x86_64: serial+openmp 20, mpich+openmp 12
aarch64: serial+openmp 10, mpich+openmp 8, openmpi+openmp 6
i686: serial+openmp 9, mpich+openmp 8
armv7hl: serial+openmp 11, mpich+openmp 9, openmpi+openmp 8
This bug appears to have been reported against 'rawhide' during the Fedora 29 development cycle.
Changing version to '29'.
I can reproduce the hang with gromacs-2018.3 on i686 and armv7hl, but only with Openmpi. However if I reorder the test execution from (openmpi, mpich, serial) to (serial, mpich, openmpi) it passes at least on i686.