Bug 1512229 - elpa testsuite fails partially on i686, ppc64le and s390x
Summary: elpa testsuite fails partially on i686, ppc64le and s390x
Status: NEW
Alias: None
Product: Fedora
Classification: Fedora
Component: elpa
Version: 29
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Dominik 'Rathann' Mierzejewski
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Keywords:
Depends On:
Blocks: ZedoraTracker PPCTracker x86Tracker
TreeView+ depends on / blocked
 
Reported: 2017-11-11 21:49 UTC by Dominik 'Rathann' Mierzejewski
Modified: 2018-11-05 21:18 UTC (History)
3 users (show)

(edit)
Clone Of:
(edit)
Last Closed:


Attachments (Terms of Use)

Description Dominik 'Rathann' Mierzejewski 2017-11-11 21:49:54 UTC
Description of problem:
Some tests in elpa testsuite fail on i686, ppc64le and s390x.

Version-Release number of selected component (if applicable):
2017.05.003-1

How reproducible:
Always.

Steps to Reproduce:
1. Build current elpa master branch --with check

Actual results:
i686    (openmpi hang)
ppc64le (18/50 FAIL for serial+openmp, 25/73 FAIL for mpich+openmp)
s390x   (25/50 FAIL for serial, 39/73 FAIL for mpi)

Expected results:
all PASS

Comment 1 Alexander Ploumistos 2017-11-11 22:40:59 UTC
Is there a koji or copr log someplace?

I glanced at the code upstream and I saw this in test/Fortran/test.F90 (which was updated in the last couple of months):

https://gitlab.mpcdf.mpg.de/elpa/elpa/blob/master/test/Fortran/test.F90

(it starts at line 233)

#ifdef TEST_QR_DECOMPOSITION
#if TEST_GPU == 1
#ifdef WITH_MPI
     call mpi_finalize(mpierr)
#endif
     stop 77
#endif
   if (nblk .lt. 64) then
     if (myid .eq. 0) then
       print *,"At the moment QR decomposition need blocksize of at least 64"
     endif
     if ((na .lt. 64) .and. (myid .eq. 0)) then
       print *,"This is why the matrix size must also be at least 64 or only 1 MPI task can be used"
     endif


Also, the last committed version of test/shared/test_setup_mpi.F90

https://gitlab.mpcdf.mpg.de/elpa/elpa/blob/master/test/shared/test_setup_mpi.F90

has this comment:

"Clean exit if QR is skipped"


Could this be what's causing the failures or is that blocksize completely irrelevant to arch word size?

Comment 2 Dominik 'Rathann' Mierzejewski 2017-11-11 23:12:28 UTC
Thanks for your interest. You can find ppc64, ppc64le, s390x and x86_64 logs in this scratch build: https://koji.fedoraproject.org/koji/taskinfo?taskID=23053097 . You'll be able to find aarch64, armv7hl and i686 in this scratch build once it completes in about 36 hours (armv7hl is slowest): https://koji.fedoraproject.org/koji/taskinfo?taskID=23064548 .

On ppc64le, what you described in comment #1 does account for only a few of the failures in each failing case. Do you know why these are not failing on x86_64? Is it because SIMD kernels are used instead of generic ones? But then, why is ppc64 not failing?

On s390x, there are some more of these exit status: 77 errors, but still they don't account for all failures.

Comment 3 Alexander Ploumistos 2017-11-12 19:25:01 UTC
(In reply to Dominik 'Rathann' Mierzejewski from comment #2)
> Thanks for your interest. You can find ppc64, ppc64le, s390x and x86_64 logs
> in this scratch build:
> https://koji.fedoraproject.org/koji/taskinfo?taskID=23053097 . You'll be
> able to find aarch64, armv7hl and i686 in this scratch build once it
> completes in about 36 hours (armv7hl is slowest):
> https://koji.fedoraproject.org/koji/taskinfo?taskID=23064548 .

Why does it take that long? I've been looking at the i686 build log since yesterday and it seems stuck at the openmpi tests. 

> On ppc64le, what you described in comment #1 does account for only a few of
> the failures in each failing case. Do you know why these are not failing on
> x86_64? Is it because SIMD kernels are used instead of generic ones? But
> then, why is ppc64 not failing?

I have no idea. Perhaps someone on devel could answer that. Have you asked upstream by any chance?

Comment 4 Alexander Ploumistos 2017-11-14 13:26:47 UTC
That's interesting, the i686 openmpi checks timed out, but they completed on armv7hl. I think upstream should have an explanation for that, I would expect both builds to fail if it was a word size issue.

Comment 5 Dominik 'Rathann' Mierzejewski 2017-12-29 13:06:45 UTC
New scratch build, since the previous ones were garbage-collected already: https://koji.fedoraproject.org/koji/taskinfo?taskID=23941175 .

Comment 6 Dominik 'Rathann' Mierzejewski 2018-01-11 13:44:15 UTC
Ok, so the i686 build get stuck in various double precision tests under OpenMPI randomly. Across three local mock -r fedora-rawhide-i386 runs, I saw the following getting stuck just eating CPU cycles:
test_real_double_hermitian_multiply_1stage_all_layouts.sh
test_real_double_eigenvalues_1stage_all_layouts.sh
test_real_double_eigenvectors_2stage_all_kernels_all_layouts.sh

For the time being, I'm going to disable testing with OpenMPI on i686. I reported this upstream and they're investigating.

Comment 7 Fedora End Of Life 2018-02-20 15:34:31 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 28 development cycle.
Changing version to '28'.

Comment 8 Dominik 'Rathann' Mierzejewski 2018-07-30 13:58:02 UTC
With elpa-2018.05.001, I observe the following failures on rawhide:
s390x:   serial+openmp  6
ppc64le: serial+openmp 12, mpich+openmp  8
ppc64:   serial+openmp  8, mpich+openmp  5
x86_64:  serial+openmp 20, mpich+openmp 12
aarch64: serial+openmp 10, mpich+openmp  8, openmpi+openmp 6
i686:    serial+openmp  9, mpich+openmp  8
armv7hl: serial+openmp 11, mpich+openmp  9, openmpi+openmp 8

Comment 9 Jan Kurik 2018-08-14 10:16:55 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 29 development cycle.
Changing version to '29'.

Comment 10 Christoph Junghans 2018-11-05 21:18:26 UTC
I can reproduce the hang with gromacs-2018.3 on i686 and armv7hl, but only with Openmpi. However if I reorder the test execution from (openmpi, mpich, serial) to (serial, mpich, openmpi) it passes at least on i686.


Note You need to log in before you can comment on or make changes to this bug.