Bug 1220161 - FTBFS of elpa if build host has > 4 cpu cores.
Summary: FTBFS of elpa if build host has > 4 cpu cores.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora EPEL
Classification: Fedora
Component: elpa
Version: epel7
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Dominik 'Rathann' Mierzejewski
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-05-10 15:19 UTC by Tuomo Soini
Modified: 2017-06-28 23:19 UTC (History)
2 users (show)

Fixed In Version: elpa-2015.11.001-6.el7
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-06-28 23:19:16 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
Mock build logs without modifications and without ncpus hack (82.34 KB, application/octet-stream)
2015-05-24 09:41 UTC, Tuomo Soini
no flags Details

Description Tuomo Soini 2015-05-10 15:19:33 UTC
Description of problem:

I tried to build elpa on a host with 24 cpu cores. Build seem to go fine but  after 6 hours mock gave up with tests because of timeout. I doubled timeout and after 12 hours mock gave up.

I checked elpa spec to find out there is elpa-rpm.patch which does a very big change to CPU core usage by doing similar to this change to several tests:

diff -up mpich/Makefile.am.r mpich/Makefile.am
--- mpich/Makefile.am.r        2015-03-17 16:05:37.000000000 +0100
+++ mpich/Makefile.am  2015-03-20 10:59:37.967517516 +0100
@@ -204,47 +204,47 @@ check_SCRIPTS = \
 
 TESTS = $(check_SCRIPTS)
 elpa1_test_real.sh:
-      echo 'mpiexec -n 2 ./elpa1_test_real@SUFFIX@ $$TEST_FLAGS' > elpa1_test_real.sh
+      echo 'mpiexec -n `getconf _NPROCESSORS_ONLN` ./elpa1_test_real@SUFFIX@ $$TEST_FLAGS' > elpa1_test_real.sh
       chmod +x elpa1_test_real.sh
 
By removing all changes to Makefile.am build was successful with first try.

I saw from changelog there has been similar problems with different cpu archs before. I see fedora build system only uses 4 cpu cores for x86_64. I even tried to build in a vm with 8 cpu cores only but timeout after 12 hours still happens and build doesn't finish.

I'd strongly suggest removing this unnecessary change from elpa-rpm.patch so that package really builds on multi-core machine.

This unnecessary optimization prevents that now.

I tested build on epel7 epel6/x86_64 and epel6/i686 and I see same timeout problem on all those builds.

tested versions:

2015.02.002-4.el7
2015.02.002-4.el6

Comment 1 Dominik 'Rathann' Mierzejewski 2015-05-14 11:04:50 UTC
There might be a bug in the openmpi packages present in epel buildroots. The tests are not timing out on Fedora rawhide and 22. I observed this with openmpi earlier (bug 1144408), but it fixed itself with recent OpenMPI packages. Are you seeing the timeouts for mpich tests as well? Please disable openmpi tests and retry.

Comment 2 Dominik 'Rathann' Mierzejewski 2015-05-14 11:08:44 UTC
Also, massive parallelization is kind of the raison d'être for this library, so if running the testsuite doesn't scale, then it's a bug that needs to be fixed, not worked around by decreasing the number of processes running.

Comment 3 Tuomo Soini 2015-05-14 11:38:30 UTC
It's not about testsuite failing - it's about testsuite taking rediculous amount of time to build which doesn't make any sense and buildsystems timining out.

With -n 2 package build takes around 3 hours to comlete and 90% of the time is taken by test suite. -n 8 doesn't complete in 12h which was my absolute maximum timeout for build.

This is not traditional FTBFS problem beause there is no failure in build. Packager modifications for "intended" behaviour of test suite are now reason for timeout, not software itself.

Comment 4 Dominik 'Rathann' Mierzejewski 2015-05-15 10:32:32 UTC
You haven't answered my question. Maybe you missed it, so let me repeat:

Are you seeing the timeouts for mpich tests as well? Please disable openmpi tests and retry.

Comment 5 Tuomo Soini 2015-05-15 14:24:40 UTC
I did answer but you didn't read it. There was only buildsystem timeout. No test timeouts. When I initially added build timeout from 6h to 12h tests just got a little further.

Yes, this can be problems in other components but this packaging change is obvious trigger for the bad behaviour.

Comment 6 Dominik 'Rathann' Mierzejewski 2015-05-17 15:20:12 UTC
You still haven't answered my question, so I'm repeating it for the last time: is this timeout issue occurring for both openmpi and mpich or only openmpi?

If it's only openmpi, then I'll implement a workaround for openmpi tests, but I don't see any reason not to use the full capacity of the build system for the testsuite. This is especially important for ARM builds.

Also, I experienced these timeouts on my own machine which has 4 cores, so this belies your claim that it only happens with >4 CPU cores.

Comment 7 Tuomo Soini 2015-05-24 09:41:31 UTC
Created attachment 1029158 [details]
Mock build logs without modifications and without ncpus hack

Log file names should indicated build environment and timeout options given to mock.

Comment 8 Fedora Update System 2017-06-12 12:08:22 UTC
elpa-2015.11.001-5.fc26 has been submitted as an update to Fedora 26. https://bodhi.fedoraproject.org/updates/FEDORA-2017-7a1372d77b

Comment 9 Fedora Update System 2017-06-12 13:42:11 UTC
openmx-3.8.1-9.el7 elpa-2015.11.001-5.el7 has been submitted as an update to Fedora EPEL 7. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2017-9161cc56d2

Comment 10 Fedora Update System 2017-06-13 03:48:38 UTC
elpa-2015.11.001-5.fc26 has been pushed to the Fedora 26 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2017-7a1372d77b

Comment 11 Fedora Update System 2017-06-13 10:12:22 UTC
elpa-2015.11.001-6.fc26 has been submitted as an update to Fedora 26. https://bodhi.fedoraproject.org/updates/FEDORA-2017-7a1372d77b

Comment 12 Fedora Update System 2017-06-14 01:34:59 UTC
elpa-2015.11.001-6.fc26 has been pushed to the Fedora 26 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2017-7a1372d77b

Comment 13 Fedora Update System 2017-06-14 07:49:43 UTC
elpa-2015.11.001-6.el7, openmx-3.8.1-9.el7 has been pushed to the Fedora EPEL 7 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2017-9161cc56d2

Comment 14 Fedora Update System 2017-06-17 19:42:24 UTC
elpa-2015.11.001-6.fc26 has been pushed to the Fedora 26 stable repository. If problems still persist, please make note of it in this bug report.

Comment 15 Fedora Update System 2017-06-28 23:19:16 UTC
elpa-2015.11.001-6.el7, openmx-3.8.1-9.el7 has been pushed to the Fedora EPEL 7 stable repository. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.