Bug 1858522 - Returned value Unable to start a daemon on the local node (-127) instead of ORTE_SUCCESS
Summary: Returned value Unable to start a daemon on the local node (-127) instead of O...
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: hwloc
Version: rawhide
Hardware: s390x
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Jiri Hladky
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: ZedoraTracker 1863077
TreeView+ depends on / blocked
 
Reported: 2020-07-18 18:07 UTC by Antonio T. sagitter
Modified: 2020-08-04 02:09 UTC (History)
6 users (show)

Fixed In Version: hwloc-2.2.0-1.fc33
Clone Of:
Environment:
Last Closed: 2020-08-04 02:09:42 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Antonio T. sagitter 2020-07-18 18:07:04 UTC
Description of problem:
OpenMPI tests of MUMPS are failing on Rawhide s390x only:

+ export OMPI_MCA_rmaps_base_oversubscribe=1
+ OMPI_MCA_rmaps_base_oversubscribe=1
+ ./ssimpletest
[buildvm-s390x-09:2509570] *** Process received signal ***
[buildvm-s390x-09:2509570] Signal: Segmentation fault (11)
[buildvm-s390x-09:2509570] Signal code: Address not mapped (1)
[buildvm-s390x-09:2509570] Failing at address: 0xfffffffffffff000
[buildvm-s390x-09:2509570] [ 0] [0x3fffdafcee0]
[buildvm-s390x-09:2509570] [ 1] /lib64/libhwloc.so.15(+0x44870)[0x3ff831c4870]
[buildvm-s390x-09:2509570] [ 2] /lib64/libhwloc.so.15(hwloc_topology_load+0xe6)[0x3ff83196ae6]
[buildvm-s390x-09:2509570] [ 3] /usr/lib64/openmpi/lib/libopen-pal.so.40(opal_hwloc_base_get_topology+0xfe2)[0x3ff836040d2]
[buildvm-s390x-09:2509570] [ 4] /usr/lib64/openmpi/lib/openmpi/mca_ess_hnp.so(+0x508c)[0x3ff82a0508c]
[buildvm-s390x-09:2509570] [ 5] /usr/lib64/openmpi/lib/libopen-rte.so.40(orte_init+0x2d2)[0x3ff83a112d2]
[buildvm-s390x-09:2509570] [ 6] /usr/lib64/openmpi/lib/libopen-rte.so.40(orte_daemon+0x26a)[0x3ff839bc72a]
[buildvm-s390x-09:2509570] [ 7] /lib64/libc.so.6(__libc_start_main+0x10a)[0x3ff836abb7a]
[buildvm-s390x-09:2509570] [ 8] orted(+0x954)[0x2aa11300954]
[buildvm-s390x-09:2509570] *** End of error message ***
[buildvm-s390x-09.s390.fedoraproject.org:2509569] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 716
[buildvm-s390x-09.s390.fedoraproject.org:2509569] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 172
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
  orte_ess_init failed
  --> Returned value Unable to start a daemon on the local node (-127) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Unable to start a daemon on the local node" (-127) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[buildvm-s390x-09.s390.fedoraproject.org:2509569] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

Version-Release number of selected component (if applicable):
MUMPS-5.3.1-3
openmpi-4.0.4-1

How reproducible:
Building MUMPS on Rawhide

Actual results:
https://koji.fedoraproject.org/koji/taskinfo?taskID=47387705

Comment 1 Orion Poplawski 2020-07-18 20:04:29 UTC
This looks to be hwloc related.  I'd like to see if updating to 2.2.0 resolves it.  I've filed https://src.fedoraproject.org/rpms/hwloc/pull-request/2

Comment 2 Orion Poplawski 2020-08-04 02:09:42 UTC
Hopefully fixed with hwloc-2.2.0-1.fc33.  Reopen if it does not.


Note You need to log in before you can comment on or make changes to this bug.