Description of problem: OpenMPI tests of MUMPS are failing on Rawhide s390x only: + export OMPI_MCA_rmaps_base_oversubscribe=1 + OMPI_MCA_rmaps_base_oversubscribe=1 + ./ssimpletest [buildvm-s390x-09:2509570] *** Process received signal *** [buildvm-s390x-09:2509570] Signal: Segmentation fault (11) [buildvm-s390x-09:2509570] Signal code: Address not mapped (1) [buildvm-s390x-09:2509570] Failing at address: 0xfffffffffffff000 [buildvm-s390x-09:2509570] [ 0] [0x3fffdafcee0] [buildvm-s390x-09:2509570] [ 1] /lib64/libhwloc.so.15(+0x44870)[0x3ff831c4870] [buildvm-s390x-09:2509570] [ 2] /lib64/libhwloc.so.15(hwloc_topology_load+0xe6)[0x3ff83196ae6] [buildvm-s390x-09:2509570] [ 3] /usr/lib64/openmpi/lib/libopen-pal.so.40(opal_hwloc_base_get_topology+0xfe2)[0x3ff836040d2] [buildvm-s390x-09:2509570] [ 4] /usr/lib64/openmpi/lib/openmpi/mca_ess_hnp.so(+0x508c)[0x3ff82a0508c] [buildvm-s390x-09:2509570] [ 5] /usr/lib64/openmpi/lib/libopen-rte.so.40(orte_init+0x2d2)[0x3ff83a112d2] [buildvm-s390x-09:2509570] [ 6] /usr/lib64/openmpi/lib/libopen-rte.so.40(orte_daemon+0x26a)[0x3ff839bc72a] [buildvm-s390x-09:2509570] [ 7] /lib64/libc.so.6(__libc_start_main+0x10a)[0x3ff836abb7a] [buildvm-s390x-09:2509570] [ 8] orted(+0x954)[0x2aa11300954] [buildvm-s390x-09:2509570] *** End of error message *** [buildvm-s390x-09.s390.fedoraproject.org:2509569] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 716 [buildvm-s390x-09.s390.fedoraproject.org:2509569] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 172 -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_init failed --> Returned value Unable to start a daemon on the local node (-127) instead of ORTE_SUCCESS -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: ompi_rte_init failed --> Returned "Unable to start a daemon on the local node" (-127) instead of "Success" (0) -------------------------------------------------------------------------- *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [buildvm-s390x-09.s390.fedoraproject.org:2509569] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! Version-Release number of selected component (if applicable): MUMPS-5.3.1-3 openmpi-4.0.4-1 How reproducible: Building MUMPS on Rawhide Actual results: https://koji.fedoraproject.org/koji/taskinfo?taskID=47387705
This looks to be hwloc related. I'd like to see if updating to 2.2.0 resolves it. I've filed https://src.fedoraproject.org/rpms/hwloc/pull-request/2
Hopefully fixed with hwloc-2.2.0-1.fc33. Reopen if it does not.