Bug 2141137 - openmpi: mpiexec hangs on f38 in koji
Summary: openmpi: mpiexec hangs on f38 in koji
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: libfabric
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Doug Ledford
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: 2137308 2137389
TreeView+ depends on / blocked
 
Reported: 2022-11-08 22:13 UTC by marcindulak
Modified: 2022-11-15 19:15 UTC (History)
10 users (show)

Fixed In Version: libfabric-1.16.1-3.fc38
Clone Of:
Environment:
Last Closed: 2022-11-12 00:15:06 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ofiwg libfabric pull 8227 0 None Merged prov/net: fix error path in xnet_enable_rdm 2022-11-15 23:26:25 UTC
Github open-mpi ompi issues 11055 0 None closed mpiexec hangs on Fedora koji builders - with IPv6 support 2022-11-15 23:26:21 UTC

Description marcindulak 2022-11-08 22:13:57 UTC
Description of problem:

mpixec hangs on f38 https://koji.fedoraproject.org/koji/taskinfo?taskID=93958805 and works on f37 https://koji.fedoraproject.org/koji/taskinfo?taskID=93958801

I experience the same type of hang in #2137389, and there mpich does not hang.
I'm unable to reproduce the issue in a local openmpi instance of dockerhub's fedora:38@sha256:c7dfa518d9db440fb02362c0f9b014c0e1b8e04bc0f6bf540d1d5ac2ecb43453.


Version-Release number of selected component (if applicable):

4.1.4-5.fc38

How reproducible:

in koji

Steps to Reproduce:
1. Save this as openmpi-test.spec

# https://github.com/open-mpi/ompi/issues/10324#issuecomment-1136363475
# https://github.com/open-mpi/ompi/issues/6850
# https://www.mail-archive.com/users@lists.open-mpi.org//msg26012.html

Name:			openmpi-test
Version:		1.0.0
Release:		1%{?dist}
Summary:		openmpi test

License:		GPLv3+

BuildRequires:		openssh-clients
BuildRequires:		openmpi-devel
BuildRequires:		gcc
BuildRequires:		strace
BuildRequires:		hostname
BuildRequires:		time

%description

%check

export TIMEOUT_OPTS='--preserve-status --kill-after 10 60'
export OMP_NUM_THREADS=1

%{_openmpi_load}
timeout ${TIMEOUT_OPTS} time strace -f -e execve -- env -i PATH=$MPI_BIN:/bin mpiexec --mca btl self,tcp --mca btl_tcp_if_include 127.0.0.1/24 --mca plm_base_verbose 10 --allow-run-as-root -np 2 hostname

# https://github.com/mikaem/mpi-examples/blob/master/helloworld.cpp
cat <<EOF > hello.c
#include <mpi.h>

int main(int argc, char** argv) {
    // Initialize the MPI environment
    MPI_Init(NULL, NULL);

    // Get the number of processes
    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    // Get the rank of the process
    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    printf("Hello world! from rank %d"
           " out of %d processors\n",
           world_rank, world_size);

    // Finalize the MPI environment.
    MPI_Finalize();
}
EOF
mpicc hello.c -o hello
timeout ${TIMEOUT_OPTS} time strace -f -e execve -- env -i PATH=$MPI_BIN:/bin mpiexec --mca btl self,tcp --mca btl_tcp_if_include 127.0.0.1/24 --mca plm_base_verbose 10 --allow-run-as-root -np 2 ./hello

%{_openmpi_unload}

2. rpmbuild -bs openmpi-test.spec
3. koji build --nowait --scratch f38 ~/rpmbuild/SRPMS/openmpi-test-1.0.0-1.fc36.src.rpm

Actual results:

hang

Expected results:

Hello world! from rank 1 out of 2 processors
Hello world! from rank 0 out of 2 processors

Additional info:

It's a strange issue, could be openmpi and/or koji problem.

Comment 1 Orion Poplawski 2022-11-11 00:16:09 UTC
Does anyone know what change triggered this?

Comment 2 Orion Poplawski 2022-11-11 04:26:28 UTC
I suspect this was caused by enabling IPv6 support in openmpi.  I've reverted that and am building it now.

Comment 3 Orion Poplawski 2022-11-11 05:24:36 UTC
Nope, that didn't appear to help.  I've filed https://github.com/open-mpi/ompi/issues/11055 upstream.

Comment 4 Orion Poplawski 2022-11-11 05:25:24 UTC
Has anyone found any failures in koschei?  I haven't yet.  Why would that be?

Comment 5 Orion Poplawski 2022-11-12 00:15:06 UTC
Turned out to be a libfabric issue. Should be fixed now.


Note You need to log in before you can comment on or make changes to this bug.