Bug 1659852 - openmpi program killed with Illegal Instruction signal
Summary: openmpi program killed with Illegal Instruction signal
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: libfabric
Version: 34
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Orion Poplawski
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-12-17 02:50 UTC by Orion Poplawski
Modified: 2022-06-08 00:24 UTC (History)
15 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2022-06-08 00:24:53 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ofiwg libfabric issues 4702 0 'None' closed Do not call into psm2 when not needed 2020-10-13 01:26:34 UTC

Description Orion Poplawski 2018-12-17 02:50:55 UTC
Description of problem:

I'm testing builds of openmpi 3.1 in a COPR.  I'm seeing many tests fail on Fedora Rawhide x86_64 with Illegal Instructions errors.

Backtrace:
/lib64/libpsm2.so.2(+0x46c14)[0x7f013b9afc14]
/lib64/libpsm2.so.2(+0x46eb5)[0x7f013b9afeb5]
/lib64/libpsm2.so.2(+0x4bbcb)[0x7f013b9b4bcb]
/lib64/libpsm2.so.2(psm2_init+0x221)[0x7f013b98be61]
/lib64/libfabric.so.1(+0xc846f)[0x7f013b27846f]
/lib64/libfabric.so.1(fi_getinfo+0x296)[0x7f013b1c79c6]
/usr/lib64/openmpi/lib/openmpi/mca_mtl_ofi.so(+0x579a)[0x7f013b8e479a]
/usr/lib64/openmpi/lib/libmpi.so.40(ompi_mtl_base_select+0xa4)[0x7f01423bffb4]
/usr/lib64/openmpi/lib/openmpi/mca_pml_cm.so(+0x5cee)[0x7f013ba9acee]
/usr/lib64/openmpi/lib/libmpi.so.40(mca_pml_base_select+0x1e4)[0x7f01423c88e4]
/usr/lib64/openmpi/lib/libmpi.so.40(ompi_mpi_init+0x6ba)[0x7f01423561fa]
/usr/lib64/openmpi/lib/libmpi.so.40(MPI_Init+0x72)[0x7f0142385a72]
../pddrive(+0xfe4c)[0x55cf75f2ce4c]
/lib64/libc.so.6(__libc_start_main+0xf3)[0x7f0141e3cee3]
../pddrive(+0x103fe)[0x55cf75f2d3fe]

That address appears to contain an AVX2 instruction:

   46c14:       c5 f9 ef c0             vpxor  %xmm0,%xmm0,%xmm0

Is this a bug in libpsm2 incorrectly trying to call AVX2 code, or perhaps libfabric incorrectly trying to use libpsm2 on non-AVX2 capable hardware.  Or something else.

Version-Release number of selected component (if applicable):
libpsm2-11.2.23-1.fc30.x86_64


Additional info:
I'm unable to reproduce this error outside of COPR, so perhaps it's triggered by something specific about the COPR hardware, which seems to be:

model name	: Intel(R) Xeon(R) CPU           X5690  @ 3.47GHz
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl cpuid pni pclmulqdq vmx ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes hypervisor lahf_lm pti tpr_shadow vnmi flexpriority ept vpid tsc_adjust arat

Comment 1 Honggang LI 2018-12-17 02:58:38 UTC
Can you please provide a reproducer?

Comment 2 Orion Poplawski 2018-12-17 03:28:12 UTC
If you make use of https://copr.fedorainfracloud.org/coprs/g/scitech/openmpi3.1/ on Fedora Rawhide and then try to build superlu_dist, that is what the above backtrace is from.  In %check it runs:

mpirun -n 4 ../pddrive -r 2 -c 2 g20.rua

which fails.  If running on less than 4 cores, you'll need to add:

export OMPI_MCA_rmaps_base_oversubscribe=1

If you apply for membership in the FAS group scitech, you can submit builds in the COPR in case you cannot reproduce locally (as I was not).

Comment 3 Orion Poplawski 2018-12-17 03:58:14 UTC
I see that libpsm2-11.2.23-1.fc30.x86_64 is built with -march=avx2.  That seems wrong for a general purpose x86_64 library.https://kojipkgs.fedoraproject.org//packages/libpsm2/11.2.23/1.fc30/data/logs/x86_64/build.log

Comment 4 John Reiser 2018-12-17 04:36:43 UTC
The source assumes incorrectly that the run-time hardware will be at least as capable as the compile-time hardware.

buildflags.mak (included from Makefile):
    #
    # test if compiler supports 32B(AVX2)/64B(AVX512F) move instruction.
    #
    ifeq (${CC},icc)
      MAVX2=-march=core-avx2 -DPSM_AVX512
    else 
      MAVX2=-mavx2
    endif
    RET := $(shell echo "int main() {}" | ${CC} ${MAVX2} -E -dM -xc - 2>&1 | grep -q AVX2 ; echo $$?)
    ifeq (0,${RET})
      BASECFLAGS += ${MAVX2}
    else 
        $(error Compiler does not support AVX2 )
    endif

Fix: delete all those lines, and also the lines which test for -mavx512f.

Comment 5 Dominik 'Rathann' Mierzejewski 2018-12-17 11:58:29 UTC
Indeed, compiling with non-Fedora-mandated compiler flags should be avoided and needs justification. Assuming the code doesn't support runtime-CPU-detection, you could try building twice, once for vanilla x86_64 (without -march) and second time with -mavx2/-mavx512f and putting the AVX-enabled binaries in /usr/lib64/haswell/ or /usr/lib64/haswell/avx512_1/. See: https://clearlinux.org/blogs/transparent-use-library-packages-optimized-intel-architecture .

Comment 6 aravind.gopalakrishnan 2018-12-17 18:46:34 UTC
Looks like the CPU you are using in the test environment is quite old and does not support AVX2 instructions. Intel Omni-Path program does not support CPUs that do not support AVX2, hence it is included by default at compile time. You can force disable the use of AVX2 instructions at build time by setting PSM_DISABLE_AVX2=1.

Comment 7 Orion Poplawski 2018-12-17 18:51:15 UTC
Packages built for Fedora need to run without modification on all supported hardware.  If psm2 is going to be used by any Fedora packages it will need to be built in a way to support non-AVX2 hardware, or else packages will need to drop psm2 support.

Comment 8 aravind.gopalakrishnan 2018-12-17 20:18:01 UTC
You can export PSM_DISABLE_AVX2=1 in the specfile (libpsm2.spec.in) and build for all architectures. That should allow psm2 included in Fedora to work in any/all supported hardware.

Comment 9 Orion Poplawski 2018-12-19 03:15:59 UTC
Setting PSM_DISABLE_AVX2=1 with 11.2.68 simply replaces -mavx2 with -mavx, which is not sufficient.

But it appears that -mavx is required to build as without it I get:

gcc -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection  -pthread -Wall -Werror -D_DEFAULT_SOURCE -D_SVID_SOURCE -D_BSD_SOURCE -O3 -g3 -fpic -fPIC -D_GNU_SOURCE  -funwind-tables -Wno-strict-aliasing -Wformat-security -I/builddir/build/BUILD/libpsm2-11.2.68/include -I/builddir/build/BUILD/libpsm2-11.2.68/mpspawn -I/builddir/build/BUILD/libpsm2-11.2.68/include/linux-x86_64 -I/usr/include/uapi -I/builddir/build/BUILD/libpsm2-11.2.68 -I/builddir/build/BUILD/libpsm2-11.2.68/ptl_ips -I/builddir/build/BUILD/libpsm2-11.2.68/build_release -I/builddir/build/BUILD/libpsm2-11.2.68/opa/.. -I/builddir/build/BUILD/libpsm2-11.2.68/opa/../ptl_ips -c /builddir/build/BUILD/libpsm2-11.2.68/opa/opa_dwordcpy-x86_64.c -o /builddir/build/BUILD/libpsm2-11.2.68/build_release/opa/opa_dwordcpy-x86_64.o
gcc -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection  -pthread -Wall -Werror -D_DEFAULT_SOURCE -D_SVID_SOURCE -D_BSD_SOURCE -O3 -g3 -fpic -fPIC -D_GNU_SOURCE  -funwind-tables -Wno-strict-aliasing -Wformat-security -I/builddir/build/BUILD/libpsm2-11.2.68/include -I/builddir/build/BUILD/libpsm2-11.2.68/mpspawn -I/builddir/build/BUILD/libpsm2-11.2.68/include/linux-x86_64 -I/usr/include/uapi -I/builddir/build/BUILD/libpsm2-11.2.68 -I/builddir/build/BUILD/libpsm2-11.2.68/ptl_ips -I/builddir/build/BUILD/libpsm2-11.2.68/build_release -I/builddir/build/BUILD/libpsm2-11.2.68/opa/.. -I/builddir/build/BUILD/libpsm2-11.2.68/opa/../ptl_ips -c /builddir/build/BUILD/libpsm2-11.2.68/opa/opa_sysfs.c -o /builddir/build/BUILD/libpsm2-11.2.68/build_release/opa/opa_sysfs.o
gcc -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection  -pthread -Wall -Werror -D_DEFAULT_SOURCE -D_SVID_SOURCE -D_BSD_SOURCE -O3 -g3 -fpic -fPIC -D_GNU_SOURCE  -funwind-tables -Wno-strict-aliasing -Wformat-security -I/builddir/build/BUILD/libpsm2-11.2.68/include -I/builddir/build/BUILD/libpsm2-11.2.68/mpspawn -I/builddir/build/BUILD/libpsm2-11.2.68/include/linux-x86_64 -I/usr/include/uapi -I/builddir/build/BUILD/libpsm2-11.2.68 -I/builddir/build/BUILD/libpsm2-11.2.68/ptl_ips -I/builddir/build/BUILD/libpsm2-11.2.68/build_release -I/builddir/build/BUILD/libpsm2-11.2.68/opa/.. -I/builddir/build/BUILD/libpsm2-11.2.68/opa/../ptl_ips -c /builddir/build/BUILD/libpsm2-11.2.68/opa/opa_syslog.c -o /builddir/build/BUILD/libpsm2-11.2.68/build_release/opa/opa_syslog.o
gcc  -g3 -fpic -c /builddir/build/BUILD/libpsm2-11.2.68/opa/opa_dwordcpy-x86_64-fast.S -o /builddir/build/BUILD/libpsm2-11.2.68/build_release/opa/opa_dwordcpy-x86_64-fast.o
/builddir/build/BUILD/libpsm2-11.2.68/opa/opa_dwordcpy-x86_64.c: In function 'hfi_pio_blockcpy_256':
/builddir/build/BUILD/libpsm2-11.2.68/opa/opa_dwordcpy-x86_64.c:206:12: error: AVX vector return without AVX enabled changes the ABI [-Werror=psabi]
    __m256i tmp0 = _mm256_load_si256(sp);
            ^~~~
In file included from /usr/lib/gcc/x86_64-redhat-linux/8/include/immintrin.h:41,
                 from /builddir/build/BUILD/libpsm2-11.2.68/opa/opa_dwordcpy-x86_64.c:57:
/usr/lib/gcc/x86_64-redhat-linux/8/include/avxintrin.h:913:1: error: inlining failed in call to always_inline '_mm256_store_si256': target specific option mismatch
 _mm256_store_si256 (__m256i *__P, __m256i __A)
 ^~~~~~~~~~~~~~~~~~
/builddir/build/BUILD/libpsm2-11.2.68/opa/opa_dwordcpy-x86_64.c:209:4: note: called from here
    _mm256_store_si256((__m256i *)(dp + 1), tmp1);
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-redhat-linux/8/include/immintrin.h:41,
                 from /builddir/build/BUILD/libpsm2-11.2.68/opa/opa_dwordcpy-x86_64.c:57:
/usr/lib/gcc/x86_64-redhat-linux/8/include/avxintrin.h:913:1: error: inlining failed in call to always_inline '_mm256_store_si256': target specific option mismatch
 _mm256_store_si256 (__m256i *__P, __m256i __A)
 ^~~~~~~~~~~~~~~~~~
/builddir/build/BUILD/libpsm2-11.2.68/opa/opa_dwordcpy-x86_64.c:208:4: note: called from here
    _mm256_store_si256((__m256i *)dp, tmp0);
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-redhat-linux/8/include/immintrin.h:41,
                 from /builddir/build/BUILD/libpsm2-11.2.68/opa/opa_dwordcpy-x86_64.c:57:
/usr/lib/gcc/x86_64-redhat-linux/8/include/avxintrin.h:907:1: error: inlining failed in call to always_inline '_mm256_load_si256': target specific option mismatch
 _mm256_load_si256 (__m256i const *__P)
 ^~~~~~~~~~~~~~~~~
/builddir/build/BUILD/libpsm2-11.2.68/opa/opa_dwordcpy-x86_64.c:207:19: note: called from here
    __m256i tmp1 = _mm256_load_si256(sp + 1);
                   ^~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-redhat-linux/8/include/immintrin.h:41,
                 from /builddir/build/BUILD/libpsm2-11.2.68/opa/opa_dwordcpy-x86_64.c:57:
/usr/lib/gcc/x86_64-redhat-linux/8/include/avxintrin.h:907:1: error: inlining failed in call to always_inline '_mm256_load_si256': target specific option mismatch
 _mm256_load_si256 (__m256i const *__P)
 ^~~~~~~~~~~~~~~~~
/builddir/build/BUILD/libpsm2-11.2.68/opa/opa_dwordcpy-x86_64.c:206:19: note: called from here
    __m256i tmp0 = _mm256_load_si256(sp);
                   ^~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-redhat-linux/8/include/immintrin.h:41,
                 from /builddir/build/BUILD/libpsm2-11.2.68/opa/opa_dwordcpy-x86_64.c:57:
/usr/lib/gcc/x86_64-redhat-linux/8/include/avxintrin.h:913:1: error: inlining failed in call to always_inline '_mm256_store_si256': target specific option mismatch
 _mm256_store_si256 (__m256i *__P, __m256i __A)
 ^~~~~~~~~~~~~~~~~~
/builddir/build/BUILD/libpsm2-11.2.68/opa/opa_dwordcpy-x86_64.c:217:4: note: called from here
    _mm256_store_si256((__m256i *)(dp + 1), tmp1);
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-redhat-linux/8/include/immintrin.h:41,
                 from /builddir/build/BUILD/libpsm2-11.2.68/opa/opa_dwordcpy-x86_64.c:57:
/usr/lib/gcc/x86_64-redhat-linux/8/include/avxintrin.h:913:1: error: inlining failed in call to always_inline '_mm256_store_si256': target specific option mismatch
 _mm256_store_si256 (__m256i *__P, __m256i __A)
 ^~~~~~~~~~~~~~~~~~
/builddir/build/BUILD/libpsm2-11.2.68/opa/opa_dwordcpy-x86_64.c:216:4: note: called from here
    _mm256_store_si256((__m256i *)dp, tmp0);
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-redhat-linux/8/include/immintrin.h:41,
                 from /builddir/build/BUILD/libpsm2-11.2.68/opa/opa_dwordcpy-x86_64.c:57:
/usr/lib/gcc/x86_64-redhat-linux/8/include/avxintrin.h:919:1: error: inlining failed in call to always_inline '_mm256_loadu_si256': target specific option mismatch
 _mm256_loadu_si256 (__m256i_u const *__P)
 ^~~~~~~~~~~~~~~~~~
/builddir/build/BUILD/libpsm2-11.2.68/opa/opa_dwordcpy-x86_64.c:215:19: note: called from here
    __m256i tmp1 = _mm256_loadu_si256(sp + 1);
                   ^~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-redhat-linux/8/include/immintrin.h:41,
                 from /builddir/build/BUILD/libpsm2-11.2.68/opa/opa_dwordcpy-x86_64.c:57:
/usr/lib/gcc/x86_64-redhat-linux/8/include/avxintrin.h:919:1: error: inlining failed in call to always_inline '_mm256_loadu_si256': target specific option mismatch
 _mm256_loadu_si256 (__m256i_u const *__P)
 ^~~~~~~~~~~~~~~~~~
/builddir/build/BUILD/libpsm2-11.2.68/opa/opa_dwordcpy-x86_64.c:214:19: note: called from here
    __m256i tmp0 = _mm256_loadu_si256(sp);
                   ^~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
make[1]: Leaving directory '/builddir/build/BUILD/libpsm2-11.2.68/opa'

So, where do we go from here?

Comment 10 russell.w.mcguire 2018-12-20 01:15:31 UTC
Why is RHEL putting and testing OmniPath into machines that are not supported.

March 2016 original public release notes for OmniPath:
https://www.intel.com/content/dam/support/us/en/documents/network-and-i-o/fabric-products/Intel_OP_Software_10_0_RN_J16607_v3_0.pdf

States Haswell or newer CPU's are required.

Is it common to put incompatible hardware together and require it to operate?

Comment 11 Honggang LI 2018-12-20 03:37:56 UTC
(In reply to russell.w.mcguire from comment #10)
> Why is RHEL putting and testing OmniPath into machines that are not
> supported.

No, RHEL will not ship OmniPath for machines with old CPUs. This bug is against Fedora, not RHEL.

However, I could suggest close this bug as NOTABUG because of unsupported old CPU.

Comment 12 Dominik 'Rathann' Mierzejewski 2018-12-20 10:37:31 UTC
(In reply to Honggang LI from comment #11)
> (In reply to russell.w.mcguire from comment #10)
> > Why is RHEL putting and testing OmniPath into machines that are not
> > supported.
> 
> No, RHEL will not ship OmniPath for machines with old CPUs. This bug is
> against Fedora, not RHEL.
> 
> However, I could suggest close this bug as NOTABUG because of unsupported
> old CPU.

No. Fedora still supports plain x86_64 (i.e. SSE2-only), so failing to run on such (admittedly old) hardware is still a bug. If you disagree, feel free to open a FESCo ticket.

Comment 13 Mamoru TASAKA 2018-12-21 15:19:36 UTC
(In reply to Orion Poplawski from comment #9)
> Setting PSM_DISABLE_AVX2=1 with 11.2.68 simply replaces -mavx2 with -mavx,
> which is not sufficient.
> 
> But it appears that -mavx is required to build as without it I get:

<snip>

> /builddir/build/BUILD/libpsm2-11.2.68/opa/opa_dwordcpy-x86_64.c: In function
> 'hfi_pio_blockcpy_256': <======================

> So, where do we go from here?

Looking at the code, these hfip_pio_blockcpy_XXXXX functions are to
implement "PIO block copying routine". When CPU supports "higher" vector instruction,
"higher copying routine" is to be seleted, see psm_hal_gen1/psm_hal_gen1_spio.c
for example:

https://github.com/intel/opa-psm2/blob/8a12e84dc7e3a89eb81f7d0d2fba13c5d9d9c484/psm_hal_gen1/psm_hal_gen1_spio.c#L160

So these line firstly defines ctrl->spio_blockcpy_routines[i] methods, then
call get_cpuid (L172) and determine what spio_blockcpy_routines[] method can be
actually used , and put it into ctrl->spio_blockcpy_selected .

As hfi_pio_blockcpy_64() is written in "pure C", I guess we can assume we can
always use this as ctrl->spio_blockcpy_selected .

(Or, maybe we can fix the selection method to determine ctrl->spio_blockcpy_selected -
 I think ideally if CPU does not actually support AVX, hfi_pio_blockcpy_64() should be
 correctly selected _even if_  hfi_pio_blockcpy_256 or so is enabled *at compilation time*)

Comment 14 Mamoru TASAKA 2018-12-21 15:33:48 UTC
So the method written in psm_hal_gen1_spio.c to examine supported instruction set is not right for Intel(R) Xeon(R) series??

Comment 15 russell.w.mcguire 2018-12-21 20:12:17 UTC
> No. Fedora still supports plain x86_64 (i.e. SSE2-only), so failing to run
> on such (admittedly old) hardware is still a bug. If you disagree, feel free
> to open a FESCo ticket.

I think I see another combined issue that has caused this to arise now and not in the past.

libfabric is being used here, and this likely came recently as a new default within OpenMPI.
libfabric will attempt to initialized ALL providers even if their hardware is not present, in effect forcing execution of libpsm2 on unsupported hardware.

Technically one solution to this is NOT building libfabric with libspm2 for THIS older machine configuration, as the libfabric on this machine is incompatible with its hardware.
Although I don't like the idea of removing libpsm2 as this test case is unique to this machine configuration and Intel wants libspm2 to remain as default enabled within libfabric.

So a real question here, does this machine test platform actually have OmniPath hardware present and the code pathways being executed are a result of a real init taking place? Or is there no OmniPath hardware present and this is rudimentary basic init code that would normally just return an error, but can't due to some variation of memcpy() being invoked with avx instructions.

My goal here is to understand the environment. One solution might be to simply ensure that libpsm2 init pathways are clean and run only SSE4.2 instructions (say some #pragma's) and leave the rest of the program stack unaffected. Removing AVX2 and even faulting back to AVX1 will have negative impact on performance for HPC customers. It would be best to keep the instructions in the code, but just clean up the init pathways for unsupported machines.

Bottom line is this older platform is NOT compatible, so we need to cleanup enough to keep it in distro, but maintain performance (and thus the entire reason for purchasing a 100Gbps card).

Thoughts?

Comment 16 Orion Poplawski 2018-12-21 23:17:36 UTC
This is probably the more fruitful approach - to get openmpi and/or libfabric to avoid calling into psm2 when not needed.  The machine(s) in question at the moment are the COPR builders - really no idea what hardware they have.

Comment 17 russell.w.mcguire 2018-12-23 05:05:09 UTC
I think there was additions to the psm2 provider in libfabric recently to avoid calling into libpsm2, and thus psm2_init(), if the hfi1 OmniPath driver was not actually running on the machine (i.e. the presence of /dev/hfi1_<N>). If this patch is able to be pulled in then this should resolve this issue. 
Perhaps we can find the version of libfabric and the psm2 provider being tested and see if we find this patch to address the problem?

Comment 18 aravind.gopalakrishnan 2018-12-26 18:04:05 UTC
I looked at the code in libfabric v1.6.0 (in prov/psm2/src/psmx2_init.c). psmx2_unit_active() does check for presence of active unit during fi_getinfo() time and if none present, it will error out (you may need to set FI_LOG_LEVEL=info to see a relevant error message).
(https://github.com/ofiwg/libfabric/blob/master/prov/psm2/src/psmx2_init.c#L269)

So, question now is to check if any real hardware is present on the COPR builders. As Russ mentioned in comment #10, the Release Notes state Haswell or newer CPUs are required. [ Just FYI- link to newer version of release notes document: https://www.intel.com/content/dam/support/us/en/documents/network-and-i-o/fabric-products/Intel_OP_Fabric_Software_10_8_RN_K21143_v3_0.pdf ]

If there is OPA hardware on the systems, could you please remove it and retry? (With Open MPI OFI MTL, you may also have to set "-mca mtl_ofi_provider_include sockets" parameter on command line as well)

Comment 19 Levi Morrison 2019-05-31 16:36:28 UTC
I'm pretty sure I'm hitting this issue. Here's the machine's processor:

model name	: Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts spec_ctrl intel_stibp flush_l1d

It has InfiniBand installed, and does not have OPA hardware installed. This same OS image gets used on a machine which does have OPA, which is why we have psm2 installed.

Do you know what commits or versions of libfabric have the /dev/hif1_* testing in it? I would like to try to help progress this if I can.

Comment 20 Ben Cotton 2019-08-13 16:59:23 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle.
Changing version to '31'.

Comment 21 Ben Cotton 2019-08-13 19:16:32 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle.
Changing version to 31.

Comment 22 Ben Cotton 2020-11-03 16:51:20 UTC
This message is a reminder that Fedora 31 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 31 on 2020-11-24.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '31'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 31 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 23 Ben Cotton 2020-11-24 18:14:38 UTC
Fedora 31 changed to end-of-life (EOL) status on 2020-11-24. Fedora 31 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 24 Ben Boeckel 2021-05-03 21:15:43 UTC
This is still an issue in Fedora 33. I'm seeing SIGILL in Fedora-using containers under `libfabric`:

Stack trace (most recent call last):
#12   Object "", at 0xffffffffffffffff, in 
#11   Object "/builds/gitlab-kitware-sciviz-ci/build/tests/kd-tree-test2", at 0x40c1ad, in _start
#10   Object "/usr/lib64/libc-2.32.so", at 0x7f8b991e01e1, in __libc_start_main
#9    Object "/builds/gitlab-kitware-sciviz-ci/build/tests/kd-tree-test2", at 0x40ad8f, in main
#8    Object "/usr/lib64/openmpi/lib/libmpi.so.40.20.5", at 0x7f8b9960744a, in PMPI_Init_thread
#7    Object "/usr/lib64/openmpi/lib/libmpi.so.40.20.5", at 0x7f8b99666f94, in ompi_mpi_init
#6    Object "/usr/lib64/openmpi/lib/libmpi.so.40.20.5", at 0x7f8b99627fca, in mca_bml_base_init
#5    Object "/usr/lib64/openmpi/lib/openmpi/mca_bml_r2.so", at 0x7f8b9616f177, in mca_bml_r2_component_init
#4    Object "/usr/lib64/openmpi/lib/libopen-pal.so.40.20.5", at 0x7f8b98f7c988, in mca_btl_base_select
#3    Object "/usr/lib64/openmpi/lib/openmpi/mca_btl_usnic.so", at 0x7f8b9615a32f, in usnic_component_init
#2    Object "/usr/lib64/libfabric.so.1.15.1", at 0x7f8b95f5357c, in fi_getinfo
#1    Object "/usr/lib64/libfabric.so.1.15.1", at 0x7f8b95f4fc26, in fi_ini
#0    Object "/usr/lib64/libfabric.so.1.15.1", at 0x7f8b9605a270, in fi_psm3_ini
Illegal instruction (Illegal operand [0x7f8b9605a270])

Comment 25 david08741 2021-05-17 10:37:34 UTC
I downgraded to libfabric-1.11.2-1.fc33.x86_64 and I get no error if libfabric is called with valid arguments.

Also 1.12.0-0.1 is working for me, only libfabric-0.12.1-1 is broken.

Let me know if I can test anything to help with this.

Comment 26 Honggang LI 2021-05-17 10:42:11 UTC
(In reply to david08741 from comment #25)
> I downgraded to libfabric-1.11.2-1.fc33.x86_64 and I get no error if
> libfabric is called with valid arguments.
> 
> Also 1.12.0-0.1 is working for me, only libfabric-0.12.1-1 is broken.

libfabric-1.12.1-1 is the first release supports psm3.

Comment 27 Ben Cotton 2021-11-04 14:01:36 UTC
This message is a reminder that Fedora 33 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 33 on 2021-11-30.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '33'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 33 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 28 Ben Cotton 2021-11-04 14:30:48 UTC
This message is a reminder that Fedora 33 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 33 on 2021-11-30.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '33'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 33 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 29 Ben Cotton 2021-11-04 15:28:28 UTC
This message is a reminder that Fedora 33 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 33 on 2021-11-30.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '33'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 33 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 30 Ben Boeckel 2021-11-05 14:19:06 UTC
Fedora 34 still has AVX256 instructions in `libpsm2`; not sure if they're guarded by runtime checks or not (just by inspecting the disassembly).

Comment 31 Ben Cotton 2022-05-12 16:21:23 UTC
This message is a reminder that Fedora Linux 34 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 34 on 2022-06-07.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '34'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 34 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 32 Ben Cotton 2022-06-08 00:24:53 UTC
Fedora Linux 34 entered end-of-life (EOL) status on 2022-06-07.

Fedora Linux 34 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.