RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1792236 - update libfabric (1.7.0 to 1.7.2) to mainstream minor-release
Summary: update libfabric (1.7.0 to 1.7.2) to mainstream minor-release
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: libfabric
Version: 7.7
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: rc
: 7.9
Assignee: Honggang LI
QA Contact: zguo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-01-17 10:51 UTC by eloi.gaudry@fft.be
Modified: 2020-09-29 19:25 UTC (History)
3 users (show)

Fixed In Version: libfabric-1.7.2-1.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-29 19:25:25 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
startup log with FI_VERBOSE_LEVEL=debug (66.99 KB, text/plain)
2020-01-17 11:55 UTC, eloi.gaudry@fft.be
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:3870 0 None None None 2020-09-29 19:25:42 UTC

Description eloi.gaudry@fft.be 2020-01-17 10:51:55 UTC
Description of problem:
Various OFI issues occurring with Infiniband hardware that were corrected  
mainstream
https://github.com/ofiwg/libfabric/releases/tag/v1.7.2


Version-Release number of selected component (if applicable):
https://github.com/ofiwg/libfabric/releases/tag/v1.7.2

How reproducible:
yum install 

Steps to Reproduce:$
1. yum install libibverbs libibverbs-utils ibutils infiniband-diags opensm libfabric
$ systemctl status rdma
$ systemctl enable rdma

Actual results:
Run a MPI job using an infiniband provider (FI_PROVIDER) and lots of issue may arise, especially linked to CQ error

Expected results:
No error when running with IB

Additional info:

Comment 2 Honggang LI 2020-01-17 10:59:26 UTC
(In reply to eloi.gaudry from comment #0)
 
> Actual results:
> Run a MPI job using an infiniband provider (FI_PROVIDER) and lots of issue
> may arise, especially linked to CQ error

Please provide the details of MPI job. We need to know the shell command you run and
the output of the MPI job.

Comment 3 eloi.gaudry@fft.be 2020-01-17 11:54:25 UTC
Here is an issue happening at the end of the computation for instance:

libfabric:ofi_rxm:ep_ctrl:rxm_listener_close():1964<warn> Unable to close msg EQ
libfabric:ofi_rxm:ep_ctrl:rxm_ep_close():1994<warn> Unable to close msg CQ
Abort(807494159) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Finalize: Other MPI error, error stack:
PMPI_Finalize(367)...............: MPI_Finalize failed
PMPI_Finalize(278)...............:
MPID_Finalize(1033)..............:
MPIDI_OFI_mpi_finalize_hook(1536): OFI endpoint close failed (ofi_init.c:1536:MPIDI_OFI_mpi_finalize_hook:Device or resource busy)


Here are the debug information from OFI at startup:
see attached slog.txt 

Here is the ibstat output for the computing node:
[root@nodeivb01 ~]# ibstat
CA 'mlx4_0'
	CA type: MT26428
	Number of ports: 1
	Firmware version: 2.9.1000
	Hardware version: b0
	Node GUID: 0x002590ffff07f920
	System image GUID: 0x002590ffff07f923
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 40
		Base lid: 1
		LMC: 0
		SM lid: 8
		Capability mask: 0x0251086a
		Port GUID: 0x002590ffff07f921
		Link layer: InfiniBand

And the command line used to start the job (nothing fancy):
mpiexec.hydra -genvall -print-rank-map -genv I_MPI_JOB_CONTEXT=406419_11611 -machinefile=/home/eg/Tests/nacelle/hostfile.406419 -n 2 -rr -cleanup /opt/fft/actran_product/Actran_2021.b.130092/bin/actranpy_mp --mem=20000 --parallel=axisymmetricorder --debug --scratch=/scratch/cluster/406419/scratch --inputfile=/home/eg/Tests/nacelle/nacelle_case1.dat --threads=1 --report=report.406419

Comment 4 eloi.gaudry@fft.be 2020-01-17 11:55:04 UTC
Created attachment 1653043 [details]
startup log with FI_VERBOSE_LEVEL=debug

Comment 5 Honggang LI 2020-01-17 12:43:54 UTC
Hi,

 As we are in late stage of RHEL-7.8, only blocker issue will be addressed in RHEL-7.8.
If you are think this is a blocker issue, please provide business justification. Otherwise,
this issue will be deferred into RHEL-7.9.

Thanks

Comment 6 eloi.gaudry@fft.be 2020-01-17 12:47:01 UTC
(In reply to Honggang LI from comment #5)
> Hi,
> 
>  As we are in late stage of RHEL-7.8, only blocker issue will be addressed
> in RHEL-7.8.
> If you are think this is a blocker issue, please provide business
> justification. Otherwise,
> this issue will be deferred into RHEL-7.9.
> 
> Thanks

Hi, 
I can only speak for customers that would still be using 7.7 (or 7.5 for what I know).
I'm a developer for a company that provides numerical simulation software, if the dependencies for this package and its own dependencies is too large, then I can understand technically why this would be deferred.
Thanks,

Comment 7 Honggang LI 2020-01-17 12:56:24 UTC
For released RHEL release, such as RHEL-7.7, Redhat fixes issues via z-stream errata.
Before a bug addressed by z-stream, it must be fixed in incoming release.

If you want this bug be fixed for rhel-7.7/7.6/7.5, this bug must be fixed in rhel-7.8
or rhel-7.9, and then it _MAY_ be fixed in RHEL-7.7 with right business justification.

If this bug be fixed in RHEL-7.9, and no business justification for a z-stream, there will
be no fix for RHEL-7.7, customer will have to update to RHEL-7.9.

Comment 8 eloi.gaudry@fft.be 2020-01-17 12:58:49 UTC
ok, thanks for the explanations.

Comment 14 eloi.gaudry@fft.be 2020-03-26 15:26:06 UTC
I guess that upgrading the ucx package simultaneously would make sense too (current upstream stable version is 1.7.x when RHEL-7 and RHEL-8 are stucked at 1.4.x).

Comment 15 Honggang LI 2020-03-26 23:47:49 UTC
(In reply to eloi.gaudry from comment #14)
> I guess that upgrading the ucx package simultaneously would make sense too
> (current upstream stable version is 1.7.x when RHEL-7 and RHEL-8 are stucked
> at 1.4.x).

1) One bug for one issue. If you want ucx update, please file a new bug for ucx.
2) We had update ucx-1.6 for RHEL-8.2.
3) ucx is really buggy with openmpi, we had observed several ucx-1.6 issues for
RHEL-8.
4) ucx-1.4 --> ucx-1.7 changes the API, we have to rebuild/update all packages depends
on ucx, for example openmpi. This is not acceptable for RHEL-7.9, as it is in
the third phase of produce, which means only bug fixes will be allowed. ucx-1.4 ->
ucx-1.7 fixes bugs, but also changes API and adds new features, which is not allowed.

Comment 16 eloi.gaudry@fft.be 2020-03-27 07:51:20 UTC
I understand your point.

Maybe that the right move would be to just update to 1.5.x (no big api change) as this version has become mandatory for IntelMPI (commercial) based application.
There are tons of bugfixes plus it was tested on lots of configurations.

If you think this doesn't make sense, i will not open another ticket.
Thanks,

Comment 17 Honggang LI 2020-03-27 08:34:32 UTC
ucx (master)]$ git log --oneline v1.4.0..v1.5.0  | wc -l
721

ucx (master)]$ git diff v1.4.0..v1.5.0  | wc -l
28478

Sorry. We can't, because too many changes for produce phase three.

Comment 20 errata-xmlrpc 2020-09-29 19:25:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (rdma-core bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3870


Note You need to log in before you can comment on or make changes to this bug.