Bug 1792236
| Summary: | update libfabric (1.7.0 to 1.7.2) to mainstream minor-release | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | eloi.gaudry <eloi.gaudry> | ||||
| Component: | libfabric | Assignee: | Honggang LI <honli> | ||||
| Status: | CLOSED ERRATA | QA Contact: | zguo <zguo> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 7.7 | CC: | honli, rdma-dev-team, zguo | ||||
| Target Milestone: | rc | Keywords: | Bugfix | ||||
| Target Release: | 7.9 | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | libfabric-1.7.2-1.el7 | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2020-09-29 19:25:25 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
eloi.gaudry@fft.be
2020-01-17 10:51:55 UTC
(In reply to eloi.gaudry from comment #0) > Actual results: > Run a MPI job using an infiniband provider (FI_PROVIDER) and lots of issue > may arise, especially linked to CQ error Please provide the details of MPI job. We need to know the shell command you run and the output of the MPI job. Here is an issue happening at the end of the computation for instance: libfabric:ofi_rxm:ep_ctrl:rxm_listener_close():1964<warn> Unable to close msg EQ libfabric:ofi_rxm:ep_ctrl:rxm_ep_close():1994<warn> Unable to close msg CQ Abort(807494159) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Finalize: Other MPI error, error stack: PMPI_Finalize(367)...............: MPI_Finalize failed PMPI_Finalize(278)...............: MPID_Finalize(1033)..............: MPIDI_OFI_mpi_finalize_hook(1536): OFI endpoint close failed (ofi_init.c:1536:MPIDI_OFI_mpi_finalize_hook:Device or resource busy) Here are the debug information from OFI at startup: see attached slog.txt Here is the ibstat output for the computing node: [root@nodeivb01 ~]# ibstat CA 'mlx4_0' CA type: MT26428 Number of ports: 1 Firmware version: 2.9.1000 Hardware version: b0 Node GUID: 0x002590ffff07f920 System image GUID: 0x002590ffff07f923 Port 1: State: Active Physical state: LinkUp Rate: 40 Base lid: 1 LMC: 0 SM lid: 8 Capability mask: 0x0251086a Port GUID: 0x002590ffff07f921 Link layer: InfiniBand And the command line used to start the job (nothing fancy): mpiexec.hydra -genvall -print-rank-map -genv I_MPI_JOB_CONTEXT=406419_11611 -machinefile=/home/eg/Tests/nacelle/hostfile.406419 -n 2 -rr -cleanup /opt/fft/actran_product/Actran_2021.b.130092/bin/actranpy_mp --mem=20000 --parallel=axisymmetricorder --debug --scratch=/scratch/cluster/406419/scratch --inputfile=/home/eg/Tests/nacelle/nacelle_case1.dat --threads=1 --report=report.406419 Created attachment 1653043 [details]
startup log with FI_VERBOSE_LEVEL=debug
Hi, As we are in late stage of RHEL-7.8, only blocker issue will be addressed in RHEL-7.8. If you are think this is a blocker issue, please provide business justification. Otherwise, this issue will be deferred into RHEL-7.9. Thanks (In reply to Honggang LI from comment #5) > Hi, > > As we are in late stage of RHEL-7.8, only blocker issue will be addressed > in RHEL-7.8. > If you are think this is a blocker issue, please provide business > justification. Otherwise, > this issue will be deferred into RHEL-7.9. > > Thanks Hi, I can only speak for customers that would still be using 7.7 (or 7.5 for what I know). I'm a developer for a company that provides numerical simulation software, if the dependencies for this package and its own dependencies is too large, then I can understand technically why this would be deferred. Thanks, For released RHEL release, such as RHEL-7.7, Redhat fixes issues via z-stream errata. Before a bug addressed by z-stream, it must be fixed in incoming release. If you want this bug be fixed for rhel-7.7/7.6/7.5, this bug must be fixed in rhel-7.8 or rhel-7.9, and then it _MAY_ be fixed in RHEL-7.7 with right business justification. If this bug be fixed in RHEL-7.9, and no business justification for a z-stream, there will be no fix for RHEL-7.7, customer will have to update to RHEL-7.9. ok, thanks for the explanations. I guess that upgrading the ucx package simultaneously would make sense too (current upstream stable version is 1.7.x when RHEL-7 and RHEL-8 are stucked at 1.4.x). (In reply to eloi.gaudry from comment #14) > I guess that upgrading the ucx package simultaneously would make sense too > (current upstream stable version is 1.7.x when RHEL-7 and RHEL-8 are stucked > at 1.4.x). 1) One bug for one issue. If you want ucx update, please file a new bug for ucx. 2) We had update ucx-1.6 for RHEL-8.2. 3) ucx is really buggy with openmpi, we had observed several ucx-1.6 issues for RHEL-8. 4) ucx-1.4 --> ucx-1.7 changes the API, we have to rebuild/update all packages depends on ucx, for example openmpi. This is not acceptable for RHEL-7.9, as it is in the third phase of produce, which means only bug fixes will be allowed. ucx-1.4 -> ucx-1.7 fixes bugs, but also changes API and adds new features, which is not allowed. I understand your point. Maybe that the right move would be to just update to 1.5.x (no big api change) as this version has become mandatory for IntelMPI (commercial) based application. There are tons of bugfixes plus it was tested on lots of configurations. If you think this doesn't make sense, i will not open another ticket. Thanks, ucx (master)]$ git log --oneline v1.4.0..v1.5.0 | wc -l 721 ucx (master)]$ git diff v1.4.0..v1.5.0 | wc -l 28478 Sorry. We can't, because too many changes for produce phase three. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (rdma-core bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3870 |