RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1770184 - undocumented openmpi update breaks multiple applications
Summary: undocumented openmpi update breaks multiple applications
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: openmpi
Version: 8.1
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: 8.0
Assignee: Honggang LI
QA Contact: Afom T. Michael
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-08 12:01 UTC by Dave Love
Modified: 2020-08-14 02:34 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-08-14 02:34:36 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Dave Love 2019-11-08 12:01:25 UTC
Description of problem:

Builds of several MPI applications in EPEL and elsewhere have started to fail following RHEL 8.1 updates.  They now crash in tests:

https://koschei.fedoraproject.org/affected-by/openmpi-devel?epoch1=0&version1=3.1.2&release1=5.el8&epoch2=0&version2=4.0.1&release2=3.el8&collection=epel8

I can't see this change in the release notes, and I'd expect at least NetCDF, which is one of the failures, to be part of MPI testing.  OpenMPI maintenance in RHEL has long been problematic, despite expertise in the Fedora community.  Is there something they can do to help?

Version-Release number of selected component (if applicable):

openmpi-0:4.0.1-3.el8.x86_64

Comment 1 Jarod Wilson 2019-11-09 01:26:02 UTC
(In reply to Dave Love from comment #0)
> Description of problem:
> 
> Builds of several MPI applications in EPEL and elsewhere have started to
> fail following RHEL 8.1 updates.  They now crash in tests:
> 
> https://koschei.fedoraproject.org/affected-by/openmpi-
> devel?epoch1=0&version1=3.1.2&release1=5.el8&epoch2=0&version2=4.0.
> 1&release2=3.el8&collection=epel8
> 
> I can't see this change in the release notes,

It's covered by rdma stack update errata, was done at the request of various people.

> and I'd expect at least
> NetCDF, which is one of the failures, to be part of MPI testing.

Nope, it's not.

> OpenMPI
> maintenance in RHEL has long been problematic, despite expertise in the
> Fedora community.  Is there something they can do to help?

Part of the problem here is that the internal package maintainers aren't actually OpenMPI *users*, and since somewhat recently taking over package maintainership, I've never been aware of anything to test with other than the mpitests package we ship. Part of the problem is UCX though, which we had demands to enable, but it seems that the version of UCX in 8.1 is... not great. We've got UCX 1.6.1 coming in 8.2 that ought to improve things. Not sure what the solution is for current 8.1 issues though...

Comment 2 Orion Poplawski 2019-11-09 03:23:17 UTC
The netcdf failure seems UCX related - I think we recently dealt with that in Fedora.

Comment 3 Dave Love 2019-11-11 12:52:04 UTC
> > I can't see this change in the release notes,
> 
> It's covered by rdma stack update errata, was done at the request of various
> people.

The only reference I can see is to a secret bug report about a fix to make it
installable against opensm, without mentioning a major version change.

> Part of the problem here is that the internal package maintainers aren't
> actually OpenMPI *users*, and since somewhat recently taking over package
> maintainership, I've never been aware of anything to test with other than
> the mpitests package we ship.

That's what I mean.  orion can doubtless suggest other tests from Fedora experience, in particular.

> Part of the problem is UCX though, which we
> had demands to enable, but it seems that the version of UCX in 8.1 is... not
> great. We've got UCX 1.6.1 coming in 8.2 that ought to improve things. Not
> sure what the solution is for current 8.1 issues though...

Yes, it's a known disaster area, along with problematic upstream openmpi maintenance.
The ofi mtl probably works, with worse latency.  The openib btl should also work, as before.
They can be defaulted in openmpi-mca.conf.

Comment 4 Don Dutile (Red Hat) 2019-11-11 14:00:11 UTC
Dave,
Are you using an EPEL version of UCX, or the RHEL version of the pkg?

I know the epel version of rhel7 UCX diretly conflicts with RHELs, and creates crashes.
I dont track, nor do we support, EPEL-based pkgs, so if you use an EPEL-based version of a pkg that's also in RHEL, that's the likely mis-match.
As Jarod stated, UCX has been problematic with version-compatibility w/MPI, APIs, etc., thus the need to stay on RHEL-only pkgs in this space.

Comment 5 Orion Poplawski 2019-11-11 15:11:04 UTC
There is no UCX package in EPEL8, and it has been retired from EPEL7.

Comment 6 Dave Love 2019-11-11 16:53:27 UTC
> As Jarod stated, UCX has been problematic with version-compatibility w/MPI,
> APIs, etc.,

We know -- that's the point!
Per the link, there are failures in Fedora CI due to RHEL 8.1 appearing, and the tests are only running single-node, not with potentially more interesting failures on an RDMA fabric.

Comment 7 Don Dutile (Red Hat) 2019-11-11 19:32:34 UTC
(In reply to Dave Love from comment #6)
> > As Jarod stated, UCX has been problematic with version-compatibility w/MPI,
> > APIs, etc.,
> 
> We know -- that's the point!
> Per the link, there are failures in Fedora CI due to RHEL 8.1 appearing, and
not sure the link of Fedora CI to RHEL-8.1 ...

> the tests are only running single-node, not with potentially more
> interesting failures on an RDMA fabric.

Dave,
Honestly, upstream UCX is doing a poor job on API compatibility.
RHEL has been running into the compat problems as RHEL is typically not 100% in lock-step with upstream.
IMO, the purpose of a user-API is to provide a stable interface, which UCX is not doing.
RHEL doesn't have this problem with other rdma-related libs(libibverbs, OFI, per-device libverbs, etc., at least not as frequently (nearly every minor & major update in UCX)).
RHEL notifies upstream when we run into these issues, and tries to work with upstream per normal processes to get bug fixes added to proper release(s).

Once the national labs add RHEL-8(.1), I expect these UCX issues to get resolved more rapidly upstream, as the consumers/users will be using RHEL-8.1, and will report UCX update errors sooner.

Comment 8 Honggang LI 2020-07-13 10:03:10 UTC
(In reply to Dave Love from comment #0)

> https://koschei.fedoraproject.org/affected-by/openmpi-
> devel?epoch1=0&version1=3.1.2&release1=5.el8&epoch2=0&version2=4.0.
> 1&release2=3.el8&collection=epel8

The page says:
" No packages are known to be affected by this upgrade. "


Can we close this bug now? thanks

Comment 9 Dave Love 2020-08-13 15:09:11 UTC
I can't remember where I saw this haven't had a chance to see whether anything relevant had checks disabled so that it would build, but I suppose it can be closed for now, at least.

I thought I'd said before, but didn't, that it's troublesome that Red Hat ships things known to be problematic, without an indication in release notes, and expects customers to sort it out.  It seems openmpi is broken for a considerable time at least once in every RHEL release.  That makes it difficult to make the case for packaged software, if nothing else.


Note You need to log in before you can comment on or make changes to this bug.