Bug 1770184 - undocumented openmpi update breaks multiple applications
Summary: undocumented openmpi update breaks multiple applications
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: openmpi
Version: 8.1
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: 8.0
Assignee: Honggang LI
QA Contact: Afom T. Michael
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-08 12:01 UTC by Dave Love
Modified: 2020-08-14 02:34 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-08-14 02:34:36 UTC
Type: Bug
Target Upstream Version:


Attachments (Terms of Use)

Description Dave Love 2019-11-08 12:01:25 UTC
Description of problem:

Builds of several MPI applications in EPEL and elsewhere have started to fail following RHEL 8.1 updates.  They now crash in tests:

https://koschei.fedoraproject.org/affected-by/openmpi-devel?epoch1=0&version1=3.1.2&release1=5.el8&epoch2=0&version2=4.0.1&release2=3.el8&collection=epel8

I can't see this change in the release notes, and I'd expect at least NetCDF, which is one of the failures, to be part of MPI testing.  OpenMPI maintenance in RHEL has long been problematic, despite expertise in the Fedora community.  Is there something they can do to help?

Version-Release number of selected component (if applicable):

openmpi-0:4.0.1-3.el8.x86_64

Comment 1 Jarod Wilson 2019-11-09 01:26:02 UTC
(In reply to Dave Love from comment #0)
> Description of problem:
> 
> Builds of several MPI applications in EPEL and elsewhere have started to
> fail following RHEL 8.1 updates.  They now crash in tests:
> 
> https://koschei.fedoraproject.org/affected-by/openmpi-
> devel?epoch1=0&version1=3.1.2&release1=5.el8&epoch2=0&version2=4.0.
> 1&release2=3.el8&collection=epel8
> 
> I can't see this change in the release notes,

It's covered by rdma stack update errata, was done at the request of various people.

> and I'd expect at least
> NetCDF, which is one of the failures, to be part of MPI testing.

Nope, it's not.

> OpenMPI
> maintenance in RHEL has long been problematic, despite expertise in the
> Fedora community.  Is there something they can do to help?

Part of the problem here is that the internal package maintainers aren't actually OpenMPI *users*, and since somewhat recently taking over package maintainership, I've never been aware of anything to test with other than the mpitests package we ship. Part of the problem is UCX though, which we had demands to enable, but it seems that the version of UCX in 8.1 is... not great. We've got UCX 1.6.1 coming in 8.2 that ought to improve things. Not sure what the solution is for current 8.1 issues though...

Comment 2 Orion Poplawski 2019-11-09 03:23:17 UTC
The netcdf failure seems UCX related - I think we recently dealt with that in Fedora.

Comment 3 Dave Love 2019-11-11 12:52:04 UTC
> > I can't see this change in the release notes,
> 
> It's covered by rdma stack update errata, was done at the request of various
> people.

The only reference I can see is to a secret bug report about a fix to make it
installable against opensm, without mentioning a major version change.

> Part of the problem here is that the internal package maintainers aren't
> actually OpenMPI *users*, and since somewhat recently taking over package
> maintainership, I've never been aware of anything to test with other than
> the mpitests package we ship.

That's what I mean.  orion can doubtless suggest other tests from Fedora experience, in particular.

> Part of the problem is UCX though, which we
> had demands to enable, but it seems that the version of UCX in 8.1 is... not
> great. We've got UCX 1.6.1 coming in 8.2 that ought to improve things. Not
> sure what the solution is for current 8.1 issues though...

Yes, it's a known disaster area, along with problematic upstream openmpi maintenance.
The ofi mtl probably works, with worse latency.  The openib btl should also work, as before.
They can be defaulted in openmpi-mca.conf.

Comment 4 Don Dutile (Red Hat) 2019-11-11 14:00:11 UTC
Dave,
Are you using an EPEL version of UCX, or the RHEL version of the pkg?

I know the epel version of rhel7 UCX diretly conflicts with RHELs, and creates crashes.
I dont track, nor do we support, EPEL-based pkgs, so if you use an EPEL-based version of a pkg that's also in RHEL, that's the likely mis-match.
As Jarod stated, UCX has been problematic with version-compatibility w/MPI, APIs, etc., thus the need to stay on RHEL-only pkgs in this space.

Comment 5 Orion Poplawski 2019-11-11 15:11:04 UTC
There is no UCX package in EPEL8, and it has been retired from EPEL7.

Comment 6 Dave Love 2019-11-11 16:53:27 UTC
> As Jarod stated, UCX has been problematic with version-compatibility w/MPI,
> APIs, etc.,

We know -- that's the point!
Per the link, there are failures in Fedora CI due to RHEL 8.1 appearing, and the tests are only running single-node, not with potentially more interesting failures on an RDMA fabric.

Comment 7 Don Dutile (Red Hat) 2019-11-11 19:32:34 UTC
(In reply to Dave Love from comment #6)
> > As Jarod stated, UCX has been problematic with version-compatibility w/MPI,
> > APIs, etc.,
> 
> We know -- that's the point!
> Per the link, there are failures in Fedora CI due to RHEL 8.1 appearing, and
not sure the link of Fedora CI to RHEL-8.1 ...

> the tests are only running single-node, not with potentially more
> interesting failures on an RDMA fabric.

Dave,
Honestly, upstream UCX is doing a poor job on API compatibility.
RHEL has been running into the compat problems as RHEL is typically not 100% in lock-step with upstream.
IMO, the purpose of a user-API is to provide a stable interface, which UCX is not doing.
RHEL doesn't have this problem with other rdma-related libs(libibverbs, OFI, per-device libverbs, etc., at least not as frequently (nearly every minor & major update in UCX)).
RHEL notifies upstream when we run into these issues, and tries to work with upstream per normal processes to get bug fixes added to proper release(s).

Once the national labs add RHEL-8(.1), I expect these UCX issues to get resolved more rapidly upstream, as the consumers/users will be using RHEL-8.1, and will report UCX update errors sooner.

Comment 8 Honggang LI 2020-07-13 10:03:10 UTC
(In reply to Dave Love from comment #0)

> https://koschei.fedoraproject.org/affected-by/openmpi-
> devel?epoch1=0&version1=3.1.2&release1=5.el8&epoch2=0&version2=4.0.
> 1&release2=3.el8&collection=epel8

The page says:
" No packages are known to be affected by this upgrade. "


Can we close this bug now? thanks

Comment 9 Dave Love 2020-08-13 15:09:11 UTC
I can't remember where I saw this haven't had a chance to see whether anything relevant had checks disabled so that it would build, but I suppose it can be closed for now, at least.

I thought I'd said before, but didn't, that it's troublesome that Red Hat ships things known to be problematic, without an indication in release notes, and expects customers to sort it out.  It seems openmpi is broken for a considerable time at least once in every RHEL release.  That makes it difficult to make the case for packaged software, if nothing else.


Note You need to log in before you can comment on or make changes to this bug.