Bug 1158864
Summary: | split compat-openmpi into subpackages by openmpi version | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Dave Love <dave.love> | ||||
Component: | compat-openmpi | Assignee: | Michal Schmidt <mschmidt> | ||||
Status: | CLOSED ERRATA | QA Contact: | Mike Stowell <mstowell> | ||||
Severity: | unspecified | Docs Contact: | Lenka Špačková <lkuprova> | ||||
Priority: | unspecified | ||||||
Version: | 6.6 | CC: | dledford, jshortt, lkuprova, mschmidt, mstowell, yanwang, zguo | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | compat-openmpi-1.4.3-5.el6 | Doc Type: | Release Note | ||||
Doc Text: |
Changes in Open MPI distribution
Open MPI is an open source Message Passing Interface implementation. The _compat-openmpi_ package, which provides earlier versions of Open MPI for backward compatibility with previous minor releases of Red Hat Enterprise Linux 6, has been split into several subpackages based on the Open MPI version.
The names of the subpackages (and their respective environment module names on the x86_64 architecture) are:
* _openmpi-1.4_ (openmpi-1.4-x86_64)
* _openmpi-1.4-psm_ (openmpi-1.4-psm-x86_64)
* _openmpi-1.5.3_ (compat-openmpi-x86_64, aliased as openmpi-1.5.3-x86_64)
* _openmpi-1.5.3-psm_ (compat-openmpi-psm-x86_64, aliased as openmpi-1.5.3-psm-x86_64)
* _openmpi-1.5.4_ (openmpi-1.5.4-x86_64)
* _openmpi-1.8_ (openmpi-x86_64, aliased as openmpi-1.8-x86_64)
The "yum install openmpi" command in Red Hat Enterprise Linux 6.8 installs the _openmpi-1.8_ package for maximum compatibility with Red Hat Enterprise Linux 6.7. A later version of Open MPI is available in the _openmpi-1.10_ package.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2016-05-11 01:23:10 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1269638, 1270066 | ||||||
Attachments: |
|
I found that the openmpi change is mentioned in the notes, but apparently only buried in the RDMA section with no public information about it. Also, from its changelog, the compat-openmpi package is supposed to cope with this -- it has the 1.5 libraries. However, it isn't required by the 1.8 package (so you wouldn't know it was required for compatibility if not pulled in by another package), it's labelled as version 1.4, and it needs the relevant module loading (which won't be done by existing scripts). The compat- version number is confusing as I wouldn't expect mixing runtime components between 1.4 and 1.5 to work. Right, compat-openmpi provides the old versions of openmpi. It should have been documented better. In RHEL 6.8 we may update openmpi once more. Then compat-openmpi would subsume openmpi-1.8 too. I agree it is odd that all the old versions are in a single package whose version number refers to only one of the provided versions. And indeed the fact that they are all installed in the same directory is fragile to say the least. I'm thinking about splitting compat-openmpi into subpackages (compat-openmpi14, compat-openmpi15, compat-openmpi18). The module names will change, that's inevitable. So "openmpi-1.8 is an incompatible update" is not a bug, but the peculiar compat-openmpi packaging is. Changing the title accordingly. *** Bug 1159807 has been marked as a duplicate of this bug. *** (In reply to Michal Schmidt from comment #5) > Right, compat-openmpi provides the old versions of openmpi. It should have > been documented better. In RHEL 6.8 we may update openmpi once more. And break everything on HPC systems _again_? Fair enough to have a more recent version available, but I thought that was what SCLs were for in the RHEL world, although they pretty much interchangeable with environment modules. > Then > compat-openmpi would subsume openmpi-1.8 too. I agree it is odd that all the > old versions are in a single package whose version number refers to only one > of the provided versions. And indeed the fact that they are all installed in > the same directory is fragile to say the least. If you mean everything lives under %_libdir/compat-openmpi, it's simply broken. The mpi libraries, the compiler wrappers, the orte support, and the MCA components all need to be compatible. > I'm thinking about splitting > compat-openmpi into subpackages (compat-openmpi14, compat-openmpi15, > compat-openmpi18). > The module names will change, that's inevitable. Does anyone at Red Hat understand HPC generally, and how openmpi works in particular? I don't mean to sound angry, but it seems no-one does, and someone should. > So "openmpi-1.8 is an incompatible update" is not a bug, but the peculiar > compat-openmpi packaging is. Changing the title accordingly. I consider it a bug if the important stuff on our system fundamentally breaks between point releases without the incompatible change even being mentioned in the release notes. (You can get some idea from the churn in EPEL as a result, and some of that compounded the problems; we have loads of user-compiled and locally-packaged binaries on a service supporting a whole "research-based" university.) People run RHEL-ish systems for stability and I thought there was even a promise of binary compatibility. (In reply to Dave Love from comment #8) > (In reply to Michal Schmidt from comment #5) > > Right, compat-openmpi provides the old versions of openmpi. It should have > > been documented better. In RHEL 6.8 we may update openmpi once more. > > And break everything on HPC systems _again_? Unfortunately, yes. Although I think we can find a way to minimize it in the future, for this update, it will likely involve some pain. > Fair enough to have a more > recent version available, but I thought that was what SCLs were for in the > RHEL world, although they pretty much interchangeable with environment > modules. OpenMPI is a bit of a red-headed step child in that it isn't just a runtime MPI stack. Because of the evolving nature of the btls it contains, and the fact that it talks directly to RDMA hardware, the updates to OpenMPI are actually hardware enablement updates. I know in comment #2 you mention that you finally found a release note about the OpenMPI update but it was buried in the RDMA section of the release notes. When we first added OpenMPI to RHEL it was to support RDMA networks and clusters. In truth, as far as Red Hat is concerned, OpenMPI and RDMA are inextricably tied. And for most of our OpenMPI using customers, this is true in their eyes as well. That all said, I think we can do a better job on how we handle the updates. > > Then > > compat-openmpi would subsume openmpi-1.8 too. I agree it is odd that all the > > old versions are in a single package whose version number refers to only one > > of the provided versions. And indeed the fact that they are all installed in > > the same directory is fragile to say the least. > > If you mean everything lives under %_libdir/compat-openmpi, it's simply > broken. The mpi libraries, the compiler wrappers, the orte support, and the > MCA components all need to be compatible. Agreed. The current compat-openmpi is broken. We need a complete, separate install tree for each compat version. > > I'm thinking about splitting > > compat-openmpi into subpackages (compat-openmpi14, compat-openmpi15, > > compat-openmpi18). > > The module names will change, that's inevitable. > > Does anyone at Red Hat understand HPC generally, and how openmpi works in > particular? I don't mean to sound angry, but it seems no-one does, and > someone should. We do. Most of our customers use multiple MPIs. OpenMPI being one (and each release of OpenMPI being separate), mvapich, mvapich2, Intel MPI, and some others. We provide some of these out of the box, some are added after the fact by the customer. In almost all cases, because we don't provide a full cluster scheduling setup out of the box, the customers have their own cluster scheduling software setup, and that includes they usually have their own means of controlling which MPI is used for each and every job they run. As a result, our out of the box MPI selection setup has probably gotten less exposure than someone might expect it to. > > So "openmpi-1.8 is an incompatible update" is not a bug, but the peculiar > > compat-openmpi packaging is. Changing the title accordingly. > > I consider it a bug if the important stuff on our system fundamentally > breaks between point releases without the incompatible change even being > mentioned in the release notes. (You can get some idea from the churn in > EPEL as a result, and some of that compounded the problems; we have loads of > user-compiled and locally-packaged binaries on a service supporting a whole > "research-based" university.) People run RHEL-ish systems for stability and > I thought there was even a promise of binary compatibility. The RDMA stack in general does not follow the normal RHEL ABI promise. This has been true ever since we included the stack. It evolves at such a pace that this sort of ABI promise simply isn't possible. Since all of the MPIs we ship (with the exception of mpich/mpich2, which are only in rhel7 and not rhel6) are directly tied to our RDMA support, they follow the same rule and do not have an active ABI guarantee. We *try* to preserve ABI, we wouldn't even have the broken attempt in compat-openmpi otherwise, but it isn't guaranteed. The real question is how can we make things better for you and still keep our other RDMA customers that expect the latest release on a treadmill happy. I'm inclined to say that Michal's suggestion of subpackages are the best way to go. And I would go further and say the parent openmpi package should pull in the subpackages during updates. And that all packages, including the current package, should use versioned module files, and that we should have a default module file that is a symlink to the current module file version. Then the recommendation for customers would be that once you compile against a gversion of openmpi, just record what version you compiled against, and on all future runs of the software, use the versioned module name to get the right mpi. This would protect you against future upgrades. (In reply to Doug Ledford from comment #9) > (In reply to Dave Love from comment #8) > > If you mean everything lives under %_libdir/compat-openmpi, it's simply > > broken. The mpi libraries, the compiler wrappers, the orte support, and the > > MCA components all need to be compatible. > > Agreed. The current compat-openmpi is broken. We need a complete, separate > install tree for each compat version. Yes. I separated them already in compat-openmpi-1.4.3-3.el6. > ... > I'm inclined to say that Michal's suggestion of subpackages are the best way > to go. And I would go further and say the parent openmpi package should > pull in the subpackages during updates. I believe I have already taken care of this by using Obsoletes directives, so that for example upgrading from openmpi-1.5.4 (in RHEL 6.3/6.4/6.5) should install both openmpi-1.10.1 and compat-openmpi15-1.5.4. > And that all packages, including > the current package, should use versioned module files, and that we should > have a default module file that is a symlink to the current module file > version. Then the recommendation for customers would be that once you > compile against a gversion of openmpi, just record what version you compiled > against, and on all future runs of the software, use the versioned module > name to get the right mpi. This would protect you against future upgrades. So in a RHEL X.Y release "ls -l /etc/modulefiles/mpi" would look something like this: -rw-r--r--. 1 root root ... openmpi-1.4-x86_64 -rw-r--r--. 1 root root ... openmpi-1.5-x86_64 -rw-r--r--. 1 root root ... openmpi-1.8-x86_64 lrwxrwxrwx. 1 root root ... openmpi-x86_64 -> openmpi-1.8-x86_64 then in RHEL X.(Y+1) it might look like this: -rw-r--r--. 1 root root ... openmpi-1.10-x86_64 -rw-r--r--. 1 root root ... openmpi-1.4-x86_64 -rw-r--r--. 1 root root ... openmpi-1.5-x86_64 -rw-r--r--. 1 root root ... openmpi-1.8-x86_64 lrwxrwxrwx. 1 root root ... openmpi-x86_64 -> openmpi-1.10-x86_64 I like this. The only remaining question is what to do in RHEL 6.8 specifically. RHEL 6.7 obviously did not use this module naming scheme, so customers that currently depend on openmpi-1.8 use its module name "openmpi-x86_64". There will be an openmpi-1.10 in RHEL 6.8. Let's say it will provide a module called "openmpi-1.10-x86_64". But should it alias the module name "openmpi-x86_64" to itself, or should this name remain dedicated to openmpi-1.8 for the rest of RHEL 6 life? So for RHEL-6.8 I decided to go with the more backwards compatible approach. First, a summary of what was shipped as the "openmpi" package in each RHEL-6 release: openmpi-1.4.1-4.3.el6 RHEL-6.0 openmpi-1.4.3-1.1.el6 RHEL-6.1 openmpi-1.5.3-3.el6 RHEL-6.2 openmpi-1.5.4-1.el6 RHEL-6.3, RHEL-6.4 openmpi-1.5.4-2.el6 RHEL-6.5 openmpi-1.8.1-1.el6 RHEL-6.6, RHEL-6.7 The versions up to and including 1.5.3 also *-psm variants. The "compat-openmpi" package in RHEL-6.7 was a chimera of openmpi v1.4.3 and v1.5.3. Only the v1.5.3 parts had any chance of working, because during the build process it overwrote many of v1.4.3's files. In RHEL-6.8 the compat-openmpi-1.4.3-5.el6 source package will generate the following binary packages: openmpi-1.4-1.4.3-5.el6 openmpi-1.4-devel-1.4.3-5.el6 openmpi-1.4-psm-1.4.3-5.el6 openmpi-1.4-psm-devel-1.4.3-5.el6 openmpi-1.5.3-1.5.3-5.el6 openmpi-1.5.3-devel-1.5.3-5.el6 openmpi-1.5.3-psm-1.5.3-5.el6 openmpi-1.5.3-psm-devel-1.5.3-5.el6 openmpi-1.5.4-1.5.4-5.el6 openmpi-1.5.4-devel-1.5.4-5.el6 openmpi-1.8-1.8.1-5.el6 openmpi-1.8-devel-1.8.1-5.el6 where openmpi-1.4 and openmpi-1.5.4 will use directory names /etc/openmpi-$VERSION/, /usr/lib64/openmpi-$VERSION-x86_64 and module names "openmpi-$VERSION-x86_64" openmpi-1.5.3 will keep using the directory names /etc/compat-openmpi-x86_64, /usr/lib64/compat-openmpi and so on, and the module name "compat-openmpi-x86_64", to maintain compatibility with what was shipped as compat-openmpi in RHEL-6.7. The module file will be symlinked as "openmpi-1.5.3-x86_64". openmpi-1.8 will keep using the directory names /etc/openmpi-x86_64, /usr/lib64/openmpi and so on, and the module name "openmpi-x86_64", to maintain compatibility with openmpi from RHEL-6.6 & 6.7. The module file will be symlinked as "openmpi-1.8-x86_64". openmpi-1.8-devel Provides "openmpi-devel". The source package openmpi-1.10.2-2.el6 will generate binary packages: openmpi-1.10-1.10.2-2.el6 openmpi-1.10-devel-1.10.2-2.el6 where openmpi-1.10 will ship files under /etc/openmpi-1.10-x86_64, /usr/lib64/openmpi-1.10 and so on, and the module "openmpi-1.10-x86_64". Therefore upgrading from RHEL6.7 should not break anything for openmpi-1.8.1 users and switching to v1.10 will be entirely opt-in. For RHEL 7 I'd like to switch to the forward-compatible scheme where the directory and module names "openmpi-$VERSION", "openmpi-$VERSION-x86_64" are used consistently by all openmpi versions, and the module name "openmpi-x86_64" is always assigned to the latest openmpi version (or maybe controlled with the alternatives mechanism). Sounds reasonable to me. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0967.html |
Created attachment 952074 [details] example session Description of problem: RHEL 6.6 updates openmpi 1.5 to openmpi 1.8. This isn't noted in the list of package updates, and it isn't at all compatible with the previous version 1.5. At least: * existing Fortran programs no longer run; * mpirun arguments have changed, and common ones at least produce a lot of warning noise now; * MCA components and their configuration/defaults have changed (e.g. paffinity is absent). In mitigation, the update is blocked if you have packaged versions of things installed from current EPEL or self-built, but this will be a real annoyance if EPEL packages start being rebuilt against the new version. An example session is attached, showing the result of the upgrade. This is a realistic example, though core binding would normally be defaulted in the MCA parameters, and is one of the defaults which changed in 1.8. I haven't debugged why binding now fails, but I have self-built openmpi-1.8 rpms working OK on our production system in parallel with 1.6 ones (compatible with 1.5). Version-Release number of selected component (if applicable): 1.8.1