Bug 676879 (mpiexec)

Summary: Review Request: mpiexec - MPI job launcher that uses the PBS task interface directly
Product: [Fedora] Fedora Reporter: Christos Triantafyllidis <christos.triantafyllidis>
Component: Package ReviewAssignee: Nobody's working on this, feel free to take it <nobody>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: unspecified    
Version: rawhideCC: ctrianta, dledford, fedora-package-review, fenlason, mrunge, notting, sergiobelkin, steve.traylen, susi.lehtola, tomspur
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-04-30 16:32:51 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 201449    

Description Christos Triantafyllidis 2011-02-11 16:51:04 UTC
Spec URL: http://svn.hellasgrid.gr/svn/code.grid.auth.gr/mpiexec/trunk/mpiexec.spec
SRPM URL: http://koji.afroditi.hellasgrid.gr/packages/mpiexec/0.83/2_torque_2.3.13.el5/src/mpiexec-0.83-2_torque_2.3.13.el5.src.rpm
Description: 
Mpiexec gathers node settings from PBS, prepares for the MPI library run environment, and starts tasks through the PBS task manager interface. Attempts to duplicate mpirun as much as possible, while getting everything correct, and being faster than rsh.  As a side effect, PBS maintains proper accounting of all tasks of a parallel job, and can terminate everything on job abort.

Comment 1 Christos Triantafyllidis 2011-02-11 16:59:07 UTC
Hm... this is my first package so (if accepted) i'll need sponsorship...

Comment 2 Jason Tibbitts 2011-02-11 17:19:41 UTC
I suggest that you read these two documents and follow the advice therein:
http://fedoraproject.org/wiki/Join_the_package_collection_maintainers
http://fedoraproject.org/wiki/How_to_get_sponsored_into_the_packager_group

Comment 3 Sergio Belkin 2011-02-18 19:23:32 UTC
Hi Christos,

I am (still) not member of packaging group, however I hope you find the following points useful:

1. Fix the Group.
2. Fix the Source URL
3. Fix the license: Enter the right "Short Name" as is listed on http://fedoraproject.org/wiki/Licensing:Main#Good_Licenses.

Greets

Comment 4 Christos Triantafyllidis 2011-02-18 23:01:24 UTC
Many thanks Sergio!

(In reply to comment #3)
> Hi Christos,
> 
> I am (still) not member of packaging group, however I hope you find the
> following points useful:
> 
> 1. Fix the Group.
Fixed! I searched for what others are using and "Applications/Engineering" seems to be the most popular for mpiexec.

> 2. Fix the Source URL
Done.

> 3. Fix the license: Enter the right "Short Name" as is listed on
> http://fedoraproject.org/wiki/Licensing:Main#Good_Licenses.

Ooops... Done.

> 
> Greets

I've also updated it to the latest upstream version.

The updated spec is in same location:
http://svn.hellasgrid.gr/svn/code.grid.auth.gr/mpiexec/trunk/mpiexec.spec

Updated SRPM:
http://koji.afroditi.hellasgrid.gr/packages/mpiexec/0.84/1_torque_2.3.13.el5/src/mpiexec-0.84-1_torque_2.3.13.el5.src.rpm

Many thanks once more for reviewing this.

BTW what is the best way to note that this is build against torque 2.3.13 (currently i'm using release for this). I guess that one could build it agains other queuing systems (if they are in fedora/epel repos).


Christos

Comment 5 Steve Traylen 2011-02-23 18:14:57 UTC
How does this relate to 

mpich2 which contains:
/usr/lib/mpich2/bin/mpiexec
and
openmpi
which contains
/usr/lib/openmpi/bin/mpiexec


http://fedoraproject.org/wiki/PackagingDrafts/MPI

may have some clues.

As for your last question about signifying which torque version you build against
then this is implicit since the package in a particular EPEL or Fedora release and
so is built against which ever torque is in that platform. You don't need the "toruqe_XXX" 
in the release.

Also the resulting requires
$ rpm -qp --requires mpiexec-0.84-1_torque_2.3.13.fc14.x86_64.rpm  | grep torque
libtorque.so.2()(64bit)

should be enough to tie version.

The pbs location changes between EPEL5 and 6 and in Fedora so I recommend some conditionals:
http://fedoraproject.org/wiki/DistTag#Conditionals
However this review is by default for rawhide so it must be correct for that.

I see there is both a GPL LICENSE file but also a "LICENSE.mvapich" file which is
probably BSD. Is this dual licensed or something ? Can you clarify.

Steve.

Comment 6 Christos Triantafyllidis 2011-02-24 08:28:59 UTC
Hey Steve,
   thanks for getting this! Find my answers/comments inline.

(In reply to comment #5)
> How does this relate to 
> 
> mpich2 which contains:
> /usr/lib/mpich2/bin/mpiexec
> and
> openmpi
> which contains
> /usr/lib/openmpi/bin/mpiexec
> 
> 
> http://fedoraproject.org/wiki/PackagingDrafts/MPI
> 
> may have some clues.
> 
I would say that this is yet another mpiexec :).

The benefit of using this package is that it integrates with the PBS which means that it takes advantage of features like correct accounting for MPI jobs. Another issue that this solves is that it is compatible with many MPI implementations (including openmpi and mpich2). Check this as the vendor explains why there are so many mpiexecs:
http://www.osc.edu/~djohnson/mpiexec/index.php#Too_many_mpiexecs


> As for your last question about signifying which torque version you build
> against
> then this is implicit since the package in a particular EPEL or Fedora release
> and
> so is built against which ever torque is in that platform. You don't need the
> "toruqe_XXX" 
> in the release.
> 
> Also the resulting requires
> $ rpm -qp --requires mpiexec-0.84-1_torque_2.3.13.fc14.x86_64.rpm  | grep
> torque
> libtorque.so.2()(64bit)
> 
> should be enough to tie version.

Actually my concern here is that this mpiexec supports other PBS implementations (i.e. OpenPBS or PBSPro). Currently they are not in fedora/epel but what if they appear later? Is it safe/OK to remove this from the release for now and think about it if/when they may appear?

> 
> The pbs location changes between EPEL5 and 6 and in Fedora so I recommend some
> conditionals:
> http://fedoraproject.org/wiki/DistTag#Conditionals
> However this review is by default for rawhide so it must be correct for that.

This sounds trivial. Is it sufficient to distinguish between EPEL5, EPEL6 and Fedora or should i also distinguish torque release? Even better is there anyway that i can get this path from currently installed torque version?

> 
> I see there is both a GPL LICENSE file but also a "LICENSE.mvapich" file which
> is
> probably BSD. Is this dual licensed or something ? Can you clarify.

According to the last line of vendor's description page:
http://www.osc.edu/~djohnson/mpiexec/index.php#Description

"Mpiexec is free software and is licensed for use under the GNU General Public License, version 2." 

Is this sufficient or should i clarify this with vendor?

> 
> Steve.

Many thanks once more for reviewing this.

Christos

Comment 7 Christos Triantafyllidis 2011-02-24 10:41:08 UTC
(In reply to comment #6)

> > 
> > I see there is both a GPL LICENSE file but also a "LICENSE.mvapich" file which
> > is
> > probably BSD. Is this dual licensed or something ? Can you clarify.
> 
> According to the last line of vendor's description page:
> http://www.osc.edu/~djohnson/mpiexec/index.php#Description
> 
> "Mpiexec is free software and is licensed for use under the GNU General Public
> License, version 2." 
> 
> Is this sufficient or should i clarify this with vendor?
> 

Closer look revealed that the latest vendor version (0.84) includes files under different license (BSD no advertising, 3 clause). Given that these sources are not possible to be split to a different package, i'll change the license to the multiple scenario (GPLv2 and BSD).

I've also found that the vendor doesn't install the license files (as well as changelog and readme) at documents area which i'll add in the next release.

So i'm waiting your reply on the naming (whether to add "torque" somewhere or not) and i'll build the new pkg/spec.

Christos

Comment 8 Steve Traylen 2011-02-24 11:47:25 UTC
I don't think torque should be present in the name anywhere , this is implicit with the library requires.

The other significant batch system is of course gridengine in Fedora.

I also think it may be worth contacting the MPI group.

Should this mpiexec use the "modules" technology that the other mpiexecs use? This was
my real question. Maybe worth adding the maintainers of those packages to comment.

Steve.

Comment 9 Christos Triantafyllidis 2011-02-24 15:31:19 UTC
The updated spec is in same location:
http://svn.hellasgrid.gr/svn/code.grid.auth.gr/mpiexec/trunk/mpiexec.spec

Updated SRPM:
http://koji.afroditi.hellasgrid.gr/packages/mpiexec/0.84/2.el5/src/mpiexec-0.84-2.el5.src.rpm

Changes since last release:
- Updated License (GPLv2 and BSD)
- Added documentation files (including licenses)
- Fixed the pbs path (it should refer to the installation path, not the var folder thus it is not related to distribution)
- Removed the torque version from release number

The only standing issue is whether to use /usr/bin for installation or something like /usr/libexec/mpiexec/ and include a module file to load it in PATH.

I've contacted people who already maintain a MPI library package (either mpich2 or openmpi) in Fedora/EPEL in order to get their opinion on this.

Christos

Comment 10 Christos Triantafyllidis 2011-03-04 17:35:53 UTC
It's more than a week later and still no answer from people maintaining other MPI library packages (namely mpich2 and openmpi). 

Given that the only (?) standing issue was the lack of usage of environment-modules i added it.

The new SPEC file is at the usual location:
http://svn.hellasgrid.gr/svn/code.grid.auth.gr/mpiexec/trunk/mpiexec.spec

Updated SRPM can be found at:
http://koji.afroditi.hellasgrid.gr/packages/mpiexec/0.84/3.el5/src/mpiexec-0.84-3.el5.src.rpm

Changes since last release:
- Added use of environment modules
- bin and man files are no longer in default locations.

Additionally i did some scratch builds at Fedora's koji:
dist-f14: http://koji.fedoraproject.org/koji/taskinfo?taskID=2885049
dist-f15: http://koji.fedoraproject.org/koji/taskinfo?taskID=2885064
dist-rawhide: http://koji.fedoraproject.org/koji/taskinfo?taskID=2885067
dist-5E-epel: http://koji.fedoraproject.org/koji/taskinfo?taskID=2885142

Finally i run rpmlint on SPEC and SRPMs:
[ctria@toolbox SPECS]$ rpmlint mpiexec.spec 
0 packages and 1 specfiles checked; 0 errors, 0 warnings.

[ctria@toolbox SRPMS]$ rpmlint mpiexec-0.84-3.*
mpiexec.src: W: spelling-error %description -l en_US mpirun -> Empirin, umpire, Epirus
mpiexec.src: W: spelling-error %description -l en_US rsh -> rah, rs, sh
mpiexec.src: W: spelling-error %description -l en_US mpirun -> Empirin, umpire, Epirus
mpiexec.src: W: spelling-error %description -l en_US rsh -> rah, rs, sh
mpiexec.src: W: spelling-error %description -l en_US mpirun -> Empirin, umpire, Epirus
mpiexec.src: W: spelling-error %description -l en_US rsh -> rah, rs, sh
mpiexec.src: W: spelling-error %description -l en_US mpirun -> Empirin, umpire, Epirus
mpiexec.src: W: spelling-error %description -l en_US rsh -> rah, rs, sh
4 packages and 0 specfiles checked; 0 errors, 8 warnings.

Which i think are perfectly fine.

I hope that i don't miss anything else but if so please shoot it.

Christos

Comment 11 Doug Ledford 2011-03-16 15:53:28 UTC
Sorry to take so long, but sometimes you're just busy.

So, here's what I see from reading up on the mpiexec website.

First, this doesn't necessarily work with lots of batch systems.  It works with OpenPBS, PBS Pro, and Torque.  However, these are all three forks of OpenPBS where OpenPBS appears to be dead, PBS Pro is (I guess) some group taking OpenPBS and delivering ongoing service and support around it, and Torque is an actively developed fork of OpenPBS.

Second, it's not intended to be multi-MPI friendly.  Not really anyway.  I know the web page talks about a lot of MPI packages, but in truth, there are only a few MPI families, with lots of forks along those families.  There is the mpich family, which includes mpich, mpich2, mvapich, mvapich2, intel mpi, etc.  Then there is the lam family, which includes lam and openmpi.  I don't know of any other open source mpi families that are still alive.  There might be other closed source mpis out there, but we don't care about those.  In any case, the mpiexec website basically calls out that lam/openmpi get things right on their own so there is no need to use mpiexec there and recommends that you don't (side note: this does not surprise me in the least, the mpich family of job starting daemons has always been nothing more than a bunch of scripts calling rsh, hardly what I would call robust or well designed, more like a quick and dirty job to get things running in the early days and then they never went back and did things right later).

So, for all the talk on the web site about how many mpis this supports, and how many PBSes this supports, it really only supports one mpi family and one pbs family.  Given that, this *absolutely* does not belong in the main path.  I know that's already been fixed, but I'm putting this here so that someone doesn't get the idea in the future to undo that fix.

Now, what's more, is I'm not entirely certain that this will work transparently with different mpich mpis from a single build.  You have to specify the default communication method in the configure script, as well as a few other options.  I haven't looked into it, but I know that mvapich and mvapich2 are enough different from mpich and mpich2 that I'm not certain that the same build parameters will work with both.  If it doesn't, then you might have to build it more than once in the spec file with different options, create an mpiexec base package, and then mpiexec-%{mpiname} sub packages that have the files specific to that particular mpi implementation.

As an example, if needed, you could create an mpiexec shell script and place it in the directory specified in your environment modules file.  This shell script could then execute %{mpidir}/bin/mpiexec-pbs.  You would then place the mpiexec binary this build process spits out into %{mpidir}/bin/mpiexec-pbs for each of the mpis you intend to support and place the files into subpackages of mpiexec specific to each mpi implementation.  That way, if you need different options for different mpis, it can be done.  However, as I haven't tried to use this program, I don't know if this is even necessary.  Before the package goes into Fedora as is though, this needs to be tested.  It's much easier to fix this before it hits repos than it is after.

Comment 12 Christos Triantafyllidis 2011-03-16 16:48:19 UTC
Hi Doug,
   i would agree with most of the things you are writing but i have the following comments:
a) As stated in package description, this package supports only the PBS batch system family. Given that Fedora has only torque in its repositories i would say that this package supports all PBS batch systems available.

b) Regarding MPI libraries i totally agree that different implementation use different parameters, namely for packages that are available in fedora at the moment we have the following libraries:
- MPICH2
- OpenMPI

From those, OpenMPI (as already commented) provides a mpiexec that works as far as OpenMPI is build against the batch system (namely torque). So i find much more convenient to have as default the parameters that MPICH needs as this is actually the one that creates the need for "yet another mpiexec". 

Note that this is the DEFAULT communication method, meaning that can be overridden at any time. I may not know the Fedora packaging internals that well but i clearly find it a bad idea to have the same code installed multiple times just to avoid the usage of a parameter. Are you sure this is actually needed to include it to the distribution?

c) The installation should be totally transparent to anyone who don't intent to use it as it is enabled by the relevant environment module. For those who intent to use it, i guess it is not that inconvenient to add their runtime  parameters if needed.

d) This build (in specific the 0.83 el5 one) has been used extensively in our Grid cluster on production quality without a single error report.

Finally, i'm not a mpiexec fan... we (as a grid site) are using it and with current config this seems to make our users/community happy. Given that i'm building this package for our infrastructure i thought it may be useful for others. If the resulted package is not usable for us i don't see any reason to maintain it for fedora :).

Regards,
Christos

Comment 13 Christos Triantafyllidis 2011-06-03 07:04:16 UTC
Hi,
   it's long since the last reply on this bug. I understand that there is no interest/need for this package from the Fedora/EPEL community. 

   Should I close this bug as "won't fix" or leave it open till someone may start to like it?

Christos

Comment 14 Matthias Runge 2012-03-02 20:20:06 UTC
Christos, are you still interested?

If yes, you should do some informal reviews and list the bugzilla-numbers here (as a reference for your potential sponsor).
You need to do something to prove a sponsor your packaging-knowledge.
Corresponding links are in comment 2 and still valid.

Comment 15 Christos Triantafyllidis 2012-03-02 21:04:27 UTC
Hi Matthias,
   i'm still interested although i'm not sure if latest version here is still working given the major changes in torque side (upgrade from 2.3 to 2.5). Given that my setup is still at 2.3 i don't see any urgent reason to put effort in updating this if there is no reviewer for it. If you are willing to review it and sponsor me i would happily do the needed (if any) changes and re-build it.

   I have done a few reviews or simply commented on other package requests in past:
https://bugzilla.redhat.com/show_bug.cgi?id=682553
https://bugzilla.redhat.com/show_bug.cgi?id=683587
https://bugzilla.redhat.com/show_bug.cgi?id=772485

   And i have submitted another package too:
https://bugzilla.redhat.com/show_bug.cgi?id=772406

   Let me know if you think that i have to do more reviews.

Regards,
Christos

Comment 16 Matthias Runge 2012-03-06 09:30:11 UTC
Christos,

Ok, great. Since I'm no sponsor, we need to wait (or ask sponsors), if your comments in those reviews are sufficient for them.

What's your fedora account name?

Comment 17 Christos Triantafyllidis 2012-03-06 09:43:30 UTC
Hi Matthias,
   i'll try to do some more review (time allowed) in meanwhile. My FAS username is ctria.

Christos

Comment 18 Susi Lehtola 2012-03-25 17:40:28 UTC
Hmm, this seems a rather nontrivial review, at least properly done.

As noted above by Doug in comment #11, the applicability of the package is rather limited. I find it hard to imagine that anyone would run a cluster with current Fedora (or RHEL) and use an antique queueing system that doesn't have out-of-the-box support for MPI, as there are other queue managers (and MPI libraries) which handle this without problems.

This being said, if there is enough interest, the package can of course be included in Fedora. But for a good review to take place, the reviewer should also test if the program works (which, admittedly, doesn't always happen).

Although I'm a heavy user of clusters and queue systems (and have had a hand in writing the Fedora MPI guidelines), I lack the knowledge to properly review this package.

IMHO, this package is a curiosity, and can be said to be obsolete.

If you (or someone else) ends up packaging this in Fedora, the name needs to be changed, 'mpiexec' is just too general. To reflect the use case of the package, something like pbs-mpiexec or torque-mpiexec (or -mpich) would be far more suitable and less prone to cause problems.

Comment 19 Christos Triantafyllidis 2012-03-25 22:37:55 UTC
Hi Jussi,
   thanks for your comments.

   Well i'm not that interested on the fedora branches as i am in EPEL ones. I totally understand your point but i can definitely say that there many (grid) sites that are running torque thus there would be some use for them. And fedora still ships this "obsolete" torque.

   Regarding whether this package is obsolete, well as long as torque and MPICH are maintained, and used in production i wouldn't say that this is the case.

   Now regarding naming i really don't care, functionality matters on my side, if you think that renaming it to "something"-mpiexec or mpiexec-"something" will help, i'm with you. I just used mpiexec as this is the name that vendor used. But for sure naming it "something"-mpiexec will not be the first guess that someone will try if he/she wants to install this package as vendor uses plain "mpiexec" (wrongly in my opinion too).

    Anyway i see that there is no will to push this forward, although there is no policy (that at least i'm aware of) against it, so feel free to close this ticket.

Regards,
Christos