Bug 509764

Summary: "mpirun -gdb" produces a time out
Product: [Fedora] Fedora Reporter: Laurent Aguerreche <laurent.aguerreche+redhat>
Component: mpich2Assignee: Deji Akingunola <dakingun>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: low    
Version: 11CC: buntinas, dakingun
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-06-28 13:29:47 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Laurent Aguerreche 2009-07-05 23:34:04 UTC
Description of problem:
The command "mpirun -gdb" does not work and return these messages:
$ mpirun -gdb my-application
0: /usr/bin/mpdgdbdrv.py:20: DeprecationWarning: The popen2 module is deprecated.  Use the subprocess module.
0:   from popen2 import Popen4
0:  mpdgdbdrv (<module> 107): timed out waiting for initial Breakpoint response

So, it is impossible to debug anything.


How reproducible:
Always.


Any idea?

Comment 1 Deji Akingunola 2009-07-08 19:15:34 UTC
What version of mpich2 has this problem?

Comment 2 Laurent Aguerreche 2009-07-08 21:15:58 UTC
I have that on my system :
$ rpm -qa | grep mpich2
mpich2-libs-1.1-1.fc11.x86_64
mpich2-1.1-1.fc11.x86_64
mpich2-devel-1.1-1.fc11.x86_64

Comment 3 Darius 2009-07-08 21:25:34 UTC
Laurent, 

Are you sure your app is an mpi application and has been compiled with the mpicc from mpich2-devel-1.1-1.fc11.x86_64?

Try this:

  cd /tmp
  mpicc /usr/share/mpich2/examples_graphics/cpi.c -o cpi
  mpiexec -gdb ./cpi

You should get a gdb prompt (without the timeout message).

If this doesn't work, can you post the output of

  which mpicc
  rpm -qf `which mpicc`

Thanks!

Comment 4 Laurent Aguerreche 2009-07-09 13:06:26 UTC
My program was not built with mpicc but directly with g++ and flags from `pkg-config mpich2-ch3` so I rebuilt it.

Unfortunately I do not see any difference. I am sure that my program uses MPI since it is able to run 3 processes on my machine and let them communicate together. My program is quite big and loads many plugins when it starts so would it be possible that the timeout value is too low?

By the way, I built cpi.c and I have been able to run it under gdb...

Comment 5 Deji Akingunola 2009-07-09 13:16:57 UTC
(In reply to comment #4)
> My program was not built with mpicc but directly with g++ and flags from
> `pkg-config mpich2-ch3` so I rebuilt it.
> 
Can you please rebuild your program with mpicxx, and try run 'mpirun -gdb' again?
The Cflags from `pkg-config mpich2-ch3` are not exactly the same flags you'll get from building with mpicxx (check out 'mpicxx -show').
Maybe we need to rework the pkgconfig flags.


> Unfortunately I do not see any difference. I am sure that my program uses MPI
> since it is able to run 3 processes on my machine and let them communicate
> together. My program is quite big and loads many plugins when it starts so
> would it be possible that the timeout value is too low?
> 
> By the way, I built cpi.c and I have been able to run it under gdb...

Comment 6 Darius 2009-07-09 15:42:07 UTC
(In reply to comment #4)
> My program was not built with mpicc but directly with g++ and flags from
> `pkg-config mpich2-ch3` so I rebuilt it.
> 
> Unfortunately I do not see any difference. I am sure that my program uses MPI
> since it is able to run 3 processes on my machine and let them communicate
> together. My program is quite big and loads many plugins when it starts so
> would it be possible that the timeout value is too low?

That is possible.  Can you try moving MPI_Init() to the very beginning of your program (before loading any plugins)?  That may help if it's a timeout problem.

Alternatively, you can edit  src/pm/mpd/mpdgdbdrv.py line 100, and change the "3" at the end of the line to something larger (10 should be plenty), then do a "make install" from the top of the build directory again.

> By the way, I built cpi.c and I have been able to run it under gdb...  

OK, that's a good sign.

-d

Comment 7 Laurent Aguerreche 2009-08-12 15:52:18 UTC
Sorry for the (very) long delay but I think I found the problem!

I tried to raise the timeout value and to move the MPI::Init() function at the beginning of my program without any success.

Then I tried to print some messages in mpdgdbdrv.py but they were lost somewhere (messages are probably redirected). So I ran directly this script after I added the following lines :

            stdout.write("b \"" + gdb_line + "\"")
            stdout.flush()

I added them around the line 204, after the line:
            while not gdb_line.startswith('Breakpoint'):

but before the "try".

This is the output I got:

$ /usr/bin/mpdgdbdrv.py my_program
b ""b "
"b "    Breakpoint 1 at 0x413a96: file /home/me/my_program/Source/main.cpp, line 50.
"b "(gdb)
"mpdgdbdrv (<module> 115): timed out waiting for initial Breakpoint response
$


Look at the spaces before the word "Breakpoint" so this line does not start with "Breakpoint"! Consequently I added the line:

            gdb_line = gdb_line.strip()

after:

            gdb_line = gdb_sout_serr.readline()  # drain breakpoint response



Now, "mpirun -gdb" seems to work...


But one thing: text completion does not work for file names for instance, is it possible to fix that?


Rgds,
Laurent.

Comment 8 Darius 2009-08-14 18:21:09 UTC
Nice catch!  I'll add that fix to the repo.

Unfortunately, text completion would not be an easy thing to add.  The text is being read by the front-end mpiexec program and just forwarded to the back-end gdb instances.  The front-end knows nothing about the symbols in your program, it just forwards commands.  So, it would be a major undertaking to add that feature.

We'd probably be better off spending time getting mpich2 to work with the debugger in eclipse/ptp.

Sorry.

-d

Comment 9 Bug Zapper 2010-04-27 15:30:25 UTC
This message is a reminder that Fedora 11 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 11.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '11'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 11's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 11 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 10 Bug Zapper 2010-06-28 13:29:47 UTC
Fedora 11 changed to end-of-life (EOL) status on 2010-06-25. Fedora 11 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.