Bug 738576 - add --with-tcp-retry-limit at compile time
Summary: add --with-tcp-retry-limit at compile time
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Fedora EPEL
Classification: Fedora
Component: torque
Version: el5
Hardware: All
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Steve Traylen
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-09-15 08:50 UTC by Arnau
Modified: 2014-06-03 12:52 UTC (History)
4 users (show)

Fixed In Version: torque-2.5.7-3.el6
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-06-03 12:52:19 UTC


Attachments (Terms of Use)

Description Arnau 2011-09-15 08:50:48 UTC
Description of problem:

from CHANGELOG
Add a configure option --with-tcp-retry-limit to prevent potential 4+ hour hangs on pbs_server. We recommend --with-tcp-retry-limit=2 (backported from 3.0.1)

and from torque list :

http://www.supercluster.org/pipermail/torqueusers/2011-March/012477.html


Version-Release number of selected component (if applicable):

all?

How reproducible:

kill a mom daemon and leave the master wait for it.

Comment 1 Fedora Update System 2011-09-18 23:45:47 UTC
torque-3.0.2-3.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/torque-3.0.2-3.fc16

Comment 2 Fedora Update System 2011-09-18 23:45:55 UTC
torque-2.5.7-2.el5.2 has been submitted as an update for Fedora EPEL 5.
https://admin.fedoraproject.org/updates/torque-2.5.7-2.el5.2

Comment 3 Fedora Update System 2011-09-18 23:46:03 UTC
torque-2.5.7-3.el6 has been submitted as an update for Fedora EPEL 6.
https://admin.fedoraproject.org/updates/torque-2.5.7-3.el6

Comment 4 Fedora Update System 2011-09-18 23:47:11 UTC
torque-2.5.7-2.el4.2 has been submitted as an update for Fedora EPEL 4.
https://admin.fedoraproject.org/updates/torque-2.5.7-2.el4.2

Comment 5 Fedora Update System 2011-09-19 18:31:03 UTC
Package torque-3.0.2-3.fc16:
* should fix your issue,
* was pushed to the Fedora 16 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing torque-3.0.2-3.fc16'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/torque-3.0.2-3.fc16
then log in and leave karma (feedback).

Comment 6 Fedora Update System 2011-10-01 18:44:38 UTC
torque-3.0.2-3.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 7 Fedora Update System 2011-10-08 19:23:15 UTC
torque-2.5.7-2.el4.2 has been pushed to the Fedora EPEL 4 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 8 Fedora Update System 2011-10-08 19:23:23 UTC
torque-2.5.7-2.el5.2 has been pushed to the Fedora EPEL 5 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 9 Fedora Update System 2011-10-08 19:23:52 UTC
torque-2.5.7-3.el6 has been pushed to the Fedora EPEL 6 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 10 Fabrice Bellet 2013-04-16 08:51:54 UTC
Hi!

I reopen this bug, as it seems that this fix causes a side effect on torque-2.5.7-9.el6, where the pbs_server randomly cannot bind to a local port two times consecutively, due to the tcp-retry-count value, and consequently gives up to connect to the mom port on the exec_host node. I assume that selecting two random local ports in the reserved range  have a statistical non-nul risk of being both busy, so maybe the tcp-retry-count value could be increased a bit to mitigate this case occurence ? 

The visible consequence is that some jobs stay forever in queue, in queued state, while having an exec_host selected.

Here is the relevant debug information from the log file:

04/15/2013 20:06:48;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 20217.linux1.grid.creatis.insa-lyon.fr state from RUNNING-STAGEGO to RUNNING-PRERUN (4-40)
04/15/2013 20:06:48;0004;PBS_Server;Svr;svr_connect;attempting connect to host 134.214.205.76 port 15002
04/15/2013 20:06:48;0004;PBS_Server;Svr;svr_connect;cannot connect to host port 15002 - cannot establish connection (cannot connect to port 985 in client_to_svr - errno:99 Cannot assign requested address) - time=0 seconds
04/15/2013 20:06:48;0004;PBS_Server;Svr;WARNING;ALERT: unable to contact node linux7.grid.creatis.insa-lyon.fr
04/15/2013 20:06:48;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Cannot assign requested address (99) in send_job, send_job failed to 86d6cd4c port 15002
04/15/2013 20:06:48;0002;PBS_Server;Job;20214.linux1.grid.creatis.insa-lyon.fr;child reported success for job after 0 seconds (dest=linux7.grid.creatis.insa-lyon.fr), rc=0
04/15/2013 20:06:48;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 20214.linux1.grid.creatis.insa-lyon.fr state from RUNNING-PRERUN to RUNNING-RUNNING (4-42)
04/15/2013 20:06:48;0004;PBS_Server;Svr;svr_connect;attempting connect to host 134.214.205.76 port 15002
04/15/2013 20:06:48;0004;PBS_Server;Svr;svr_connect;cannot connect to host port 15002 - cannot establish connection (cannot connect to port 985 in client_to_svr - errno:99 Cannot assign requested address) - time=0 seconds
04/15/2013 20:06:48;0004;PBS_Server;Svr;WARNING;ALERT: unable to contact node linux7.grid.creatis.insa-lyon.fr
04/15/2013 20:06:48;0002;PBS_Server;Job;20217.linux1.grid.creatis.insa-lyon.fr;child reported failure for job after 0 seconds (dest=linux7.grid.creatis.insa-lyon.fr), rc=1
04/15/2013 20:06:48;0008;PBS_Server;Job;20217.linux1.grid.creatis.insa-lyon.fr;unable to run job, MOM rejected/rc=1
04/15/2013 20:06:48;0040;PBS_Server;Req;free_nodes;freeing nodes for job 20217.linux1.grid.creatis.insa-lyon.fr
04/15/2013 20:06:48;0040;PBS_Server;Req;free_nodes;freeing node linux7.grid.creatis.insa-lyon.fr/9 from job 20217.linux1.grid.creatis.insa-lyon.fr (nsnfree=5)
04/15/2013 20:06:48;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 20217.linux1.grid.creatis.insa-lyon.fr state from RUNNING-PRERUN to QUEUED-STAGECMP (1-16)

Comment 11 Steve Traylen 2014-06-03 12:52:19 UTC
No longer being maintained in EPEL by me.


Note You need to log in before you can comment on or make changes to this bug.