738576 – add --with-tcp-retry-limit at compile time

Bug 738576 - add --with-tcp-retry-limit at compile time

Summary: add --with-tcp-retry-limit at compile time

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora EPEL
Classification:	Fedora
Component:	torque
Sub Component:
Version:	el5
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Assignee:	Steve Traylen
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-09-15 08:50 UTC by Arnau
Modified:	2014-06-03 12:52 UTC (History)
CC List:	4 users (show)
Fixed In Version:	torque-2.5.7-3.el6
Clone Of:
Environment:
Last Closed:	2014-06-03 12:52:19 UTC
Type:	---
Embargoed:

Attachments	(Terms of Use)

Description Arnau 2011-09-15 08:50:48 UTC

Description of problem:

from CHANGELOG
Add a configure option --with-tcp-retry-limit to prevent potential 4+ hour hangs on pbs_server. We recommend --with-tcp-retry-limit=2 (backported from 3.0.1)

and from torque list :

http://www.supercluster.org/pipermail/torqueusers/2011-March/012477.html


Version-Release number of selected component (if applicable):

all?

How reproducible:

kill a mom daemon and leave the master wait for it.

Comment 1 Fedora Update System 2011-09-18 23:45:47 UTC

torque-3.0.2-3.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/torque-3.0.2-3.fc16

Comment 2 Fedora Update System 2011-09-18 23:45:55 UTC

torque-2.5.7-2.el5.2 has been submitted as an update for Fedora EPEL 5.
https://admin.fedoraproject.org/updates/torque-2.5.7-2.el5.2

Comment 3 Fedora Update System 2011-09-18 23:46:03 UTC

torque-2.5.7-3.el6 has been submitted as an update for Fedora EPEL 6.
https://admin.fedoraproject.org/updates/torque-2.5.7-3.el6

Comment 4 Fedora Update System 2011-09-18 23:47:11 UTC

torque-2.5.7-2.el4.2 has been submitted as an update for Fedora EPEL 4.
https://admin.fedoraproject.org/updates/torque-2.5.7-2.el4.2

Comment 5 Fedora Update System 2011-09-19 18:31:03 UTC

Package torque-3.0.2-3.fc16:
* should fix your issue,
* was pushed to the Fedora 16 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing torque-3.0.2-3.fc16'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/torque-3.0.2-3.fc16
then log in and leave karma (feedback).

Comment 6 Fedora Update System 2011-10-01 18:44:38 UTC

torque-3.0.2-3.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 7 Fedora Update System 2011-10-08 19:23:15 UTC

torque-2.5.7-2.el4.2 has been pushed to the Fedora EPEL 4 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 8 Fedora Update System 2011-10-08 19:23:23 UTC

torque-2.5.7-2.el5.2 has been pushed to the Fedora EPEL 5 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 9 Fedora Update System 2011-10-08 19:23:52 UTC

torque-2.5.7-3.el6 has been pushed to the Fedora EPEL 6 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 10 Fabrice Bellet 2013-04-16 08:51:54 UTC

Hi!

I reopen this bug, as it seems that this fix causes a side effect on torque-2.5.7-9.el6, where the pbs_server randomly cannot bind to a local port two times consecutively, due to the tcp-retry-count value, and consequently gives up to connect to the mom port on the exec_host node. I assume that selecting two random local ports in the reserved range  have a statistical non-nul risk of being both busy, so maybe the tcp-retry-count value could be increased a bit to mitigate this case occurence ? 

The visible consequence is that some jobs stay forever in queue, in queued state, while having an exec_host selected.

Here is the relevant debug information from the log file:

04/15/2013 20:06:48;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 20217.linux1.grid.creatis.insa-lyon.fr state from RUNNING-STAGEGO to RUNNING-PRERUN (4-40)
04/15/2013 20:06:48;0004;PBS_Server;Svr;svr_connect;attempting connect to host 134.214.205.76 port 15002
04/15/2013 20:06:48;0004;PBS_Server;Svr;svr_connect;cannot connect to host port 15002 - cannot establish connection (cannot connect to port 985 in client_to_svr - errno:99 Cannot assign requested address) - time=0 seconds
04/15/2013 20:06:48;0004;PBS_Server;Svr;WARNING;ALERT: unable to contact node linux7.grid.creatis.insa-lyon.fr
04/15/2013 20:06:48;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Cannot assign requested address (99) in send_job, send_job failed to 86d6cd4c port 15002
04/15/2013 20:06:48;0002;PBS_Server;Job;20214.linux1.grid.creatis.insa-lyon.fr;child reported success for job after 0 seconds (dest=linux7.grid.creatis.insa-lyon.fr), rc=0
04/15/2013 20:06:48;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 20214.linux1.grid.creatis.insa-lyon.fr state from RUNNING-PRERUN to RUNNING-RUNNING (4-42)
04/15/2013 20:06:48;0004;PBS_Server;Svr;svr_connect;attempting connect to host 134.214.205.76 port 15002
04/15/2013 20:06:48;0004;PBS_Server;Svr;svr_connect;cannot connect to host port 15002 - cannot establish connection (cannot connect to port 985 in client_to_svr - errno:99 Cannot assign requested address) - time=0 seconds
04/15/2013 20:06:48;0004;PBS_Server;Svr;WARNING;ALERT: unable to contact node linux7.grid.creatis.insa-lyon.fr
04/15/2013 20:06:48;0002;PBS_Server;Job;20217.linux1.grid.creatis.insa-lyon.fr;child reported failure for job after 0 seconds (dest=linux7.grid.creatis.insa-lyon.fr), rc=1
04/15/2013 20:06:48;0008;PBS_Server;Job;20217.linux1.grid.creatis.insa-lyon.fr;unable to run job, MOM rejected/rc=1
04/15/2013 20:06:48;0040;PBS_Server;Req;free_nodes;freeing nodes for job 20217.linux1.grid.creatis.insa-lyon.fr
04/15/2013 20:06:48;0040;PBS_Server;Req;free_nodes;freeing node linux7.grid.creatis.insa-lyon.fr/9 from job 20217.linux1.grid.creatis.insa-lyon.fr (nsnfree=5)
04/15/2013 20:06:48;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 20217.linux1.grid.creatis.insa-lyon.fr state from RUNNING-PRERUN to QUEUED-STAGECMP (1-16)

Comment 11 Steve Traylen 2014-06-03 12:52:19 UTC

No longer being maintained in EPEL by me.

Note You need to log in before you can comment on or make changes to this bug.