Bug 738576

Summary: add --with-tcp-retry-limit at compile time
Product: [Fedora] Fedora EPEL Reporter: Arnau <arnaubria>
Component: torqueAssignee: Steve Traylen <steve.traylen>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: unspecified    
Version: el5CC: fabrice, fotis, garrick, steve.traylen
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: torque-2.5.7-3.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-06-03 12:52:19 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Description Arnau 2011-09-15 08:50:48 UTC
Description of problem:

from CHANGELOG
Add a configure option --with-tcp-retry-limit to prevent potential 4+ hour hangs on pbs_server. We recommend --with-tcp-retry-limit=2 (backported from 3.0.1)

and from torque list :

http://www.supercluster.org/pipermail/torqueusers/2011-March/012477.html


Version-Release number of selected component (if applicable):

all?

How reproducible:

kill a mom daemon and leave the master wait for it.

Comment 1 Fedora Update System 2011-09-18 23:45:47 UTC
torque-3.0.2-3.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/torque-3.0.2-3.fc16

Comment 2 Fedora Update System 2011-09-18 23:45:55 UTC
torque-2.5.7-2.el5.2 has been submitted as an update for Fedora EPEL 5.
https://admin.fedoraproject.org/updates/torque-2.5.7-2.el5.2

Comment 3 Fedora Update System 2011-09-18 23:46:03 UTC
torque-2.5.7-3.el6 has been submitted as an update for Fedora EPEL 6.
https://admin.fedoraproject.org/updates/torque-2.5.7-3.el6

Comment 4 Fedora Update System 2011-09-18 23:47:11 UTC
torque-2.5.7-2.el4.2 has been submitted as an update for Fedora EPEL 4.
https://admin.fedoraproject.org/updates/torque-2.5.7-2.el4.2

Comment 5 Fedora Update System 2011-09-19 18:31:03 UTC
Package torque-3.0.2-3.fc16:
* should fix your issue,
* was pushed to the Fedora 16 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing torque-3.0.2-3.fc16'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/torque-3.0.2-3.fc16
then log in and leave karma (feedback).

Comment 6 Fedora Update System 2011-10-01 18:44:38 UTC
torque-3.0.2-3.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 7 Fedora Update System 2011-10-08 19:23:15 UTC
torque-2.5.7-2.el4.2 has been pushed to the Fedora EPEL 4 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 8 Fedora Update System 2011-10-08 19:23:23 UTC
torque-2.5.7-2.el5.2 has been pushed to the Fedora EPEL 5 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 9 Fedora Update System 2011-10-08 19:23:52 UTC
torque-2.5.7-3.el6 has been pushed to the Fedora EPEL 6 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 10 Fabrice Bellet 2013-04-16 08:51:54 UTC
Hi!

I reopen this bug, as it seems that this fix causes a side effect on torque-2.5.7-9.el6, where the pbs_server randomly cannot bind to a local port two times consecutively, due to the tcp-retry-count value, and consequently gives up to connect to the mom port on the exec_host node. I assume that selecting two random local ports in the reserved range  have a statistical non-nul risk of being both busy, so maybe the tcp-retry-count value could be increased a bit to mitigate this case occurence ? 

The visible consequence is that some jobs stay forever in queue, in queued state, while having an exec_host selected.

Here is the relevant debug information from the log file:

04/15/2013 20:06:48;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 20217.linux1.grid.creatis.insa-lyon.fr state from RUNNING-STAGEGO to RUNNING-PRERUN (4-40)
04/15/2013 20:06:48;0004;PBS_Server;Svr;svr_connect;attempting connect to host 134.214.205.76 port 15002
04/15/2013 20:06:48;0004;PBS_Server;Svr;svr_connect;cannot connect to host port 15002 - cannot establish connection (cannot connect to port 985 in client_to_svr - errno:99 Cannot assign requested address) - time=0 seconds
04/15/2013 20:06:48;0004;PBS_Server;Svr;WARNING;ALERT: unable to contact node linux7.grid.creatis.insa-lyon.fr
04/15/2013 20:06:48;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Cannot assign requested address (99) in send_job, send_job failed to 86d6cd4c port 15002
04/15/2013 20:06:48;0002;PBS_Server;Job;20214.linux1.grid.creatis.insa-lyon.fr;child reported success for job after 0 seconds (dest=linux7.grid.creatis.insa-lyon.fr), rc=0
04/15/2013 20:06:48;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 20214.linux1.grid.creatis.insa-lyon.fr state from RUNNING-PRERUN to RUNNING-RUNNING (4-42)
04/15/2013 20:06:48;0004;PBS_Server;Svr;svr_connect;attempting connect to host 134.214.205.76 port 15002
04/15/2013 20:06:48;0004;PBS_Server;Svr;svr_connect;cannot connect to host port 15002 - cannot establish connection (cannot connect to port 985 in client_to_svr - errno:99 Cannot assign requested address) - time=0 seconds
04/15/2013 20:06:48;0004;PBS_Server;Svr;WARNING;ALERT: unable to contact node linux7.grid.creatis.insa-lyon.fr
04/15/2013 20:06:48;0002;PBS_Server;Job;20217.linux1.grid.creatis.insa-lyon.fr;child reported failure for job after 0 seconds (dest=linux7.grid.creatis.insa-lyon.fr), rc=1
04/15/2013 20:06:48;0008;PBS_Server;Job;20217.linux1.grid.creatis.insa-lyon.fr;unable to run job, MOM rejected/rc=1
04/15/2013 20:06:48;0040;PBS_Server;Req;free_nodes;freeing nodes for job 20217.linux1.grid.creatis.insa-lyon.fr
04/15/2013 20:06:48;0040;PBS_Server;Req;free_nodes;freeing node linux7.grid.creatis.insa-lyon.fr/9 from job 20217.linux1.grid.creatis.insa-lyon.fr (nsnfree=5)
04/15/2013 20:06:48;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 20217.linux1.grid.creatis.insa-lyon.fr state from RUNNING-PRERUN to QUEUED-STAGECMP (1-16)

Comment 11 Steve Traylen 2014-06-03 12:52:19 UTC
No longer being maintained in EPEL by me.