Description of problem: from CHANGELOG Add a configure option --with-tcp-retry-limit to prevent potential 4+ hour hangs on pbs_server. We recommend --with-tcp-retry-limit=2 (backported from 3.0.1) and from torque list : http://www.supercluster.org/pipermail/torqueusers/2011-March/012477.html Version-Release number of selected component (if applicable): all? How reproducible: kill a mom daemon and leave the master wait for it.
torque-3.0.2-3.fc16 has been submitted as an update for Fedora 16. https://admin.fedoraproject.org/updates/torque-3.0.2-3.fc16
torque-2.5.7-2.el5.2 has been submitted as an update for Fedora EPEL 5. https://admin.fedoraproject.org/updates/torque-2.5.7-2.el5.2
torque-2.5.7-3.el6 has been submitted as an update for Fedora EPEL 6. https://admin.fedoraproject.org/updates/torque-2.5.7-3.el6
torque-2.5.7-2.el4.2 has been submitted as an update for Fedora EPEL 4. https://admin.fedoraproject.org/updates/torque-2.5.7-2.el4.2
Package torque-3.0.2-3.fc16: * should fix your issue, * was pushed to the Fedora 16 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing torque-3.0.2-3.fc16' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/torque-3.0.2-3.fc16 then log in and leave karma (feedback).
torque-3.0.2-3.fc16 has been pushed to the Fedora 16 stable repository. If problems still persist, please make note of it in this bug report.
torque-2.5.7-2.el4.2 has been pushed to the Fedora EPEL 4 stable repository. If problems still persist, please make note of it in this bug report.
torque-2.5.7-2.el5.2 has been pushed to the Fedora EPEL 5 stable repository. If problems still persist, please make note of it in this bug report.
torque-2.5.7-3.el6 has been pushed to the Fedora EPEL 6 stable repository. If problems still persist, please make note of it in this bug report.
Hi! I reopen this bug, as it seems that this fix causes a side effect on torque-2.5.7-9.el6, where the pbs_server randomly cannot bind to a local port two times consecutively, due to the tcp-retry-count value, and consequently gives up to connect to the mom port on the exec_host node. I assume that selecting two random local ports in the reserved range have a statistical non-nul risk of being both busy, so maybe the tcp-retry-count value could be increased a bit to mitigate this case occurence ? The visible consequence is that some jobs stay forever in queue, in queued state, while having an exec_host selected. Here is the relevant debug information from the log file: 04/15/2013 20:06:48;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 20217.linux1.grid.creatis.insa-lyon.fr state from RUNNING-STAGEGO to RUNNING-PRERUN (4-40) 04/15/2013 20:06:48;0004;PBS_Server;Svr;svr_connect;attempting connect to host 134.214.205.76 port 15002 04/15/2013 20:06:48;0004;PBS_Server;Svr;svr_connect;cannot connect to host port 15002 - cannot establish connection (cannot connect to port 985 in client_to_svr - errno:99 Cannot assign requested address) - time=0 seconds 04/15/2013 20:06:48;0004;PBS_Server;Svr;WARNING;ALERT: unable to contact node linux7.grid.creatis.insa-lyon.fr 04/15/2013 20:06:48;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Cannot assign requested address (99) in send_job, send_job failed to 86d6cd4c port 15002 04/15/2013 20:06:48;0002;PBS_Server;Job;20214.linux1.grid.creatis.insa-lyon.fr;child reported success for job after 0 seconds (dest=linux7.grid.creatis.insa-lyon.fr), rc=0 04/15/2013 20:06:48;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 20214.linux1.grid.creatis.insa-lyon.fr state from RUNNING-PRERUN to RUNNING-RUNNING (4-42) 04/15/2013 20:06:48;0004;PBS_Server;Svr;svr_connect;attempting connect to host 134.214.205.76 port 15002 04/15/2013 20:06:48;0004;PBS_Server;Svr;svr_connect;cannot connect to host port 15002 - cannot establish connection (cannot connect to port 985 in client_to_svr - errno:99 Cannot assign requested address) - time=0 seconds 04/15/2013 20:06:48;0004;PBS_Server;Svr;WARNING;ALERT: unable to contact node linux7.grid.creatis.insa-lyon.fr 04/15/2013 20:06:48;0002;PBS_Server;Job;20217.linux1.grid.creatis.insa-lyon.fr;child reported failure for job after 0 seconds (dest=linux7.grid.creatis.insa-lyon.fr), rc=1 04/15/2013 20:06:48;0008;PBS_Server;Job;20217.linux1.grid.creatis.insa-lyon.fr;unable to run job, MOM rejected/rc=1 04/15/2013 20:06:48;0040;PBS_Server;Req;free_nodes;freeing nodes for job 20217.linux1.grid.creatis.insa-lyon.fr 04/15/2013 20:06:48;0040;PBS_Server;Req;free_nodes;freeing node linux7.grid.creatis.insa-lyon.fr/9 from job 20217.linux1.grid.creatis.insa-lyon.fr (nsnfree=5) 04/15/2013 20:06:48;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 20217.linux1.grid.creatis.insa-lyon.fr state from RUNNING-PRERUN to QUEUED-STAGECMP (1-16)
No longer being maintained in EPEL by me.