Bug 738576
| Summary: | add --with-tcp-retry-limit at compile time | ||
|---|---|---|---|
| Product: | [Fedora] Fedora EPEL | Reporter: | Arnau <arnaubria> |
| Component: | torque | Assignee: | Steve Traylen <steve.traylen> |
| Status: | CLOSED WONTFIX | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | el5 | CC: | fabrice, fotis, garrick, steve.traylen |
| Target Milestone: | --- | Keywords: | Reopened |
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | torque-2.5.7-3.el6 | Doc Type: | Bug Fix |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2014-06-03 12:52:19 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Arnau
2011-09-15 08:50:48 UTC
torque-3.0.2-3.fc16 has been submitted as an update for Fedora 16. https://admin.fedoraproject.org/updates/torque-3.0.2-3.fc16 torque-2.5.7-2.el5.2 has been submitted as an update for Fedora EPEL 5. https://admin.fedoraproject.org/updates/torque-2.5.7-2.el5.2 torque-2.5.7-3.el6 has been submitted as an update for Fedora EPEL 6. https://admin.fedoraproject.org/updates/torque-2.5.7-3.el6 torque-2.5.7-2.el4.2 has been submitted as an update for Fedora EPEL 4. https://admin.fedoraproject.org/updates/torque-2.5.7-2.el4.2 Package torque-3.0.2-3.fc16: * should fix your issue, * was pushed to the Fedora 16 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing torque-3.0.2-3.fc16' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/torque-3.0.2-3.fc16 then log in and leave karma (feedback). torque-3.0.2-3.fc16 has been pushed to the Fedora 16 stable repository. If problems still persist, please make note of it in this bug report. torque-2.5.7-2.el4.2 has been pushed to the Fedora EPEL 4 stable repository. If problems still persist, please make note of it in this bug report. torque-2.5.7-2.el5.2 has been pushed to the Fedora EPEL 5 stable repository. If problems still persist, please make note of it in this bug report. torque-2.5.7-3.el6 has been pushed to the Fedora EPEL 6 stable repository. If problems still persist, please make note of it in this bug report. Hi! I reopen this bug, as it seems that this fix causes a side effect on torque-2.5.7-9.el6, where the pbs_server randomly cannot bind to a local port two times consecutively, due to the tcp-retry-count value, and consequently gives up to connect to the mom port on the exec_host node. I assume that selecting two random local ports in the reserved range have a statistical non-nul risk of being both busy, so maybe the tcp-retry-count value could be increased a bit to mitigate this case occurence ? The visible consequence is that some jobs stay forever in queue, in queued state, while having an exec_host selected. Here is the relevant debug information from the log file: 04/15/2013 20:06:48;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 20217.linux1.grid.creatis.insa-lyon.fr state from RUNNING-STAGEGO to RUNNING-PRERUN (4-40) 04/15/2013 20:06:48;0004;PBS_Server;Svr;svr_connect;attempting connect to host 134.214.205.76 port 15002 04/15/2013 20:06:48;0004;PBS_Server;Svr;svr_connect;cannot connect to host port 15002 - cannot establish connection (cannot connect to port 985 in client_to_svr - errno:99 Cannot assign requested address) - time=0 seconds 04/15/2013 20:06:48;0004;PBS_Server;Svr;WARNING;ALERT: unable to contact node linux7.grid.creatis.insa-lyon.fr 04/15/2013 20:06:48;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Cannot assign requested address (99) in send_job, send_job failed to 86d6cd4c port 15002 04/15/2013 20:06:48;0002;PBS_Server;Job;20214.linux1.grid.creatis.insa-lyon.fr;child reported success for job after 0 seconds (dest=linux7.grid.creatis.insa-lyon.fr), rc=0 04/15/2013 20:06:48;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 20214.linux1.grid.creatis.insa-lyon.fr state from RUNNING-PRERUN to RUNNING-RUNNING (4-42) 04/15/2013 20:06:48;0004;PBS_Server;Svr;svr_connect;attempting connect to host 134.214.205.76 port 15002 04/15/2013 20:06:48;0004;PBS_Server;Svr;svr_connect;cannot connect to host port 15002 - cannot establish connection (cannot connect to port 985 in client_to_svr - errno:99 Cannot assign requested address) - time=0 seconds 04/15/2013 20:06:48;0004;PBS_Server;Svr;WARNING;ALERT: unable to contact node linux7.grid.creatis.insa-lyon.fr 04/15/2013 20:06:48;0002;PBS_Server;Job;20217.linux1.grid.creatis.insa-lyon.fr;child reported failure for job after 0 seconds (dest=linux7.grid.creatis.insa-lyon.fr), rc=1 04/15/2013 20:06:48;0008;PBS_Server;Job;20217.linux1.grid.creatis.insa-lyon.fr;unable to run job, MOM rejected/rc=1 04/15/2013 20:06:48;0040;PBS_Server;Req;free_nodes;freeing nodes for job 20217.linux1.grid.creatis.insa-lyon.fr 04/15/2013 20:06:48;0040;PBS_Server;Req;free_nodes;freeing node linux7.grid.creatis.insa-lyon.fr/9 from job 20217.linux1.grid.creatis.insa-lyon.fr (nsnfree=5) 04/15/2013 20:06:48;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 20217.linux1.grid.creatis.insa-lyon.fr state from RUNNING-PRERUN to QUEUED-STAGECMP (1-16) No longer being maintained in EPEL by me. |