Bug 144033
Summary: | [RHEL3] poll() seems to ignore large timeout | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Issue Tracker <tao> | ||||||
Component: | kernel | Assignee: | Peter Staubach <staubach> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 3.0 | CC: | aviro, ezannoni, jakub, jbaron, lwang, peterm, petrides, riel, roland | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | RHSA-2006-0144 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2006-03-15 15:48:24 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 168424 | ||||||||
Attachments: |
|
Description
Issue Tracker
2005-01-03 22:17:04 UTC
poll() seems to ignore large(say 6 hours) timeout. Run the attached program: it never returns. This is on AS2.1 and 3.0. ---------------------------------------------------------------- dmair I was able to successfully reproduce the isse after allowing the attached program to run overnight. It never did return. I've escalated the issue up to our engineering team. Regards, David Mair ---------------------------------------------------------------------- fhirtz This is a problem with convertion to ticks, the kernel code does... if (timeout) { /* Careful about overflow in the intermediate values */ if ((unsigned long) timeout < MAX_SCHEDULE_TIMEOUT / HZ) timeout = (unsigned long)(timeout*HZ+999)/1000+1; else /* Negative or overflow */ timeout = MAX_SCHEDULE_TIMEOUT; } ...and from sched.h... #define MAX_SCHEDULE_TIMEOUT LONG_MAX ...therefore (assuming HZ==100, 32bit long) anything >= 21,474,836 milli-seconds (about 5.9 hours) means forever. While it's a bug, It will be difficult to correct. Some options on our side might be to keep it looping inside sys_poll(), capping to the "max" non forever value, or even just returning EINVAL. ------------------------------------------------------------------------ fhirtz The problem is that to change this is a major ABI change, anything relying on poll() not returning EINVAL is going to be very surprised. The only saving grace is that very few people/code do "large" poll() values ... it's usually none, seconds or forever. It's unlikely for inclusion in a current release until we can get it changed upstream. We're likely looking at upstream fix in 2.6.x and then that being in the next RHELx release. You should be able to do something like the following: #define MY_POLL_MAX_MSECS (2 * 1000 * 1000) /* close enough */ int my_poll(struct pollfd *ufds, unsigned int nfds, int timeout) { int tmout = timeout; int ret = 0; while ((tmout > MY_POLL_MAX_MSECS) && !(ret =poll(ufds, nfds, MY_POLL_MAX_MSECS))) tmout -= MY_POLL_MAX_MSECS; if (!ret) ret = poll(ufds, nfds, tmout); return (ret); } ...which is basically what the good fix will be (and it might well be done inside glibc, so will look almost identical to the above). We're passing this up to see if this sort of change might be acceptable in the current glibc. testcase: #include <stdio.h> #include <sys/poll.h> #include <time.h> int main(int argc, char **argv) { // 6 hours const int timeout = 6*60*60*1000; int ret = poll(NULL, 0, timeout); if (ret<0) perror("poll failed:"); return 0; } Doing this in userland is a bad idea IMHO. Why should we punish e.g. 32-bit programs running on 64-bit kernels where the kernel will handle the maximum (~ 25 days due to using int, not struct timeval or something like that) timeout just fine? I think this should be fixed in the kernel. By my reading of POSIX, poll is not allowed to put any maximum on the useful timeout values. Notably select/pselect is specified to return EINVAL when the timeout exceeds an implementation maximum, while poll's specification does not have this clause. select/pselect is required to support a timeout of at least 31 days, which will no longer be true when using HZ=1000 with 32-bit longs. I think this needs to be fixed in the kernel, both for select and for poll; i.e., they should loop to count down the whole specified timeout. However, for RHEL3 and earlier I think we can reasonably call this a known limitation and leave it as it is. No Linux kernel has ever done any better before. We should look at fixing this upstream in 2.6 and for RHEL4. *** Bug 160065 has been marked as a duplicate of this bug. *** Reassigning this to PeterS and removing bug 160065 as a dup (since that one is against RHEL4). The timeout limit for poll(2) is limited to 2^31-1 milliseconds, due to the timeout being stored in an int. The value, 0, causes poll(2) to return immediately. Any other value is treated as infinite. Internally, the kernel uses a long to store the timeout value, converted to clock ticks. This should mean that as long as the value of HZ is 1000 or less, then the kernel should be able to correctly handle the full range of timeout values which can be expressed. Any values of HZ which are larger than 1000, on 32 bit platforms, will reduce the range of timeout values that the kernel can correctly handle. I think that the kernel should be able to correctly handle any valid timeout values, but that would start to involve massive changes in the associated kernel infrastructures and this would probably not be considered to be a worthwhile change to make. The benefits would not be considered to be large enough to offset the risks and the costs of the changes. I can correct the math used to convert the timeout in milliseconds to the appropriate number of clock ticks. This will work fully as long as the value of HZ stays 1000 or less. Created attachment 118350 [details]
Proposed patch
Created attachment 118768 [details]
Proposed patch
A fix for this problem has just been committed to the RHEL3 U7 patch pool this evening (in kernel version 2.4.21-37.3.EL). An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0144.html |