Bug 144033 - [RHEL3] poll() seems to ignore large timeout
[RHEL3] poll() seems to ignore large timeout
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel (Show other bugs)
3.0
All Linux
medium Severity medium
: ---
: ---
Assigned To: Peter Staubach
Brian Brock
:
Depends On:
Blocks: 168424
  Show dependency treegraph
 
Reported: 2005-01-03 17:17 EST by Issue Tracker
Modified: 2007-11-30 17:07 EST (History)
9 users (show)

See Also:
Fixed In Version: RHSA-2006-0144
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-03-15 10:48:24 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Proposed patch (864 bytes, patch)
2005-09-01 11:45 EDT, Peter Staubach
no flags Details | Diff
Proposed patch (1.43 KB, patch)
2005-09-13 13:52 EDT, Peter Staubach
no flags Details | Diff

  None (edit)
Description Issue Tracker 2005-01-03 17:17:04 EST
Escalated to Bugzilla from IssueTracker
Comment 1 Elena Zannoni 2005-01-03 17:19:39 EST
poll() seems to ignore large(say 6 hours) timeout.
Run the attached program: it never returns.

This is on AS2.1 and 3.0.

----------------------------------------------------------------
dmair I was able to successfully reproduce the isse after allowing the attached
program to run overnight.  It never did return.  I've escalated the issue up to
our engineering team.

Regards,

David Mair

----------------------------------------------------------------------
fhirtz This is a problem with convertion to ticks, the kernel code does...

      if (timeout) {
              /* Careful about overflow in the intermediate values */
              if ((unsigned long) timeout < MAX_SCHEDULE_TIMEOUT / HZ)
                      timeout = (unsigned long)(timeout*HZ+999)/1000+1;
              else /* Negative or overflow */
                      timeout = MAX_SCHEDULE_TIMEOUT;
      }

...and from sched.h...

#define        MAX_SCHEDULE_TIMEOUT    LONG_MAX

...therefore (assuming HZ==100, 32bit long) anything >= 21,474,836 milli-seconds
(about 5.9 hours) means forever.

While it's a bug, It will be difficult to correct. Some options on our side
might be to keep it looping inside sys_poll(), capping to the "max" non forever
value, or even just returning EINVAL.

------------------------------------------------------------------------
fhirtz The problem is that to change this is a major ABI change, anything
relying on poll() not returning EINVAL is going to be very surprised. The only
saving grace is that very few people/code do "large" poll() values ... it's
usually none, seconds or forever. It's unlikely for inclusion in a current
release until we can get it changed upstream. We're likely looking at upstream
fix in 2.6.x and then that being in the next RHELx release.

You should be able to do something like the following:

#define MY_POLL_MAX_MSECS (2 * 1000 * 1000) /* close enough */
int my_poll(struct pollfd *ufds, unsigned int nfds, int timeout)
{
 int tmout = timeout;
 int ret = 0;

while ((tmout > MY_POLL_MAX_MSECS) && !(ret =poll(ufds, nfds, MY_POLL_MAX_MSECS)))
   tmout -= MY_POLL_MAX_MSECS;

if (!ret)
   ret = poll(ufds, nfds, tmout);

return (ret);
}

...which is basically what the good fix will be (and it might well be done
inside glibc, so will look almost identical to the above). We're passing this up
to see if this sort of change might be acceptable in the current glibc.
Comment 2 Elena Zannoni 2005-01-03 17:21:19 EST
testcase:

#include <stdio.h>
  #include <sys/poll.h>
  #include <time.h>

  int main(int argc, char **argv)
  {
    // 6 hours
    const int timeout = 6*60*60*1000;  

    int ret = poll(NULL, 0, timeout);
    if (ret<0) perror("poll failed:");

    return 0;
  }
Comment 3 Jakub Jelinek 2005-01-03 17:33:30 EST
Doing this in userland is a bad idea IMHO.  Why should we punish e.g. 32-bit
programs running on 64-bit kernels where the kernel will handle the maximum
(~ 25 days due to using int, not struct timeval or something like that) timeout just fine?
I think this should be fixed in the kernel.
Comment 4 Roland McGrath 2005-01-03 18:49:34 EST
By my reading of POSIX, poll is not allowed to put any maximum on the useful
timeout values.  Notably select/pselect is specified to return EINVAL when the
timeout exceeds an implementation maximum, while poll's specification does not
have this clause.  select/pselect is required to support a timeout of at least
31  days, which will no longer be true when using HZ=1000 with 32-bit longs.
I think this needs to be fixed in the kernel, both for select and for poll;
i.e., they should loop to count down the whole specified timeout.
However, for RHEL3 and earlier I think we can reasonably call this a known
limitation and leave it as it is.  No Linux kernel has ever done any better
before.  We should look at fixing this upstream in 2.6 and for RHEL4.
Comment 5 Jason Baron 2005-06-21 17:17:29 EDT
*** Bug 160065 has been marked as a duplicate of this bug. ***
Comment 6 Ernie Petrides 2005-08-31 17:33:43 EDT
Reassigning this to PeterS and removing bug 160065 as a dup (since that
one is against RHEL4).
Comment 7 Peter Staubach 2005-09-01 11:44:53 EDT
The timeout limit for poll(2) is limited to 2^31-1 milliseconds, due to the
timeout being stored in an int.  The value, 0, causes poll(2) to return
immediately.  Any other value is treated as infinite.

Internally, the kernel uses a long to store the timeout value, converted to
clock ticks.  This should mean that as long as the value of HZ is 1000 or
less, then the kernel should be able to correctly handle the full range of
timeout values which can be expressed.  Any values of HZ which are larger
than 1000, on 32 bit platforms, will reduce the range of timeout values that
the kernel can correctly handle.

I think that the kernel should be able to correctly handle any valid
timeout values, but that would start to involve massive changes in the
associated kernel infrastructures and this would probably not be considered
to be a worthwhile change to make.  The benefits would not be considered
to be large enough to offset the risks and the costs of the changes.

I can correct the math used to convert the timeout in milliseconds to the
appropriate number of clock ticks.  This will work fully as long as the
value of HZ stays 1000 or less.
Comment 8 Peter Staubach 2005-09-01 11:45:53 EDT
Created attachment 118350 [details]
Proposed patch
Comment 9 Peter Staubach 2005-09-13 13:52:13 EDT
Created attachment 118768 [details]
Proposed patch
Comment 10 Ernie Petrides 2005-09-21 20:43:24 EDT
A fix for this problem has just been committed to the RHEL3 U7
patch pool this evening (in kernel version 2.4.21-37.3.EL).
Comment 14 Red Hat Bugzilla 2006-03-15 10:48:24 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0144.html

Note You need to log in before you can comment on or make changes to this bug.