Bug 451494

Summary: Listdir function fails randomly due to mislocated errno assignment [patch included]
Product: [Fedora] Fedora Reporter: Philip Spencer <pspencer>
Component: pythonAssignee: James Antill <james.antill>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: low    
Version: 8CC: dcantrell, james.antill, jonathansteffan, katzj, matt_domsch
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-08-12 14:20:29 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Patch to fix the problem of listdir mistakenly thinking error occurred none

Description Philip Spencer 2008-06-14 22:28:28 UTC
Python's os.listdir function will fail randomly when trying to list large
directories. As a result, commands like createrepo and repomanage will often
fail when invoked on a 64-machine on, for example, the current Fedora 8 updates
directory.

To reproduce: 

    Copy the current Fedora 8 updates directory to a location /tmp/f8updates.
    Be on a 64-bit machine running the latest kernel.
    Run "repomanage -c --old /tmp/f8updates" 20 times in a row.
    On all 5 machines I tried this on, at least one time out of the 20
    the command would fail with a "cannot read directory" error message.

    With createrepo there would be a silent failure at least one time out of
    20 (the repo files would be created listing no packages at all!)

The cause:

Python's os.listdir function has the following code in Modules/posixmodule.c
beginning at line 2163:

   errno = 0
   ...
    for (;;) {
         Py_BEGIN_ALLOW_THREADS
         ep = readdir(dirp);
         Py_END_ALLOW_THREADS
         if (ep == NULL)
              break;
         ...
         a bunch of other stuff, including PyString_FromStringAndSize
         which calls malloc
         ...
    }
    if (errno != 0 && d != NULL) {

The assumption is that errno will be nonzero only if readdir failed.
However, this is not the case. GLibc's malloc will, in some rare cases,
set errno even when it succeeds. So, during one pass through the loop
errno gets set to ENOMEM and then it is still set to ENOMEM when the
final readdir returns null at the end of the directory listing.

The fix is to move the line "errno = 0" from outside the loop to
right before the readdir inside the loop. That is the only way to
ensure that, when the loop is exited with readdir returning null,
the condition "errno != 0" is equivalent to "readdir actually failed."

The attached patch does this. With the attached patch, we have experienced
no more of the failures from createrepo or repomanage.



Version-Release number of selected component (if applicable):

Python-2.5.1-15.fc8, but Python 2.5.2 is affected as well.

How reproducible:

Reproducible somwhat randomly but in our testing we have never
been able to run the commands above more than 20 times without
seeing the problem. 

It seems to be only on 64-bit machines, and only in certain circumstances,
that glibc's malloc returns ENOMEM even when it succeeds -- I suspect because
it first tries to meet large memory requests with mmap and after many requests
the processes mmap limit may be exceeded (mmap fails, setting errno) and malloc
then consolidates its maps and allocates the memory a different way --
succeeding this time, but not clearing errno. I don't know for sure. I do know
that createrepo/repomanage work fine on 32-bit machines and work fine on
repositories under about 5000 packages.

Steps to Reproduce:

see above.

Comment 1 Philip Spencer 2008-06-14 22:28:28 UTC
Created attachment 309374 [details]
Patch to fix the problem of listdir mistakenly thinking error occurred

Comment 2 James Antill 2008-06-15 05:13:46 UTC
 Can you log this bug upstream ... as it isn't "fixed" in 2.6.x.
 I'm also unsure about the patch:

1. Just before the loop the code is:

        if ((d = PyList_New(0)) == NULL) {
                closedir(dirp);
                PyMem_Free(name);
                return NULL;
        }

...and as far as I can see d is _only_ set to NULL again, when one of the Py_*
functions fails (from a memory related problem).

2. Given #1 the end test:

        if (errno != 0 && d != NULL) {

...will not trigger, due to a spurious errno ... because the check on d won't
trigger.


 The problem is that I don't see how Python does the right thing if readdir()
_does_ return an error (I'm guessing it doesn't, but that's hard to test so
noone knew?) ... which makes me worry about NFS.


Comment 3 James Antill 2008-06-15 05:20:43 UTC
 In createrepo. Can you try changing __init__.py:_os_path_walk():

        try:
            names = os.listdir(top)
        except os.error:
            return

...into:

        try:
          try:
            names = os.listdir(top)
          except os.error:
            names = os.listdir(top)
        except os.error:
            return


Comment 4 James Antill 2008-06-15 05:26:48 UTC
 Also, I just tried:

% for i in $(seq 1 128); repomanage -c --old /tmp/f8updates > /dev/null
% du -sh /tmp/f8updates/packages 
404M    /tmp/f8updates/packages

...and got nothing.


Comment 5 Philip Spencer 2008-06-15 19:19:47 UTC
I have logged it upstream as issue #3115 at bugs.python.org.

Re "will not trigger ... because the check on d won't trigger ... I don't see
how Python does the right thing if readdir() _does_ return an error"

The rest of the code is correct apart from the misplaced errno initialization.

Before the loop, d is set to PyList_New(0), a new python list.
If d is null (list init failed), the routine returns immediately.
So, by the time the loop is entered, d is known to be non-null.
The loop is exited in one of two ways:
 (1) readdir returns null, in which case d is still non-null upon loop exit,
 (2) a failure occurs during attempts to copy or convert a file name or to
     append it to the list, in which case d is explicitly set to null before
     exiting the loop.

At the end of the loop, a test is made for
    errno != 0  && d != NULL
The "d != NULL" means that the loop exited for reason #1 (by readdir
returning null instead of by a failure in another part of the code)
and the "errno != 0" means readdir encountered an error not just the
normal end of directory.

Therefore, the test "if (errno != 0 && d != NULL)" is a correct test for
"readdir encountered an error" -- as long as errno was set to 0 before
readdir.

Re: "In createrepo, can you try..." I can't, because I've already patched
python. But I do know that if I take out the "try:" altogether it will fail
with "Errno=12 (unable to allocate memory)".

Re "I just tried ... and got nothing" --- were you on an x86_64 machine?
It is only on x86_64 that I've seen glibc's malloc set errno=12 despite
succeeding. And it only happens sporadically. Memory allocation may be set
slightly differently on your machine.

In any event, given the possibility that malloc (or other system calls) can set
errno even when they succeed, the python code as it stands is obviously
incorrect because if errno is NOT initialized to 0 right before calling readdir,
then the test for errno != 0 upon loop exit for reason #1 is no longer an
accurate test for readdir returning an error.




   




Comment 6 James Antill 2008-06-15 22:22:24 UTC
"""
At the end of the loop, a test is made for
    errno != 0  && d != NULL
The "d != NULL" means that the loop exited for reason #1 (by readdir
returning null instead of by a failure in another part of the code)
"""

...you're right, for some reason I read that as "d == NULL" ... so I can see how
it could happen, and I don't see any possible problems with the fix. I'm
building the fix for Fed-9 now:

http://koji.fedoraproject.org/koji/taskinfo?taskID=663001

"""
Re "I just tried ... and got nothing" --- were you on an x86_64 machine?
"""

...yes, I'm on x86_64.


Comment 7 Fedora Update System 2008-06-15 23:23:34 UTC
python-2.5.1-26.fc9 has been submitted as an update for Fedora 9

Comment 8 Fedora Update System 2008-06-16 04:16:02 UTC
python-2.5.1-26.fc8 has been submitted as an update for Fedora 8

Comment 9 Fedora Update System 2008-06-16 23:31:43 UTC
python-2.5.1-26.fc9 has been pushed to the Fedora 9 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update python'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F9/FEDORA-2008-5366

Comment 10 Fedora Update System 2008-07-04 03:39:21 UTC
python-2.5.1-26.fc8 has been pushed to the Fedora 8 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 11 Fedora Update System 2008-07-04 03:39:37 UTC
python-2.5.1-26.fc9 has been pushed to the Fedora 9 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 12 Matt Domsch 2008-07-06 04:14:09 UTC
my FTBFS rebuilds in mock suffered a few failures with -25; will try now with -26.

Comment 13 Fedora Update System 2008-07-26 06:10:17 UTC
python-2.5.1-26.fc9 has been pushed to the Fedora 9 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 14 Matt Domsch 2008-08-12 14:20:29 UTC
F-8 and F-9 updates pushed to stable, closing.