Python's os.listdir function will fail randomly when trying to list large directories. As a result, commands like createrepo and repomanage will often fail when invoked on a 64-machine on, for example, the current Fedora 8 updates directory. To reproduce: Copy the current Fedora 8 updates directory to a location /tmp/f8updates. Be on a 64-bit machine running the latest kernel. Run "repomanage -c --old /tmp/f8updates" 20 times in a row. On all 5 machines I tried this on, at least one time out of the 20 the command would fail with a "cannot read directory" error message. With createrepo there would be a silent failure at least one time out of 20 (the repo files would be created listing no packages at all!) The cause: Python's os.listdir function has the following code in Modules/posixmodule.c beginning at line 2163: errno = 0 ... for (;;) { Py_BEGIN_ALLOW_THREADS ep = readdir(dirp); Py_END_ALLOW_THREADS if (ep == NULL) break; ... a bunch of other stuff, including PyString_FromStringAndSize which calls malloc ... } if (errno != 0 && d != NULL) { The assumption is that errno will be nonzero only if readdir failed. However, this is not the case. GLibc's malloc will, in some rare cases, set errno even when it succeeds. So, during one pass through the loop errno gets set to ENOMEM and then it is still set to ENOMEM when the final readdir returns null at the end of the directory listing. The fix is to move the line "errno = 0" from outside the loop to right before the readdir inside the loop. That is the only way to ensure that, when the loop is exited with readdir returning null, the condition "errno != 0" is equivalent to "readdir actually failed." The attached patch does this. With the attached patch, we have experienced no more of the failures from createrepo or repomanage. Version-Release number of selected component (if applicable): Python-2.5.1-15.fc8, but Python 2.5.2 is affected as well. How reproducible: Reproducible somwhat randomly but in our testing we have never been able to run the commands above more than 20 times without seeing the problem. It seems to be only on 64-bit machines, and only in certain circumstances, that glibc's malloc returns ENOMEM even when it succeeds -- I suspect because it first tries to meet large memory requests with mmap and after many requests the processes mmap limit may be exceeded (mmap fails, setting errno) and malloc then consolidates its maps and allocates the memory a different way -- succeeding this time, but not clearing errno. I don't know for sure. I do know that createrepo/repomanage work fine on 32-bit machines and work fine on repositories under about 5000 packages. Steps to Reproduce: see above.
Created attachment 309374 [details] Patch to fix the problem of listdir mistakenly thinking error occurred
Can you log this bug upstream ... as it isn't "fixed" in 2.6.x. I'm also unsure about the patch: 1. Just before the loop the code is: if ((d = PyList_New(0)) == NULL) { closedir(dirp); PyMem_Free(name); return NULL; } ...and as far as I can see d is _only_ set to NULL again, when one of the Py_* functions fails (from a memory related problem). 2. Given #1 the end test: if (errno != 0 && d != NULL) { ...will not trigger, due to a spurious errno ... because the check on d won't trigger. The problem is that I don't see how Python does the right thing if readdir() _does_ return an error (I'm guessing it doesn't, but that's hard to test so noone knew?) ... which makes me worry about NFS.
In createrepo. Can you try changing __init__.py:_os_path_walk(): try: names = os.listdir(top) except os.error: return ...into: try: try: names = os.listdir(top) except os.error: names = os.listdir(top) except os.error: return
Also, I just tried: % for i in $(seq 1 128); repomanage -c --old /tmp/f8updates > /dev/null % du -sh /tmp/f8updates/packages 404M /tmp/f8updates/packages ...and got nothing.
I have logged it upstream as issue #3115 at bugs.python.org. Re "will not trigger ... because the check on d won't trigger ... I don't see how Python does the right thing if readdir() _does_ return an error" The rest of the code is correct apart from the misplaced errno initialization. Before the loop, d is set to PyList_New(0), a new python list. If d is null (list init failed), the routine returns immediately. So, by the time the loop is entered, d is known to be non-null. The loop is exited in one of two ways: (1) readdir returns null, in which case d is still non-null upon loop exit, (2) a failure occurs during attempts to copy or convert a file name or to append it to the list, in which case d is explicitly set to null before exiting the loop. At the end of the loop, a test is made for errno != 0 && d != NULL The "d != NULL" means that the loop exited for reason #1 (by readdir returning null instead of by a failure in another part of the code) and the "errno != 0" means readdir encountered an error not just the normal end of directory. Therefore, the test "if (errno != 0 && d != NULL)" is a correct test for "readdir encountered an error" -- as long as errno was set to 0 before readdir. Re: "In createrepo, can you try..." I can't, because I've already patched python. But I do know that if I take out the "try:" altogether it will fail with "Errno=12 (unable to allocate memory)". Re "I just tried ... and got nothing" --- were you on an x86_64 machine? It is only on x86_64 that I've seen glibc's malloc set errno=12 despite succeeding. And it only happens sporadically. Memory allocation may be set slightly differently on your machine. In any event, given the possibility that malloc (or other system calls) can set errno even when they succeed, the python code as it stands is obviously incorrect because if errno is NOT initialized to 0 right before calling readdir, then the test for errno != 0 upon loop exit for reason #1 is no longer an accurate test for readdir returning an error.
""" At the end of the loop, a test is made for errno != 0 && d != NULL The "d != NULL" means that the loop exited for reason #1 (by readdir returning null instead of by a failure in another part of the code) """ ...you're right, for some reason I read that as "d == NULL" ... so I can see how it could happen, and I don't see any possible problems with the fix. I'm building the fix for Fed-9 now: http://koji.fedoraproject.org/koji/taskinfo?taskID=663001 """ Re "I just tried ... and got nothing" --- were you on an x86_64 machine? """ ...yes, I'm on x86_64.
python-2.5.1-26.fc9 has been submitted as an update for Fedora 9
python-2.5.1-26.fc8 has been submitted as an update for Fedora 8
python-2.5.1-26.fc9 has been pushed to the Fedora 9 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing update python'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F9/FEDORA-2008-5366
python-2.5.1-26.fc8 has been pushed to the Fedora 8 stable repository. If problems still persist, please make note of it in this bug report.
python-2.5.1-26.fc9 has been pushed to the Fedora 9 stable repository. If problems still persist, please make note of it in this bug report.
my FTBFS rebuilds in mock suffered a few failures with -25; will try now with -26.
F-8 and F-9 updates pushed to stable, closing.