Bug 692888
| Summary: | gofer becomes unresponsive w/ Too many open file error | ||
|---|---|---|---|
| Product: | [Retired] Pulp | Reporter: | dgao |
| Component: | user-experience | Assignee: | Jeff Ortel <jortel> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | dgao |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | unspecified | CC: | pmatilai, pthomas, skarmark, tsanders |
| Target Milestone: | --- | Keywords: | Triaged |
| Target Release: | Sprint 25 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2011-08-16 14:02:30 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 641987 | ||
Here's a better paste. [root@pulp-qe-rhel6 ~]# cat /tmp/goferd.18867.lsof|sort|cut -c45-120 253,0 0 1046108 /tmp/goferd.18867.lsof /dev/null /dev/null /dev/null /dev/null /dev/urandom / / pipe pipe pipe pipe pipe pipe unused:54114->unused:amqp (ESTABLISHED) unused:54115->unused:amqp (ESTABLISHED) unused:39891->a23-1-44-218.deploy.akamaitechnologies.com:https (CLOSE_WAIT) /tmp/tmprdE8y6 (deleted) /tmp/tmpj5Fz2b (deleted) /tmp/ffiUnpDvC /var/lib/rpm/Triggername /var/lib/rpm/Triggername /var/lib/rpm/Providename /var/lib/rpm/Providename /var/lib/rpm/Providename /var/lib/rpm/Providename /var/lib/rpm/Providename /var/lib/rpm/__db.003 /var/lib/rpm/Requireversion /var/lib/rpm/Requireversion /var/lib/yum/history/history-2011-01-17.sqlite /var/lib/yum/history/history-2011-01-17.sqlite /var/lib/rpm/Basenames /var/lib/rpm/Basenames /var/lib/rpm/Basenames /var/lib/rpm/Basenames /var/lib/rpm/Basenames /var/cache/yum/x86_64/6Server/epel/8633060596fe26eb3061583aa8fde96a3f69910ef /var/cache/yum/x86_64/6Server/epel/8633060596fe26eb3061583aa8fde96a3f69910ef /var/cache/yum/x86_64/6Server/epel/8633060596fe26eb3061583aa8fde96a3f69910ef /var/lib/rpm/Group /var/lib/rpm/Group /var/lib/rpm/Installtid /var/lib/rpm/Installtid /var/lib/rpm/Packages /var/lib/rpm/Packages /var/lib/rpm/Packages /var/lib/rpm/Packages /var/lib/rpm/Packages /var/lib/rpm/Packages /var/lib/rpm/Packages /var/lib/rpm/Requirename /var/lib/rpm/Requirename /var/cache/yum/x86_64/6Server/package_test_repo/primary.sqlite /var/cache/yum/x86_64/6Server/errata_pkg_upload_repo/primary.sqlite /var/log/gofer/agent.log /var/cache/yum/x86_64/6Server/package_group_test_repo/primary.sqlite /var/lib/rpm/__db.002 /var/cache/yum/x86_64/6Server/rhel-x86_64-server-6.0.z/8c584af3a1bc5d9a36833 /var/cache/yum/x86_64/6Server/rhel-x86_64-server-6.0.z/8c584af3a1bc5d9a36833 /var/cache/yum/x86_64/6Server/rhel-x86_64-server-6.0.z/8c584af3a1bc5d9a36833 /var/lib/rpm/Name /var/lib/rpm/Name /var/lib/rpm/Name /var/lib/rpm/Name /var/lib/rpm/Name /var/lib/rpm/Name /var/lib/rpm/Name /var/lib/rpm/__db.001 /var/lib/rpm/Filedigests /var/lib/rpm/Filedigests /var/log/pulp/client.log /var/cache/yum/x86_64/6Server/f13test/primary.xml.gz.sqlite /var/cache/yum/x86_64/6Server/f13test/primary.xml.gz.sqlite /var/cache/yum/x86_64/6Server/f13test/primary.xml.gz.sqlite /var/cache/yum/x86_64/6Server/fedora-pulp-testing/primary.xml.gz.sqlite /var/cache/yum/x86_64/6Server/fedora-pulp-testing/primary.xml.gz.sqlite /var/cache/yum/x86_64/6Server/fedora-pulp-testing/primary.xml.gz.sqlite /var/lib/rpm/Dirnames /var/lib/rpm/Dirnames /tmp/ffiUnpDvC (deleted) /var/lib/rpm/Sigmd5 /var/lib/rpm/Sigmd5 /var/lib/rpm/Provideversion /var/lib/rpm/Provideversion /var/log/yum.log /var/log/yum.log /var/log/yum.log /var/lib/rpm/__db.004 /var/lib/rpm/Sha1header /var/lib/rpm/Sha1header /usr/bin/python /usr/lib/locale/locale-archive socket I think what happens here is that the bind invokes RepoLib stuff which is concurrency protected by a lockfile. I suspect that when we get "too many open files", this lock acquisition hangs. I'll verify/fix this. But, since gofer executes RMI calls in a single thread, when the bind hangs, it blocks following RMI calls like Package.install(). Looks like yum is leaking ~10 file descriptors per yum run:
/var/lib/rpm
/var/yum/cache
I assume (python) rpm is responsible for /var/lib/rpm. I'll get with Florian Festi about the RPM leaks and Seth and/or James about yum.
===========================
Test program:
import os
from time import sleep
from yum import YumBase
def install(packageinfo=['zsh']):
installed = []
yb = YumBase()
for info in packageinfo:
if isinstance(info, list):
pkgs = yb.pkgSack.returnNewestByNameArch(tuple(info))
else:
pkgs = yb.pkgSack.returnNewestByName(info)
for p in pkgs:
installed.append(str(p))
yb.tsInfo.addInstall(p)
yb.resolveDeps()
yb.processTransaction()
if reboot_suggested:
cfg_assumeyes = cfg.client.assumeyes
if cfg_assumeyes in ["True", "False"]:
assumeyes = eval(cfg_assumeyes)
else:
assumeyes = assumeyes
if assumeyes is True:
self.__schedule_reboot()
return (installed, {'reboot_performed' :True})
else:
return (installed, {'reboot_performed' :False})
return (installed, None)
if __name__ == '__main__':
print os.getpid()
for i in range(0,10):
print i
sleep(1)
for i in range(0,10):
try:
install()
except Exception, e:
print e
print 'lsof now'
sleep(5)
RESULTS:
[jortel@localhost tmp]$ lsof | grep 6755 |grep -v "\.so" |wc
7 63 601
[jortel@localhost tmp]$ lsof | grep 6755 |grep -v "\.so" |wc
18 162 1840
[jortel@localhost tmp]$ lsof | grep 6755 |grep -v "\.so" |wc
29 261 3129
[jortel@localhost tmp]$ lsof | grep 6755 |grep -v "\.so" |wc
40 360 4418
[jortel@localhost tmp]$ lsof | grep 6755 |grep -v "\.so" |wc
40 360 4418
[jortel@localhost tmp]$ lsof | grep 6755 |grep -v "\.so" |wc
51 459 5707
[jortel@localhost tmp]$ lsof | grep 6755 |grep -v "\.so" |wc
73 657 8285
[jortel@localhost tmp]$ lsof | grep 6755 |grep -v "\.so" |wc
95 855 10863
[jortel@localhost tmp]$ lsof | grep 6755 |grep -v "\.so" |wc
106 954 12152
[jortel@localhost tmp]$ lsof | grep 6755 |grep -v "\.so" |wc
117 1053 13441
This appears to cure the leak in the test program:
--- pyleak.py.orig 2011-04-05 16:32:00.000000000 +0300
+++ pyleak.py 2011-04-05 16:32:04.000000000 +0300
@@ -17,6 +17,7 @@
yb.tsInfo.addInstall(p)
yb.resolveDeps()
yb.processTransaction()
+ yb.closeRpmDB()
if reboot_suggested:
cfg_assumeyes = cfg.client.assumeyes
if cfg_assumeyes in ["True", "False"]:
But that shouldn't really be necessary as yb should fall out of scope and take any rpm(db) references down with it. Needs further investigation...
Looks like after adding:
yb.processTransaction()
+ yb.closeRpmDB()
Yum is leaking (1) file descriptor to: /var/log/yum.log which I would assume is a yum issue.
The remaining leak is in the logging setup. The function setFileLog() at loggingLevels.py:240 adds a FileHandler every time it's called. Which means, every time we use YumBase(), another FileHandler is added. Added hack to purge yum FileHandlers until yum fixes this. build: 0.161 On F13 [root@pulp-nightly pulp]# lsof | grep "python" | grep "yum.log" | wc -l 22 [root@pulp-nightly pulp]# rpm -e patb [root@pulp-nightly pulp]# pulp-admin package install --consumerid=fred -n patb Created task id: 85749b75-7b34-11e0-9720-5452004fdd6d Task is scheduled for: 2011-05-10 14:37 Waiting: [\] [[[u'patb-0.1-1.x86_64'], None]] installed on fred [root@pulp-nightly pulp]# lsof | grep "python" | grep "yum.log" | wc -l 23 The leaked file descriptor: python 24692 root 32w REG 253,0 1210293 531303 /var/log/yum.log [root@pulp-nightly pulp]# ps -elf | grep "24692" 0 S root 1581 32178 0 80 0 - 25800 pipe_w 14:34 pts/0 00:00:00 grep 24692 5 S root 24692 1 0 80 0 - 213854 futex_ 10:25 ? 00:02:24 python /usr/bin/goferd So it seems like the agent is still leaking 1 file descriptor per install command. When goferd starts, there are no file descriptors for /var/log/yum.log. After the first package install, lsof will show (1) open file for /var/log/yum.log. But, subsequent package installs don't result in additional open files for /var/log/yum.log beyond the initial one. They are cleaned up by our hack/patch. So, we don't seem to be leaking one per package install (beyond the first time). build: 0.200 verified [root@preethi ~]# rpm -q pulp pulp-0.0.213-1.fc14.noarch [root@preethi upload]# lsof | grep "python" | grep "yum.log" | wc -l0 [root@preethi upload]# [root@preethi upload]# pulp-admin package install -n zsh --consumerid=preethi Created task id: 72d8d48c-b87a-11e0-a230-002564a85a58 Task is scheduled for: None Waiting: [\] [[['zsh-4.3.10-6.fc14.x86_64'], None]] installed on preethi [root@preethi upload]# lsof | grep "python" | grep "yum.log" | wc -l 1 [root@preethi upload]# [root@preethi upload]# pulp-admin package install -n screen --consumerid=preethi Created task id: e9da265c-b87f-11e0-b2b6-002564a85a58 Task is scheduled for: None Waiting: [\] [[['screen-4.0.3-15.fc12.x86_64', 'screen-4.0.3-15.fc12.x86_64'], None]] installed on preethi [root@preethi upload]# lsof | grep "python" | grep "yum.log" | wc -l 1 [root@preethi upload]# ps -elf | grep "16724" 5 S root 16724 1 0 80 0 - 229033 futex_ 14:01 ? 00:00:22 python /usr/bin/goferd 0 S root 17258 14908 0 80 0 - 25844 pipe_w 14:45 pts/0 00:00:00 grep --color=auto 16724 Closing with Community Release 15 pulp-0.0.223-4. |
this was caused by a cli installed cmd: 2011-03-31 14:24:57,627 [ERROR][test_consumer_remote] __call__() @ dispatcher.py:357 - [Errno 24] Too many open files: '/var/run/subsys/pulp/repolib.pid' Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/gofer/messaging/dispatcher.py", line 354, in __call__ retval = method(*args, **keywords) File "/usr/lib/gofer/plugins/pulp.py", line 183, in bind bind_data['repo'], bind_data['host_urls'], bind_data['gpg_keys']) File "/usr/lib/python2.6/site-packages/pulp/client/repolib.py", line 81, in bind File "/usr/lib/python2.6/site-packages/pulp/client/lock.py", line 169, in acquire File "/usr/lib/python2.6/site-packages/pulp/client/lock.py", line 57, in open IOError: [Errno 24] Too many open files: '/var/run/subsys/pulp/repolib.pid' [root@pulp-qe-rhel6-client gofer]# lsof |grep 3378|grep "/var/lib/rpm" python 3378 root mem REG 253,0 753664 135488 /var/lib/rpm/__db.004 python 3378 root mem REG 253,0 1318912 135487 /var/lib/rpm/__db.003 python 3378 root mem REG 253,0 229376 135486 /var/lib/rpm/__db.002 python 3378 root mem REG 253,0 24576 135485 /var/lib/rpm/__db.001 python 3378 root 18r REG 253,0 20103168 130824 /var/lib/rpm/Packages python 3378 root 19r REG 253,0 24576 130825 /var/lib/rpm/Name python 3378 root 20r REG 253,0 1290240 130826 /var/lib/rpm/Providename python 3378 root 22u REG 253,0 20103168 130824 /var/lib/rpm/Packages python 3378 root 23u REG 253,0 24576 130825 /var/lib/rpm/Name python 3378 root 28u REG 253,0 20103168 130824 /var/lib/rpm/Packages python 3378 root 29r REG 253,0 20103168 130824 /var/lib/rpm/Packages python 3378 root 30r REG 253,0 24576 130825 /var/lib/rpm/Name python 3378 root 31r REG 253,0 1290240 130826 /var/lib/rpm/Providename python 3378 root 32r REG 253,0 1536000 130838 /var/lib/rpm/Basenames python 3378 root 33u REG 253,0 20103168 130824 /var/lib/rpm/Packages python 3378 root 34u REG 253,0 24576 130825 /var/lib/rpm/Name python 3378 root 35u REG 253,0 1536000 130838 /var/lib/rpm/Basenames python 3378 root 43u REG 253,0 12288 131382 /var/lib/rpm/Triggername python 3378 root 44u REG 253,0 16384 131379 /var/lib/rpm/Group python 3378 root 45u REG 253,0 204800 131380 /var/lib/rpm/Requirename python 3378 root 46u REG 253,0 1290240 130826 /var/lib/rpm/Providename python 3378 root 47u REG 253,0 413696 131383 /var/lib/rpm/Dirnames python 3378 root 48u REG 253,0 139264 131384 /var/lib/rpm/Requireversion python 3378 root 49u REG 253,0 520192 131385 /var/lib/rpm/Provideversion python 3378 root 50u REG 253,0 16384 131386 /var/lib/rpm/Installtid python 3378 root 51u REG 253,0 49152 131387 /var/lib/rpm/Sigmd5 python 3378 root 52u REG 253,0 81920 131388 /var/lib/rpm/Sha1header python 3378 root 53u REG 253,0 2621440 131389 /var/lib/rpm/Filedigests . . . [root@pulp-qe-rhel6-client gofer]# lsof |grep 3378|grep "/var/lib/rpm"| wc -l 693 [root@pulp-qe-rhel6-client gofer]# lsof | grep "3378"| wc -l 1137