Hide Forgot
Created attachment 1144243 [details] lsof in the middle of this run after I increase the limit to 4096 Description of problem: http connection create many open files until we got "IOError: [Errno 24] Too many open files" Version-Release number of selected component (if applicable): http connection create many open files until we got "IOError: [Errno 24] Too many open files" How reproducible: 100% Steps to Reproduce: 1. Run our Tier1 test 2. Make sure with "ulimit -a" the open files are: open files (-n) 1024 Actual results: https://rhev-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/4.0-GE-Tier1-x86/5/consoleFull 03:07:18 ETraceback (most recent call last): 03:07:18 File "/usr/bin/py.test", line 9, in <module> 03:07:18 load_entry_point('pytest==2.8.6', 'console_scripts', 'py.test')() 03:07:18 File "/usr/lib/python2.7/site-packages/_pytest/config.py", line 48, in main 03:07:18 return config.hook.pytest_cmdline_main(config=config) 03:07:18 File "/usr/lib/python2.7/site-packages/_pytest/vendored_packages/pluggy.py", line 724, in __call__ 03:07:18 return self._hookexec(self, self._nonwrappers + self._wrappers, kwargs) 03:07:18 File "/usr/lib/python2.7/site-packages/_pytest/vendored_packages/pluggy.py", line 338, in _hookexec 03:07:18 return self._inner_hookexec(hook, methods, kwargs) 03:07:18 File "/usr/lib/python2.7/site-packages/_pytest/vendored_packages/pluggy.py", line 333, in <lambda> 03:07:18 _MultiCall(methods, kwargs, hook.spec_opts).execute() 03:07:18 File "/usr/lib/python2.7/site-packages/_pytest/vendored_packages/pluggy.py", line 596, in execute 03:07:18 res = hook_impl.function(*args) 03:07:18 File "/usr/lib/python2.7/site-packages/_pytest/main.py", line 115, in pytest_cmdline_main 03:07:18 return wrap_session(config, _main) 03:07:18 File "/usr/lib/python2.7/site-packages/_pytest/main.py", line 110, in wrap_session 03:07:18 exitstatus=session.exitstatus) 03:07:18 File "/usr/lib/python2.7/site-packages/_pytest/vendored_packages/pluggy.py", line 724, in __call__ 03:07:18 return self._hookexec(self, self._nonwrappers + self._wrappers, kwargs) 03:07:18 File "/usr/lib/python2.7/site-packages/_pytest/vendored_packages/pluggy.py", line 338, in _hookexec 03:07:18 return self._inner_hookexec(hook, methods, kwargs) 03:07:18 File "/usr/lib/python2.7/site-packages/_pytest/vendored_packages/pluggy.py", line 333, in <lambda> 03:07:18 _MultiCall(methods, kwargs, hook.spec_opts).execute() 03:07:18 File "/usr/lib/python2.7/site-packages/_pytest/vendored_packages/pluggy.py", line 595, in execute 03:07:18 return _wrapped_call(hook_impl.function(*args), self.execute) 03:07:18 File "/usr/lib/python2.7/site-packages/_pytest/vendored_packages/pluggy.py", line 249, in _wrapped_call 03:07:18 wrap_controller.send(call_outcome) 03:07:18 File "/usr/lib/python2.7/site-packages/_pytest/terminal.py", line 361, in pytest_sessionfinish 03:07:18 outcome.get_result() 03:07:18 File "/usr/lib/python2.7/site-packages/_pytest/vendored_packages/pluggy.py", line 279, in get_result 03:07:18 _reraise(*ex) # noqa 03:07:18 File "/usr/lib/python2.7/site-packages/_pytest/vendored_packages/pluggy.py", line 264, in __init__ 03:07:18 self.result = func() 03:07:18 File "/usr/lib/python2.7/site-packages/_pytest/vendored_packages/pluggy.py", line 596, in execute 03:07:18 res = hook_impl.function(*args) 03:07:18 File "/usr/lib/python2.7/site-packages/_pytest/junitxml.py", line 361, in pytest_sessionfinish 03:07:18 logfile = open(self.logfile, 'w', encoding='utf-8') 03:07:18 File "/usr/lib64/python2.7/codecs.py", line 881, in open 03:07:18 file = __builtin__.open(filename, mode, buffering) 03:07:18 IOError: [Errno 24] Too many open files: '/var/lib/jenkins/workspace/4.0-GE-Tier1-x86/xunit_output.xml' see attached file with lsof command Expected results: close these files after some timeout Additional info: in 3.6 we use the same infrastructure and didn't got this errors.
Seems to be a Jenkins issues
After investigation with juan we found that the problem is: that in 4.0 the server doesn't send the "Connection: close" header for HEAD requests we tried to run API calls: - on 3.6 engine - not reproduce - on rhel6 + python 2.6 machine run to remote 4.0 engine - reproduced - on rhel7 + python 2.7 machine run locally or remote to 4.0 engine - reproduced
It is true that the server doesn't send the "Connection: close" header like it used to do in version 3.6. We should probably change that, to avoid other similar issues. But after studying the issue I believe that it can be solved in the client, making sure that it consumes the (empty) body of the HEAD response. As the client is using the Python "httplib" module I'd suggest to make sure to always do the following for HEAD requests: connection.request('HEAD', ...) response = connection.getresponse() response.read() That call to "read" should make sure that the body is consumed, and the connection released.
Note that my analysis in comment 3 wasn't correct. The problem wasn't related to the consumption of the response body. It was a connection leak in in the testing framework. This leak wasn't problematic with version 3.6 of the engine, as the connections were leaked, but closed, so they didn't consume any resource other than memory. But with version 4 of the engine the connections are leaked, but they stay open, because the engine doesn't send the "Connection: close" response header for failed requests. This means that the leaked connections consume file descriptors and sockets, thus generating a real problem. That leak in the testing framework has been fixed. We want also to modify the engine so that it sends the "Connection: close" response header for failed connections, that is why we are keeping this bug open. However, that may be difficult, or even impossible, because that header is managed by the application server, not by the application. We are investigating it, but we may eventually close the bug as CANTFIX.
(In reply to Juan Hernández from comment #7) > Note that my analysis in comment 3 wasn't correct. The problem wasn't > related to the consumption of the response body. It was a connection leak in > in the testing framework. > > This leak wasn't problematic with version 3.6 of the engine, as the > connections were leaked, but closed, so they didn't consume any resource > other than memory. But with version 4 of the engine the connections are > leaked, but they stay open, because the engine doesn't send the "Connection: > close" response header for failed requests. This means that the leaked > connections consume file descriptors and sockets, thus generating a real > problem. AFAIU, A 3.x engine does send a "Connection:close" response header for failed requests, which makes this issues a regression in the behaviour we had before. I'm marking this issue as a Regression, based on this info.
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.
Looking at this deeper I see that it the "Connection: close" response header is added by the Apache web server, not by the application server. And this when running in EL6. The difference between EL6 and the other distributions is that the version of Apache used there is 2.2 instead of 2.4. The EL6 packaging of that version of Apache includes the following configuration: KeepAlive Off This disables completely the use of persistent connections, so that the "Connection: close" request is sent for all responses, not only failed ones. Newer versions of Apache (2.4 and newer) don't include this directive, so persistent connections are enabled. We could explicitly disable persistent connections adding "KeepAlive Off" as part of the changes that engine-setup makes to the system, but this would affect all the applications deployed to the web server. We can also disable it for specific locations, for example only for the API, with something like this inside /etc/httpd/conf.d/z-ovirt-engine-proxy.conf: SetEnvIf Request_URI "^/(ovirt-engine/)?api(/.*)?$" nokeepalive But doing this would actually mean a change in behavior for users that are already using EL7. As persistent connections improve performance and are a good thing, I'm in favor of not changing this configuration, and making a release note explaining that this has been changed, and how to restore the previous behavior for those users that may find an issue. As there will be no change to the source I'm moving to ON_QA.
should update in the release note