Bug 1324549 - Response to HEAD requests does not contain a "Connection: close" header which leads to "IOError: [Errno 24] Too many open files"
Summary: Response to HEAD requests does not contain a "Connection: close" header which...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: RestAPI
Version: 4.0.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium vote
Target Milestone: ovirt-4.0.0-beta
: 4.0.0
Assignee: Juan Hernández
QA Contact: Kobi Hakimi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-04-06 15:20 UTC by Kobi Hakimi
Modified: 2016-06-16 12:23 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Release Note
Doc Text:
The default configuration of the web server in EL6 disables persistent connections, adding the following parameter to the /etc/httpd/conf/httpd.conf file: KeepAlive Off This means that programs using the API will always receive the following response header, for all requests: Connection: close In EL7 persistent connections are enabled by default, which is good for performance in general, but may cause issues for programs that expect that "Connection: close" header. We users to update those programs so that they don't require the header, but if that isn't possible then the previous behavior of the server can be restored adding the "KeepAlive Off" parameter to the web server configuration.
Clone Of:
Environment:
Last Closed: 2016-06-16 12:23:10 UTC
oVirt Team: Infra
rule-engine: ovirt-4.0.0+
rule-engine: blocker+
mgoldboi: planning_ack+
juan.hernandez: devel_ack+
pstehlik: testing_ack+


Attachments (Terms of Use)
lsof in the middle of this run after I increase the limit to 4096 (227.57 KB, text/plain)
2016-04-06 15:20 UTC, Kobi Hakimi
no flags Details

Description Kobi Hakimi 2016-04-06 15:20:07 UTC
Created attachment 1144243 [details]
lsof in the middle of this run after I increase the limit to 4096

Description of problem:
http connection create many open files until we got "IOError: [Errno 24] Too many open files"

Version-Release number of selected component (if applicable):
http connection create many open files until we got "IOError: [Errno 24] Too many open files"

How reproducible:
100%

Steps to Reproduce:
1. Run our Tier1 test
2. Make sure with "ulimit -a" the open files are:
open files                      (-n) 1024

Actual results:
https://rhev-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/4.0-GE-Tier1-x86/5/consoleFull

03:07:18 ETraceback (most recent call last):
03:07:18   File "/usr/bin/py.test", line 9, in <module>
03:07:18     load_entry_point('pytest==2.8.6', 'console_scripts', 'py.test')()
03:07:18   File "/usr/lib/python2.7/site-packages/_pytest/config.py", line 48, in main
03:07:18     return config.hook.pytest_cmdline_main(config=config)
03:07:18   File "/usr/lib/python2.7/site-packages/_pytest/vendored_packages/pluggy.py", line 724, in __call__
03:07:18     return self._hookexec(self, self._nonwrappers + self._wrappers, kwargs)
03:07:18   File "/usr/lib/python2.7/site-packages/_pytest/vendored_packages/pluggy.py", line 338, in _hookexec
03:07:18     return self._inner_hookexec(hook, methods, kwargs)
03:07:18   File "/usr/lib/python2.7/site-packages/_pytest/vendored_packages/pluggy.py", line 333, in <lambda>
03:07:18     _MultiCall(methods, kwargs, hook.spec_opts).execute()
03:07:18   File "/usr/lib/python2.7/site-packages/_pytest/vendored_packages/pluggy.py", line 596, in execute
03:07:18     res = hook_impl.function(*args)
03:07:18   File "/usr/lib/python2.7/site-packages/_pytest/main.py", line 115, in pytest_cmdline_main
03:07:18     return wrap_session(config, _main)
03:07:18   File "/usr/lib/python2.7/site-packages/_pytest/main.py", line 110, in wrap_session
03:07:18     exitstatus=session.exitstatus)
03:07:18   File "/usr/lib/python2.7/site-packages/_pytest/vendored_packages/pluggy.py", line 724, in __call__
03:07:18     return self._hookexec(self, self._nonwrappers + self._wrappers, kwargs)
03:07:18   File "/usr/lib/python2.7/site-packages/_pytest/vendored_packages/pluggy.py", line 338, in _hookexec
03:07:18     return self._inner_hookexec(hook, methods, kwargs)
03:07:18   File "/usr/lib/python2.7/site-packages/_pytest/vendored_packages/pluggy.py", line 333, in <lambda>
03:07:18     _MultiCall(methods, kwargs, hook.spec_opts).execute()
03:07:18   File "/usr/lib/python2.7/site-packages/_pytest/vendored_packages/pluggy.py", line 595, in execute
03:07:18     return _wrapped_call(hook_impl.function(*args), self.execute)
03:07:18   File "/usr/lib/python2.7/site-packages/_pytest/vendored_packages/pluggy.py", line 249, in _wrapped_call
03:07:18     wrap_controller.send(call_outcome)
03:07:18   File "/usr/lib/python2.7/site-packages/_pytest/terminal.py", line 361, in pytest_sessionfinish
03:07:18     outcome.get_result()
03:07:18   File "/usr/lib/python2.7/site-packages/_pytest/vendored_packages/pluggy.py", line 279, in get_result
03:07:18     _reraise(*ex)  # noqa
03:07:18   File "/usr/lib/python2.7/site-packages/_pytest/vendored_packages/pluggy.py", line 264, in __init__
03:07:18     self.result = func()
03:07:18   File "/usr/lib/python2.7/site-packages/_pytest/vendored_packages/pluggy.py", line 596, in execute
03:07:18     res = hook_impl.function(*args)
03:07:18   File "/usr/lib/python2.7/site-packages/_pytest/junitxml.py", line 361, in pytest_sessionfinish
03:07:18     logfile = open(self.logfile, 'w', encoding='utf-8')
03:07:18   File "/usr/lib64/python2.7/codecs.py", line 881, in open
03:07:18     file = __builtin__.open(filename, mode, buffering)
03:07:18 IOError: [Errno 24] Too many open files: '/var/lib/jenkins/workspace/4.0-GE-Tier1-x86/xunit_output.xml'

see attached file with lsof command

Expected results:
close these files after some timeout

Additional info:
in 3.6 we use the same infrastructure and didn't got this errors.

Comment 1 Gil Klein 2016-04-06 15:25:47 UTC
Seems to be a Jenkins issues

Comment 2 Kobi Hakimi 2016-04-20 11:02:52 UTC
After investigation with juan we found that the problem is:
that in 4.0 the server doesn't send the "Connection: close" header for HEAD requests

we tried to run API calls:
 - on 3.6 engine - not reproduce
 - on rhel6 + python 2.6 machine run to remote 4.0 engine - reproduced
 - on rhel7 + python 2.7 machine run locally or remote to 4.0 engine - reproduced

Comment 3 Juan Hernández 2016-04-20 16:28:00 UTC
It is true that the server doesn't send the "Connection: close" header like it used to do in version 3.6. We should probably change that, to avoid other similar issues. But after studying the issue I believe that it can be solved in the client, making sure that it consumes the (empty) body of the HEAD response. As the client is using the Python "httplib" module I'd suggest to make sure to always do the following for HEAD requests:

  connection.request('HEAD', ...)
  response = connection.getresponse()
  response.read()

That call to "read" should make sure that the body is consumed, and the connection released.

Comment 7 Juan Hernández 2016-04-25 11:03:44 UTC
Note that my analysis in comment 3 wasn't correct. The problem wasn't related to the consumption of the response body. It was a connection leak in in the testing framework.

This leak wasn't problematic with version 3.6 of the engine, as the connections were leaked, but closed, so they didn't consume any resource other than memory. But with version 4 of the engine the connections are leaked, but they stay open, because the engine doesn't send the "Connection: close" response header for failed requests. This means that the leaked connections consume file descriptors and sockets, thus generating a real problem.

That leak in the testing framework has been fixed. We want also to modify the engine so that it sends the "Connection: close" response header for failed connections, that is why we are keeping this bug open. However, that may be difficult, or even impossible, because that header is managed by the application server, not by the application. We are investigating it, but we may eventually close the bug as CANTFIX.

Comment 8 Gil Klein 2016-05-02 06:35:37 UTC
(In reply to Juan Hernández from comment #7)
> Note that my analysis in comment 3 wasn't correct. The problem wasn't
> related to the consumption of the response body. It was a connection leak in
> in the testing framework.
> 
> This leak wasn't problematic with version 3.6 of the engine, as the
> connections were leaked, but closed, so they didn't consume any resource
> other than memory. But with version 4 of the engine the connections are
> leaked, but they stay open, because the engine doesn't send the "Connection:
> close" response header for failed requests. This means that the leaked
> connections consume file descriptors and sockets, thus generating a real
> problem.
AFAIU, A 3.x engine does send a "Connection:close" response header for failed requests, which makes this issues a regression in the behaviour we had before.

I'm marking this issue as a Regression, based on this info.

Comment 9 Red Hat Bugzilla Rules Engine 2016-05-02 06:35:42 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 10 Sandro Bonazzola 2016-05-02 10:06:07 UTC
Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.

Comment 11 Juan Hernández 2016-05-03 10:32:18 UTC
Looking at this deeper I see that it the "Connection: close" response header is added by the Apache web server, not by the application server. And this when running in EL6. The difference between EL6 and the other distributions is that the version of Apache used there is 2.2 instead of 2.4. The EL6 packaging of that version of Apache includes the following configuration:

  KeepAlive Off

This disables completely the use of persistent connections, so that the "Connection: close" request is sent for all responses, not only failed ones.

Newer versions of Apache (2.4 and newer) don't include this directive, so persistent connections are enabled.

We could explicitly disable persistent connections adding "KeepAlive Off" as part of the changes that engine-setup makes to the system, but this would affect all the applications deployed to the web server.

We can also disable it for specific locations, for example only for the API, with something like this inside /etc/httpd/conf.d/z-ovirt-engine-proxy.conf:

  SetEnvIf Request_URI "^/(ovirt-engine/)?api(/.*)?$" nokeepalive

But doing this would actually mean a change in behavior for users that are already using EL7.

As persistent connections improve performance and are a good thing, I'm in favor of not changing this configuration, and making a release note explaining that this has been changed, and how to restore the previous behavior for those users that may find an issue.

As there will be no change to the source I'm moving to ON_QA.

Comment 12 Kobi Hakimi 2016-06-16 12:23:10 UTC
should update in the release note


Note You need to log in before you can comment on or make changes to this bug.