Bug 1391375 - Please revisit the proxy timeout related settings
Summary: Please revisit the proxy timeout related settings
Keywords:
Status: CLOSED DUPLICATE of bug 1289315
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 11.0 (Ocata)
Assignee: Chris Jones
QA Contact: Arik Chernetsky
URL:
Whiteboard:
: 1593811 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-03 08:13 UTC by Attila Fazekas
Modified: 2019-02-21 10:33 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-11-23 12:28:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Attila Fazekas 2016-11-03 08:13:24 UTC
Description of problem:

Currently we are using these proxy timeout settings for all services:
defaults
  log  global
  maxconn  4096
  mode  tcp
  retries  3
  timeout  http-request 10s
  timeout  queue 1m
  timeout  connect 10s
  timeout  client 1m
  timeout  server 1m
  timeout  check 10s


Both glance, heat has an API call which haves the service to connect to an external service: for Ex.:
 
time glance --os-image-api-version 1  image-create --disk-format=qcow2  --container-format=bare --copy-from=http://example.com
504 Gateway Time-out: The server didn't respond in time. (HTTP N/A)

real	1m1.020s
user	0m0.531s
sys	0m0.095s


glance does not needs to save the full url, before it returns, but it wants have at least a HEAD. 


BTW, Long responding API calls also possible, when you for ex. asking something about 10k item.


In my case the server was not responded in time, because first it tried on IPv6 (I have no external IPv6 connectivity, just local), but it can be delayed by any other reasons. (iptables DROP ..)

I would not notice the issue, beside it is slow if the proxy timeout would be 3 minute.

Actual results:
The haproxy is inpatient, or the services does not have built in response deadlines.
 

haproxy disconnects from the backend server before it generates the response,
and returns with 504 to the client instead of the real response.


The Glance api log has this kind of traces :
2016-11-03 07:33:35.043 12918 INFO eventlet.wsgi.server [req-749fa599-f55a-432b-aac7-966fcb5381c4 cab7f2d4f87e46a493775b3fc87be0d4 2ab73c577de146f2816b311485b63340 - default default] Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/eventlet/wsgi.py", line 512, in handle_one_response
    write(b''.join(towrite))
  File "/usr/lib/python2.7/site-packages/eventlet/wsgi.py", line 453, in write
    wfile.flush()
  File "/usr/lib64/python2.7/socket.py", line 303, in flush
    self._sock.sendall(view[write_offset:write_offset+buffer_size])
  File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 385, in sendall
    tail = self.send(data, flags)
  File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 379, in send
    return self._send_loop(self.fd.send, data, flags)
  File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 366, in _send_loop
    return send_method(data, *args)
error: [Errno 104] Connection reset by peer


When I directly connect to the backend server, 
It worked in 130 sec.

Expected results:

haproxy does not breaks the connection to backend server, when the server is able to response.  1 minute without response does not means the server is dead.
3 minute also does not means it, but in the above case it would be better.

Alternatively, all service must be somehow convinced to ALLWAYS response within 50 sec (less than the proxy timeout), even if it is an 503. Not the haproxy's responsibility to kill a (potentially) valid in-progress request.

Comment 1 Fabio Massimo Di Nitto 2016-11-03 15:23:22 UTC
Ryan,

can you please check current OSP10 haproxy.conf and see what changes could/should be done?

Comment 2 Ryan O'Hara 2016-11-04 16:22:37 UTC
The timeout that is coming into play here is the 'timeout  server 1m'. We've had other requests to increase this timeout in the past, but ultimately we choose not to because it never seems to be enough. If we increase the timeout to 2 minutes, next month somebody will want it to be 3 minutes, and next year 5 minutes.

If glance and heat need longer timeouts, I suggest we set them in the proxy definition (listen or frontend block). This will override the timeout from defaults, but only for those proxies.

Comment 3 Attila Fazekas 2016-11-11 08:57:16 UTC
Another 504 hit:
https://bugzilla.redhat.com/show_bug.cgi?id=1394155

What would happen if you just delete that option ?

Comment 4 Attila Fazekas 2016-11-11 10:11:34 UTC
Think in a different way for minute, what if I say proxy has to be at least as patient as the client ?

Comment 5 Ryan O'Hara 2016-11-14 14:11:40 UTC
(In reply to Attila Fazekas from comment #4)
> Think in a different way for minute, what if I say proxy has to be at least
> as patient as the client ?

I don't understand this question. Patient in what way? Are you talking about the 'timeout client'?

Comment 6 Attila Fazekas 2016-11-14 17:43:10 UTC
If the client itself is not giving up on waiting for the server, why the intermediate proxy would break the connection ? Why the proxy does not wait as long as the client is willing to wait ?

Why the proxy thinks he can judge an openstack api service, and punishing it by breaking the client connection because of the server being slow ?

<joking>
If we would like to keep following the already implemented not patient pattern, I would recommend also automatically deleting the services when they are not responding in time, and I also recommend decreasing the time limit to 5 sec instead of 60sec.
</joking>

I read the haproxy doc, looks like the `doc` likes this limited timeout thing what we are doing now, but I have doubts about really it is the right way for any openstack service.

Comment 7 Fabio Massimo Di Nitto 2016-11-23 12:28:18 UTC

*** This bug has been marked as a duplicate of bug 1289315 ***

Comment 8 Lukas Bezdicka 2019-02-21 10:33:08 UTC
*** Bug 1593811 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.