Bug 1504279 - Metadata service takes a long time to response
Summary: Metadata service takes a long time to response
Keywords:
Status: CLOSED EOL
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-eventlet
Version: 12.0 (Pike)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z4
: 12.0 (Pike)
Assignee: Jon Schlueter
QA Contact: Noam Manos
URL:
Whiteboard:
: 1608873 (view as bug list)
Depends On:
Blocks: 1612971 1628772
TreeView+ depends on / blocked
 
Reported: 2017-10-19 20:51 UTC by Itzik Brown
Modified: 2021-12-10 15:32 UTC (History)
15 users (show)

Fixed In Version: python-eventlet-0.20.1-5.1.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1612971 1628772 (view as bug list)
Environment:
Last Closed: 2018-12-14 11:34:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1696094 0 None None None 2017-10-24 09:54:49 UTC
Launchpad 1785615 0 None None None 2018-08-06 14:24:52 UTC
Red Hat Issue Tracker OSP-11325 0 None None None 2021-12-10 15:32:45 UTC

Description Itzik Brown 2017-10-19 20:51:24 UTC
Description of problem:
When launching an instance in OpenDaylight Bare metal setup it take more than t 30 seconds to get the meta data of the instance.
One result is that it's not possible to SSH to the instance with a key.

The setup is a containers setup (Not Neutron services).
This behavior was not observed on a virt setup 

A Trace from /var/log/neutron/metadata-agent.log:

2017-10-19 13:05:39.828 528628 INFO eventlet.wsgi.server [-] 192.168.99.7,<local> "GET / HTTP/1.1" status: 200  len: 0 time: 90.0983951
2017-10-19 13:09:58.159 528628 INFO eventlet.wsgi.server [-] 192.168.99.7,<local> "GET / HTTP/1.1" status: 200  len: 234 time: 30.0548701
2017-10-19 13:10:12.974 528640 INFO eventlet.wsgi.server [-] Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/eventlet/wsgi.py", line 521, in handle_one_response
    write(b''.join(towrite))
  File "/usr/lib/python2.7/site-packages/eventlet/wsgi.py", line 462, in write
    wfile.flush()
  File "/usr/lib64/python2.7/socket.py", line 303, in flush
    self._sock.sendall(view[write_offset:write_offset+buffer_size])
  File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 390, in sendall
    tail = self.send(data, flags)
  File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 384, in send
    return self._send_loop(self.fd.send, data, flags)
  File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 371, in _send_loop
    return send_method(data, *args)
error: [Errno 32] Broken pipe


Version-Release number of selected component (if applicable):
openstack-neutron-11.0.2-0.20171014013905.9c11f8a.el7ost.noarch

How reproducible:
It was seen twice

Steps to Reproduce:
1. Deploy A bare metal setup with containers and OpenDaylight  
2. Launch an instance 
3. Connect to the instance through a console
4. Run curl http://169.254.169.254


Actual results:


Expected results:


Additional info:

Comment 1 Daniel Alvarez Sanchez 2017-10-23 10:49:27 UTC
The TCP connection gets established right away but it's a 30 seconds delay until the response gets sent back to the instance:


10:40:42.802174  In fa:16:3e:91:80:5f ethertype IPv4 (0x0800), length 76: (tos 0x0, ttl 64, id 25959, offset 0, flags [DF], proto TCP (6), length 60)
    192.168.99.7.58092 > 169.254.169.254.http: Flags [S], cksum 0xa1bb (correct), seq 2545290520, win 14600, options [mss 1460,sackOK,TS val 84568191 ecr 0,nop,wscale 2], length 0

10:40:42.802227 Out fa:16:3e:2d:28:37 ethertype IPv4 (0x0800), length 76: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    169.254.169.254.http > 192.168.99.7.58092: Flags [S.], cksum 0x77db (incorrect -> 0x8dd4), seq 741284838, ack 2545290521, win 28960, options [mss 1460,sackOK,TS val 351108790 ecr 845681$1,nop,wscale 7], length 0
10:40:42.802572  In fa:16:3e:91:80:5f ethertype IPv4 (0x0800), length 68: (tos 0x0, ttl 64, id 25960, offset 0, flags [DF], proto TCP (6), length 52)
    192.168.99.7.58092 > 169.254.169.254.http: Flags [.], cksum 0x1f7e (correct), seq 1, ack 1, win 3650, options [nop,nop,TS val 84568192 ecr 351108790], length 0

10:40:42.802602  In fa:16:3e:91:80:5f ethertype IPv4 (0x0800), length 210: (tos 0x0, ttl 64, id 25961, offset 0, flags [DF], proto TCP (6), length 194)
    192.168.99.7.58092 > 169.254.169.254.http: Flags [P.], cksum 0x332d (correct), seq 1:143, ack 1, win 3650, options [nop,nop,TS val 84568192 ecr 351108790], length 142: HTTP, length: 142
        GET / HTTP/1.1
        User-Agent: curl/7.24.0 (x86_64-pc-linux-gnu) libcurl/7.24.0 OpenSSL/1.0.0j zlib/1.2.6
        Host: 169.254.169.254
        Accept: */*

10:40:42.841733 Out fa:16:3e:2d:28:37 ethertype IPv4 (0x0800), length 68: (tos 0x0, ttl 64, id 25937, offset 0, flags [DF], proto TCP (6), length 52)
    169.254.169.254.http > 192.168.99.7.58092: Flags [.], cksum 0x77d3 (incorrect -> 0x2c1f), seq 1, ack 143, win 235, options [nop,nop,TS val 351108830 ecr 84568192], length 0

[******** 30 seconds delay ********]

10:41:12.860156 Out fa:16:3e:2d:28:37 ethertype IPv4 (0x0800), length 283: (tos 0x0, ttl 64, id 25938, offset 0, flags [DF], proto TCP (6), length 267)
    169.254.169.254.http > 192.168.99.7.58092: Flags [P.], cksum 0x78aa (incorrect -> 0xd935), seq 1:216, ack 143, win 235, options [nop,nop,TS val 351138848 ecr 84568192], length 215: HTTP$ length: 215
        HTTP/1.1 200 OK
        Content-Type: text/plain; charset=UTF-8
        Content-Length: 98
        Date: Mon, 23 Oct 2017 10:41:12 GMT

        1.0
        2007-01-19
        2007-03-01
        2007-08-29
        2007-10-10
        2007-12-15
        2008-02-01


I'll try to check whether it's a connection problem between the metadata agent and nova and update this BZ.

Comment 2 Daniel Alvarez Sanchez 2017-10-23 15:57:07 UTC
In order to isolate the issue, I tried to manually issue the request to nova metadata api using curl and the response came back right away without any detail at all:

[heat-admin@controller-0 ~]$ curl http://overcloud.internalapi.localdomain:8775 -H "X-Forwarded-For: 192.168.99.7" -H "X-Neutron-Network-Id: a773e94f-31c4-4850-b840-d97c10af9cbd"
1.0
2007-01-19
2007-03-01
2007-08-29
2007-10-10
2007-12-15
2008-02-01
2008-09-01
2009-04-04


Then I enabled some traces and checked that the 30 seconds delay happens in the following call when the metadata agent tries to send the answer back to the instance [0].

I remember having gone through this 30 seconds issue in the past [1] and it had to do with eventlet having issues to resolve hostnames when there's an entry in /etc/hosts:

[root@controller-0 log]# rpm -qa | grep eventlet                                                                                                                                              
python-eventlet-0.20.1-2.el7ost.noarch

[root@controller-0 log]# cat /etc/neutron/metadata_agent.ini  | grep nova_metadata_host                                                                                                       
#nova_metadata_host = 127.0.0.1
nova_metadata_host=overcloud.internalapi.localdomain

[root@controller-0 log]# cat /etc/hosts | grep overcloud.internal                                                                                                                             
172.17.0.11  overcloud.internalapi.localdomain


After changing the nova_metadata_host to 172.17.0.11, this 30 seconds delay is not longer seen:

[root@controller-0 log]# cat /etc/neutron/metadata_agent.ini  | grep nova_metadata_host                                                                                                       
nova_metadata_host = 172.17.0.11  


15:49:49.522382  In fa:16:3e:91:80:5f ethertype IPv4 (0x0800), length 210: (tos 0x0, ttl 64, id 21252, offset 0, flags [DF], proto TCP (6), length 194)
    192.168.99.7.58235 > 169.254.169.254.http: Flags [P.], cksum 0x0054 (correct), seq 1:143, ack 1, win 3650, options [nop,nop,TS val 89204866 ecr 369655510], length 142: HTTP, length: 142
        GET / HTTP/1.1
        User-Agent: curl/7.24.0 (x86_64-pc-linux-gnu) libcurl/7.24.0 OpenSSL/1.0.0j zlib/1.2.6
        Host: 169.254.169.254
        Accept: */*

        0x0000:  4500 00c2 5304 4000 4006 6f85 c0a8 6307  E...S.@.@.o...c.
        0x0010:  a9fe a9fe e37b 0050 50bc 7b81 c46e ccff  .....{.PP.{..n..
        0x0020:  8018 0e42 0054 0000 0101 080a 0551 2882  ...B.T.......Q(.
        0x0030:  1608 7ed6 4745 5420 2f20 4854 5450 2f31  ..~.GET./.HTTP/1
        0x0040:  2e31 0d0a 5573 6572 2d41 6765 6e74 3a20  .1..User-Agent:.
        0x0050:  6375 726c 2f37 2e32 342e 3020 2878 3836  curl/7.24.0.(x86
        0x0060:  5f36 342d 7063 2d6c 696e 7578 2d67 6e75  _64-pc-linux-gnu
        0x0070:  2920 6c69 6263 7572 6c2f 372e 3234 2e30  ).libcurl/7.24.0
        0x0080:  204f 7065 6e53 534c 2f31 2e30 2e30 6a20  .OpenSSL/1.0.0j.
        0x0090:  7a6c 6962 2f31 2e32 2e36 0d0a 486f 7374  zlib/1.2.6..Host
        0x00a0:  3a20 3136 392e 3235 342e 3136 392e 3235  :.169.254.169.25
        0x00b0:  340d 0a41 6363 6570 743a 202a 2f2a 0d0a  4..Accept:.*/*..
        0x00c0:  0d0a 0000 0000 0000 0000 0000 0000 0000  ................
        0x00d0:  0000                                     ..
15:49:49.561737 Out fa:16:3e:2d:28:37 ethertype IPv4 (0x0800), length 68: (tos 0x0, ttl 64, id 45609, offset 0, flags [DF], proto TCP (6), length 52)
    169.254.169.254.http > 192.168.99.7.58235: Flags [.], cksum 0x77d3 (incorrect -> 0xf945), seq 1, ack 143, win 235, options [nop,nop,TS val 369655550 ecr 89204866], length 0
        0x0000:  4500 0034 b229 4000 4006 10ee a9fe a9fe  E..4.)@.@.......
        0x0010:  c0a8 6307 0050 e37b c46e ccff 50bc 7c0f  ..c..P.{.n..P.|.
        0x0020:  8010 00eb 77d3 0000 0101 080a 1608 7efe  ....w.........~.
        0x0030:  0551 2882 0000 0000 0000 0000 0000 0000  .Q(.............
        0x0040:  0000 0000                                ....
15:49:49.586618 Out fa:16:3e:2d:28:37 ethertype IPv4 (0x0800), length 283: (tos 0x0, ttl 64, id 45610, offset 0, flags [DF], proto TCP (6), length 267)
    169.254.169.254.http > 192.168.99.7.58235: Flags [P.], cksum 0x78aa (incorrect -> 0x0f7c), seq 1:216, ack 143, win 235, options [nop,nop,TS val 369655574 ecr 89204866], length 215: HTTP,
 length: 215
        HTTP/1.1 200 OK
        Content-Type: text/plain; charset=UTF-8
        Content-Length: 98
        Date: Mon, 23 Oct 2017 15:49:49 GMT

        1.0
        2007-01-19


With this change, the issue is solved except for a few times that it also times out due to the agent lasting exactly 1 minute in getting the instance port details from neutron-server on an RPC call:

$ curl http://169.254.169.254
<html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>

The agent spends 1 minute in [2] even though server always returns the expected results. When this happens, the RPC call always takes exactly 1 minute:

2017-10-23 15:52:26.789 378361 DEBUG neutron.agent.metadata.agent [-] XXX A _get_instance_and_tenant_id /usr/lib/python2.7/site-packages/neutron/agent/metadata/agent.py:164
2017-10-23 15:52:26.790 378361 DEBUG neutron.agent.metadata.agent [-] XXX a _get_ports_from_server /usr/lib/python2.7/site-packages/neutron/agent/metadata/agent.py:107
2017-10-23 15:53:26.883 378361 DEBUG neutron.agent.metadata.agent [-] XXX B _get_instance_and_tenant_id /usr/lib/python2.7/site-packages/neutron/agent/metadata/agent.py:166


Before having changed the metadata host from the hostname to its IP address, the RPC call used to stuck more often but I'm not sure how they're related.


[0] https://github.com/openstack/neutron/blob/master/neutron/agent/metadata/agent.py#L197
[1] https://bugs.launchpad.net/neutron/+bug/1696094
[2] https://github.com/openstack/neutron/blob/master/neutron/agent/metadata/agent.py#L105

Comment 3 Daniel Alvarez Sanchez 2017-10-24 09:54:49 UTC
Ok, now things are clearer:

We were seeing two main delays here:

1. Agent trying to contact nova metadata service (30 secs) [0]
2. Agent trying to do an RPC call to neutron-server (60 secs) [1]

In both cases, hostnames are used:

1. nova_metadata_host=overcloud.internalapi.localdomain
2. transport_url=rabbit://guest:****@controller-0.internalapi.localdomain:5672/?ssl=0


Even though, both overcloud.internalapi.localdomain and controller-0.internalapi.localdomain show up in the /etc/hosts file, it looks like (as they don't have an IPv6 entry), it still tries to be resolved through DNS servers.

For 2), Ihar sent a patch upstream which is not yet merged [2] and this would reduce the 1 minute delay seen in 2 because once an IPv4 entry is found, py-amqp woulnd't try to contact dns servers to get the IPv6 resolution. For 1) I think we don't have any solutions.

# nslookup overcloud.internalapi.localdomain
;; connection timed out; trying next origin
;; connection timed out; no servers could be reached

However, if we add a default route so that the configured nameservers in /etc/resolv.conf are either reachable or fail, the problem disappears and metadata is fetched by the instance right away:

$ sudo route add default gw 192.0.5.1
$ nslookup overcloud.internalapi.localdomain
Server:         10.35.28.28
Address:        10.35.28.28#53

** server can't find overcloud.internalapi.localdomain: NXDOMAIN

[0] https://github.com/openstack/neutron/blob/master/neutron/agent/metadata/agent.py#L197
[1] https://github.com/openstack/neutron/blob/master/neutron/agent/metadata/agent.py#L105
[2] https://github.com/celery/py-amqp/pull/153

Comment 4 Daniel Alvarez Sanchez 2017-10-24 13:07:11 UTC
Closing as NOTABUG.
As nameservers were configured in /etc/resolv.conf those should be reachable (at least a default gw should've been configured in this case) to avoid these resolution timeouts even when the entries exist in /etc/hosts.

Comment 5 Tim Rozet 2018-08-01 14:36:53 UTC
I also hit this with Queens. Thanks to Daniel for helping me debug and finding the root cause is with the HTTP curl from neutron metadata agent -> nova metadata agent taking 30 seconds before responding. This is because there was a DNS lookup happening and timing out, and *then* looking at /etc/hosts to determine the FQDN for the overcloud-internal api hostname. The code that takes 30 seconds is this line:

https://github.com/openstack/neutron/blob/master/neutron/agent/metadata/agent.py#L197

When attempting to run this curl manually or in a python shell in the metadata agent container it returns in 1 second. Daniel figured out the difference was running in eventlet was causing the issue:
https://github.com/eventlet/eventlet/issues/387

That bug was fixed, but we are still using the old version in RDO:
https://review.rdoproject.org/r/gitweb?p=rdoinfo.git;a=blob;f=deps.yml;h=a890de38e0694ad496adc4f85d9e0b994c7b20d8;hb=HEAD#l2031

Comment 6 Daniel Alvarez Sanchez 2018-08-01 16:04:02 UTC
After some debugging with Tim, the versions are correct in both eventlet and amqp.
Problem is the DNS resolution in eventlet. 

Here is a reproducer:

import eventlet
eventlet.monkey_patch()

import socket
print socket.getaddrinfo('overcloud.internalapi.localdomain', 80, 0, socket.SOCK_STREAM)


If we don't monkey patch, everything goes ok but otherwise, it takes 30 seconds.
The reason is that eventlet tries to resolve both IPv6 and IPv4 [0] so even though we have this patch in [1], it'll first try to resolve IPv6 entry through hosts file (which doesn't exist), then through nameserver and then it'll try to resolve IPv4 through hosts file which is found and returned.

As this is changing the default behavior of socket.getaddrinfo, the fix should go to eventlet.

[0] https://github.com/eventlet/eventlet/blob/master/eventlet/support/greendns.py#L472
[1] https://github.com/eventlet/eventlet/commit/62ccb9cae17e4dc360379d1a0eec6a9157cd99b9

Comment 7 Daniel Alvarez Sanchez 2018-08-02 12:54:57 UTC
I created a new issue and a pull request for eventlet:

https://github.com/eventlet/eventlet/pull/512

Comment 8 Daniel Alvarez Sanchez 2018-08-02 12:58:24 UTC
@Tim, thanks for the setup and giving me a hand with the patch. I tried to add you as a co author but apparently it's not possible on pull requests. Sorry :(

Comment 9 Daniel Alvarez Sanchez 2018-08-03 08:40:54 UTC
*** Bug 1608873 has been marked as a duplicate of this bug. ***

Comment 10 Daniel Alvarez Sanchez 2018-08-06 09:16:14 UTC
Patch got merged upstream in eventlet master:
https://github.com/eventlet/eventlet/commit/10ff67f84a1e2f53c8eb254944ebf81762041b45

Comment 13 Daniel Alvarez Sanchez 2018-08-06 16:04:14 UTC
Koji build URL: https://cbs.centos.org/koji/taskinfo?taskID=506034

Comment 30 Lon Hohberger 2018-12-06 11:47:57 UTC
According to our records, this should be resolved by python-eventlet-0.20.1-5.1.el7ost.  This build is available now.


Note You need to log in before you can comment on or make changes to this bug.