Bug 1308693 - 30 seconds response time to ironic node-list on a fresh Red Hat OSP 7 Director undercloud installation
Summary: 30 seconds response time to ironic node-list on a fresh Red Hat OSP 7 Directo...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ironic
Version: 7.0 (Kilo)
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 7.0 (Kilo)
Assignee: Lucas Alvares Gomes
QA Contact: Toure Dunnon
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-02-15 18:42 UTC by Andreas Karis
Modified: 2019-10-10 11:12 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-04-04 16:01:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Andreas Karis 2016-02-15 18:42:52 UTC
Description of problem:
Very slow response times to ironic node-list on Red Hat OSP 7 Director after undercloud installation.

New, running Director. Openstack ironic is slow after undercloud installation. This is a VMware ESXi virtual machine, 8 cores, 16 GB of RAM and 40 GB of disk space.

Director is up and running and everything seems to work. However, customer noticed the Horizon was too slow to be usable.

We narrowed down the issue to ironic. We can reproduce the issue constantly when we call "ironic node-list", these calls take ca. 30 seconds to complete.

Version-Release number of selected component (if applicable):
RH OSP 7

How reproducible:
run the following ->
time ironic node-list

ironic appears to be hanging during the following:

------------------
DEBUG (session:195) REQ: curl -g -i -X GET http://172.24.252.205:6385/v1/nodes -H "X-OpenStack-Ironic-API-Version: 1.6" -H "User-Agent: python-ironicclient" -H "Content-Type: application/json" -H "Accept: application/json" -H "X-Auth-Token: {SHA1}d99aea7657237e6bbe306b68b74b96269438941c"

Steps to Reproduce:
1. su - stack
2. source stackrc
3. time ironic node-list

Actual results:
ca. 30 seconds to get an answer from ironic

Expected results:
should return an answer within a couple of seconds


Additional info:
ironic appears to be hanging during the following:

------------------
DEBUG (session:195) REQ: curl -g -i -X GET http://172.24.252.205:6385/v1/nodes -H "X-OpenStack-Ironic-API-Version: 1.6" -H "User-Agent: python-ironicclient" -H "Content-Type: application/json" -H "Accept: application/json" -H "X-Auth-Token: {SHA1}d99aea7657237e6bbe306b68b74b96269438941c"
------------------

The logs dont show anything helpful. We see a keystone request... a long pause (10+ seconds) then another keystone request that returns.

We can reproduce this using the curl command.

------------------
[stack@director log]$ time !!
time curl -g -i -X GET http://172.24.252.205:6385/v1/nodes -H "X-OpenStack-Ironic-API-Version: 1.6" -H "User-Agent: python-ironicclient" -H "Content-Type: application/json" -H "Accept: application/json" -H "X-Auth-Token: {SHA1}6ace843a8ecfc102b92f87c47c80344c14abc5d4"
HTTP/1.0 401 Unauthorized
Date: Mon, 15 Feb 2016 16:56:17 GMT
Server: WSGIServer/0.1 Python/2.7.5
WWW-Authenticate: Keystone uri='http://172.24.252.205:5000/'
Content-type: text/plain
Content-Length: 23

Authentication required
real    0m16.317s
user    0m0.002s
sys     0m0.007s
------------------

Comment 2 Jack Waterworth 2016-02-15 18:54:09 UTC
additionally, ironic-conductor does not report anything during this time. ironic-api shows nothing for the first 10 seconds, followed by a token message:

ironic-api.log
--------------------------
<pause>
2016-02-15 11:53:11.127 7494 DEBUG keystonemiddleware.auth_token [-] Removing headers from request environment: X-Service-Catalog,X-Identity-Status,X-Service-Identity-Status,X-Roles,X-Service-Roles,X-Domain-Name,X-Service-Domain-Name,X-Project-Id,X-Service-Project-Id,X-Project-Domain-Name,X-Service-Project-Domain-Name,X-User-Id,X-Service-User-Id,X-User-Name,X-Service-User-Name,X-Project-Name,X-Service-Project-Name,X-User-Domain-Id,X-Service-User-Domain-Id,X-Domain-Id,X-Service-Domain-Id,X-User-Domain-Name,X-Service-User-Domain-Name,X-Project-Domain-Id,X-Service-Project-Domain-Id,X-Role,X-User,X-Tenant-Name,X-Tenant-Id,X-Tenant _remove_auth_headers /usr/lib/python2.7/site-packages/keystonemiddleware/auth_token/__init__.py:672
--------------------------

Comment 3 Steve Baker 2016-02-16 20:40:35 UTC
I've seen this when the undercloud has entries in /etc/resolv.conf for name servers that cannot be reached. Can you confirm this is causing your issue?

If so, the workaround is obvious, but I wonder if ironic could be fixed to behave better when there is bogus name server config.

Comment 4 Andreas Karis 2016-02-16 20:55:33 UTC
Only when the nameserver cannot be reached or also when it cannot resolve its own reverse DNS entries? I see lots of reverse DNS lookups in the strace for its own IP address. I will check with the customer.

Also, I see lots of these. However, in a lab, I can completely stop rabbitmq and ironic list commands still work fine.

2016-02-11 11:52:45.819 20412 DEBUG keystonemiddleware.auth_token [-] Authenticating user token __call__ /usr/lib/python2.7/site-packages/keystonemiddleware/auth_token/__init__.py:581
2016-02-11 11:52:45.824 20412 DEBUG oslo_messaging._drivers.impl_rabbit [-] Received recoverable error from kombu: on_error /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py:810
2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit Traceback (most recent call last):
2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/utils/__init__.py", line 243, in retry_over_time
2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit     return fun(*args, **kwargs)
2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 237, in connect
2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit     return self.connection
2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 741, in connection
2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit     self._connection = self._establish_connection()
2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 696, in _establish_connection
2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit     conn = self.transport.establish_connection()
2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 112, in establish_connection
2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit     conn = self.Connection(**opts)
2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 183, in __init__
2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit     return self._x_open(virtual_host)
2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 625, in _x_open
2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit     self._send_method((10, 40), args)
2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 56, in _send_method
2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit     self.channel_id, method_sig, args, content,
2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 221, in write_method
2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit     write_frame(1, channel, payload)
2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 182, in write_frame
2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit     frame_type, channel, size, payload, 0xce,
2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 377, in sendall
2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit     tail = self.send(data, flags)
2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 359, in send
2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit     total_sent += fd.send(data[total_sent:], flags)
2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit error: [Errno 32] Broken pipe
2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit
2016-02-11 11:52:45.827 20412 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.252.205:5672 is unreachable: [Errno 32] Broken pipe. Trying again in 2 seconds.

Comment 5 Steve Baker 2016-02-16 21:00:26 UTC
I only saw this with unreachable name servers, specifically when my VPN connection went down.

Comment 6 Andreas Karis 2016-02-16 22:28:03 UTC
Hello, adding the node's br-ctlplane IP address into /etc/hosts fixed the issue for 1 out of 2 customers. I will verify with the second customer as well. It seems that ironic is trying to resolv PTR for its own address. 

If this is the solution for both customers, then we should definitely update the documentation.

Let me get back to you to see if it works for the second customer as well.

Thanks.

Comment 7 Dan Yocum 2016-04-04 15:53:22 UTC
Long pauses in keystone requests indicate that the keystone.tokens table has grown without bound.  Have you enabled a cron job to clear this out every 24 hours?  This is in the tuning section of the Director Install and Config guide.

You should also clear out the unusaged heat raw templates table as well since it can grow w/o bound, too:

# Flush keystone tokens
0 04 * * * /bin/keystone-manage token_flush
# Remove unused raw_templates
0 04 * * * /bin/heat-manage purge_deleted -g days 1

Comment 8 Andreas Karis 2016-04-04 16:01:14 UTC
We resolved this issue a long time ago. As far as I remember, it was due to DNS issues.

Thanks,

Andreas


Note You need to log in before you can comment on or make changes to this bug.