Description of problem: Very slow response times to ironic node-list on Red Hat OSP 7 Director after undercloud installation. New, running Director. Openstack ironic is slow after undercloud installation. This is a VMware ESXi virtual machine, 8 cores, 16 GB of RAM and 40 GB of disk space. Director is up and running and everything seems to work. However, customer noticed the Horizon was too slow to be usable. We narrowed down the issue to ironic. We can reproduce the issue constantly when we call "ironic node-list", these calls take ca. 30 seconds to complete. Version-Release number of selected component (if applicable): RH OSP 7 How reproducible: run the following -> time ironic node-list ironic appears to be hanging during the following: ------------------ DEBUG (session:195) REQ: curl -g -i -X GET http://172.24.252.205:6385/v1/nodes -H "X-OpenStack-Ironic-API-Version: 1.6" -H "User-Agent: python-ironicclient" -H "Content-Type: application/json" -H "Accept: application/json" -H "X-Auth-Token: {SHA1}d99aea7657237e6bbe306b68b74b96269438941c" Steps to Reproduce: 1. su - stack 2. source stackrc 3. time ironic node-list Actual results: ca. 30 seconds to get an answer from ironic Expected results: should return an answer within a couple of seconds Additional info: ironic appears to be hanging during the following: ------------------ DEBUG (session:195) REQ: curl -g -i -X GET http://172.24.252.205:6385/v1/nodes -H "X-OpenStack-Ironic-API-Version: 1.6" -H "User-Agent: python-ironicclient" -H "Content-Type: application/json" -H "Accept: application/json" -H "X-Auth-Token: {SHA1}d99aea7657237e6bbe306b68b74b96269438941c" ------------------ The logs dont show anything helpful. We see a keystone request... a long pause (10+ seconds) then another keystone request that returns. We can reproduce this using the curl command. ------------------ [stack@director log]$ time !! time curl -g -i -X GET http://172.24.252.205:6385/v1/nodes -H "X-OpenStack-Ironic-API-Version: 1.6" -H "User-Agent: python-ironicclient" -H "Content-Type: application/json" -H "Accept: application/json" -H "X-Auth-Token: {SHA1}6ace843a8ecfc102b92f87c47c80344c14abc5d4" HTTP/1.0 401 Unauthorized Date: Mon, 15 Feb 2016 16:56:17 GMT Server: WSGIServer/0.1 Python/2.7.5 WWW-Authenticate: Keystone uri='http://172.24.252.205:5000/' Content-type: text/plain Content-Length: 23 Authentication required real 0m16.317s user 0m0.002s sys 0m0.007s ------------------
additionally, ironic-conductor does not report anything during this time. ironic-api shows nothing for the first 10 seconds, followed by a token message: ironic-api.log -------------------------- <pause> 2016-02-15 11:53:11.127 7494 DEBUG keystonemiddleware.auth_token [-] Removing headers from request environment: X-Service-Catalog,X-Identity-Status,X-Service-Identity-Status,X-Roles,X-Service-Roles,X-Domain-Name,X-Service-Domain-Name,X-Project-Id,X-Service-Project-Id,X-Project-Domain-Name,X-Service-Project-Domain-Name,X-User-Id,X-Service-User-Id,X-User-Name,X-Service-User-Name,X-Project-Name,X-Service-Project-Name,X-User-Domain-Id,X-Service-User-Domain-Id,X-Domain-Id,X-Service-Domain-Id,X-User-Domain-Name,X-Service-User-Domain-Name,X-Project-Domain-Id,X-Service-Project-Domain-Id,X-Role,X-User,X-Tenant-Name,X-Tenant-Id,X-Tenant _remove_auth_headers /usr/lib/python2.7/site-packages/keystonemiddleware/auth_token/__init__.py:672 --------------------------
I've seen this when the undercloud has entries in /etc/resolv.conf for name servers that cannot be reached. Can you confirm this is causing your issue? If so, the workaround is obvious, but I wonder if ironic could be fixed to behave better when there is bogus name server config.
Only when the nameserver cannot be reached or also when it cannot resolve its own reverse DNS entries? I see lots of reverse DNS lookups in the strace for its own IP address. I will check with the customer. Also, I see lots of these. However, in a lab, I can completely stop rabbitmq and ironic list commands still work fine. 2016-02-11 11:52:45.819 20412 DEBUG keystonemiddleware.auth_token [-] Authenticating user token __call__ /usr/lib/python2.7/site-packages/keystonemiddleware/auth_token/__init__.py:581 2016-02-11 11:52:45.824 20412 DEBUG oslo_messaging._drivers.impl_rabbit [-] Received recoverable error from kombu: on_error /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py:810 2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit Traceback (most recent call last): 2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/kombu/utils/__init__.py", line 243, in retry_over_time 2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit return fun(*args, **kwargs) 2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 237, in connect 2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit return self.connection 2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 741, in connection 2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit self._connection = self._establish_connection() 2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 696, in _establish_connection 2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit conn = self.transport.establish_connection() 2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 112, in establish_connection 2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit conn = self.Connection(**opts) 2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 183, in __init__ 2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit return self._x_open(virtual_host) 2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 625, in _x_open 2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit self._send_method((10, 40), args) 2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 56, in _send_method 2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit self.channel_id, method_sig, args, content, 2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 221, in write_method 2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit write_frame(1, channel, payload) 2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 182, in write_frame 2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit frame_type, channel, size, payload, 0xce, 2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 377, in sendall 2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit tail = self.send(data, flags) 2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 359, in send 2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit total_sent += fd.send(data[total_sent:], flags) 2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit error: [Errno 32] Broken pipe 2016-02-11 11:52:45.824 20412 TRACE oslo_messaging._drivers.impl_rabbit 2016-02-11 11:52:45.827 20412 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.252.205:5672 is unreachable: [Errno 32] Broken pipe. Trying again in 2 seconds.
I only saw this with unreachable name servers, specifically when my VPN connection went down.
Hello, adding the node's br-ctlplane IP address into /etc/hosts fixed the issue for 1 out of 2 customers. I will verify with the second customer as well. It seems that ironic is trying to resolv PTR for its own address. If this is the solution for both customers, then we should definitely update the documentation. Let me get back to you to see if it works for the second customer as well. Thanks.
Long pauses in keystone requests indicate that the keystone.tokens table has grown without bound. Have you enabled a cron job to clear this out every 24 hours? This is in the tuning section of the Director Install and Config guide. You should also clear out the unusaged heat raw templates table as well since it can grow w/o bound, too: # Flush keystone tokens 0 04 * * * /bin/keystone-manage token_flush # Remove unused raw_templates 0 04 * * * /bin/heat-manage purge_deleted -g days 1
We resolved this issue a long time ago. As far as I remember, it was due to DNS issues. Thanks, Andreas