Created attachment 1076657 [details] Debug info at the point the deployment hangs Description of problem: Deploying an overcloud with 3 controllers and 1 compute node on baremetal hangs. After four hours the the deployment fails with: "HTTPUnauthorized: ERROR: Authentication failed. Please try again with option --include-password or export HEAT_INCLUDE_PASSWORD=1 Authentication required" However this error is misleading and is only thrown because the deploy hangs for four hours and hits the auth token timeout. The real issue shows that one controller node completes successfully and the other two hang: From the difference between the os-collect-config log output ($ sudo journalctl -u os-collect-config) on the two systems: <bnemec> The working one looks up its instance-id through os-apply-config, finds it, and continues to report back. <bnemec> The stuck one looks up its instance-id, doesn't find it, and doesn't report back. <bnemec> ec2.json is missing from /run/os-collect-config/os_config_files.json Compare: [heat-admin@overcloud-controller-1 ~]$ sudo cat /var/run/os-collect-config/os_config_files.json ["/var/lib/os-collect-config/heat_local.json", "/var/lib/os-collect-config/cfn.json", "/var/lib/os-collect-config/overcloud-Controller-glvonlfw4cyr-1-fwjzxtscidpo-ControllerConfig-eadn3ts46kbd.json", "/var/lib/os-collect-config/overcloud-Controller-glvonlfw4cyr-1-fwjzxtscidpo-NetworkConfig-cewcrmrguc72-OsNetConfigImpl-dikf7dmjxwei.json"][heat-admin@overcloud-controller-1 ~]$ [heat-admin@overcloud-controller-2 ~]$ sudo cat /var/run/os-collect-config/os_config_files.json ["/var/lib/os-collect-config/heat_local.json", "/var/lib/os-collect-config/ec2.json", "/var/lib/os-collect-config/cfn.json", "/var/lib/os-collect-config/overcloud-Controller-glvonlfw4cyr-2-zpqyf3pxsvzh-ControllerConfig-rlywsxrszpgb.json", "/var/lib/os-collect-config/overcloud-Controller-glvonlfw4cyr-2-zpqyf3pxsvzh-NetworkConfig-6kanbsanfcl6-OsNetConfigImpl-hyqst4v27pq5.json"][heat-admin@overcloud-controller-2 ~]$ If the stalled stack is deleted and the exact deploy command is rerun, the stack goes to complete. Version-Release number of selected component (if applicable): [stack@host15 ~]$ rpm -qa | grep openstack openstack-nova-common-2015.1.1-1.el7ost.noarch openstack-tripleo-puppet-elements-0.0.1-5.el7ost.noarch openstack-heat-api-cfn-2015.1.1-4.el7ost.noarch openstack-ceilometer-api-2015.1.1-1.el7ost.noarch openstack-ironic-api-2015.1.1-4.el7ost.noarch openstack-swift-plugin-swift3-1.7-3.el7ost.noarch openstack-tuskar-ui-extras-0.0.4-1.el7ost.noarch openstack-ironic-conductor-2015.1.1-4.el7ost.noarch openstack-tempest-kilo-20150708.2.el7ost.noarch openstack-neutron-ml2-2015.1.1-5.el7ost.noarch openstack-nova-scheduler-2015.1.1-1.el7ost.noarch openstack-puppet-modules-2015.1.8-17.el7ost.noarch openstack-nova-cert-2015.1.1-1.el7ost.noarch python-django-openstack-auth-1.2.0-5.el7ost.noarch openstack-dashboard-2015.1.1-2.el7ost.noarch openstack-tripleo-heat-templates-0.8.6-64.el7ost.noarch openstack-heat-templates-0-0.6.20150605git.el7ost.noarch python-openstackclient-1.0.3-2.el7ost.noarch openstack-utils-2014.2-1.el7ost.noarch openstack-heat-api-cloudwatch-2015.1.1-4.el7ost.noarch openstack-ceilometer-notification-2015.1.1-1.el7ost.noarch openstack-ceilometer-collector-2015.1.1-1.el7ost.noarch openstack-tuskar-0.4.18-4.el7ost.noarch openstack-neutron-openvswitch-2015.1.1-5.el7ost.noarch openstack-swift-container-2.3.0-1.el7ost.noarch openstack-nova-api-2015.1.1-1.el7ost.noarch redhat-access-plugin-openstack-7.0.0-0.el7ost.noarch openstack-tripleo-common-0.0.1.dev6-2.git49b57eb.el7ost.noarch openstack-tripleo-image-elements-0.9.6-10.el7ost.noarch openstack-nova-console-2015.1.1-1.el7ost.noarch openstack-neutron-2015.1.1-5.el7ost.noarch openstack-heat-engine-2015.1.1-4.el7ost.noarch openstack-ceilometer-common-2015.1.1-1.el7ost.noarch openstack-ironic-discoverd-1.1.0-6.el7ost.noarch openstack-selinux-0.6.37-1.el7ost.noarch openstack-nova-compute-2015.1.1-1.el7ost.noarch openstack-nova-conductor-2015.1.1-1.el7ost.noarch openstack-swift-account-2.3.0-1.el7ost.noarch openstack-swift-proxy-2.3.0-1.el7ost.noarch openstack-tuskar-ui-0.4.0-3.el7ost.noarch openstack-neutron-common-2015.1.1-5.el7ost.noarch openstack-heat-common-2015.1.1-4.el7ost.noarch openstack-ironic-common-2015.1.1-4.el7ost.noarch openstack-ceilometer-alarm-2015.1.1-1.el7ost.noarch openstack-keystone-2015.1.1-1.el7ost.noarch openstack-dashboard-theme-2015.1.1-2.el7ost.noarch openstack-glance-2015.1.1-1.el7ost.noarch openstack-tripleo-0.0.7-0.1.1664e566.el7ost.noarch openstack-heat-api-2015.1.1-4.el7ost.noarch openstack-ceilometer-central-2015.1.1-1.el7ost.noarch openstack-swift-2.3.0-1.el7ost.noarch openstack-nova-novncproxy-2015.1.1-1.el7ost.noarch openstack-swift-object-2.3.0-1.el7ost.noarch [stack@host15 ~]$ rpm -qa | grep oscplugin python-rdomanager-oscplugin-0.0.10-5.el7ost.noarch How reproducible: Deploy always fails the first time on baremetal and passes on rerun Steps to Reproduce: 1. Install ops-director from latest puddle on baremetal 2. Deploy an HA overcloud (3 controllers 1 compute) 3. Watch the deployment stall Actual results: The deployment will error out after 4 hours Expected results: The deployment should reach COMPLETE Additional info:
this was introduced b/c of the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1228862 there is an issue with the patch for that, which i've fixed in: https://code.engineering.redhat.com/gerrit/#/c/58392/ it's built into instack-undercloud-2.1.2-28.el7ost
going to close this one as a dupe of the other bug so we dont have to chase a bunch of acks on it *** This bug has been marked as a duplicate of bug 1228862 ***
the issue here was that keystone users are getting recreated now in 98-undercloud-setup, and nova-api must be holding onto an old token. This causes the following traceback in the nova-api log when the overcloud nodes query for metadata: _handle_fault_response /usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py:176 2015-09-24 16:47:50.925 2968 ERROR nova.api.openstack [req-ef1347fd-554d-4ff7-9bcf-9b47501bf52f 71514baf617747dd816985fdebe313e8 f13b7755370e4875ae107532d2815a51 - - -] Caught error: 404 Not Found The resource could not be found. 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack Traceback (most recent call last): 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/nova/api/openstack/__init__.py", line 125, in __call__ 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack return req.get_response(self.application) 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/webob/request.py", line 1296, in send 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack application, catch_exc_info=False) 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/webob/request.py", line 1260, in call_application 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack app_iter = application(self.environ, start_response) 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/webob/dec.py", line 144, in __call__ 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack return resp(environ, start_response) 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/keystonemiddleware/auth_token/__init__.py", line 634, in __call__ 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack return self._call_app(env, start_response) 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/keystonemiddleware/auth_token/__init__.py", line 554, in _call_app 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack return self._app(env, _fake_start_response) 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/webob/dec.py", line 144, in __call__ 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack return resp(environ, start_response) 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/webob/dec.py", line 144, in __call__ 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack return resp(environ, start_response) 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/routes/middleware.py", line 131, in __call__ 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack response = self.app(environ, start_response) 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/webob/dec.py", line 144, in __call__ 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack return resp(environ, start_response) 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/webob/dec.py", line 130, in __call__ 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack resp = self.call_func(req, *args, **self.kwargs) 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/webob/dec.py", line 195, in call_func 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack return self.func(req, *args, **kwargs) 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/nova/api/openstack/wsgi.py", line 756, in __call__ 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack content_type, body, accept) 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/nova/api/openstack/wsgi.py", line 821, in _process_stack 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack action_result = self.dispatch(meth, request, action_args) 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/nova/api/openstack/wsgi.py", line 911, in dispatch 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack return method(req=request, **action_args) 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/nova/api/openstack/compute/contrib/floating_ips.py", line 103, in index 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack floating_ips = self.network_api.get_floating_ips_by_project(context) 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 1252, in get_floating_ips_by_project 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack fips = client.list_floatingips(tenant_id=project_id)['floatingips'] 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 102, in with_params 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack ret = self.function(instance, *args, **kwargs) 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 691, in list_floatingips 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack **_params) 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 307, in list 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack for r in self._pagination(collection, path, **params): 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 320, in _pagination 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack res = self.get(path, params=params) 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 293, in get 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack headers=headers, params=params) 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 270, in retry_request 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack headers=headers, params=params) 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 211, in do_request 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack self._handle_fault_response(status_code, replybody) 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 185, in _handle_fault_response 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack exception_handler_v20(status_code, des_error_body) 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 83, in exception_handler_v20 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack message=message) 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack NeutronClientException: 404 Not Found 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack The resource could not be found. 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack 2015-09-24 16:47:50.925 2968 TRACE nova.api.openstack the corresponding error from neutron/server.log: 2015-09-24 16:47:53.688 29876 DEBUG keystoneclient.session [-] Request returned failure status: 404 request /usr/lib/python2.7/site-packages/keystoneclient/session.py:396 2015-09-24 16:47:53.688 29876 WARNING keystonemiddleware.auth_token [-] Authorization failed for token