Bug 1326883

Summary: After compute node scaling in a mixed UC8-OC7 environment, nova-compute service cannot start
Product: Red Hat OpenStack Reporter: Dan Yasny <dyasny>
Component: openstack-novaAssignee: Eoghan Glynn <eglynn>
Status: CLOSED NOTABUG QA Contact: Prasanth Anbalagan <panbalag>
Severity: high Docs Contact:
Priority: unspecified    
Version: 7.0 (Kilo)CC: berrange, brad, dasmith, eglynn, kchamart, ndipanov, sbauza, sferdjao, sgordon, vromanso, yeylon
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-04-15 13:32:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Dan Yasny 2016-04-13 16:02:28 UTC
Description of problem:
I am testing the support for managing overcloud 7.3 from undercloud 8
The flow:
1. deploy a standard setup - 3 controllers, 1 compute and 1 ceph, network isolation and SSL using 7.3 GA
2. populate the overcloud with instances, tenants, objects, volumes, etc
3. upgrade undercloud to 8
4. fix the known issues from 1325702 and 1326644
5. bring up vlan10, to restore connectivity
6. verify the populated objects are alive and still exist in the overcloud
7. update tripleo-overcloud-passwords with "OVERCLOUD_RABBITMQ_PASSWORD=guest" (BZ1320333)
8. rerun the deploy command, pointing to a local tht dir, that ocntains the kilo templates, and changing the compute count from 1 to 2
9. wait for the deployment to complete. It actually failed, with UPDATE_FAILED; stack_status_reason : Engine went down during stack UPDATE
10. check ironic, and nova on UC - looks like the second compute got added fine.
11. tried to see if I can stop the instances and restart them again, so they would spread across the two computes, and found I can't stop the VMs. 
12. checked the compute node, and saw that it has a cycling message in the nova-compute log:
2016-04-13 15:11:30.687 30346 ERROR oslo_messaging._drivers.impl_rabbit [req-377bb6b2-4ed0-47d8-aa4e-a4c85a8dc8d1 - - - - -] AMQP server 192.168.100.13:5672 closed the connection. Check login credentials: Socket closed

Version-Release number of selected component (if applicable):
7.3GA on the overcloud

Undercloud on 8 puddle:
python-django-openstack-auth-2.0.1-1.2.el7ost.noarch
openstack-dashboard-8.0.1-2.el7ost.noarch
openstack-heat-engine-5.0.1-5.el7ost.noarch
openstack-nova-scheduler-12.0.2-5.el7ost.noarch
openstack-neutron-ml2-7.0.1-15.el7ost.noarch
openstack-ironic-api-4.2.2-4.el7ost.noarch
openstack-ceilometer-collector-5.0.2-2.el7ost.noarch
openstack-ironic-inspector-2.2.5-2.el7ost.noarch
openstack-selinux-0.6.58-1.el7ost.noarch
openstack-tuskar-0.4.18-5.el7ost.noarch
openstack-tripleo-image-elements-0.9.9-1.el7ost.noarch
openstack-swift-2.5.0-2.el7ost.noarch
openstack-ceilometer-notification-5.0.2-2.el7ost.noarch
openstack-neutron-common-7.0.1-15.el7ost.noarch
python-openstackclient-1.7.2-1.el7ost.noarch
openstack-dashboard-theme-8.0.1-2.el7ost.noarch
openstack-heat-api-cloudwatch-5.0.1-5.el7ost.noarch
openstack-tempest-liberty-20160317.1.el7ost.noarch
openstack-nova-console-12.0.2-5.el7ost.noarch
openstack-nova-novncproxy-12.0.2-5.el7ost.noarch
openstack-ironic-conductor-4.2.2-4.el7ost.noarch
openstack-glance-11.0.1-4.el7ost.noarch
openstack-keystone-8.0.1-1.el7ost.noarch
openstack-puppet-modules-7.0.17-1.el7ost.noarch
openstack-tripleo-0.0.7-1.el7ost.noarch
openstack-nova-cert-12.0.2-5.el7ost.noarch
openstack-neutron-openvswitch-7.0.1-15.el7ost.noarch
openstack-ceilometer-alarm-5.0.2-2.el7ost.noarch
openstack-swift-object-2.5.0-2.el7ost.noarch
openstack-heat-templates-0-0.8.20150605git.el7ost.noarch
openstack-tuskar-ui-0.4.0-5.el7ost.noarch
openstack-utils-2014.2-1.el7ost.noarch
openstack-tripleo-heat-templates-kilo-0.8.14-7.el7ost.noarch
openstack-ceilometer-common-5.0.2-2.el7ost.noarch
openstack-ironic-common-4.2.2-4.el7ost.noarch
openstack-heat-common-5.0.1-5.el7ost.noarch
openstack-heat-api-cfn-5.0.1-5.el7ost.noarch
openstack-nova-conductor-12.0.2-5.el7ost.noarch
openstack-ceilometer-central-5.0.2-2.el7ost.noarch
redhat-access-plugin-openstack-7.0.0-0.el7ost.noarch
openstack-tripleo-heat-templates-0.8.14-7.el7ost.noarch
openstack-ceilometer-polling-5.0.2-2.el7ost.noarch
openstack-tripleo-common-0.3.1-1.el7ost.noarch
openstack-heat-api-5.0.1-5.el7ost.noarch
openstack-nova-api-12.0.2-5.el7ost.noarch
openstack-swift-proxy-2.5.0-2.el7ost.noarch
openstack-swift-container-2.5.0-2.el7ost.noarch
openstack-nova-common-12.0.2-5.el7ost.noarch
openstack-nova-compute-12.0.2-5.el7ost.noarch
openstack-neutron-7.0.1-15.el7ost.noarch
openstack-ceilometer-api-5.0.2-2.el7ost.noarch
openstack-swift-account-2.5.0-2.el7ost.noarch
openstack-tuskar-ui-extras-0.0.4-2.el7ost.noarch
openstack-tripleo-puppet-elements-0.0.5-1.el7ost.noarch
openstack-swift-plugin-swift3-1.9-1.el7ost.noarch


How reproducible:
once so far

Steps to Reproduce:
1. see above
2.
3.

Actual results:
on the old compute, nova-compute service is stuck on starting

Expected results:

heat stack scale should work and nova should work
Additional info:

setup is available for investigation

Comment 2 Brad P. Crochet 2016-04-15 13:32:36 UTC
It appears that on the system in question, the overcloud deploy was run from a directory different from the initial run. What occurred was that the tripleo-overcloud-passwords file was regenerated (as was all of the passwords), and so it put the stack into an indeterminate state. If this can be reproduced when using the same password file, then please reopen. Otherwise, closing this.