Bug 1290572 - Scaling out an updated overcloud (7.1->7.2) fails
Summary: Scaling out an updated overcloud (7.1->7.2) fails
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
Target Milestone: y2
: 7.0 (Kilo)
Assignee: Jiri Stransky
QA Contact: Marius Cornea
Depends On:
TreeView+ depends on / blocked
Reported: 2015-12-10 20:28 UTC by Marius Cornea
Modified: 2015-12-21 16:54 UTC (History)
7 users (show)

Fixed In Version: openstack-tripleo-heat-templates-0.8.6-93.el7ost
Doc Type: Bug Fix
Doc Text:
Clone Of:
Last Closed: 2015-12-21 16:54:38 UTC
Target Upstream Version:

Attachments (Terms of Use)
deployment error (10.39 KB, text/plain)
2015-12-10 20:28 UTC, Marius Cornea
no flags Details
os-collect-config (3.99 MB, text/x-vhdl)
2015-12-10 22:16 UTC, Marius Cornea
no flags Details
/var/log/messages (11.57 MB, text/plain)
2015-12-10 22:18 UTC, Marius Cornea
no flags Details

System ID Private Priority Status Summary Last Updated
OpenStack gerrit 245093 0 None None None Never
Red Hat Product Errata RHBA-2015:2651 0 normal SHIPPED_LIVE Red Hat Enterprise Linux OSP 7 director Bug Fix Advisory 2015-12-21 21:50:26 UTC

Description Marius Cornea 2015-12-10 20:28:06 UTC
Created attachment 1104490 [details]
deployment error

Description of problem:
I'm trying to add an additional node to an update overcloud from 7.1 -> 7.2 but the stack update fails because puppet fails to restart openstack-nova-scheduler on one of the controller nodes. If I rerun the update command the stack update completes ok.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. Deploy 7.1 by using 7.1 templates:
openstack overcloud deploy \
    --templates ~/templates/my-overcloud \
    --control-scale 3 --compute-scale 1 --ceph-storage-scale 3 \
    --ntp-server clock.redhat.com \
    --libvirt-type qemu \
    -e ~/templates/my-overcloud/environments/network-isolation.yaml \
    -e ~/templates/network-environment.yaml \
    -e ~/templates/firstboot-environment.yaml \
    -e ~/templates/ceph.yaml 

2. Update the undercloud to 7.2 and run the update procedure to 7.2 with 7.2 templates:
/usr/bin/yes '' | openstack overcloud update stack overcloud -i \
         --templates ~/templates/my-overcloud \
         -e ~/templates/my-overcloud/overcloud-resource-registry-puppet.yaml \
         -e ~/templates/my-overcloud/environments/network-isolation.yaml \
         -e ~/templates/network-environment.yaml \
         -e ~/templates/firstboot-environment.yaml \
         -e ~/templates/ceph.yaml \
         -e ~/templates/my-overcloud/environments/updates/update-from-vip.yaml \
         -e ~/templates/ctrlport.yaml

Wait for the update to complete

3. Try to scale out with an additional node:

openstack overcloud deploy \
    --templates ~/templates/my-overcloud \
    --control-scale 3 --compute-scale 2 --ceph-storage-scale 3 \
    --ntp-server clock.redhat.com \
    --libvirt-type qemu \
    -e ~/templates/my-overcloud/overcloud-resource-registry-puppet.yaml \
    -e ~/templates/my-overcloud/environments/network-isolation.yaml \
    -e ~/templates/network-environment.yaml \
    -e ~/templates/firstboot-environment.yaml \
    -e ~/templates/ceph.yaml \
    -e ~/templates/my-overcloud/environments/updates/update-from-vip.yaml \
    -e ~/templates/ctrlport.yaml

Actual results:
Stack update fails. I'm attaching the deployment output. When I log in to the controller the nova-scheduler appears to be running:

[root@overcloud-controller-1 heat-admin]# systemctl status openstack-nova-scheduler
● openstack-nova-scheduler.service - OpenStack Nova Scheduler Server
   Loaded: loaded (/usr/lib/systemd/system/openstack-nova-scheduler.service; disabled; vendor preset: disabled)
   Active: active (running) since Thu 2015-12-10 14:45:55 EST; 12min ago
 Main PID: 11340 (nova-scheduler)
   CGroup: /system.slice/openstack-nova-scheduler.service
           └─11340 /usr/bin/python /usr/bin/nova-scheduler

Dec 10 14:45:53 overcloud-controller-1.localdomain systemd[1]: Starting OpenStack Nova Scheduler Server...
Dec 10 14:45:55 overcloud-controller-1.localdomain systemd[1]: Started OpenStack Nova Scheduler Server.
Warning: openstack-nova-scheduler.service changed on disk. Run 'systemctl daemon-reload' to reload units.

Expected results:
The stack update completes ok.

Additional info:

If I rerun the deploy command for a 2nd time the stack update finishes successfully.

Comment 1 Emilien Macchi 2015-12-10 21:39:53 UTC
This is obvious Puppet fails because systemd returns 1 when trying to restart nova-scheduler.

Looking at the few logs you provided, it looks like we might need to run 'systemctl daemon-reload' after yum update sometimes.

"If I rerun the deploy command for a 2nd time the stack update finishes successfully." > I'm perplex about it, that would mean my statement is wrong and a second runs provides what we missed during the first run.

Can you try to run 'systemctl daemon-reload' at the end of step 2 and then run step 3?

Comment 2 Marius Cornea 2015-12-10 22:16:31 UTC
Created attachment 1104509 [details]

Attaching /var/log/messages and the os-collect-config journal in case they provide any helpful info for this. I'll run a fresh environment tomorrow and run systemctl daemon-reload after step 2 and let you know how it goes.

Comment 3 Marius Cornea 2015-12-10 22:18:43 UTC
Created attachment 1104510 [details]

Comment 7 Jiri Stransky 2015-12-11 17:23:03 UTC
The issue was that pacemaker wasn't getting into maintenance mode for the duration of the puppet run on the second and subsequent `openstack overcloud deploy` calls.

The heat resource executing this action didn't receive DeployIdentifier/UpdateIdentifier as an input, which caused it to be executed only on the first time it got introduced into the stack, and never again.

The resource now receives the same re-apply trigger as the Puppet runs, which means it's going to run together with Puppet as intended.

Comment 8 Jiri Stransky 2015-12-11 18:41:30 UTC
Tested by deploying and then scaling up +1 compute. This is not the exact triggering scenario described by Marius, but i saw pacemaker get in and out of maintenance mode during the scale up.

Comment 11 Marius Cornea 2015-12-16 20:31:24 UTC

I was able to successfully scale out according to the reproduce steps.

Comment 13 errata-xmlrpc 2015-12-21 16:54:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.