Bug 1290572

Summary: Scaling out an updated overcloud (7.1->7.2) fails
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: rhosp-directorAssignee: Jiri Stransky <jstransk>
Status: CLOSED ERRATA QA Contact: Marius Cornea <mcornea>
Severity: high Docs Contact:
Priority: high    
Version: 7.0 (Kilo)CC: emacchi, jcoufal, jstransk, mburns, mcornea, rhel-osp-director-maint, yeylon
Target Milestone: y2   
Target Release: 7.0 (Kilo)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-0.8.6-93.el7ost Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-12-21 16:54:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
deployment error
none
os-collect-config
none
/var/log/messages none

Description Marius Cornea 2015-12-10 20:28:06 UTC
Created attachment 1104490 [details]
deployment error

Description of problem:
I'm trying to add an additional node to an update overcloud from 7.1 -> 7.2 but the stack update fails because puppet fails to restart openstack-nova-scheduler on one of the controller nodes. If I rerun the update command the stack update completes ok.

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-0.8.6-91.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy 7.1 by using 7.1 templates:
openstack overcloud deploy \
    --templates ~/templates/my-overcloud \
    --control-scale 3 --compute-scale 1 --ceph-storage-scale 3 \
    --ntp-server clock.redhat.com \
    --libvirt-type qemu \
    -e ~/templates/my-overcloud/environments/network-isolation.yaml \
    -e ~/templates/network-environment.yaml \
    -e ~/templates/firstboot-environment.yaml \
    -e ~/templates/ceph.yaml 

2. Update the undercloud to 7.2 and run the update procedure to 7.2 with 7.2 templates:
/usr/bin/yes '' | openstack overcloud update stack overcloud -i \
         --templates ~/templates/my-overcloud \
         -e ~/templates/my-overcloud/overcloud-resource-registry-puppet.yaml \
         -e ~/templates/my-overcloud/environments/network-isolation.yaml \
         -e ~/templates/network-environment.yaml \
         -e ~/templates/firstboot-environment.yaml \
         -e ~/templates/ceph.yaml \
         -e ~/templates/my-overcloud/environments/updates/update-from-vip.yaml \
         -e ~/templates/ctrlport.yaml

Wait for the update to complete

3. Try to scale out with an additional node:

openstack overcloud deploy \
    --templates ~/templates/my-overcloud \
    --control-scale 3 --compute-scale 2 --ceph-storage-scale 3 \
    --ntp-server clock.redhat.com \
    --libvirt-type qemu \
    -e ~/templates/my-overcloud/overcloud-resource-registry-puppet.yaml \
    -e ~/templates/my-overcloud/environments/network-isolation.yaml \
    -e ~/templates/network-environment.yaml \
    -e ~/templates/firstboot-environment.yaml \
    -e ~/templates/ceph.yaml \
    -e ~/templates/my-overcloud/environments/updates/update-from-vip.yaml \
    -e ~/templates/ctrlport.yaml


Actual results:
Stack update fails. I'm attaching the deployment output. When I log in to the controller the nova-scheduler appears to be running:

[root@overcloud-controller-1 heat-admin]# systemctl status openstack-nova-scheduler
● openstack-nova-scheduler.service - OpenStack Nova Scheduler Server
   Loaded: loaded (/usr/lib/systemd/system/openstack-nova-scheduler.service; disabled; vendor preset: disabled)
   Active: active (running) since Thu 2015-12-10 14:45:55 EST; 12min ago
 Main PID: 11340 (nova-scheduler)
   CGroup: /system.slice/openstack-nova-scheduler.service
           └─11340 /usr/bin/python /usr/bin/nova-scheduler

Dec 10 14:45:53 overcloud-controller-1.localdomain systemd[1]: Starting OpenStack Nova Scheduler Server...
Dec 10 14:45:55 overcloud-controller-1.localdomain systemd[1]: Started OpenStack Nova Scheduler Server.
Warning: openstack-nova-scheduler.service changed on disk. Run 'systemctl daemon-reload' to reload units.


Expected results:
The stack update completes ok.

Additional info:

If I rerun the deploy command for a 2nd time the stack update finishes successfully.

Comment 1 Emilien Macchi 2015-12-10 21:39:53 UTC
This is obvious Puppet fails because systemd returns 1 when trying to restart nova-scheduler.

Looking at the few logs you provided, it looks like we might need to run 'systemctl daemon-reload' after yum update sometimes.

"If I rerun the deploy command for a 2nd time the stack update finishes successfully." > I'm perplex about it, that would mean my statement is wrong and a second runs provides what we missed during the first run.

Can you try to run 'systemctl daemon-reload' at the end of step 2 and then run step 3?

Comment 2 Marius Cornea 2015-12-10 22:16:31 UTC
Created attachment 1104509 [details]
os-collect-config

Attaching /var/log/messages and the os-collect-config journal in case they provide any helpful info for this. I'll run a fresh environment tomorrow and run systemctl daemon-reload after step 2 and let you know how it goes.

Comment 3 Marius Cornea 2015-12-10 22:18:43 UTC
Created attachment 1104510 [details]
/var/log/messages

Comment 7 Jiri Stransky 2015-12-11 17:23:03 UTC
The issue was that pacemaker wasn't getting into maintenance mode for the duration of the puppet run on the second and subsequent `openstack overcloud deploy` calls.

The heat resource executing this action didn't receive DeployIdentifier/UpdateIdentifier as an input, which caused it to be executed only on the first time it got introduced into the stack, and never again.

The resource now receives the same re-apply trigger as the Puppet runs, which means it's going to run together with Puppet as intended.

Comment 8 Jiri Stransky 2015-12-11 18:41:30 UTC
Tested by deploying and then scaling up +1 compute. This is not the exact triggering scenario described by Marius, but i saw pacemaker get in and out of maintenance mode during the scale up.

Comment 11 Marius Cornea 2015-12-16 20:31:24 UTC
openstack-tripleo-heat-templates-0.8.6-94.el7ost.noarch

I was able to successfully scale out according to the reproduce steps.

Comment 13 errata-xmlrpc 2015-12-21 16:54:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2015:2651