Bug 1464456

Summary: Upgrade 8 to 9 failed, customer skipped Updating the Configuration Agent step.
Product: Red Hat OpenStack Reporter: Eduard Barrera <ebarrera>
Component: rhosp-directorAssignee: Sofer Athlan-Guyot <sathlang>
Status: CLOSED NOTABUG QA Contact: Amit Ugol <augol>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 9.0 (Mitaka)CC: dbecker, ebarrera, mburns, mcornea, morazi, mschuppe, rhel-osp-director-maint, sathlang
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-06-30 09:51:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Eduard Barrera 2017-06-23 13:09:40 UTC
Description of problem:

An overcloud update failed with the following error a:
https://bugzilla.redhat.com/show_bug.cgi?id=1443638

Error: (<unknown>): mapping values are not allowed in this context at line 334 column 42 at /var/lib/heat-config/heat-config-puppet/2c674b79-51d1-44f9-81d9-691f6227ac81.pp:16 on node overcloud-test-controller-1.localdomain
Wrapped exception:
(<unknown>): mapping values are not allowed in this context at line 334 column 42
Error: (<unknown>): mapping values are not allowed in this context at line 334 column 42 at /var/lib/heat-config/heat-config-puppet/2c674b79-51d1-44f9-81d9-691f6227ac81.pp:16 on node overcloud-test-controller-1.localdomain
We traced the error and came to conclusion that the error is caused by the file /etc/puppet/hieradata/controller.yaml line 334:
Psych::SyntaxError: (controller.yaml): mapping values are not allowed in this context at line 334 column 42
	from /usr/share/ruby/psych.rb:205:in `parse'
	from /usr/share/ruby/psych.rb:205:in `parse_stream'
	from /usr/share/ruby/psych.rb:153:in `parse'
	from /usr/share/ruby/psych.rb:129:in `load'
	from /usr/share/ruby/psych.rb:299:in `block in load_file'
	from /usr/share/ruby/psych.rb:299:in `open'
	from /usr/share/ruby/psych.rb:299:in `load_file'
	from (irb):7:in `block in irb_binding'
	from (irb):5:in `foreach'
	from (irb):5
	from /bin/irb:12:in `<main>'

The output of line 334 = ceilometer::dispatcher::gnocchi::url: ://:

The error is probably caused because step Updating the Configuration Agent[2]

Now Director is version 9 so it is not possible to do that step, so I we do it it will correspond to the step for updating from 9 to 10.


What steps should be done now to continue with the upgrade to 9 ?



[1]https://bugzilla.redhat.com/show_bug.cgi?id=1443638

[2]https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/8/html/upgrading_red_hat_openstack_platform/sect-updating_the_environment#sect-Updating_the_Configuration_Agent


2.3. Updating the Configuration Agent

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Sofer Athlan-Guyot 2017-06-23 14:06:19 UTC
Hi,

If I get this properly you've achieved to upgrade to osp9 but you have skipped the step in the 2.4 documentation.

Basically this step make sure that all previous run of the heat agent are remembered after a reboot of the overcloud nodes.

So can you check the status of /var/run/heat-config /var/lib/heat-config.

Basically you want the /var/lib/heat-config to be populated.  You can review the script that makes the copy: https://github.com/openstack/heat-templates/blob/master/hot/software-config/elements/heat-config/bin/heat-config-rebuild-deployed 

Then you can check that the heat agent is using the right directory /var/lib/...
in this script /usr/libexec/os-refresh-config/configure.d/55-heat-config on the overcloud.

Comment 2 Sofer Athlan-Guyot 2017-06-23 14:06:55 UTC
Tell me if that's enough information to get you going.

Comment 3 Martin Schuppert 2017-06-26 09:35:35 UTC
(In reply to Sofer Athlan-Guyot from comment #1)
> Hi,
> 
> If I get this properly you've achieved to upgrade to osp9 but you have
> skipped the step in the 2.4 documentation.
> 
> Basically this step make sure that all previous run of the heat agent are
> remembered after a reboot of the overcloud nodes.
> 
> So can you check the status of /var/run/heat-config /var/lib/heat-config.

still using /var/run instead of /var/lib:

[mschuppe@collab-shell var]$ ll run/heat-config/
total 688
drwxrwxrwx+ 2 mschuppe mschuppe   8192 Jun 22 16:10 deployed
-rwxrwxrwx+ 1 mschuppe mschuppe 678141 Jun 22 16:09 heat-config
drwxrwxrwx+ 2 mschuppe mschuppe   4096 Jun 22 08:49 heat-config-script

[mschuppe@collab-shell var]$ ll lib/heat-config/
total 24
drwxrwxrwx+ 2 mschuppe mschuppe 4096 Jun 22 16:10 heat-config-puppet
drwxrwxrwx+ 3 mschuppe mschuppe 4096 Jun 22 15:22 heat-config-script
drwxrwxrwx+ 2 mschuppe mschuppe   44 Jun  3  2016 hooks

> 
> Basically you want the /var/lib/heat-config to be populated.  You can review
> the script that makes the copy:
> https://github.com/openstack/heat-templates/blob/master/hot/software-config/
> elements/heat-config/bin/heat-config-rebuild-deployed 
> 
> Then you can check that the heat agent is using the right directory
> /var/lib/...
> in this script /usr/libexec/os-refresh-config/configure.d/55-heat-config on
> the overcloud.

usr/libexec/os-refresh-config/configure.d/55-heat-config use the old /var/run:

HOOKS_DIR = os.environ.get('HEAT_CONFIG_HOOKS',
                           '/var/lib/heat-config/hooks')
CONF_FILE = os.environ.get('HEAT_SHELL_CONFIG',
                           '/var/run/heat-config/heat-config')
DEPLOYED_DIR = os.environ.get('HEAT_CONFIG_DEPLOYED',
                              '/var/run/heat-config/deployed')
HEAT_CONFIG_NOTIFY = os.environ.get('HEAT_CONFIG_NOTIFY',
                                    'heat-config-notify')

Basically the remaining question is if it is ok to run the above from an already upgraded undercloud to OSP9 instead of an OSP8 undercloud (overcloud is still OSP8):
 
1) from OSP9 undercloud copy the /usr/share/openstack-heat-templates/software-config/elements/heat-config/os-refresh-config/configure.d/55-heat-config to the overcloud nodes
2) on the overcloud nodes create /var/lib/heat-config/deployed
3) copy heat-config-rebuild-deployed from OSP9 undercloud to the overcloud nodes
4) run heat-config-rebuild-deployed (or manually move /var/run/heat-config/deployed to /var/lib/heat-config/deployed )

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/8/html/upgrading_red_hat_openstack_platform/sect-updating_the_environment#sect-Updating_the_Configuration_Agent

Comment 4 Sofer Athlan-Guyot 2017-06-26 10:08:20 UTC
Hi,

as seen on irc, you can directly apply the steps from the documentation.

We have cross checked that 55-heat-config from osp8 and osp9 are the same, so everything apply.

Comment 5 Eduard Barrera 2017-06-28 13:18:20 UTC
During the upgrade process from 8 to 9 step: 3.4.3 Installing Aodh https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/9/html-single/upgrading_red_hat_openstack_platform/#sect-Major-Upgrading_the_Overcloud-Aodh

Deployment is finishing the following way:


2017-06-28 07:41:26 [overcloud-test-AllNodesExtraConfig-7maxstusxiwk-NetworkMidonetDeploymentComputes-x5tposkcnizj]: UPDATE_COMPLETE Stack UPDATE completed successfully
2017-06-28 07:41:26 [overcloud-test-AllNodesExtraConfig-7maxstusxiwk-NetworkMidonetDeploymentControllers-ptyl3bafm72q]: UPDATE_COMPLETE Stack UPDATE completed successfully
2017-06-28 07:41:27 [NetworkMidonetDeploymentControllers]: UPDATE_COMPLETE state changed
2017-06-28 07:41:27 [NetworkMidonetDeploymentComputes]: UPDATE_COMPLETE state changed
2017-06-28 07:41:29 [overcloud-test-AllNodesExtraConfig-7maxstusxiwk]: UPDATE_COMPLETE Stack UPDATE completed successfully
2017-06-28 07:41:30 [AllNodesExtraConfig]: UPDATE_COMPLETE state changed
Stack overcloud-test UPDATE_COMPLETE
/home/stack/.ssh/known_hosts updated.
Original contents retained as /home/stack/.ssh/known_hosts.old
Authorization Failed: Unable to establish connection to https://api-test.heicloud.uni-heidelberg.de:13000/v2.0/tokens

So not sure what is the error here, is the update finishing correctly and some extra stuff is failing ?

but anyway, It is supposed that this step removes ceilometer and install aodh, but the alarm evacuator is still present. I neither understand why heat is not started:




[root@overcloud-test-controller-0 heat-admin]# pcs status |grep -i stopped -B 1
 Clone Set: openstack-ceilometer-alarm-notifier-clone [openstack-ceilometer-alarm-notifier]
     Stopped: [ overcloud-test-controller-0 overcloud-test-controller-1 overcloud-test-controller-2 ]
 Clone Set: openstack-heat-engine-clone [openstack-heat-engine]
     Stopped: [ overcloud-test-controller-0 overcloud-test-controller-1 overcloud-test-controller-2 ]
 Clone Set: openstack-heat-api-clone [openstack-heat-api]
     Stopped: [ overcloud-test-controller-0 overcloud-test-controller-1 overcloud-test-controller-2 ]
 Clone Set: openstack-heat-api-cloudwatch-clone [openstack-heat-api-cloudwatch]
     Stopped: [ overcloud-test-controller-0 overcloud-test-controller-1 overcloud-test-controller-2 ]
 Clone Set: openstack-ceilometer-notification-clone [openstack-ceilometer-notification]
     Stopped: [ overcloud-test-controller-0 overcloud-test-controller-1 overcloud-test-controller-2 ]
 Clone Set: openstack-ceilometer-alarm-evaluator-clone [openstack-ceilometer-alarm-evaluator]
     Stopped: [ overcloud-test-controller-0 overcloud-test-controller-1 overcloud-test-controller-2 ]
 Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn]
     Stopped: [ overcloud-test-controller-0 overcloud-test-controller-1 overcloud-test-controller-2 ]

Failed Actions: * openstack-ceilometer-alarm-evaluator_start_0 on overcloud-test-controller-0 'not installed' (5): call=242, status=Not installed, exitreason='none', last-rc-change='Mon Jun 26 15:55:18 2017', queued=0ms, exec=123ms * openstack-ceilometer-alarm-evaluator_start_0 on overcloud-test-controller-1 'not installed' (5): call=239, status=Not installed, exitreason='none', last-rc-change='Mon Jun 26 15:55:18 2017', queued=0ms, exec=124ms * openstack-ceilometer-alarm-evaluator_start_0 on overcloud-test-controller-2 'not installed' (5): call=234, status=Not installed, exitreason='none', last-rc-change='Mon Jun 26 15:55:18 2017', queued=0ms, exec=134ms 


I check stonith is off and aparently there is no constraint stopping to start heat if ceilometer alarm is not started

These are the logs on corosync.log


un 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:  warning: check_migration_threshold: Forcing openstack-ceilometer-alarm-evaluator-clone away from overcloud-test-controller-0 after 1000000 failures (max=1000000)
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: get_failcount_full:        openstack-ceilometer-alarm-evaluator-clone has failed INFINITY times on overcloud-test-controller-0
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:  warning: check_migration_threshold: Forcing openstack-ceilometer-alarm-evaluator-clone away from overcloud-test-controller-0 after 1000000 failures (max=1000000)
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: get_failcount_full:        openstack-ceilometer-alarm-evaluator-clone has failed INFINITY times on overcloud-test-controller-0
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:  warning: check_migration_threshold: Forcing openstack-ceilometer-alarm-evaluator-clone away from overcloud-test-controller-0 after 1000000 failures (max=1000000)
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: get_failcount_full:        openstack-ceilometer-alarm-evaluator:0 has failed INFINITY times on overcloud-test-controller-1
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:  warning: check_migration_threshold: Forcing openstack-ceilometer-alarm-evaluator-clone away from overcloud-test-controller-1 after 1000000 failures (max=1000000)
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: get_failcount_full:        openstack-ceilometer-alarm-evaluator-clone has failed INFINITY times on overcloud-test-controller-1
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:  warning: check_migration_threshold: Forcing openstack-ceilometer-alarm-evaluator-clone away from overcloud-test-controller-1 after 1000000 failures (max=1000000)
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: get_failcount_full:        openstack-ceilometer-alarm-evaluator-clone has failed INFINITY times on overcloud-test-controller-1
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:  warning: check_migration_threshold: Forcing openstack-ceilometer-alarm-evaluator-clone away from overcloud-test-controller-1 after 1000000 failures (max=1000000)
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: get_failcount_full:        openstack-ceilometer-alarm-evaluator:0 has failed INFINITY times on overcloud-test-controller-2
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:  warning: check_migration_threshold: Forcing openstack-ceilometer-alarm-evaluator-clone away from overcloud-test-controller-2 after 1000000 failures (max=1000000)
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: get_failcount_full:        openstack-ceilometer-alarm-evaluator-clone has failed INFINITY times on overcloud-test-controller-2
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:  warning: check_migration_threshold: Forcing openstack-ceilometer-alarm-evaluator-clone away from overcloud-test-controller-2 after 1000000 failures (max=1000000)
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: get_failcount_full:        openstack-ceilometer-alarm-evaluator-clone has failed INFINITY times on overcloud-test-controller-2
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:  warning: check_migration_threshold: Forcing openstack-ceilometer-alarm-evaluator-clone away f


un 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: RecurringOp:        Start recurring monitor (60s) for openstack-heat-engine:1 on overcloud-test-controller-1
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: RecurringOp:        Start recurring monitor (60s) for openstack-heat-engine:2 on overcloud-test-controller-2
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: RecurringOp:        Start recurring monitor (60s) for openstack-heat-api:0 on overcloud-test-controller-0
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: RecurringOp:        Start recurring monitor (60s) for openstack-heat-api:1 on overcloud-test-controller-1
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: RecurringOp:        Start recurring monitor (60s) for openstack-heat-api:2 on overcloud-test-controller-2
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: RecurringOp:        Start recurring monitor (60s) for openstack-heat-api-cloudwatch:0 on overcloud-test-controller-0
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: RecurringOp:        Start recurring monitor (60s) for openstack-heat-api-cloudwatch:1 on overcloud-test-controller-1
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: RecurringOp:        Start recurring monitor (60s) for openstack-heat-api-cloudwatch:2 on overcloud-test-controller-2
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: RecurringOp:        Start recurring monitor (60s) for openstack-heat-api-cfn:0 on overcloud-test-controller-0
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: RecurringOp:        Start recurring monitor (60s) for openstack-heat-api-cfn:1 on overcloud-test-controller-1
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: RecurringOp:        Start recurring monitor (60s) for openstack-heat-api-cfn:2 on overcloud-test-controller-2
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: clone_update_actions_interleave:   Inhibiting openstack-ceilometer-notification:0 from being active
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: clone_update_actions_interleave:   Inhibiting openstack-ceilometer-notification:1 from being active
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: clone_update_actions_interleave:   Inhibiting openstack-ceilometer-notification:2 from being active
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: clone_update_actions_interleave:   Inhibiting openstack-ceilometer-notification:0 from being active
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: clone_update_actions_interleave:   Inhibiting openstack-ceilometer-notification:1 from being active
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: clone_update_actions_interleave:   Inhibiting openstack-ceilometer-notification:2 from being active
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: clone_update_actions_interleave:   Inhibiting openstack-heat-api:0 from being active
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: native_deallocate: Deallocating openstack-heat-api:0 from overcloud-test-controller-0
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: clone_update_actions_interleave:   Inhibiting openstack-heat-api:1 from being active
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: native_deallocate: Deallocating openstack-heat-api:1 from overcloud-test-controller-1
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: clone_update_actions_interleave:   Inhibiting openstack-heat-api:2 from being active
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: native_deallocate: Deallocating openstack-heat-api:2 from overcloud-test-controller-2
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: clone_update_actions_interleave:   Inhibiting openstack-heat-api:0 from being active
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: clone_update_actions_interleave:   Inhibiting openstack-heat-api:1 from being active
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: clone_update_actions_interleave:   Inhibiting openstack-heat-api:2 from being active
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: clone_update_actions_interleave:   Inhibiting openstack-heat-api:0 from being active
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: clone_update_actions_interleave:   Inhibiting openstack-heat-api:1 from being active
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: clone_update_actions_interleave:   Inhibiting openstack-heat-api:2 from being active
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: clone_update_actions_interleave:   Inhibiting openstack-heat-api:0 from being active
Jun 28 10:42:33 [3644] overcloud-test-controller-2.localdomain    pengine:     info: clone_update_actions_interleave:   Inhibiting openstack-heat-api:1 from being active

Comment 6 Sofer Athlan-Guyot 2017-06-30 09:51:57 UTC
Hi,

closing this one.  The new issue is tracked there https://bugzilla.redhat.com/show_bug.cgi?id=1465939

thanks,