Bug 1366392

Summary: AODH migration fails because puppet-aodh module cannot be found by Puppet
Product: Red Hat OpenStack Reporter: Alexander Chuzhoy <sasha>
Component: openstack-tripleo-heat-templatesAssignee: Jiri Stransky <jstransk>
Status: CLOSED CURRENTRELEASE QA Contact: Arik Chernetsky <achernet>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 9.0 (Mitaka)CC: athomas, dbecker, emacchi, jstransk, mburns, morazi, ohochman, rhel-osp-director-maint, sasha, sclewis, srevivo, tvignaud
Target Milestone: async   
Target Release: 9.0 (Mitaka)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-09-19 15:01:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1337794    

Description Alexander Chuzhoy 2016-08-11 20:40:42 UTC
rhel-osp-director: 8->9 upgrade of overcloud with network manager being disabled on OC nodes fails

Environment:
openstack-tripleo-heat-templates-2.0.0-30.el7ost.noarch
openstack-puppet-modules-8.1.8-1.el7ost.noarch
instack-undercloud-4.0.0-11.el7ost.noarch
openstack-tripleo-heat-templates-liberty-2.0.0-30.el7ost.noarch
openstack-tripleo-heat-templates-kilo-0.8.14-16.el7ost.noarch

Step to reproduce:
1. Have an overcloud with Network Manager disabled on nodes.
2. Attempt to upgrade to 9

Result:
2016-06-29 13:29:21 [NetworkDeployment]: SIGNAL_COMPLETE Unknown
2016-06-29 13:29:31 [1]: SIGNAL_IN_PROGRESS Signal: deployment failed (1)
2016-06-29 13:29:31 [1]: CREATE_FAILED Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 1
2016-06-29 13:29:32 [overcloud-UpdateWorkflow-ehdq5lr3lzcp-AodhUpgradeConfigDeployment-c4nltoaweugv]: CREATE_FAILED Resource CREATE failed: Error: resources[2]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 1
2016-06-29 13:29:33 [AodhUpgradeConfigDeployment]: CREATE_FAILED Error: resources.AodhUpgradeConfigDeployment.resources[2]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1
2016-06-29 13:29:34 [overcloud-UpdateWorkflow-ehdq5lr3lzcp]: UPDATE_FAILED Error: resources.AodhUpgradeConfigDeployment.resources[2]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1
2016-06-29 13:29:35 [UpdateWorkflow]: UPDATE_FAILED resources.UpdateWorkflow: Error: resources.AodhUpgradeConfigDeployment.resources[2]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1
2016-06-29 13:29:35 [1]: SIGNAL_IN_PROGRESS Signal: deployment succeeded
2016-06-29 13:29:36 [1]: UPDATE_COMPLETE state changed
2016-06-29 13:29:37 [overcloud-ControllerAllNodesValidationDeployment-szghjo6u4hqw]: UPDATE_COMPLETE Stack UPDATE completed successfully
2016-06-29 13:29:38 [ControllerAllNodesValidationDeployment]: UPDATE_COMPLETE state changed
2016-06-29 13:29:38 [overcloud]: UPDATE_FAILED resources.UpdateWorkflow: Error: resources.AodhUpgradeConfigDeployment.resources[2]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1
2016-06-29 13:29:40 [1]: SIGNAL_COMPLETE Unknown
2016-06-29 13:29:41 [1]: SIGNAL_COMPLETE Unknown
2016-06-29 13:29:42 [1]: SIGNAL_COMPLETE Unknown
2016-06-29 13:29:43 [1]: SIGNAL_COMPLETE Unknown
2016-06-29 13:29:44 [1]: SIGNAL_COMPLETE Unknown
2016-06-29 13:29:44 [ControllerDeployment]: SIGNAL_COMPLETE Unknown
2016-06-29 13:29:45 [1]: SIGNAL_COMPLETE Unknown
2016-06-29 13:29:46 [NetworkDeployment]: SIGNAL_COMPLETE Unknown
Stack overcloud UPDATE_FAILED
Deployment failed:  Heat Stack update failed.




    "deploy_stderr": "Error: NetworkManager is not running.\nCould not retrieve fact='apache_version', resolution='<anonymous>': undefined method `[]' for nil:NilClass\nCould not retrieve fact='apache_version', resolution='<anonymous>': undefined method `[]' for nil:NilClass\n\u001b[1;31mError: Puppet::Parser::AST::Resource failed with error ArgumentError: Could not find declared class ::aodh at /var/lib/heat-config/heat-config-puppet/d9c22157-4dac-4144-9be0-1e4e3606866f.pp:30 on node overcloud-controller-1.localdomain\nWrapped exception:\nCould not find declared class ::aodh\u001b[0m\n\u001b[1;31mError: Puppet::Parser::AST::Resource failed with error ArgumentError: Could not find declared class ::aodh at /var/lib/heat-config/heat-config-puppet/d9c22157-4dac-4144-9be0-1e4e3606866f.pp:30 on node overcloud-controller-1.localdomain\u001b[0m\n",
    "deploy_status_code": 1

Comment 2 Omri Hochman 2016-08-11 21:01:36 UTC
This Bz blocks the path for Upgrade from:  7.3  ->  8.0 ->  9.0 

this came after applying the workaround for Bz#1364583 that suggest to disable NetworkManager on the overcloud nodes - in order to successfully upgrade from 7.3 to 8.0.

See: https://bugzilla.redhat.com/show_bug.cgi?id=1364583#c14

Comment 4 Emilien Macchi 2016-08-12 00:10:51 UTC
The error to take in account is:

Could not find declared class ::aodh

Note: don't worry about "Error: NetworkManager is not running" this is not critical.

So if TripleO can't find ::aodh, this is because puppet-aodh is not installed. You need to make sure OPM was upgraded during the process and make sure the aodh is here.

Comment 5 Alexander Chuzhoy 2016-08-12 00:49:55 UTC
[heat-admin@overcloud-controller-0 ~]$ rpm -q puppet
puppet-3.6.2-2.el7.noarch
[heat-admin@overcloud-controller-0 ~]$ logout
Connection to 192.168.0.11 closed.
[stack@undercloud72 ~]$ rpm -qa|grep puppet
openstack-tripleo-puppet-elements-2.0.0-4.el7ost.noarch
puppet-3.6.2-4.el7sat.noarch
openstack-puppet-modules-8.1.8-1.el7ost.noarch

Comment 6 Alexander Chuzhoy 2016-08-12 00:54:09 UTC
The setup was successfully upgraded from 7 to 8.

When I cam to upgrade it to 9, the reported issue occured during the aodh migration step.

Comment 7 Emilien Macchi 2016-08-12 01:00:48 UTC
I'm pretty sure OPM is not updated on the overcloud at the stage of Aodh migration.

Again: the error is OPM is not updated. Please ignore the NetworkManager thing.

Comment 10 Jiri Stransky 2016-08-12 14:09:47 UTC
Debugged the deployment. Moving the BZ to openstack-puppet-modules component and listing a workaround below.

Emilien is right that the root cause is missing aodh module. It's not in fact completely missing, it's present in /usr/share/openstack-puppet/modules, but it's not symlinked from /etc/puppet/modules, where Puppet looks for modules.

The symlinking is currently (probably incorrectly?) part of the overcloud image building process rather than the openstack-puppet-module RPM itself. The effect is that whenever we add a new module during RPM update of openstack-puppet-modules, it cannot be found by Puppet, because the symlink in /etc/puppet/modules is not created.

We should probably move the symlinking onto the RPM level. Re-running the DIB elements on overcloud to create the symlinks is probably not a realistic solution.


The workaround could to run this on every overcloud node:

ln -f -s /usr/share/openstack-puppet/modules/* /etc/puppet/modules/

prior to triggering the AODH migration. (The script line is taken from what the DIB element does when building an image [1].)


[1] https://github.com/openstack/tripleo-puppet-elements/blob/627b949430f9124181d4470abd908e25a9bfa760/elements/puppet-modules/install.d/puppet-modules-package-install/75-puppet-modules-package#L7

Comment 11 Mike Burns 2016-08-12 15:20:24 UTC
(In reply to Jiri Stransky from comment #10)
> Debugged the deployment. Moving the BZ to openstack-puppet-modules component
> and listing a workaround below.
> 
> Emilien is right that the root cause is missing aodh module. It's not in
> fact completely missing, it's present in
> /usr/share/openstack-puppet/modules, but it's not symlinked from
> /etc/puppet/modules, where Puppet looks for modules.
> 
> The symlinking is currently (probably incorrectly?) part of the overcloud
> image building process rather than the openstack-puppet-module RPM itself.
> The effect is that whenever we add a new module during RPM update of
> openstack-puppet-modules, it cannot be found by Puppet, because the symlink
> in /etc/puppet/modules is not created.
> 
> We should probably move the symlinking onto the RPM level. Re-running the
> DIB elements on overcloud to create the symlinks is probably not a realistic
> solution.
> 
> 
> The workaround could to run this on every overcloud node:
> 
> ln -f -s /usr/share/openstack-puppet/modules/* /etc/puppet/modules/
> 
> prior to triggering the AODH migration. (The script line is taken from what
> the DIB element does when building an image [1].)
> 
> 
> [1]
> https://github.com/openstack/tripleo-puppet-elements/blob/
> 627b949430f9124181d4470abd908e25a9bfa760/elements/puppet-modules/install.d/
> puppet-modules-package-install/75-puppet-modules-package#L7

Given the move that DIB makes, I'm not sure that it makes sense to have the RPM do this automatically.  I think I'd rather see the upgrade do this symlink.  

In theory, we're only going to add new modules between releases.  (There is a rare case, possibly, where we add one within a release in which case I'd say do this on update and upgrade).  

Doing this in the rpm could have significant issues for people who use OPM outside of director or packstack.  The might install them on a foreman server and import them for other hosts, but not want them in /etc/puppet on that machine.  

We also have to consider the packaging changes around OPM (separate rpms per module).

Comment 12 Jiri Stransky 2016-08-12 15:56:10 UTC
(In reply to Mike Burns from comment #11)
> Doing this in the rpm could have significant issues for people who use OPM
> outside of director or packstack.  The might install them on a foreman
> server and import them for other hosts, but not want them in /etc/puppet on
> that machine.  

Ok that's fair.

The solution on upgrade side will probably be a bit hacky because we'll need to put it into the AODH migration (as there's nothing prior that running on the cloud during upgrade) to fix up existing OSP 8 which are already in wrong state. 

The above would be just a mitaka-specific fix. We'd deal with this properly upstream in a slightly different way -- we should probably trigger the symlinking both on minor updates and major upgrades to keep consistent state at all times, though the chance of breakage during minor update is low.

Moving back to t-h-t then :)

Comment 15 Jiri Stransky 2016-08-16 16:11:14 UTC
Submitted a Mitaka-only patch to fixup existing Liberty deployments before doing AODH migration:

https://review.openstack.org/#/c/355446

And also a patch that should prevent getting the deployment into a bad state in the future:

https://review.openstack.org/#/c/356028

I'd still like to investigate if we can move away from the symlinks altogether, but such solution has some minor conflict potential (e.g. custom roles in TripleO, and probably unclean backport to mitaka), so putting the fix into the updates/upgrades first could still be a way to go.

Comment 16 Jiri Stransky 2016-08-17 14:30:46 UTC
Submitted another patch for TripleO to not depend on the /etc/puppet/modules symlinks.

https://review.openstack.org/#/c/356457/


The mitaka-only fixup already got merged and solves our immediate problem. The other two are more forward-looking to prevent such issues in the future, but we don't necessarily need them for OSP 7->8->9 upgrade.

I suggest backporting only the mitaka-only fixup for now, as it should fix the problem during AODH migration. I'll also change the BZ title accordingly.

Comment 17 Jiri Stransky 2016-09-19 15:01:39 UTC
Short term fix went into Mitaka and a long term solution to prevent similar issues from happening has been merged to Newton. Closing this BZ as it is predominantly about the fixed Mitaka issue. Adding the Newton patch too into the external trackers.