Bug 1321132
Summary: | rhel-osp-director: [7.3->8.0] The major-upgrade-pacemaker-converge.yaml step fails due to: Error: Could not restart Service[neutron-server]: Execution of '/usr/bin/systemctl restart neutron-server' returned 1 | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Alexander Chuzhoy <sasha> |
Component: | openstack-tripleo-heat-templates | Assignee: | Jiri Stransky <jstransk> |
Status: | CLOSED ERRATA | QA Contact: | Alexander Chuzhoy <sasha> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 8.0 (Liberty) | CC: | akaris, dbecker, emacchi, gfidente, jcoufal, jeckersb, jguiditt, jslagle, mandreou, mburns, mcornea, morazi, rhel-osp-director-maint, yeylon |
Target Milestone: | ga | ||
Target Release: | 8.0 (Liberty) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | openstack-tripleo-heat-templates-0.8.14-4.el7ost | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-04-07 21:50:04 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Alexander Chuzhoy
2016-03-24 18:15:19 UTC
Investigating, puppet is not supposed to ever try to restart neutron on its own. The restart is attempted because on upgrades the rabbit password is changing from our old default 'guest' to a new randomly generated password. In such a circumstance puppet goes through the following Notice: /Stage[main]/Neutron/Neutron_config[oslo_messaging_rabbit/rabbit_password]/value: value changed '[old secret redacted]' to '[new secret redacted]' Notice: /Stage[main]/Main/Exec[galera-ready]/returns: executed successfully Notice: /Stage[main]/Neutron::Agents::Dhcp/Service[neutron-dhcp-service]: Triggered 'refresh' from 1 events Notice: /Stage[main]/Neutron::Agents::L3/Service[neutron-l3]: Triggered 'refresh' from 1 events Notice: /Stage[main]/Neutron::Agents::Metadata/Service[neutron-metadata]: Triggered 'refresh' from 1 events Notice: /Stage[main]/Neutron::Db::Sync/Exec[neutron-db-sync]: Triggered 'refresh' from 1 events Notice: /Stage[main]/Pacemaker::Corosync/Exec[enable-not-start-tripleo_cluster]/returns: executed successfully Notice: /Stage[main]/Pacemaker::Corosync/Exec[Set password for hacluster user on tripleo_cluster]/returns: executed successfully Notice: /Stage[main]/Pacemaker::Corosync/Exec[auth-successful-across-all-nodes]/returns: executed successfully Notice: Pacemaker has reported quorum achieved Notice: /Stage[main]/Pacemaker::Corosync/Notify[pacemaker settled]/message: defined 'message' as 'Pacemaker has reported quorum achieved' Notice: Finished catalog run in 146.15 seconds The neutron-server service receives a refresh action because of [1]; puppet shouldn't take any action because of [2] but for reasons which are still unclear it will instead try a systemctl restart. We need to continue investigation about why that is and how we can avoid that. 1. https://github.com/openstack/puppet-neutron/blob/master/manifests/server.pp#L357 2. https://github.com/openstack/tripleo-heat-templates/blob/master/puppet/manifests/overcloud_controller_pacemaker.pp#L758-L762 The problem seems to be in the puppet module which is defining the services outside the manage_service conditional. This is how it should be (puppetlabs-rabbitmq) https://github.com/puppetlabs/puppetlabs-rabbitmq/blob/master/manifests/service.pp#L22-L37 This is how it is (puppet-neutron) https://github.com/openstack/puppet-neutron/blob/master/manifests/server.pp#L509-L525 We might need to apply such a change for each service into each puppet/openstack module. (In reply to Giulio Fidente from comment #5) > The problem seems to be in the puppet module which is defining the services > outside the manage_service conditional. > > This is how it should be (puppetlabs-rabbitmq) > https://github.com/puppetlabs/puppetlabs-rabbitmq/blob/master/manifests/ > service.pp#L22-L37 > > This is how it is (puppet-neutron) > https://github.com/openstack/puppet-neutron/blob/master/manifests/server. > pp#L509-L525 > > We might need to apply such a change for each service into each > puppet/openstack module. update - worked on this today trying to find a workaround after ^^^; I tried two things, unsuccessfully: 1. Moving the service "{ 'neutron-server': definition " into the manage_service conditional at https://github.com/openstack/puppet-neutron/blob/034f0d0779b57d3e53f9fd420818c3343d651938/manifests/server.pp#L516-L524 just means the neutron-server service is never defined so you get "Error: Could not find resource 'Service[neutron-server]' for relationship from 'Neutron_config[DEFAULT/verbose]'" 2. Setting 'service_ensure' explicitly to 'undef' inside the neutron-server definition (rather than stopped/started) to try ensure that puppet never does anything to the service, still results in the same behaviour reported in the bug here, with "Error: /Stage[main]/Neutron::Server/Service[neutron-server]: Failed to call refresh: Could not restart Service[neutron-server]:" (In reply to marios from comment #6) > (In reply to Giulio Fidente from comment #5) > > The problem seems to be in the puppet module which is defining the services > > outside the manage_service conditional. > > > > This is how it should be (puppetlabs-rabbitmq) > > https://github.com/puppetlabs/puppetlabs-rabbitmq/blob/master/manifests/ > > service.pp#L22-L37 > > > > This is how it is (puppet-neutron) > > https://github.com/openstack/puppet-neutron/blob/master/manifests/server. > > pp#L509-L525 > > > > We might need to apply such a change for each service into each > > puppet/openstack module. > > update - worked on this today trying to find a workaround after ^^^; I tried > two things, unsuccessfully: > > 1. Moving the service "{ 'neutron-server': definition " into the > manage_service conditional at > https://github.com/openstack/puppet-neutron/blob/ > 034f0d0779b57d3e53f9fd420818c3343d651938/manifests/server.pp#L516-L524 just > means the neutron-server service is never defined so you get "Error: Could > not find resource 'Service[neutron-server]' for relationship from > 'Neutron_config[DEFAULT/verbose]'" > > 2. Setting 'service_ensure' explicitly to 'undef' inside the neutron-server > definition (rather than stopped/started) to try ensure that puppet never > does anything to the service, still results in the same behaviour reported > in the bug here, with "Error: > /Stage[main]/Neutron::Server/Service[neutron-server]: Failed to call > refresh: Could not restart Service[neutron-server]:" https://github.com/openstack/puppet-neutron/blob/034f0d0779b57d3e53f9fd420818c3343d651938/manifests/server.pp#L358 Neutron_config<||> ~> Service['neutron-server'] is likely to be where the config change triggers the refresh for neutron-server So I've read a lot of things in different comments from marios and gfidente, let me explain so bits here, so we avoid any confusion: * ~> is a triggering symbol in Puppet that will notify a resource if the resource on the left changes. In case of a service, the service will be restarted only if the resource has "enabled => true", which is not the case in TripleO with Pacemaker, "enabled => false" so the service won't be managed by puppet-neutron at all, even when a notify occurs. If you look at this pastebin: http://paste.openstack.org/show/492099/ You'll see that we create an Exec that notify neutron-server service that is not enabled, Puppet catalog just ignores the notify and do not try to restart it. * puppet-neutron has the right parameters to disable a service. "ensure" and "enabled" are the right ones, so nothing is missing on this side. Moreover, the initial deployment worked, so ~> already ran during the initial deployment. * we need to investigate puppet-pacemaker that manages the actual SystemD resource for neutron-server: https://github.com/openstack/puppet-pacemaker/blob/master/manifests/resource/systemd.pp It might be possible that Pacemaker re-enabled the systemd resource (during upgrade or anytime before) so Puppet thinks the resource is enabled and then the ~> will try to restart the service, while we don't want it. So we need to investigate: * does Pacemaker enables neutron-server systemd resource again? * does Puppet checks if the resource is enabled (need to be investigated in Puppet code, if whether or not Puppet checks the actual systemd resource before applying the restart trigger). I don't think the bug is in puppet-neutron anyway. It might be some parameters in puppet-pacemaker or something in the resource agent. Bug reported in puppet-pacemaker in the meantime we found the exact root cause: https://bugs.launchpad.net/puppet-pacemaker/+bug/1562922 > * does Puppet checks if the resource is enabled (need to be investigated in Puppet code, if whether or not Puppet checks the actual systemd resource before applying the restart trigger). Yes: https://github.com/puppetlabs/puppet/blob/3.6.2/lib/puppet/provider/service/systemd.rb#L13 output = systemctl('list-unit-files', '--type', 'service', '--full', '--all', '--no-pager') output.scan(/^(\S+)\s+(disabled|enabled)\s*$/i).each do |m| i << new(:name => m[0]) end Puppet will check if the systemd resource does exist to determine the status of the resource. Which means, even if THT is setting "enabled => false" for neutron::server, if the systemd resource exists for some reasons (managed by something else), the service will be restarted by a ~> if Neutron_config changes. Now, we need to investigate if whether or not pacemaker creates a systemd resource for neutron-server. In a first glance, we can see that pcs is checking if the systemd resource is existing by using the same way as Puppet: https://github.com/feist/pcs/blob/master/pcs/resource.py#L284 Also for the record, Puppet 3.6 and Puppet 4.x (current master) does not check systemd resource status the same way: in 3.6: https://github.com/puppetlabs/puppet/blob/3.6.2/lib/puppet/provider/service/systemd.rb#L30 systemctl("is-enabled", @resource[:name]) in master: https://github.com/puppetlabs/puppet/blob/master/lib/puppet/provider/service/systemd.rb#L58-L63 systemctl_info = systemctl( 'show', @resource[:name], '--property', 'LoadState', '--property', 'UnitFileState', '--no-pager' ) Also, this is how Pacemaker is creating the systemd resource: https://github.com/ClusterLabs/pacemaker/blob/master/lib/services/systemd.c#L556-L561 I'm not sure how Puppet 3.6 provider interprets it. Does Puppet think the resource is enabled so it try to restart it? (In reply to Emilien Macchi from comment #10) > > * does Puppet checks if the resource is enabled (need to be investigated in Puppet code, if whether or not Puppet checks the actual systemd resource before applying the restart trigger). > > Yes: > https://github.com/puppetlabs/puppet/blob/3.6.2/lib/puppet/provider/service/ > systemd.rb#L13 > > output = systemctl('list-unit-files', '--type', 'service', '--full', > '--all', '--no-pager') > output.scan(/^(\S+)\s+(disabled|enabled)\s*$/i).each do |m| > i << new(:name => m[0]) > end After talking to some pacemaker folks, it sounds like the output of the above command for a pacemaker-managed service would be 'static', so not matching either disabled or enabled. > > Puppet will check if the systemd resource does exist to determine the status > of the resource. Which means, even if THT is setting "enabled => false" for > neutron::server, if the systemd resource exists for some reasons (managed by > something else), the service will be restarted by a ~> if Neutron_config > changes. > > Now, we need to investigate if whether or not pacemaker creates a systemd > resource for neutron-server. > In a first glance, we can see that pcs is checking if the systemd resource > is existing by using the same way as Puppet: > https://github.com/feist/pcs/blob/master/pcs/resource.py#L284 Omri checked the following for me on overcloud controller (7.3ga): [root@overcloud-controller-0 systemd]# systemctl list-unit-files | grep neutron neutron-bsn-lldp.service disabled neutron-cisco-cfg-agent.service disabled neutron-dhcp-agent.service disabled neutron-l3-agent.service disabled neutron-lbaas-agent.service disabled neutron-lbaasv2-agent.service disabled neutron-metadata-agent.service disabled neutron-metering-agent.service disabled neutron-netns-cleanup.service disabled neutron-openvswitch-agent.service disabled Also, for more data: [root@overcloud-controller-0 systemd]# systemctl is-enabled neutron-server.service disabled [root@overcloud-controller-0 systemd]# echo $? 1 (In reply to marios from comment #7) > (In reply to marios from comment #6) > > (In reply to Giulio Fidente from comment #5) > > > The problem seems to be in the puppet module which is defining the services > > > outside the manage_service conditional. > > > > > > This is how it should be (puppetlabs-rabbitmq) > > > https://github.com/puppetlabs/puppetlabs-rabbitmq/blob/master/manifests/ > > > service.pp#L22-L37 > > > > > > This is how it is (puppet-neutron) > > > https://github.com/openstack/puppet-neutron/blob/master/manifests/server. > > > pp#L509-L525 > > > > > > We might need to apply such a change for each service into each > > > puppet/openstack module. > > > > update - worked on this today trying to find a workaround after ^^^; I tried > > two things, unsuccessfully: > > > > 1. Moving the service "{ 'neutron-server': definition " into the > > manage_service conditional at > > https://github.com/openstack/puppet-neutron/blob/ > > 034f0d0779b57d3e53f9fd420818c3343d651938/manifests/server.pp#L516-L524 just > > means the neutron-server service is never defined so you get "Error: Could > > not find resource 'Service[neutron-server]' for relationship from > > 'Neutron_config[DEFAULT/verbose]'" > > > > 2. Setting 'service_ensure' explicitly to 'undef' inside the neutron-server > > definition (rather than stopped/started) to try ensure that puppet never > > does anything to the service, still results in the same behaviour reported > > in the bug here, with "Error: > > /Stage[main]/Neutron::Server/Service[neutron-server]: Failed to call > > refresh: Could not restart Service[neutron-server]:" > > https://github.com/openstack/puppet-neutron/blob/ > 034f0d0779b57d3e53f9fd420818c3343d651938/manifests/server.pp#L358 > > Neutron_config<||> ~> Service['neutron-server'] > > is likely to be where the config change triggers the refresh for > neutron-server that probably it, the reason why it works for rabbitmq is that in the rabbitmq module notifications are not sent to the service resource but rather to the service class which eventually define the resource (based on manage_service) [1] 1. https://github.com/puppetlabs/puppetlabs-rabbitmq/blob/master/manifests/config.pp#L172 *** Bug 1322509 has been marked as a duplicate of this bug. *** Verified: Environment: openstack-tripleo-heat-templates-0.8.14-5.el7ost.noarch Was able to complete the upgrade with no issues. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-0604.html |