Bug 1475128

Summary: OSP11 -> OSP12 upgrade: openstack-ceilometer-collector service remains running on the baremetal host during upgrade
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: openstack-tripleo-heat-templatesAssignee: Emilien Macchi <emacchi>
Status: CLOSED ERRATA QA Contact: Marius Cornea <mcornea>
Severity: urgent Docs Contact:
Priority: high    
Version: 12.0 (Pike)CC: dbecker, fbaudin, mandreou, mbultel, mburns, morazi, ohochman, pkilambi, rhel-osp-director-maint, tvignaud
Target Milestone: rcKeywords: Triaged
Target Release: 12.0 (Pike)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-7.0.1-0.20170927205937.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-13 21:44:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1399762    
Attachments:
Description Flags
messages
none
ansible upgrade_tasks step1 and step2 none

Description Marius Cornea 2017-07-26 06:41:29 UTC
Description of problem:
OSP11 -> OSP12 upgrade: openstack-ceilometer-collector remains running on the baremetal host during upgrade. I would expect the service to be stopped and disabled on the baremetal host during upgrade so it can run inside a container. If it's not needed any longer then it should also be stopped and disabled and have the rpm removed. 

[root@controller-0 heat-admin]# systemctl status openstack-ceilometer-collector
● openstack-ceilometer-collector.service - OpenStack ceilometer collection service
   Loaded: loaded (/usr/lib/systemd/system/openstack-ceilometer-collector.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2017-07-25 23:43:37 UTC; 6h ago
 Main PID: 414639 (ceilometer-coll)
   Memory: 16.4M
   CGroup: /system.slice/openstack-ceilometer-collector.service
           ├─414639 ceilometer-collector: master process [/usr/bin/ceilometer-collector --logfile /var/log/ceilometer/collector.log]
           └─414853 ceilometer-collector: CollectorService worker(0)

Jul 25 23:43:37 controller-0 systemd[1]: Started OpenStack ceilometer collection service.
Jul 25 23:43:37 controller-0 systemd[1]: Starting OpenStack ceilometer collection service...

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-7.0.0-0.20170718190543.el7ost.noarch

How reproducible:
100%

Comment 1 Pradeep Kilambi 2017-07-26 14:55:55 UTC
the collector should be disabled by default. Collector is deprecated in osp12 and we did not containerize it. It should load this upgrade yaml:

https://github.com/openstack/tripleo-heat-templates/blob/master/puppet/services/disabled/ceilometer-collector-disabled.yaml

Comment 2 Marios Andreou 2017-07-27 10:27:09 UTC
The openstack-ceilometer-collector service was deprecated in https://review.openstack.org/#/c/450885/. After that merged and by default in the resource registry we disable the service by pointing to the puppet/services/disabled/ceilometer-collector-disabled.yaml as per pradk comment #1 (and the service has not been containerized). 

SO we must be missing something... per comment #0, the service *is* already being stopped disabled at [2] unless we are still enabling it somehow (for removing the package we have https://review.openstack.org/479886 still in review, but it shouldn't affect us here).

Mcornea can you sanity check the templates you used (I already checked OSP12 and can't see anything missing from upstream, i.e. it all seems to be there wrt the review that deprecated this), in particular that you aren't pointing to the 'non' disabled ceilometer-collector. In theory you'd have to include [3] nowadays to get ceilometer-collector

Alternatively, if you have /var/log/messages from the controller I can check the upgrade_tasks and check if the 'stop and disable' ceilo-collector are there as they should be.

thanks

[1] https://github.com/openstack/tripleo-heat-templates/blob/c2b2cc555a7d6d447e2e33b7d9f29801eb740b03/overcloud-resource-registry-puppet.j2.yaml#L202

[2] https://github.com/openstack/tripleo-heat-templates/blob/c2b2cc555a7d6d447e2e33b7d9f29801eb740b03/puppet/services/disabled/ceilometer-collector-disabled.yaml#L39

[3] https://github.com/openstack/tripleo-heat-templates/blob/c2b2cc555a7d6d447e2e33b7d9f29801eb740b03/environments/services/ceilometer-collector.yaml

Comment 3 Marius Cornea 2017-07-27 11:54:21 UTC
This the deploy command that I used for the docker composable upgrade:

#!/bin/bash

timeout 100m openstack overcloud deploy \
--templates /usr/share/openstack-tripleo-heat-templates \
--libvirt-type kvm \
--ntp-server clock.redhat.com \
-e /home/stack/virt/network/network-environment.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/hostnames.yml \
-e /home/stack/virt/debug.yaml \
-e /home/stack/virt/nodes_data.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/docker.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/docker-ha.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-composable-steps-docker.yaml \
-e /home/stack/docker-osp12.yaml \

Attaching /var/log/messages from the controller node

Comment 4 Marius Cornea 2017-07-27 11:56:12 UTC
Created attachment 1305304 [details]
messages

Comment 5 Marios Andreou 2017-07-28 09:52:23 UTC
Created attachment 1305877 [details]
ansible upgrade_tasks step1 and step2

picked out step1 and 2 from the attached /var/log/messages for easier debug.

Comment 6 Marios Andreou 2017-07-28 10:12:47 UTC
@mcornea I checked the log you attached and picked out step1 and 2 of the upgrade_tasks into a new attachment.

Indeed I do not see the expected 'stop and disable ceilometer-collector' so something else must be going on.

I guess next step is to sanity check the templates, or I missed something still. 

Are you by any chance using environments/disable-telemetry in your deployment (I don't see it in the templates above but worth checking)... in that file I see "OS::TripleO::Services::CeilometerCollector: OS::Heat::None" which if used would explain why the (now) default  OS::TripleO::Services::CeilometerCollector: puppet/services/disabled/ceilometer-collector-disabled.yaml (from the resource registry) is being overruled.

Otherwise can you check/grep against your templates for 
"OS::TripleO::Services::CeilometerCollector" to see if there is some other mapping there.

Comment 7 Marius Cornea 2017-08-03 15:23:22 UTC
(In reply to marios from comment #6)
> @mcornea I checked the log you attached and picked out step1 and 2 of the
> upgrade_tasks into a new attachment.
> 
> Indeed I do not see the expected 'stop and disable ceilometer-collector' so
> something else must be going on.
> 
> I guess next step is to sanity check the templates, or I missed something
> still. 
> 
> Are you by any chance using environments/disable-telemetry in your
> deployment (I don't see it in the templates above but worth checking)... in
> that file I see "OS::TripleO::Services::CeilometerCollector: OS::Heat::None"
> which if used would explain why the (now) default 
> OS::TripleO::Services::CeilometerCollector:
> puppet/services/disabled/ceilometer-collector-disabled.yaml (from the
> resource registry) is being overruled.

Nope, these are the environments used for the deploy command:

-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /home/stack/virt/hostnames.yml \
-e /home/stack/virt/debug.yaml \
-e /home/stack/virt/nodes_data.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/docker.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/docker-ha.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-composable-steps-docker.yaml \
-e /home/stack/docker-osp12.yaml \

> Otherwise can you check/grep against your templates for 
> "OS::TripleO::Services::CeilometerCollector" to see if there is some other
> mapping there.

(undercloud) [stack@undercloud-0 openstack-tripleo-heat-templates]$ grep -Ri OS::TripleO::Services::CeilometerCollector
deployed-server/deployed-server-roles-data.yaml:    - OS::TripleO::Services::CeilometerCollector
environments/contrail/roles_data_contrail.yaml:    - OS::TripleO::Services::CeilometerCollector
environments/disable-telemetry.yaml:  OS::TripleO::Services::CeilometerCollector: OS::Heat::None
environments/services/ceilometer-collector.yaml:  OS::TripleO::Services::CeilometerCollector: ../../puppet/services/ceilometer-collector.yaml
overcloud-resource-registry-puppet.j2.yaml:  OS::TripleO::Services::CeilometerCollector: puppet/services/disabled/ceilometer-collector-disabled.yaml

[stack@undercloud-0 ~]$ grep -Ri OS::TripleO::Services::CeilometerCollector /home/stack/virt/
[stack@undercloud-0 ~]$

Comment 8 Marios Andreou 2017-08-17 13:44:39 UTC
Current theory: ceilometer is disabled by default now reg points to the services/disabled, AND, it is removed entirely from the roles_data.yaml ( https://github.com/openstack/tripleo-heat-templates/blob/master/roles_data.yaml ). SO even though the reg is pointing to services/disabled, since the service is not included on any roles, the tasks in that file are not executed.

If you have ceilometer-collector, it means your roles_data used when you deployed has that service. So you should include it again in the roles data you use on upgrade. But this also seems counterintuitive. Will discuss on scrum today.

We can confirm this by checking the stack too like:

openstack stack output show overcloud EnabledServices > EnabledServices

grep ceilometer ./EnabledServices

you shouldn't have the ceilometercollector there

Comment 9 Pradeep Kilambi 2017-08-17 14:52:59 UTC
We do something similar with CeilometerExpirer where the service is disabled, but its still in roles_data so it gets picked up i guess. Marios and I had a quick chat on this and this could be a potential solution. If there better ways, we can discuss. I pushed a patch upstream so we can discuss further and merge if we all agree:

https://review.openstack.org/494589

Comment 10 Marios Andreou 2017-08-18 09:47:46 UTC
Adding pradk's review to the trackers - pasting from my comment there "it seems counterintuitive to require the service to be in roles_data even though it is disabled by default, but I can't think of another solution. It might somehow be rationalised as a deprecation period of one cycle (!) since we need it there in order to run the remaining service specific decommission tasks before it completely dissapears on the Q->R upgrade"

Comment 13 Marius Cornea 2017-11-22 13:55:11 UTC
[root@controller-0 heat-admin]# systemctl status openstack-ceilometer-collector
● openstack-ceilometer-collector.service - OpenStack ceilometer collection service
   Loaded: loaded (/usr/lib/systemd/system/openstack-ceilometer-collector.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

Nov 21 22:01:39 controller-0 systemd[1]: openstack-ceilometer-collector.service stop-sigterm timed out. Killing.
Nov 21 22:01:39 controller-0 systemd[1]: openstack-ceilometer-collector.service: main process exited, code=killed, status=9/KILL
Nov 21 22:01:39 controller-0 systemd[1]: Unit openstack-ceilometer-collector.service entered failed state.
Nov 21 22:01:39 controller-0 systemd[1]: openstack-ceilometer-collector.service failed.
Nov 21 22:01:39 controller-0 systemd[1]: Started OpenStack ceilometer collection service.
Nov 21 22:01:39 controller-0 systemd[1]: Starting OpenStack ceilometer collection service...
Nov 21 22:03:11 controller-0 ceilometer-collector[71272]: /usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py:246: NotSupportedWarning: Configuration option(s) ['api', 'api_paste_config', 'config_dir', 'c...'] not supported
Nov 21 22:03:11 controller-0 ceilometer-collector[71272]: exception.NotSupportedWarning
Nov 22 11:07:17 controller-0 systemd[1]: Stopping OpenStack ceilometer collection service...
Nov 22 11:07:35 controller-0 systemd[1]: Stopped OpenStack ceilometer collection service.
Hint: Some lines were ellipsized, use -l to show in full.

Comment 16 errata-xmlrpc 2017-12-13 21:44:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462