Bug 1464464 - L->M upgrade fails on installing aodh step
L->M upgrade fails on installing aodh step
Status: CLOSED DUPLICATE of bug 1451101
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo (Show other bugs)
9.0 (Mitaka)
Unspecified Linux
unspecified Severity high
: ---
: 9.0 (Mitaka)
Assigned To: Sofer Athlan-Guyot
Marius Cornea
: Triaged, ZStream
Depends On:
Blocks: 1451101
  Show dependency treegraph
 
Reported: 2017-06-23 09:42 EDT by Yolanda Robla
Modified: 2017-07-25 05:47 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-07-25 05:47:51 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
os-collect-config from controller02 (3.06 MB, text/plain)
2017-06-23 09:54 EDT, Yolanda Robla
no flags Details
os-collect-config from controller0 (3.22 MB, text/plain)
2017-06-23 10:25 EDT, Yolanda Robla
no flags Details
deployed.json on controller00 (14.83 KB, text/plain)
2017-06-23 10:42 EDT, Yolanda Robla
no flags Details
sosreport from controller00 (17.33 MB, application/x-xz)
2017-06-23 11:21 EDT, Yolanda Robla
no flags Details

  None (edit)
Description Yolanda Robla 2017-06-23 09:42:01 EDT
Description of problem:

I'm running the first step to upgrade overcloud nodes, from L->M . I'm using 3 baremetal controllers, 1 baremetal compute.
After performing first step:

 openstack overcloud deploy --force-postconfig --control-scale 3 --compute-scale 1 --templates -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e /home/stack/network-environment.yaml  -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-aodh.yaml  --ntp-server 135.248.16.241

I'm getting this failure:

  "status": "FAILED", 
  "server_id": "413b675e-2106-443d-9719-08e292196672", 
  "config_id": "3306a226-7a45-46ed-8d2e-142ec8b4157c", 
  "output_values": {
    "deploy_stdout": " Clone Set: openstack-ceilometer-alarm-notifier-clone [openstack-ceilometer-alarm-notifier]\n Clone Set: openstack-ceilometer-alarm-evaluator-clone [openstack-ceilometer-alarm-evaluator]\nERROR: cluster finished transition but openstack-ceilometer-alarm-evaluator was not in stopped state, exiting.\n", 
    "deploy_stderr": "", 
    "deploy_status_code": 1
  }, 
  "creation_time": "2017-06-23T13:05:51", 
  "updated_time": "2017-06-23T13:06:34", 
  "input_values": {}, 
  "action": "CREATE", 
  "status_reason": "deploy_status_code : Deployment exited with non-zero status code: 1", 

On the controller, i can see the ceilometer resources are not started:

[heat-admin@overcloud-controller-2 ~]$ sudo pcs status
Cluster name: tripleo_cluster
WARNING: no stonith devices and stonith-enabled is not false
Stack: corosync
Current DC: overcloud-controller-0 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum
Last updated: Fri Jun 23 13:40:36 2017		Last change: Fri Jun 23 13:06:31 2017 by root via crm_resource on overcloud-controller-0

3 nodes and 112 resources configured: 3 resources DISABLED and 0 BLOCKED from being started due to failures

Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

Full list of resources:

 ip-172.17.1.10	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-0
 ip-172.17.4.10	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-1
 Clone Set: haproxy-clone [haproxy]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 ip-10.5.195.91	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-2
 ip-172.17.3.10	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-0
 ip-172.31.255.111	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-1
 ip-172.17.1.11	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-2
 Master/Slave Set: redis-master [redis]
     Masters: [ overcloud-controller-0 ]
     Slaves: [ overcloud-controller-1 overcloud-controller-2 ]
 Master/Slave Set: galera-master [galera]
     Masters: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: mongod-clone [mongod]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: memcached-clone [memcached]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-scheduler-clone [openstack-nova-scheduler]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-l3-agent-clone [neutron-l3-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-alarm-notifier-clone [openstack-ceilometer-alarm-notifier]
     openstack-ceilometer-alarm-notifier	(systemd:openstack-ceilometer-alarm-notifier):	FAILED overcloud-controller-1
     openstack-ceilometer-alarm-notifier	(systemd:openstack-ceilometer-alarm-notifier):	FAILED overcloud-controller-2
     Started: [ overcloud-controller-0 ]
 Clone Set: openstack-heat-engine-clone [openstack-heat-engine]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-api-clone [openstack-ceilometer-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-metadata-agent-clone [neutron-metadata-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-ovs-cleanup-clone [neutron-ovs-cleanup]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-netns-cleanup-clone [neutron-netns-cleanup]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-api-clone [openstack-heat-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-cinder-scheduler-clone [openstack-cinder-scheduler]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-api-clone [openstack-nova-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-api-cloudwatch-clone [openstack-heat-api-cloudwatch]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-collector-clone [openstack-ceilometer-collector]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-notification-clone [openstack-ceilometer-notification]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-dhcp-agent-clone [neutron-dhcp-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-glance-api-clone [openstack-glance-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-openvswitch-agent-clone [neutron-openvswitch-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-novncproxy-clone [openstack-nova-novncproxy]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: delay-clone [delay]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: httpd-clone [httpd]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-keystone-clone [openstack-keystone]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-consoleauth-clone [openstack-nova-consoleauth]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-glance-registry-clone [openstack-glance-registry]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-cinder-api-clone [openstack-cinder-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-central-clone [openstack-ceilometer-central]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-server-clone [neutron-server]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-alarm-evaluator-clone [openstack-ceilometer-alarm-evaluator]
     openstack-ceilometer-alarm-evaluator	(systemd:openstack-ceilometer-alarm-evaluator):	FAILED overcloud-controller-1 (disabled)
     openstack-ceilometer-alarm-evaluator	(systemd:openstack-ceilometer-alarm-evaluator):	FAILED overcloud-controller-2 (disabled)
     Started: [ overcloud-controller-0 ]
 Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 openstack-cinder-volume	(systemd:openstack-cinder-volume):	Started overcloud-controller-0
 Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

Failed Actions:
* openstack-ceilometer-alarm-notifier_monitor_60000 on overcloud-controller-1 'not running' (7): call=488, status=complete, exitreason='none',
    last-rc-change='Fri Jun 23 13:07:30 2017', queued=0ms, exec=0ms
* openstack-ceilometer-alarm-evaluator_monitor_60000 on overcloud-controller-1 'not running' (7): call=486, status=complete, exitreason='none',
    last-rc-change='Fri Jun 23 13:07:28 2017', queued=0ms, exec=0ms
* openstack-ceilometer-alarm-evaluator_monitor_60000 on overcloud-controller-2 'not running' (7): call=507, status=complete, exitreason='none',
    last-rc-change='Fri Jun 23 13:07:28 2017', queued=0ms, exec=0ms
* openstack-ceilometer-alarm-notifier_monitor_60000 on overcloud-controller-2 'not running' (7): call=509, status=complete, exitreason='none',
    last-rc-change='Fri Jun 23 13:07:30 2017', queued=0ms, exec=0ms


Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


sudo systemctl | grep openstack-ceilometer
  openstack-ceilometer-api.service                                                                                               loaded    active running   Cluster Controlled openstack-ceilometer-api
  openstack-ceilometer-central.service                                                                                           loaded    active running   Cluster Controlled openstack-ceilometer-central
  openstack-ceilometer-collector.service                                                                                         loaded    active running   Cluster Controlled openstack-ceilometer-collector
  openstack-ceilometer-notification.service                                                                                      loaded    active running   Cluster Controlled openstack-ceilometer-notification
Comment 3 Yolanda Robla 2017-06-23 09:54 EDT
Created attachment 1291099 [details]
os-collect-config from controller02
Comment 4 Yolanda Robla 2017-06-23 10:25 EDT
Created attachment 1291106 [details]
os-collect-config from controller0
Comment 5 Yolanda Robla 2017-06-23 10:42 EDT
Created attachment 1291108 [details]
deployed.json on controller00
Comment 6 Yolanda Robla 2017-06-23 11:21 EDT
Created attachment 1291135 [details]
sosreport from controller00
Comment 7 Sofer Athlan-Guyot 2017-06-23 13:35:38 EDT
oki,

so we may have a race condition here that leads to failure to disable the openstack-ceilometer-alarm-evaluator resource.

During the switch to aodh we remove openstack-ceilometer-alarm package
on all three controller nodes.

If the pkg removal happens on any of the non-bootstrap controller
before the pcs resource is disabled, this operation will fail on the
non-bootstrap node where the package has been remove (as it can't
stop).
Comment 13 Yolanda Robla 2017-06-27 09:05:54 EDT
Thanks for the patch! After applying it manually to the undercloud, i could successfully perform the aodh upgrade step.

Note You need to log in before you can comment on or make changes to this bug.