This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 1289260 - rhel-osp-director: 7.1 GA openstack-heat resources are down on 1 controller in HA deployment
rhel-osp-director: 7.1 GA openstack-heat resources are down on 1 controller i...
Status: CLOSED WORKSFORME
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director (Show other bugs)
7.0 (Kilo)
Unspecified Unspecified
high Severity unspecified
: y2
: 7.0 (Kilo)
Assigned To: Marios Andreou
yeylon@redhat.com
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-12-07 13:52 EST by Alexander Chuzhoy
Modified: 2016-04-18 02:52 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-12-09 14:00:46 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
/var/log dir from one controller. (3.82 MB, application/x-gzip)
2015-12-07 14:00 EST, Alexander Chuzhoy
no flags Details

  None (edit)
Description Alexander Chuzhoy 2015-12-07 13:52:11 EST
rhel-osp-director: 7.1 GA openstack-heat resources are down on 1 controller in HA deployment

Environment:
openstack-heat-common-2015.1.1-5.el7ost.noarch
openstack-heat-engine-2015.1.1-5.el7ost.noarch
python-heatclient-0.6.0-1.el7ost.noarch
openstack-heat-api-cfn-2015.1.1-5.el7ost.noarch
openstack-heat-api-2015.1.1-5.el7ost.noarch
openstack-heat-api-cloudwatch-2015.1.1-5.el7ost.noarch
instack-undercloud-2.1.2-29.el7ost.noarch




Steps to reproduce:
1. Deploy HA overcloud with 3 ceph nodes.
2. Login to one controller.
3. Run pcs status


Result:
 Clone Set: openstack-heat-engine-clone [openstack-heat-engine]
     Started: [ overcloud-controller-1 overcloud-controller-2 ]
     Stopped: [ overcloud-controller-0 ]
--
 Clone Set: openstack-heat-api-clone [openstack-heat-api]
     Started: [ overcloud-controller-1 overcloud-controller-2 ]
     Stopped: [ overcloud-controller-0 ]
--
 Clone Set: openstack-heat-api-cloudwatch-clone [openstack-heat-api-cloudwatch]
     Started: [ overcloud-controller-1 overcloud-controller-2 ]
     Stopped: [ overcloud-controller-0 ]
--
 Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn]
     Started: [ overcloud-controller-1 overcloud-controller-2 ]
     Stopped: [ overcloud-controller-0 ]


Expected results:
The resources should be running on all controllers.
Comment 2 Alexander Chuzhoy 2015-12-07 14:00 EST
Created attachment 1103338 [details]
/var/log dir from one controller.
Comment 3 Alexander Chuzhoy 2015-12-07 14:02:07 EST
Reproduced on the same env.

W/A:
[root@overcloud-controller-0 ~]# pcs resource cleanup openstack-heat-engine


[root@overcloud-controller-0 ~]# pcs resource cleanup openstack-heat-api


[root@overcloud-controller-0 ~]# pcs resource cleanup openstack-heat-cloudwatch


After that al resources appear as started on all controllers.
Comment 4 James Slagle 2015-12-08 12:21:08 EST
in anticipation of probably having to get the HA folks involved with this one, can you run a crm_report on each controller node, and either attach here or upload it somewhere for review?
Comment 5 Marios Andreou 2015-12-08 12:56:33 EST
besides the heat resources, are the neutron resources behaving OK? i.e. in the pcs status output what are the neutron services/agents like... i see a lot of restarts for neutron-server in attached messages and *I believe* this environment has the "older" neutron constraints like: 

pcs constraint order show | grep neutron

would include:

  start neutron-server-clone then start neutron-ovs-cleanup-clone (kind:Mandatory)

I am getting ready to call it a day but will pickup tomorrow with any added info you can provide, 

thanks
Comment 6 Marios Andreou 2015-12-09 06:06:34 EST
Hi sasha... I just had a go at reproducing this and was somewhat successful in that IO *did* see some services stopped in pcs status after 'Overcloud Deployed' was declared. However on my (virt) environment after a couple of minutes I get a clean pcs status. Does pacemaker eventually manage to bring eveything up on your env if you don't run the resource cleanup (for me was in the order of 2-4 minutes). More info on my env below.

I have a  7.1 environment... my packages seem to match what you have listed - my env is 2015-10-05.1 puddle using the 2015-10-05.1 images

I didn't see this on my initial deploy with just 3 control 2 compute. When I added 3 ceph like:

openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --ceph-storage-scale 3  --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml  -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml --ntp-server "0.fedora.pool.ntp.org"

As soon as I saw the 'Overcloud Deployed', from controller 0 i see quite a few things stopped (including the heat resources you mention):

[root@overcloud-controller-0 heat-admin]# pcs status | grep -ni stop -C 3
26-     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
27- Clone Set: neutron-l3-agent-clone [neutron-l3-agent]
28-     Started: [ overcloud-controller-1 overcloud-controller-2 ]
29:     Stopped: [ overcloud-controller-0 ]
30- Clone Set: openstack-ceilometer-alarm-notifier-clone [openstack-ceilometer-alarm-notifier]
31:     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
32- Clone Set: openstack-heat-engine-clone [openstack-heat-engine]
33:     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
34- Clone Set: openstack-ceilometer-api-clone [openstack-ceilometer-api]
35-     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
36- Clone Set: neutron-metadata-agent-clone [neutron-metadata-agent]
37-     Started: [ overcloud-controller-1 overcloud-controller-2 ]
38:     Stopped: [ overcloud-controller-0 ]
39- Clone Set: neutron-ovs-cleanup-clone [neutron-ovs-cleanup]
40-     Started: [ overcloud-controller-1 overcloud-controller-2 ]
41:     Stopped: [ overcloud-controller-0 ]
42- Clone Set: neutron-netns-cleanup-clone [neutron-netns-cleanup]
43-     Started: [ overcloud-controller-1 overcloud-controller-2 ]
44:     Stopped: [ overcloud-controller-0 ]
45- Clone Set: openstack-heat-api-clone [openstack-heat-api]
46:     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
47- Clone Set: openstack-cinder-scheduler-clone [openstack-cinder-scheduler]
48-     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
49- Clone Set: openstack-nova-api-clone [openstack-nova-api]
50-     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
51- Clone Set: openstack-heat-api-cloudwatch-clone [openstack-heat-api-cloudwatch]
52:     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
53- Clone Set: openstack-ceilometer-collector-clone [openstack-ceilometer-collector]
54-     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
55- Clone Set: openstack-keystone-clone [openstack-keystone]
--
59- Clone Set: openstack-glance-registry-clone [openstack-glance-registry]
60-     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
61- Clone Set: openstack-ceilometer-notification-clone [openstack-ceilometer-notification]
62:     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
63- Clone Set: openstack-cinder-api-clone [openstack-cinder-api]
64-     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
65- Clone Set: neutron-dhcp-agent-clone [neutron-dhcp-agent]
66-     Started: [ overcloud-controller-1 overcloud-controller-2 ]
67:     Stopped: [ overcloud-controller-0 ]
68- Clone Set: openstack-glance-api-clone [openstack-glance-api]
69-     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
70- Clone Set: neutron-openvswitch-agent-clone [neutron-openvswitch-agent]
71-     Started: [ overcloud-controller-1 overcloud-controller-2 ]
72:     Stopped: [ overcloud-controller-0 ]
73- Clone Set: openstack-nova-novncproxy-clone [openstack-nova-novncproxy]
74-     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
75- Clone Set: delay-clone [delay]
--
82- Clone Set: openstack-ceilometer-central-clone [openstack-ceilometer-central]
83-     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
84- Clone Set: openstack-ceilometer-alarm-evaluator-clone [openstack-ceilometer-alarm-evaluator]
85:     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
86- Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn]
87:     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
88- openstack-cinder-volume	(systemd:openstack-cinder-volume):	Started overcloud-controller-1
89- Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor]
90-     Started: [ overcloud-controller-0 overcloud-controller-2 ]
91:     Stopped: [ overcloud-controller-1 ]
92-
93-Failed Actions:
94-* neutron-openvswitch-agent_monitor_60000 on overcloud-controller-0 'not running' (7): call=212, status=complete, exitreason='none',



but within about 2/3 minutes this cleared up and i got a 'green' pcs status. My undercloud packages fyi

[stack@instack ~]$ rpm -qa | grep "heat\|instack"
openstack-tripleo-heat-templates-0.8.6-71.el7ost.noarch
openstack-heat-engine-2015.1.1-5.el7ost.noarch
openstack-heat-api-cloudwatch-2015.1.1-5.el7ost.noarch
openstack-heat-templates-0-0.6.20150605git.el7ost.noarch
python-heatclient-0.6.0-1.el7ost.noarch
instack-0.0.7-1.el7ost.noarch
openstack-heat-api-2015.1.1-5.el7ost.noarch
openstack-heat-common-2015.1.1-5.el7ost.noarch
instack-undercloud-2.1.2-29.el7ost.noarch
heat-cfntools-1.2.8-2.el7.noarch
openstack-heat-api-cfn-2015.1.1-5.el7ost.noarch

thanks, marios
Comment 7 Alexander Chuzhoy 2015-12-09 13:52:08 EST
I just checked another deployment untouched for a few hours.
No stopped resources.

Note You need to log in before you can comment on or make changes to this bug.