rhel-osp-director: failed to update 7.1->7.2 HA overcloud, failed at the very end, timeout on waiting for keystone to stop during the restart the whole cluster step Environment: openstack-tripleo-image-elements-0.9.6-10.el7ost.noarch openstack-tripleo-common-0.0.1.dev6-5.git49b57eb.el7ost.noarch openstack-tripleo-heat-templates-0.8.6-92.el7ost.noarch instack-undercloud-2.1.2-36.el7ost.noarch openstack-tripleo-puppet-elements-0.0.1-5.el7ost.noarch openstack-tripleo-0.0.7-0.1.1664e566.el7ost.noarch Steps to reproduce: 1. Deploy HA overcloud 7.1 and configure ha instance on it. 2. Update the setup to 7.2 Result: The update fails Expected result: The update shouldn't fail.
[stack@puma33 ~]$ heat resource-show overcloud ControllerNodesPostDeployment +------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Property | Value | +------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | attributes | {} | | description | | | links | http://192.0.2.1:8004/v1/f468af05049d4404899b6ec8a1c7827d/stacks/overcloud/5eb7a7e6-919e-4d30-98a5-9c74b252fbca/resources/ControllerNodesPostDeployment (self) | | | http://192.0.2.1:8004/v1/f468af05049d4404899b6ec8a1c7827d/stacks/overcloud/5eb7a7e6-919e-4d30-98a5-9c74b252fbca (stack) | | | http://192.0.2.1:8004/v1/f468af05049d4404899b6ec8a1c7827d/stacks/overcloud-ControllerNodesPostDeployment-szuhtybg4hqz/a8b4d36c-0ad8-4ed6-91c5-da0d9b1245fd (nested) | | logical_resource_id | ControllerNodesPostDeployment | | physical_resource_id | a8b4d36c-0ad8-4ed6-91c5-da0d9b1245fd | | required_by | BlockStorageNodesPostDeployment | | | CephStorageNodesPostDeployment | | resource_name | ControllerNodesPostDeployment | | resource_status | UPDATE_FAILED | | resource_status_reason | resources.ControllerNodesPostDeployment: Error: resources.ControllerPostPuppet.resources.ControllerPostPuppetRestartDeployment.resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1 | | resource_type | OS::TripleO::ControllerPostDeployment | | updated_time | 2015-12-14T19:32:37Z | +------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
[root@overcloud-controller-0 ~]# pcs status Cluster name: tripleo_cluster Last updated: Mon Dec 14 17:40:09 2015 Last change: Mon Dec 14 14:47:20 2015 by root via crm_resource on overcloud-controller-0 Stack: corosync Current DC: overcloud-controller-2 (version 1.1.13-10.el7-44eb2dd) - partition with quorum 6 nodes and 252 resources configured Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] RemoteOnline: [ overcloud-compute-0 overcloud-compute-1 ] RemoteOFFLINE: [ overcloud-compute-2 ] Full list of resources: Clone Set: haproxy-clone [haproxy] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 ] ip-192.0.2.8 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 ip-172.18.0.10 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 ip-172.17.0.10 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 ip-172.17.0.11 (ocf::heartbeat:IPaddr2): Started overcloud-controller-1 Master/Slave Set: galera-master [galera] Masters: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 ] ip-172.19.0.10 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 ip-10.35.180.10 (ocf::heartbeat:IPaddr2): Started overcloud-controller-1 Master/Slave Set: redis-master [redis] Masters: [ overcloud-controller-1 ] Slaves: [ overcloud-controller-0 overcloud-controller-2 ] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 ] Clone Set: mongod-clone [mongod] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 ] Clone Set: rabbitmq-clone [rabbitmq] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 ] Clone Set: memcached-clone [memcached] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 ] Clone Set: openstack-nova-scheduler-clone [openstack-nova-scheduler] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 ] Clone Set: neutron-l3-agent-clone [neutron-l3-agent] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-ceilometer-alarm-notifier-clone [openstack-ceilometer-alarm-notifier] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 ] Clone Set: openstack-heat-engine-clone [openstack-heat-engine] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-ceilometer-api-clone [openstack-ceilometer-api] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 ] Clone Set: neutron-metadata-agent-clone [neutron-metadata-agent] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-ovs-cleanup-clone [neutron-ovs-cleanup] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 ] Clone Set: neutron-netns-cleanup-clone [neutron-netns-cleanup] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 ] Clone Set: openstack-heat-api-clone [openstack-heat-api] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-cinder-scheduler-clone [openstack-cinder-scheduler] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-nova-api-clone [openstack-nova-api] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 ] Clone Set: openstack-heat-api-cloudwatch-clone [openstack-heat-api-cloudwatch] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-ceilometer-collector-clone [openstack-ceilometer-collector] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 ] Clone Set: openstack-keystone-clone [openstack-keystone] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 ] Clone Set: openstack-nova-consoleauth-clone [openstack-nova-consoleauth] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 ] Clone Set: openstack-glance-registry-clone [openstack-glance-registry] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-ceilometer-notification-clone [openstack-ceilometer-notification] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 ] Clone Set: openstack-cinder-api-clone [openstack-cinder-api] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-dhcp-agent-clone [neutron-dhcp-agent] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-glance-api-clone [openstack-glance-api] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-openvswitch-agent-clone [neutron-openvswitch-agent] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-nova-novncproxy-clone [openstack-nova-novncproxy] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 ] Clone Set: delay-clone [delay] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 ] Clone Set: neutron-server-clone [neutron-server] neutron-server (systemd:neutron-server): FAILED overcloud-controller-0 Started: [ overcloud-controller-1 overcloud-controller-2 ] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 ] Clone Set: httpd-clone [httpd] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-ceilometer-central-clone [openstack-ceilometer-central] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 ] Clone Set: openstack-ceilometer-alarm-evaluator-clone [openstack-ceilometer-alarm-evaluator] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 ] Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] openstack-cinder-volume (systemd:openstack-cinder-volume): Stopped Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 ] nova-evacuate (ocf::openstack:NovaEvacuate): Stopped ipmilan-overcloud-controller-0 (stonith:fence_ipmilan): Started overcloud-controller-2 ipmilan-overcloud-controller-1 (stonith:fence_ipmilan): Started overcloud-controller-2 ipmilan-overcloud-controller-2 (stonith:fence_ipmilan): Started overcloud-controller-1 Clone Set: neutron-openvswitch-agent-compute-clone [neutron-openvswitch-agent-compute] Started: [ overcloud-compute-0 overcloud-compute-1 ] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: libvirtd-compute-clone [libvirtd-compute] Started: [ overcloud-compute-0 overcloud-compute-1 ] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: ceilometer-compute-clone [ceilometer-compute] Started: [ overcloud-compute-1 ] Stopped: [ overcloud-compute-0 overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: nova-compute-clone [nova-compute] Started: [ overcloud-compute-0 overcloud-compute-1 ] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] ipmilan-overcloud-compute-0 (stonith:fence_ipmilan): Started overcloud-controller-2 ipmilan-overcloud-compute-1 (stonith:fence_ipmilan): Started overcloud-controller-2 ipmilan-overcloud-compute-2 (stonith:fence_ipmilan): Started overcloud-controller-1 fence-nova (stonith:fence_compute): Started overcloud-controller-1 overcloud-compute-0 (ocf::pacemaker:remote): Started overcloud-controller-1 overcloud-compute-1 (ocf::pacemaker:remote): Started overcloud-controller-2 overcloud-compute-2 (ocf::pacemaker:remote): FAILED overcloud-controller-1 Failed Actions: * neutron-openvswitch-agent_monitor_60000 on overcloud-controller-0 'not running' (7): call=476, status=complete, exitreason='none', last-rc-change='Mon Dec 14 14:46:22 2015', queued=0ms, exec=25ms * neutron-server_monitor_60000 on overcloud-controller-0 'not running' (7): call=479, status=complete, exitreason='none', last-rc-change='Mon Dec 14 14:46:22 2015', queued=0ms, exec=22ms * overcloud-compute-2_monitor_20000 on overcloud-controller-1 'not running' (7): call=347, status=complete, exitreason='none', last-rc-change='Mon Dec 14 17:39:23 2015', queued=0ms, exec=0ms * neutron-openvswitch-agent_monitor_60000 on overcloud-controller-1 'not running' (7): call=441, status=complete, exitreason='none', last-rc-change='Mon Dec 14 14:46:32 2015', queued=0ms, exec=9ms * httpd_monitor_60000 on overcloud-controller-2 'OCF_PENDING' (196): call=247, status=complete, exitreason='none', last-rc-change='Mon Dec 14 13:10:00 2015', queued=0ms, exec=0ms * neutron-openvswitch-agent_monitor_60000 on overcloud-controller-2 'not running' (7): call=459, status=complete, exitreason='none', last-rc-change='Mon Dec 14 14:46:32 2015', queued=0ms, exec=24ms PCSD Status: overcloud-controller-0: Online overcloud-controller-1: Online overcloud-controller-2: Online Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
This looks similar to BZ#1291096
Its basically a combination of Bug #1275324 (missing systemd timeouts) and probably Bug #1288528 or something similar.
all the patches to set the right timeouts for bug 1275324 are merged and in use. are you saying that the timeouts still aren't correct? given that bug 1288528 is fixed in rhos 8, do we deem that not a blocker in rhos 7?
(In reply to James Slagle from comment #6) > are you saying that the timeouts still aren't correct? Lazy extrapolation on my part. I see most have 100s, although it now seems that 120s is required there :-( However at least one systemd resource does not have them set: [root@overcloud-controller-2 heat-admin]# pcs resource show ceilometer-compute Resource: ceilometer-compute (class=systemd type=openstack-ceilometer-compute) Operations: monitor interval=60s (ceilometer-compute-monitor-interval-60s) These too by the looks of it: [root@overcloud-controller-2 heat-admin]# pcs resource show neutron-openvswitch-agent-compute Resource: neutron-openvswitch-agent-compute (class=systemd type=neutron-openvswitch-agent) Operations: monitor interval=60s (neutron-openvswitch-agent-compute-monitor-interval-60s) [root@overcloud-controller-2 heat-admin]# pcs resource show libvirtd-compute Resource: libvirtd-compute (class=systemd type=libvirtd) Operations: monitor interval=60s (libvirtd-compute-monitor-interval-60s) Maybe these aren't configured by director?
(In reply to Andrew Beekhof from comment #7) > (In reply to James Slagle from comment #6) > > are you saying that the timeouts still aren't correct? > > Lazy extrapolation on my part. > > I see most have 100s, although it now seems that 120s is required there :-( For which resources in particular? All of them, including the resources for the services on the controllers, or just the compute resources shown below? > However at least one systemd resource does not have them set: > > [root@overcloud-controller-2 heat-admin]# pcs resource show > ceilometer-compute > Resource: ceilometer-compute (class=systemd > type=openstack-ceilometer-compute) > Operations: monitor interval=60s (ceilometer-compute-monitor-interval-60s) > > > These too by the looks of it: > > [root@overcloud-controller-2 heat-admin]# pcs resource show > neutron-openvswitch-agent-compute > Resource: neutron-openvswitch-agent-compute (class=systemd > type=neutron-openvswitch-agent) > Operations: monitor interval=60s > (neutron-openvswitch-agent-compute-monitor-interval-60s) > [root@overcloud-controller-2 heat-admin]# pcs resource show libvirtd-compute > Resource: libvirtd-compute (class=systemd type=libvirtd) > Operations: monitor interval=60s (libvirtd-compute-monitor-interval-60s) > > Maybe these aren't configured by director? Likely. I don't think we configure these in director given that instance HA setup is a manual process right now aiui. We likely need to document as part of the process what the user should set the timeouts for these resources. Should the *-compute resources be 100s or 120s as well?
(In reply to James Slagle from comment #8) > (In reply to Andrew Beekhof from comment #7) > > (In reply to James Slagle from comment #6) > > > are you saying that the timeouts still aren't correct? > > > > Lazy extrapolation on my part. > > > > I see most have 100s, although it now seems that 120s is required there :-( > > For which resources in particular? All of them, including the resources for > the services on the controllers, Unfortunately, yes. You saw https://bugzilla.redhat.com/show_bug.cgi?id=1275324#c15 ? Basically there is apparently an additional grace period before systemd kills the process. > or just the compute resources shown below? > > > However at least one systemd resource does not have them set: > > > > [root@overcloud-controller-2 heat-admin]# pcs resource show > > ceilometer-compute > > Resource: ceilometer-compute (class=systemd > > type=openstack-ceilometer-compute) > > Operations: monitor interval=60s (ceilometer-compute-monitor-interval-60s) > > > > > > These too by the looks of it: > > > > [root@overcloud-controller-2 heat-admin]# pcs resource show > > neutron-openvswitch-agent-compute > > Resource: neutron-openvswitch-agent-compute (class=systemd > > type=neutron-openvswitch-agent) > > Operations: monitor interval=60s > > (neutron-openvswitch-agent-compute-monitor-interval-60s) > > [root@overcloud-controller-2 heat-admin]# pcs resource show libvirtd-compute > > Resource: libvirtd-compute (class=systemd type=libvirtd) > > Operations: monitor interval=60s (libvirtd-compute-monitor-interval-60s) > > > > Maybe these aren't configured by director? > > Likely. I don't think we configure these in director given that instance HA > setup is a manual process right now aiui. We likely need to document as part > of the process what the user should set the timeouts for these resources. > Should the *-compute resources be 100s or 120s as well? 120s, yes :(
Doc/workaround works for me.
*** This bug has been marked as a duplicate of bug 1295835 ***